CN117524353B - Molecular large model based on multidimensional molecular information, construction method and application - Google Patents

Molecular large model based on multidimensional molecular information, construction method and application Download PDF

Info

Publication number
CN117524353B
CN117524353B CN202311574206.0A CN202311574206A CN117524353B CN 117524353 B CN117524353 B CN 117524353B CN 202311574206 A CN202311574206 A CN 202311574206A CN 117524353 B CN117524353 B CN 117524353B
Authority
CN
China
Prior art keywords
molecular
node
information
self
dimensional
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202311574206.0A
Other languages
Chinese (zh)
Other versions
CN117524353A (en
Inventor
申彦明
马煜婷
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Dalian University of Technology
Original Assignee
Dalian University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Dalian University of Technology filed Critical Dalian University of Technology
Priority to CN202311574206.0A priority Critical patent/CN117524353B/en
Publication of CN117524353A publication Critical patent/CN117524353A/en
Application granted granted Critical
Publication of CN117524353B publication Critical patent/CN117524353B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16CCOMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
    • G16C20/00Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
    • G16C20/50Molecular design, e.g. of drugs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • G06F18/2155Generating training patterns; Bootstrap methods, e.g. bagging or boosting characterised by the incorporation of unlabelled data, e.g. multiple instance learning [MIL], semi-supervised techniques using expectation-maximisation [EM] or naïve labelling
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16CCOMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
    • G16C20/00Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
    • G16C20/70Machine learning, data mining or chemometrics

Landscapes

  • Engineering & Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Theoretical Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Chemical & Material Sciences (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Computation (AREA)
  • Health & Medical Sciences (AREA)
  • Crystallography & Structural Chemistry (AREA)
  • Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Computing Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Medicinal Chemistry (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Databases & Information Systems (AREA)
  • Pharmacology & Pharmacy (AREA)
  • Medical Informatics (AREA)
  • Software Systems (AREA)
  • Evolutionary Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention provides a molecular large model based on multidimensional molecular information, a construction method and application thereof, comprising the steps of constructing an unsupervised pre-training data set, preprocessing the unsupervised pre-training data set and generating a molecular conformation to obtain a molecular pre-training data set formed by a molecular map; carrying out structural coding on a molecular diagram in a molecular pre-training data set to obtain initialized atomic characteristics, and inputting the initialized atomic characteristics into a transducer; the shortest path structure code, the side information code and the three-dimensional distance pair code are integrated into the self-attention layer of the transducer, the three-dimensional distance pair code interacts with the self-attention layer of the transducer in the training process, and node pair characteristics of the self-attention layer of the transducer are updated iteratively; and defining a two-dimensional space and three-dimensional space combined molecular graph self-supervision learning task, and obtaining a molecular large model based on multidimensional molecular information after training. The invention can accelerate the drug screening speed and provide help for drug research and development.

Description

Molecular large model based on multidimensional molecular information, construction method and application
Technical Field
The invention belongs to the field of artificial intelligence, and particularly discloses a molecular large model construction method based on multidimensional molecular information.
Background
Traditional drug development is a complex and time-consuming process involving multiple links, such as potential target identification, compound optimization, bioactivity evaluation, etc., requiring significant manpower, material resources, and financial resources. The large model can utilize massive biomedical data to carry out excavation and analysis, and potential drug molecules can be rapidly screened out, so that the drug research and development speed is improved, the manpower and material resources and the input cost are reduced, and powerful support is provided for innovation and development of the intelligent medicine industry. For existing datasets, such as ZINC, which contain only two-dimensional molecular information, the ability of the model to learn spatial information is limited. In addition, there are some public datasets that contain three-dimensional spatial information, such as PCQM m 2 or QM9, but the limitation of the data size cannot meet the requirement of the current molecular large model, so that it is necessary to construct a large-scale molecular dataset that contains two-dimensional plane and three-dimensional stereo information at the same time. Successful application of large models is not supported by a transducer, but the existing transducer cannot fully characterize structural information in the graph, so that large model learning on large-scale graph data is difficult. 2021, ying et al proposed Graphormer to improve the modeling capability of structural information by introducing the structural coding information of the graph into the transducer, but this approach lacks learning of three-dimensional information, thereby limiting the application scope of the model. Therefore, luo introduces three-dimensional information learning based on Graphormer and introduces three-dimensional position coding, however, the method only adds additional information in the attention matrix and cannot improve the modeling capability of three-dimensional coordinates, thereby limiting the application range of the model.
In the existing pre-training model, such as SMILES character strings, structural information cannot be well captured due to the limitation of sequence conditions. For contrast pre-training tasks, improper data enhancement can result in false positive samples, and this learning approach can result in bias in model learning. Another method is to train the model by generating a pre-training task, and by masking a certain proportion of atoms and edges, and letting the model measure the attribute of the masked part, but the generating pre-training task may have a too simple problem, for example, for the chemical field, the nature only contains 118 elements, and there is a serious data imbalance problem, from this point of view, the molecular characterization learning may not fully utilize the prior knowledge of chemistry, and the performance is affected. From the three-dimensional perspective, GEM proposes to introduce spatial structure information of a compound into a pre-training encoding process, but only uses an edge-bond angle diagram as additional spatial information, and lacks complete utilization of three-dimensional coordinates; transformer-M first proposes to use two-dimensional information and three-dimensional information for characterization learning, but is limited by the number of three-dimensional equilibrium conformations, and cannot be widely popularized to a large-scale data set.
In summary, the existing methods have more or less certain limitations: (1) For the biomedical field, a molecular large model is needed to predict properties such as bioactivity and side effects of molecules to accelerate drug screening, or to generate molecules with specific properties and structures to provide candidate molecules for drug design and discovery; (2) The existing models are mostly dependent on small-scale data sets, are limited by the number of existing three-dimensional equilibrium conformations, and lack large-scale graph data sets which can be used for training a molecular large model; (3) The existing model has limited characterization capability on molecular information, cannot fully learn the three-dimensional space information, and needs to provide a model with high expression capability.
Disclosure of Invention
The invention provides a molecular large model based on multidimensional molecular information, a construction method and application thereof, and aims to solve the problems that a large-scale graph data set which can be used for molecular large model training is lacking in the existing biological medicine field, the existing model has limited representation capability on molecular information, and three-dimensional space information cannot be fully learned.
The invention provides a molecular large model construction method based on multi-dimensional molecular information, which comprises the following steps:
Constructing an unsupervised pre-training data set, and performing preprocessing and molecular conformation generation processing on the unsupervised pre-training data set to obtain a molecular pre-training data set formed by a molecular diagram;
Carrying out structural coding on a molecular diagram in the molecular pre-training data set to obtain initialized atomic characteristics, and inputting the initialized atomic characteristics into a transducer;
integrating shortest path structural codes, side information codes and three-dimensional distance pair codes into the self-attention layer of the transducer, interacting the three-dimensional distance pair codes with the self-attention layer of the transducer in the training process, and iteratively updating node pair characteristics of the self-attention layer of the transducer;
And defining a two-dimensional space and three-dimensional space combined molecular graph self-supervision learning task, and obtaining a molecular large model based on multidimensional molecular information after training.
A method of molecular large model construction based on multidimensional molecular information according to some embodiments of the present application, the pre-processing includes removing hydrogen atoms, removing charges, removing small fragments, removing chiralities and standardized tautomers, retaining a backbone structural representation of the molecule, the backbone structural representation including an ID number of the molecule in an original database and a one-dimensional molecular SMILES representation.
A method for constructing a molecular large model based on multidimensional molecular information according to some embodiments of the present application, the molecular conformation generation process includes performing the molecular conformation generation process by RDkit kits based on the following steps:
generating a preliminary molecular conformation based on the distance geometry;
modifying the molecular conformation based on ETKDG method;
The molecular conformation was optimized based on MMFF force field.
According to some embodiments of the application, the method for constructing the molecular large model based on multidimensional molecular information comprises a plurality of transducer blocks, wherein each transducer block consists of a self-focusing layer and a feedforward neural network layer, and the self-focusing layer and the feedforward neural network layer are subjected to standard normalization operation.
According to some embodiments of the application, a molecular large model construction method based on multidimensional molecular information is provided, wherein the molecular diagram is shown in a formula (1):
G={Xatom,A,E,R} (1)
Wherein, For the atomic node characteristic matrix, n represents the number of atoms, d represents the atomic characteristic dimension, X atom contains the inherent attribute of atoms, A represents the adjacent matrix of the molecular diagram and covers the 1 st order topology information of the molecular diagram, E represents the set of the upper edges of the molecular diagram,/>Representing the geometric space coordinates of the molecule in three dimensions.
According to some embodiments of the application, the initialized atomic characteristics comprise an atomic node characteristic matrix, node degree codes, random walk position codes and three-dimensional distance codes, and the initialized atomic characteristics are shown in a formula (2):
x0=[Xatom|Xdegree|XRW|X3D] (2)
Wherein, X degree represents node degree code, X RW represents random walk position code, and X 3D represents three-dimensional distance code;
The node degree code X degree is shown in formula (3):
xdegree=fα(D) (3)
Wherein D represents a degree matrix of the molecular graph, f α is a mapping function of the degree information,
The random walk position code X RW is shown in formula (4):
Wherein, The random walk position code of the node i is represented, m represents the dimension of the random walk position code, RW is a random walk operation result matrix, and the random walk operation result matrix is represented by a formula (5):
RW=AD-1 (5)
wherein D -1 represents the inverse of the degree,
The three-dimensional distance codeAs shown in formula (6):
Wherein, The three-dimensional distance coding of the node i is represented, U (i) represents a neighbor node set of the node i, U (i) represents the sum of the neighbor numbers of the node i, r i-rj represents the distance information of the node i and the node j, r i represents the coordinate information of the node i, and r j represents the coordinate information of the node j.
According to the molecular large model construction method based on multidimensional molecular information in some embodiments of the present application, the shortest path structure code, the side information code and the three-dimensional distance pair code are integrated into the self-attention layer of the transducer by means of bias terms, as shown in formula (7):
wherein Att (X) l+1 represents the l+1st layer self-attention layer, att (X) l represents the l layer self-attention layer, SPD represents the shortest path structure encoding, edge represents the side information encoding, Representing the three-dimensional distance pair code,
The shortest path structure code SPD is shown in a formula (8):
wherein F is the shortest path between points in the molecular graph obtained according to Floyd algorithm, As a mapping function of the shortest path,
The Edge information code Edge is shown in a formula (9):
Edge=gθ(E) (9)
where g θ is a mapping function for side information,
The three-dimensional distance pair code is shown in formula (10):
wherein r i represents coordinate information of a node i, r j represents coordinate information of a node j, alpha i,ji,jkk are learnable parameters, alpha i,ji,j is controlled by an atomic node element type, corresponding alpha i,ji,j of node pairs formed by different elements are different, mu kk is a parameter of Gaussian kernel mapping, and k represents the number of Gaussian kernels;
The three-dimensional distance is used for interacting the code with the self-attention layer of the transducer, and the interaction and the iterative updating process of the self-attention layer of the transducer are shown in the formula (11) -the formula (12):
Wherein, Features representing the initial node pair i-j,/>M represents a mapping matrix,/>Characteristic of node pair i-j representing the first self-attention layer, H is the number of heads of attention, d is the dimension of the hidden layer,/>Is the query of the h head of the first self-attention layer,/>Is the key of the h head of the self-attention l layer,
The characteristics of the updated node pairs represent bias terms as the next self-attention layer.
According to the molecular large model construction method based on multidimensional molecular information, the defined two-dimensional space and three-dimensional space joint molecular graph self-supervision learning task comprises a two-dimensional space covering node attribute prediction task and a three-dimensional space coordinate denoising task;
The two-dimensional space hidden node attribute prediction task adopts the node attribute which is predicted to be hidden as a pre-training task, model learning molecular structure information is enabled to predict the hidden attribute by masking part of the node characteristics of the graph in input, and a loss function L 2D of the two-dimensional space hidden node attribute prediction task is shown as a formula (13):
L2D=-∑i∈Mlogp(zi|GM) (13)
Wherein p is a conditional probability, z i represents the output of the last transducer block corresponding to node i, and G M represents the masked molecular map;
the three-dimensional space coordinate denoising task is based on atomic three-dimensional coordinate information when the three-dimensional space coordinate denoising task is input Added Gaussian noise/> Disturbing molecular geometry to minimize the difference between the model predicted noise value and the input noise value, and model predicted kth coordinate dimension noise output/>As shown in equation (13):
wherein att ij represents the attention score between node i and node j, And/>Representing a learnable parameter,/>The relative position information corresponding to the kth coordinate dimension of Δij is represented, and Δij represents the relative position information of the node i and the node j, as shown in formula (15):
the loss function of the three-dimensional space coordinate denoising task is shown in formula (16):
wherein V represents all node sets in the graph, |V| represents the number of nodes, ε i represents the true coordinate noise of the ith node, Coordinate noise representing the predicted ith node;
the loss function of the molecular large model based on the multidimensional molecular information is shown in a formula (17):
L=αL2p+βL3D (17)
wherein alpha represents the loss weight of the two-dimensional space mask node attribute prediction task, and beta represents the loss weight of the three-dimensional space coordinate denoising task.
The invention also provides a molecular large model based on the multi-dimensional molecular information, which is obtained by adopting the molecular large model construction method based on the multi-dimensional molecular information.
The invention also provides application of the model in the field of biological medicine, and the data set to be subjected to the downstream task is input into the multi-dimensional molecular large model for fine adjustment to obtain an output result corresponding to the downstream task.
The molecular large model based on multidimensional molecular information, the construction method and the application provided by the invention can assist the biological medicine field to better understand the molecular structure and the chemical principle from the artificial intelligence angle, thereby revealing the internal mechanism of the molecule, and can effectively learn the molecular characterization by fully utilizing two-dimensional and three-dimensional molecular information, widely assist various downstream tasks such as molecular property prediction, molecular generation and the like, and accelerate the research and development speed of the medicine, and the method specifically comprises the following steps:
(1) A large-scale graph pre-training data set is constructed, two-dimensional and three-dimensional information is contained, and the defect of single information mining of the existing model is overcome;
(2) The graph representation learning method with strong expression capability is designed, the topological structure of the graph is fully excavated, the potential energy conversion in the three-dimensional geometric space is simulated, interaction is carried out with the attention moment array, and the application range of downstream tasks can be enlarged;
(3) The multi-dimensional self-supervision learning task can fully utilize the two-dimensional and three-dimensional graph structure information, and further improve the representation capability of the model.
Drawings
FIG. 1 is a schematic flow diagram of a molecular large model construction method based on multidimensional molecular information according to an embodiment of the invention;
FIG. 2 is a schematic flow chart of a transducer according to an embodiment of the present invention.
Detailed Description
Embodiments of the present invention are described in further detail below with reference to the accompanying drawings and examples. The following examples are illustrative of the invention but are not intended to limit the scope of the invention.
The embodiment provides a molecular large model construction method based on multi-dimensional molecular information, as shown in fig. 1, comprising the following steps:
step 1: the method comprises the steps of constructing an unsupervised pre-training data set, wherein data of the constructed unsupervised pre-training data set are derived from a PubChem database and a ZINC database, each piece of molecular data in the unsupervised pre-training data set comprises an ID number of the molecular data in an original database and a one-dimensional molecular SMILES representation, 1.1 parts per million of molecular data is derived from a PubChem database, and 10 parts per million of molecular data is derived from the ZINC database.
SMILES is a linear symbol that contains only simple atoms and bonds, and few grammatical rules, but can represent molecular information, and is similar to text information, and by learning this sequence representation in a large language model, molecular properties can be predicted by learning this sequence representation in a natural language model, by reference to learning patterns in natural language processing. However, this method has some problems: firstly, SMILES cannot fully capture molecular structure information, such as similarity information of two molecules, and the like, so that a model cannot fully utilize the structure information to influence final performance; meanwhile, one molecule can be characterized as a plurality of SMILES forms, so that learning is deviated and performance is affected; finally, since the input data only has a SMILES form, the input format of downstream tasks such as molecular property prediction and the like is greatly limited, and the method cannot be directly applied to large-scale drug screening.
Considering that the same molecule can have a plurality of different SMILES forms, this results in the inability to use the SMILES forms as a matching and deduplication operation for the compound, thus guaranteeing a one-to-one correspondence of molecules to SMILES sequences by removing hydrogen atoms, removing charges, removing small fragments, removing chiralities and standardizing tautomers when the unsupervised pretrained dataset is pre-processed, preserving the backbone structural representation of the molecule, including the ID number of the molecule in the original database and the one-dimensional molecular SMILES representation.
Performing a conformational generation process on the unsupervised pre-training dataset includes performing a molecular conformational generation process by RDkit kit based on the following steps:
generating a preliminary molecular conformation based on the distance geometry;
modifying the molecular conformation based on ETKDG method;
The molecular conformation was optimized based on MMFF force field.
And finally, generating a 2D graph representation of the molecule according to the standardized SMILES sequence, uniformly storing the atomic characteristics, the chemical bond characteristics, the adjacent matrix of the molecular graph and the atomic three-dimensional coordinates generated by the molecular conformation generation process into a pt type file, and ensuring that the model is directly called.
In this embodiment, the pre-training stage adopts the constructed unsupervised pre-training data set, and performs pre-training according to the set multi-dimensional self-supervision task, so that the model has a certain molecular characterization capability, and can be effectively generalized to various downstream tasks.
Step 2: carrying out structural coding on a molecular diagram in a molecular pre-training data set to obtain initialized atomic characteristics, and inputting the initialized atomic characteristics into a transducer, as shown in fig. 2;
in molecular representation learning, a transducer can effectively capture global information, and the global information is particularly important in molecular characterization learning, and for this purpose, the transducer is used as a backbone network of a molecular large model framework, and comprises a plurality of transducer blocks, wherein each transducer block consists of a self-focusing layer and a feedforward neural network layer, and the self-focusing layer and the feedforward neural network layer perform standard normalization operation.
The molecular diagram is shown in formula (1):
Wherein, For the atomic node characteristic matrix, n represents the number of atoms, d represents the atomic characteristic dimension, X atom contains the inherent attribute of atoms, A represents the adjacent matrix of the molecular diagram and covers the 1 st order topology information of the molecular diagram, E represents the set of the upper edges of the molecular diagram,/>Representing the geometric space coordinates of the molecule in three dimensions.
The position coding is an indispensable component in the Transformer, and when the node characteristics are input, besides the atomic node characteristic matrix, the node degree coding, the random walk position coding and the three-dimensional distance coding are integrated, and for the graph structure, no fixed node sequence exists, so that the position coding is very difficult, compared with other position coding modes, the position coding based on the random walk has lower calculation complexity, the problem of characteristic value symbols is not needed to be considered, the characteristic position information is calculated mainly by virtue of the adjacent matrix and the degree, and the specific position coding can be provided for the nodes with different k-hop topological neighbors.
The initialized atomic characteristics comprise an atomic node characteristic matrix, node degree codes, random walk position codes and three-dimensional distance codes, and the initialized atomic characteristics are shown in a formula (2):
x0=[Xatom|Xdegree|XRW|X3D] (2)
Wherein, X degree represents node degree code, X RW represents random walk position code, and X 3D represents three-dimensional distance code;
The node degree code X degree is shown in formula (3):
xdegree=fα(D) (3)
Wherein D represents a degree matrix of the molecular graph, f α is a mapping function of the degree information,
The random walk position code X RW is shown in formula (4):
Wherein, The random walk position code of the node i is represented, m represents the dimension of the random walk position code, RW is a random walk operation result matrix, and the random walk operation result matrix is represented by a formula (5):
RW=AD-1 (5)
wherein D -1 represents the inverse of the degree,
Three-dimensional distance codingAs shown in formula (6):
Wherein, The three-dimensional distance coding of the node i is represented, U (i) represents a neighbor node set of the node i, U (i) represents the sum of the neighbor numbers of the node i, r i-rj represents the distance information of the node i and the node j, r i represents the coordinate information of the node i, and r j represents the coordinate information of the node j.
Step 3: the shortest path structure code, the side information code and the three-dimensional distance pair code are integrated into the self-attention layer of the transducer, the three-dimensional distance pair code interacts with the self-attention layer of the transducer in the training process, and node pair characteristics of the self-attention layer of the transducer are updated iteratively;
the shortest path structural code, the side information code and the three-dimensional distance pair code are integrated into the self-attention layer of the transducer by means of offset items, as shown in a formula (7):
wherein Att (X) l+1 represents the l+1st layer self-attention layer, att (r) l represents the l layer self-attention layer, SPD represents the shortest path structure encoding, edge represents the side information encoding, Representing the three-dimensional distance pair code,
The shortest path structure code SPD is shown in formula (8):
wherein F is the shortest path between points in the molecular graph obtained according to Floyd algorithm, As a mapping function of the shortest path,
The Edge information code Edge is shown in formula (9):
Edge=gθ(E) (9)
where g θ is a mapping function for side information,
The three-dimensional distance pair code is shown in formula (10):
wherein r i represents coordinate information of a node i, r j represents coordinate information of a node j, alpha i,ji,jkk are learnable parameters, alpha i,ji,j is controlled by an atomic node element type, corresponding alpha i,ji,j of node pairs formed by different elements are different, mu kk is a parameter of Gaussian kernel mapping, and k represents the number of Gaussian kernels;
the three-dimensional distance is used for interacting the coding and the self-attention layer of the transducer, and the iterative updating process of the self-attention layer of the transducer is shown in the formula (11) -the formula (12):
Wherein, Features representing the initial node pair i-j,/>M represents the mapping matrix and,Characteristic of node pair i-j representing the first self-attention layer, H is the number of heads of attention, d is the dimension of the hidden layer,/>Is the query of the h head of the first self-attention layer,/>Is the key of the h head of the self-attention l layer,
The updated node pair feature serves as a bias term for the next self-attention layer.
Step 4: and defining a two-dimensional space and three-dimensional space combined molecular graph self-supervision learning task, and obtaining a molecular large model based on multidimensional molecular information after training.
The self-supervision learning task of the joint sub-graph of the two-dimensional space and the three-dimensional space comprises a two-dimensional space covering node attribute prediction task and a three-dimensional space coordinate denoising task;
The two-dimensional space hidden node attribute prediction task adopts the node attribute which is predicted to be hidden as a pre-training task, and model learning molecular structure information is enabled to predict the hidden attribute by masking part of the node characteristics of the graph in input, and a loss function L 2D of the two-dimensional space hidden node attribute prediction task is shown as a formula (13):
L2D=-∑i∈Mlogp(zi|GM) (13)
Wherein p is a conditional probability, z i represents the output of the last transducer block corresponding to node i, and G M represents the masked molecular map;
atomic three-dimensional coordinate information when three-dimensional space coordinate denoising task is input through input Adding Gaussian noiseDisturbing molecular geometry to minimize the difference between the model predicted noise value and the input noise value, and model predicted kth coordinate dimension noise output/>As shown in equation (14):
wherein att ij represents the attention score between node i and node j, And/>Representing a learnable parameter,/>The relative position information corresponding to the kth coordinate dimension of Δij is represented, and Δij represents the relative position information of the node i and the node j, as shown in formula (15):
the loss function of the three-dimensional space coordinate denoising task is shown in formula (16):
wherein V represents all node sets in the graph, |V| represents the number of nodes, ε i represents the true coordinate noise of the ith node, Coordinate noise representing the predicted ith node;
The loss function of the molecular large model based on multidimensional molecular information is shown in formula (17):
L=αL2D+βL3D (17)
wherein alpha represents the loss weight of the two-dimensional space mask node attribute prediction task, and beta represents the loss weight of the three-dimensional space coordinate denoising task.
The model can be fused with molecular information of different visual angles and stored as a pt type file according to the design of the two-dimensional space and three-dimensional space combined molecular graph self-supervision learning task, so that the downstream task can be further fine-tuned, and the generalization performance is improved.
The embodiment also provides a molecular large model based on the multi-dimensional molecular information, which is obtained by adopting the molecular large model construction method based on the multi-dimensional molecular information.
The embodiment also provides application of the molecular large model based on the multi-dimensional molecular information in the biomedical field, and the data set to be subjected to the downstream task is input into the molecular large model based on the multi-dimensional molecular information for fine adjustment to obtain an output result corresponding to the downstream task, wherein the data set to be subjected to the downstream task comprises a molecular property prediction task data set, a three-dimensional coordinate generation task data set and a drug screening data set. The model can fully excavate molecular information in the field of biological medicine, learn molecular characterization by simulating two-dimensional structure information and potential energy conversion in a three-dimensional geometric space, improve the performance of the model on a plurality of downstream tasks such as molecular property prediction, target spot prediction, molecular synthesis and the like, and accelerate the drug screening speed, thereby providing important support and assistance for drug research and development.
The embodiment provides an application of a molecular large model based on multi-dimensional molecular information in a fine adjustment data set in the biomedical field, in general, the molecular tag information can be used for positively guiding a trained model in performance, and after the molecular large model in the embodiment completes a pre-training task, three types of output of a graph feature vector, a node feature matrix and a three-dimensional coordinate matrix can be finally generated to be connected with various downstream tasks, for example: and (3) completing a molecular property prediction task by using the graph feature vector, completing a molecular pose prediction task by using a three-dimensional coordinate matrix, and the like.
The dataset used in this embodiment is PCQM4mv2 dataset, and using the HOMO-LUMO values provided in PCQM4mv2 dataset, the supervised downstream task is defined as a regression task-predicting the quantum characteristics of the molecular graph to optimize model parameters in the self-supervised learning process, including:
Step 1: a supervised fine tuning dataset was constructed, the data of which was derived from the PCQM4mv2 dataset, and the PCQM4mv2 dataset contained 340 ten thousand organic molecules, recording the three-dimensional conformation and HOMO-LUMO energy gap under the molecular equilibrium state calculated using density functional theory.
Step 2: the supervision fine tuning data set is preprocessed, and the backbone structure representation of the molecules is reserved through removing hydrogen atoms, removing charges, removing small fragments, removing chirality and standardized tautomers, wherein the backbone structure representation comprises an ID number of the molecules in an original database and a one-dimensional molecular SMILES representation, so that one-to-one correspondence of the molecules and SMILES sequences is ensured.
Step 3: in this embodiment, MAE is used as the loss function, and an Adam optimizer is used to optimize the model learning parameters. And adjusting the super parameters according to the verification set.
MAE calculation is shown in equation (18):
where N represents the number of all molecular maps in the PCQM Mv2 dataset, Output result of nth molecule is expressed,/>Representing the true tag of the nth molecule.
The embodiments of the invention have been presented for purposes of illustration and description, and are not intended to be exhaustive or limited to the invention in the form disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art. The embodiments were chosen and described in order to best explain the principles of the invention and the practical application, and to enable others of ordinary skill in the art to understand the invention for various embodiments with various modifications as are suited to the particular use contemplated.

Claims (3)

1. The molecular large model construction method based on multidimensional molecular information is characterized by comprising the following steps:
Constructing an unsupervised pre-training data set, and performing preprocessing and molecular conformation generation processing on the unsupervised pre-training data set to obtain a molecular pre-training data set formed by a molecular diagram;
Carrying out structural coding on a molecular diagram in the molecular pre-training data set to obtain initialized atomic characteristics, and inputting the initialized atomic characteristics into a transducer;
integrating shortest path structural codes, side information codes and three-dimensional distance pair codes into the self-attention layer of the transducer, interacting the three-dimensional distance pair codes with the self-attention layer of the transducer in the training process, and iteratively updating node pair characteristics of the self-attention layer of the transducer;
Defining a two-dimensional space and three-dimensional space joint molecular graph self-supervision learning task, and obtaining a molecular large model based on multidimensional molecular information after training;
The transformers comprise a plurality of transformers, each Transformer consists of a self-focusing layer and a feedforward neural network layer, and the self-focusing layer and the feedforward neural network layer perform standard normalization operation;
the molecular diagram is shown in formula (1):
G={Xatom,A,E,R} (1)
Wherein, For the atomic node characteristic matrix, n represents the number of atoms, d represents the atomic characteristic dimension, X atom contains the inherent attribute of atoms, A represents the adjacent matrix of the molecular diagram and covers the 1 st order topology information of the molecular diagram, E represents the set of the upper edges of the molecular diagram,/>Representing geometric space coordinates of the molecules in three-dimensional space;
The initialized atomic characteristics comprise an atomic node characteristic matrix, node degree codes, random walk position codes and three-dimensional distance codes, and the initialized atomic characteristics are shown in a formula (2):
X0=[Xatom|Xdegree|XRW|X3D] (2)
Wherein, X degree represents node degree code, X RW represents random walk position code, and X 3D represents three-dimensional distance code;
The node degree code X degree is shown in formula (3):
Xdegree=fα(D) (3)
Wherein D represents a degree matrix of the molecular graph, f α is a mapping function of the degree information,
The random walk position code X RW is shown in formula (4):
Wherein, The random walk position code of the node i is represented, m represents the dimension of the random walk position code, RW is a random walk operation result matrix, and the random walk operation result matrix is represented by a formula (5):
RW=AD-1 (5)
wherein D -1 represents the inverse of the degree,
The three-dimensional distance codeAs shown in formula (6):
Wherein, The three-dimensional distance coding of the node i is represented, U (i) represents a neighbor node set of the node i, U (i) represents the sum of the neighbor numbers of the node i, r i-rj || represents distance information of the node i and the node j, r i represents coordinate information of the node i, and r j represents coordinate information of the node j;
And the shortest path structure code, the side information code and the three-dimensional distance pair code are merged into the self-attention layer of the transducer by means of offset items, as shown in a formula (7):
wherein Att (X) l+1 represents the l+1st layer self-attention layer, att (X) l represents the l layer self-attention layer, SPD represents the shortest path structure encoding, edge represents the side information encoding, Representing the three-dimensional distance pair code,
The shortest path structure code SPD is shown in a formula (8):
wherein F is the shortest path between points in the molecular graph obtained according to Floyd algorithm, As a mapping function of the shortest path,
The Edge information code Edge is shown in a formula (9):
Edge=gθ(E) (9)
where g θ is a mapping function for side information,
The three-dimensional distance pair code is shown in formula (10):
Wherein r i represents coordinate information of a node i, r j represents coordinate information of a node j, alpha i,jji,jkk are learnable parameters, alpha i,ji,j is controlled by an atomic node element type, corresponding alpha i,ji,j of node pairs formed by different elements are different, mu kk is a parameter of Gaussian kernel mapping, and k represents the number of Gaussian kernels;
The three-dimensional distance pair code interacts with the self-attention layer of the transducer, and the interaction and the self-attention layer iteration updating edge node pair feature matrix and node feature updating process of the transducer are shown in the following formula (11) -formula (12):
Wherein, Features representing the initial node pair i-j,/>M represents a mapping matrix,/>Characteristic of node pair i-j representing the first self-attention layer, H is the number of heads of attention, d is the dimension of the hidden layer,/>Is the query of the h head of the first self-attention layer,/>Is the key of the h head of the self-attention l layer,
The updated node pair characteristics are used as bias items of the self-attention layer of the next layer;
The self-supervision learning task of the defined two-dimensional space and three-dimensional space joint molecular graph comprises a two-dimensional space covering node attribute prediction task and a three-dimensional space coordinate denoising task;
The two-dimensional space hidden node attribute prediction task adopts the node attribute which is predicted to be hidden as a pre-training task, model learning molecular structure information is enabled to predict the hidden attribute by masking part of the node characteristics of the graph in input, and a loss function L 2D of the two-dimensional space hidden node attribute prediction task is shown as a formula (13):
L2D=-∑i∈Mlogp(zi∣GM) (13)
Wherein p is a conditional probability, z i represents the output of the last transducer block corresponding to node i, and G M represents the masked molecular map;
the three-dimensional space coordinate denoising task is based on atomic three-dimensional coordinate information when the three-dimensional space coordinate denoising task is input Adding Gaussian noise Disturbing molecular geometry to minimize the difference between the model predicted noise value and the input noise value, and model predicted kth coordinate dimension noise output/>As shown in equation (14):
wherein att ij represents the attention score between node i and node j, And/>Representing a learnable parameter,/>The relative position information corresponding to the kth coordinate dimension of Δij is represented, and Δij represents the relative position information of the node i and the node j, as shown in formula (15):
the loss function of the three-dimensional space coordinate denoising task is shown in formula (16):
wherein V represents all node sets in the graph, |V| represents the number of nodes, ε i represents the true coordinate noise of the ith node, Coordinate noise representing the predicted ith node;
the loss function of the molecular large model based on the multidimensional molecular information is shown in a formula (17):
L=αL2D+βL3D (17)
wherein alpha represents the loss weight of the two-dimensional space mask node attribute prediction task, and beta represents the loss weight of the three-dimensional space coordinate denoising task.
2. The method of claim 1, wherein the preprocessing comprises removing hydrogen atoms, removing charges, removing small fragments, removing chiralities and standardized tautomers, and retaining a backbone structural representation of the molecules, wherein the backbone structural representation comprises an ID number of the molecules in an original database and a one-dimensional molecular SMILES representation.
3. The method for constructing a molecular large model based on multidimensional molecular information according to claim 1, wherein the molecular conformation generation process comprises the molecular conformation generation process by RDkit kits based on the following steps:
generating a preliminary molecular conformation based on the distance geometry;
modifying the molecular conformation based on ETKDG method;
The molecular conformation was optimized based on MMFF force field.
CN202311574206.0A 2023-11-23 2023-11-23 Molecular large model based on multidimensional molecular information, construction method and application Active CN117524353B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311574206.0A CN117524353B (en) 2023-11-23 2023-11-23 Molecular large model based on multidimensional molecular information, construction method and application

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311574206.0A CN117524353B (en) 2023-11-23 2023-11-23 Molecular large model based on multidimensional molecular information, construction method and application

Publications (2)

Publication Number Publication Date
CN117524353A CN117524353A (en) 2024-02-06
CN117524353B true CN117524353B (en) 2024-05-10

Family

ID=89766059

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311574206.0A Active CN117524353B (en) 2023-11-23 2023-11-23 Molecular large model based on multidimensional molecular information, construction method and application

Country Status (1)

Country Link
CN (1) CN117524353B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117912575B (en) * 2024-03-19 2024-05-14 苏州大学 Atomic importance analysis method based on multi-dimensional molecular pre-training model

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113241128A (en) * 2021-04-29 2021-08-10 天津大学 Molecular property prediction method based on molecular space position coding attention neural network model
CN113299354A (en) * 2021-05-14 2021-08-24 中山大学 Small molecule representation learning method based on Transformer and enhanced interactive MPNN neural network
CN114566232A (en) * 2022-02-17 2022-05-31 北京百度网讯科技有限公司 Molecular characterization model training method and device and electronic equipment
WO2023029351A1 (en) * 2021-08-30 2023-03-09 平安科技(深圳)有限公司 Self-supervised learning-based method, apparatus and device for predicting properties of drug small molecules
CN115831261A (en) * 2022-11-14 2023-03-21 浙江大学杭州国际科创中心 Three-dimensional space molecule generation method and device based on multi-task pre-training inverse reinforcement learning
CN116052792A (en) * 2023-01-31 2023-05-02 杭州碳硅智慧科技发展有限公司 Training method and device for molecular optimal conformation prediction model
WO2023153882A1 (en) * 2022-02-11 2023-08-17 Samsung Display Co., Ltd. Method for optimizing properties of a molecule
CN116978483A (en) * 2023-07-31 2023-10-31 浙江大学 Molecular property prediction method and system based on graphic neural network and three-dimensional encoder
CN116978481A (en) * 2023-03-23 2023-10-31 腾讯科技(深圳)有限公司 Molecular attribute prediction method, device, electronic equipment and storage medium

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111755078B (en) * 2020-07-30 2022-09-23 腾讯科技(深圳)有限公司 Drug molecule attribute determination method, device and storage medium

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113241128A (en) * 2021-04-29 2021-08-10 天津大学 Molecular property prediction method based on molecular space position coding attention neural network model
CN113299354A (en) * 2021-05-14 2021-08-24 中山大学 Small molecule representation learning method based on Transformer and enhanced interactive MPNN neural network
WO2023029351A1 (en) * 2021-08-30 2023-03-09 平安科技(深圳)有限公司 Self-supervised learning-based method, apparatus and device for predicting properties of drug small molecules
WO2023153882A1 (en) * 2022-02-11 2023-08-17 Samsung Display Co., Ltd. Method for optimizing properties of a molecule
CN114566232A (en) * 2022-02-17 2022-05-31 北京百度网讯科技有限公司 Molecular characterization model training method and device and electronic equipment
CN115831261A (en) * 2022-11-14 2023-03-21 浙江大学杭州国际科创中心 Three-dimensional space molecule generation method and device based on multi-task pre-training inverse reinforcement learning
CN116052792A (en) * 2023-01-31 2023-05-02 杭州碳硅智慧科技发展有限公司 Training method and device for molecular optimal conformation prediction model
CN116978481A (en) * 2023-03-23 2023-10-31 腾讯科技(深圳)有限公司 Molecular attribute prediction method, device, electronic equipment and storage medium
CN116978483A (en) * 2023-07-31 2023-10-31 浙江大学 Molecular property prediction method and system based on graphic neural network and three-dimensional encoder

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
Molecular Transformer: A Model for Uncertainty-Calibrated Chemical Reaction Prediction;Philippe Schwaller ET AL;《ACS Cent. Sci.》;20190830;全文 *
Unified 2D and 3D Pre-Training of Molecular Representations;Jinhua Zhu ET AL;《https:arxiv.org/abs/2207.08806》;20220818;全文 *
基于深度学习的3D分子生成模型研究进展;20230209;《中国科学:化学》;20230209;全文 *
方剂图论和拓扑学:方剂结构的化学图论和分子拓扑学原理及其研究方法(续);冯前进;刘润兰;;山西中医学院学报;20131028(第05期);全文 *

Also Published As

Publication number Publication date
CN117524353A (en) 2024-02-06

Similar Documents

Publication Publication Date Title
Zhang et al. Hierarchical graph pooling with structure learning
Tahmasebi et al. Machine learning in geo-and environmental sciences: From small to large scale
Kampffmeyer et al. Rethinking knowledge graph propagation for zero-shot learning
Diallo et al. Deep embedding clustering based on contractive autoencoder
Li et al. Deep learning methods for molecular representation and property prediction
JP2023082017A (en) computer system
CN113707235A (en) Method, device and equipment for predicting properties of small drug molecules based on self-supervision learning
CN112905801B (en) Stroke prediction method, system, equipment and storage medium based on event map
CN117524353B (en) Molecular large model based on multidimensional molecular information, construction method and application
CN109858015A (en) A kind of semantic similarity calculation method and device based on CTW and KM algorithm
Song et al. Geologist-level wireline log shape identification with recurrent neural networks
CN111476261A (en) Community-enhanced graph convolution neural network method
Zhou et al. M-evolve: structural-mapping-based data augmentation for graph classification
Herath et al. Topologically optimal design and failure prediction using conditional generative adversarial networks
Rizvi et al. Spectrum of advancements and developments in multidisciplinary domains for generative adversarial networks (GANs)
CN116741307A (en) Three-dimensional molecular structure simulation method for synthesis and screening of lead compounds
Liu et al. A systematic machine learning method for reservoir identification and production prediction
CN112270950A (en) Fusion network drug target relation prediction method based on network enhancement and graph regularization
CN112529057A (en) Graph similarity calculation method and device based on graph convolution network
CN116208399A (en) Network malicious behavior detection method and device based on metagraph
Liu et al. Object detection via inner-inter relational reasoning network
Peng et al. Pocket-specific 3d molecule generation by fragment-based autoregressive diffusion models
Fan et al. Gated graph pooling with self-loop for graph classification
Zhu et al. Structural landmarking and interaction modelling: a “slim” network for graph classification
Rolon et al. A multi-class structured dictionary learning method using discriminant atom selection

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant