CN117524353B

CN117524353B - Molecular large model based on multidimensional molecular information, construction method and application

Info

Publication number: CN117524353B
Application number: CN202311574206.0A
Authority: CN
Inventors: 申彦明; 马煜婷
Original assignee: Dalian University of Technology
Current assignee: Dalian University of Technology
Priority date: 2023-11-23
Filing date: 2023-11-23
Publication date: 2024-05-10
Anticipated expiration: 2043-11-23
Also published as: CN117524353A

Abstract

The invention provides a molecular large model based on multidimensional molecular information, a construction method and application thereof, comprising the steps of constructing an unsupervised pre-training data set, preprocessing the unsupervised pre-training data set and generating a molecular conformation to obtain a molecular pre-training data set formed by a molecular map; carrying out structural coding on a molecular diagram in a molecular pre-training data set to obtain initialized atomic characteristics, and inputting the initialized atomic characteristics into a transducer; the shortest path structure code, the side information code and the three-dimensional distance pair code are integrated into the self-attention layer of the transducer, the three-dimensional distance pair code interacts with the self-attention layer of the transducer in the training process, and node pair characteristics of the self-attention layer of the transducer are updated iteratively; and defining a two-dimensional space and three-dimensional space combined molecular graph self-supervision learning task, and obtaining a molecular large model based on multidimensional molecular information after training. The invention can accelerate the drug screening speed and provide help for drug research and development.

Description

Molecular large model based on multidimensional molecular information, construction method and application

Technical Field

The invention belongs to the field of artificial intelligence, and particularly discloses a molecular large model construction method based on multidimensional molecular information.

Background

Traditional drug development is a complex and time-consuming process involving multiple links, such as potential target identification, compound optimization, bioactivity evaluation, etc., requiring significant manpower, material resources, and financial resources. The large model can utilize massive biomedical data to carry out excavation and analysis, and potential drug molecules can be rapidly screened out, so that the drug research and development speed is improved, the manpower and material resources and the input cost are reduced, and powerful support is provided for innovation and development of the intelligent medicine industry. For existing datasets, such as ZINC, which contain only two-dimensional molecular information, the ability of the model to learn spatial information is limited. In addition, there are some public datasets that contain three-dimensional spatial information, such as PCQM m 2 or QM9, but the limitation of the data size cannot meet the requirement of the current molecular large model, so that it is necessary to construct a large-scale molecular dataset that contains two-dimensional plane and three-dimensional stereo information at the same time. Successful application of large models is not supported by a transducer, but the existing transducer cannot fully characterize structural information in the graph, so that large model learning on large-scale graph data is difficult. 2021, ying et al proposed Graphormer to improve the modeling capability of structural information by introducing the structural coding information of the graph into the transducer, but this approach lacks learning of three-dimensional information, thereby limiting the application scope of the model. Therefore, luo introduces three-dimensional information learning based on Graphormer and introduces three-dimensional position coding, however, the method only adds additional information in the attention matrix and cannot improve the modeling capability of three-dimensional coordinates, thereby limiting the application range of the model.

In the existing pre-training model, such as SMILES character strings, structural information cannot be well captured due to the limitation of sequence conditions. For contrast pre-training tasks, improper data enhancement can result in false positive samples, and this learning approach can result in bias in model learning. Another method is to train the model by generating a pre-training task, and by masking a certain proportion of atoms and edges, and letting the model measure the attribute of the masked part, but the generating pre-training task may have a too simple problem, for example, for the chemical field, the nature only contains 118 elements, and there is a serious data imbalance problem, from this point of view, the molecular characterization learning may not fully utilize the prior knowledge of chemistry, and the performance is affected. From the three-dimensional perspective, GEM proposes to introduce spatial structure information of a compound into a pre-training encoding process, but only uses an edge-bond angle diagram as additional spatial information, and lacks complete utilization of three-dimensional coordinates; transformer-M first proposes to use two-dimensional information and three-dimensional information for characterization learning, but is limited by the number of three-dimensional equilibrium conformations, and cannot be widely popularized to a large-scale data set.

In summary, the existing methods have more or less certain limitations: (1) For the biomedical field, a molecular large model is needed to predict properties such as bioactivity and side effects of molecules to accelerate drug screening, or to generate molecules with specific properties and structures to provide candidate molecules for drug design and discovery; (2) The existing models are mostly dependent on small-scale data sets, are limited by the number of existing three-dimensional equilibrium conformations, and lack large-scale graph data sets which can be used for training a molecular large model; (3) The existing model has limited characterization capability on molecular information, cannot fully learn the three-dimensional space information, and needs to provide a model with high expression capability.

Disclosure of Invention

The invention provides a molecular large model based on multidimensional molecular information, a construction method and application thereof, and aims to solve the problems that a large-scale graph data set which can be used for molecular large model training is lacking in the existing biological medicine field, the existing model has limited representation capability on molecular information, and three-dimensional space information cannot be fully learned.

The invention provides a molecular large model construction method based on multi-dimensional molecular information, which comprises the following steps:

Constructing an unsupervised pre-training data set, and performing preprocessing and molecular conformation generation processing on the unsupervised pre-training data set to obtain a molecular pre-training data set formed by a molecular diagram;

Carrying out structural coding on a molecular diagram in the molecular pre-training data set to obtain initialized atomic characteristics, and inputting the initialized atomic characteristics into a transducer;

integrating shortest path structural codes, side information codes and three-dimensional distance pair codes into the self-attention layer of the transducer, interacting the three-dimensional distance pair codes with the self-attention layer of the transducer in the training process, and iteratively updating node pair characteristics of the self-attention layer of the transducer;

And defining a two-dimensional space and three-dimensional space combined molecular graph self-supervision learning task, and obtaining a molecular large model based on multidimensional molecular information after training.

A method of molecular large model construction based on multidimensional molecular information according to some embodiments of the present application, the pre-processing includes removing hydrogen atoms, removing charges, removing small fragments, removing chiralities and standardized tautomers, retaining a backbone structural representation of the molecule, the backbone structural representation including an ID number of the molecule in an original database and a one-dimensional molecular SMILES representation.

A method for constructing a molecular large model based on multidimensional molecular information according to some embodiments of the present application, the molecular conformation generation process includes performing the molecular conformation generation process by RDkit kits based on the following steps:

generating a preliminary molecular conformation based on the distance geometry;

modifying the molecular conformation based on ETKDG method;

The molecular conformation was optimized based on MMFF force field.

According to some embodiments of the application, the method for constructing the molecular large model based on multidimensional molecular information comprises a plurality of transducer blocks, wherein each transducer block consists of a self-focusing layer and a feedforward neural network layer, and the self-focusing layer and the feedforward neural network layer are subjected to standard normalization operation.

According to some embodiments of the application, a molecular large model construction method based on multidimensional molecular information is provided, wherein the molecular diagram is shown in a formula (1):

G＝{X^atom,A,E,R} (1)

Wherein, For the atomic node characteristic matrix, n represents the number of atoms, d represents the atomic characteristic dimension, X ^atom contains the inherent attribute of atoms, A represents the adjacent matrix of the molecular diagram and covers the 1 st order topology information of the molecular diagram, E represents the set of the upper edges of the molecular diagram,/>Representing the geometric space coordinates of the molecule in three dimensions.

According to some embodiments of the application, the initialized atomic characteristics comprise an atomic node characteristic matrix, node degree codes, random walk position codes and three-dimensional distance codes, and the initialized atomic characteristics are shown in a formula (2):

x⁰＝[X^atom|X^degree|X^RW|X^3D] (2)

Wherein, X ^degree represents node degree code, X ^RW represents random walk position code, and X ^3D represents three-dimensional distance code;

The node degree code X ^degree is shown in formula (3):

x^degree＝f_α(D) (3)

Wherein D represents a degree matrix of the molecular graph, f _α is a mapping function of the degree information,

The random walk position code X ^RW is shown in formula (4):

Wherein, The random walk position code of the node i is represented, m represents the dimension of the random walk position code, RW is a random walk operation result matrix, and the random walk operation result matrix is represented by a formula (5):

RW＝AD^-1 (5)

wherein D ^-1 represents the inverse of the degree,

The three-dimensional distance codeAs shown in formula (6):

Wherein, The three-dimensional distance coding of the node i is represented, U (i) represents a neighbor node set of the node i, U (i) represents the sum of the neighbor numbers of the node i, r _i-r_j represents the distance information of the node i and the node j, r _i represents the coordinate information of the node i, and r _j represents the coordinate information of the node j.

According to the molecular large model construction method based on multidimensional molecular information in some embodiments of the present application, the shortest path structure code, the side information code and the three-dimensional distance pair code are integrated into the self-attention layer of the transducer by means of bias terms, as shown in formula (7):

wherein Att (X) ^l+1 represents the l+1st layer self-attention layer, att (X) ^l represents the l layer self-attention layer, SPD represents the shortest path structure encoding, edge represents the side information encoding, Representing the three-dimensional distance pair code,

The shortest path structure code SPD is shown in a formula (8):

wherein F is the shortest path between points in the molecular graph obtained according to Floyd algorithm, As a mapping function of the shortest path,

The Edge information code Edge is shown in a formula (9):

Edge＝g_θ(E) (9)

where g _θ is a mapping function for side information,

The three-dimensional distance pair code is shown in formula (10):

wherein r _i represents coordinate information of a node i, r _j represents coordinate information of a node j, alpha _i,j,β_i,j,μ^k,σ^k are learnable parameters, alpha _i,j,β_i,j is controlled by an atomic node element type, corresponding alpha _i,j,β_i,j of node pairs formed by different elements are different, mu ^k,σ^k is a parameter of Gaussian kernel mapping, and k represents the number of Gaussian kernels;

The three-dimensional distance is used for interacting the code with the self-attention layer of the transducer, and the interaction and the iterative updating process of the self-attention layer of the transducer are shown in the formula (11) -the formula (12):

Wherein, Features representing the initial node pair i-j,/>M represents a mapping matrix,/>Characteristic of node pair i-j representing the first self-attention layer, H is the number of heads of attention, d is the dimension of the hidden layer,/>Is the query of the h head of the first self-attention layer,/>Is the key of the h head of the self-attention l layer,

The characteristics of the updated node pairs represent bias terms as the next self-attention layer.

According to the molecular large model construction method based on multidimensional molecular information, the defined two-dimensional space and three-dimensional space joint molecular graph self-supervision learning task comprises a two-dimensional space covering node attribute prediction task and a three-dimensional space coordinate denoising task;

The two-dimensional space hidden node attribute prediction task adopts the node attribute which is predicted to be hidden as a pre-training task, model learning molecular structure information is enabled to predict the hidden attribute by masking part of the node characteristics of the graph in input, and a loss function L _2D of the two-dimensional space hidden node attribute prediction task is shown as a formula (13):

L_2D＝-∑_i∈Mlogp(z_i|G^M) (13)

Wherein p is a conditional probability, z _i represents the output of the last transducer block corresponding to node i, and G ^M represents the masked molecular map;

the three-dimensional space coordinate denoising task is based on atomic three-dimensional coordinate information when the three-dimensional space coordinate denoising task is input Added Gaussian noise/> Disturbing molecular geometry to minimize the difference between the model predicted noise value and the input noise value, and model predicted kth coordinate dimension noise output/>As shown in equation (13):

wherein att _ij represents the attention score between node i and node j, And/>Representing a learnable parameter,/>The relative position information corresponding to the kth coordinate dimension of Δij is represented, and Δij represents the relative position information of the node i and the node j, as shown in formula (15):

the loss function of the three-dimensional space coordinate denoising task is shown in formula (16):

wherein V represents all node sets in the graph, |V| represents the number of nodes, ε _i represents the true coordinate noise of the ith node, Coordinate noise representing the predicted ith node;

the loss function of the molecular large model based on the multidimensional molecular information is shown in a formula (17):

L＝αL_2p+βL_3D (17)

wherein alpha represents the loss weight of the two-dimensional space mask node attribute prediction task, and beta represents the loss weight of the three-dimensional space coordinate denoising task.

The invention also provides a molecular large model based on the multi-dimensional molecular information, which is obtained by adopting the molecular large model construction method based on the multi-dimensional molecular information.

The invention also provides application of the model in the field of biological medicine, and the data set to be subjected to the downstream task is input into the multi-dimensional molecular large model for fine adjustment to obtain an output result corresponding to the downstream task.

The molecular large model based on multidimensional molecular information, the construction method and the application provided by the invention can assist the biological medicine field to better understand the molecular structure and the chemical principle from the artificial intelligence angle, thereby revealing the internal mechanism of the molecule, and can effectively learn the molecular characterization by fully utilizing two-dimensional and three-dimensional molecular information, widely assist various downstream tasks such as molecular property prediction, molecular generation and the like, and accelerate the research and development speed of the medicine, and the method specifically comprises the following steps:

(1) A large-scale graph pre-training data set is constructed, two-dimensional and three-dimensional information is contained, and the defect of single information mining of the existing model is overcome;

(2) The graph representation learning method with strong expression capability is designed, the topological structure of the graph is fully excavated, the potential energy conversion in the three-dimensional geometric space is simulated, interaction is carried out with the attention moment array, and the application range of downstream tasks can be enlarged;

(3) The multi-dimensional self-supervision learning task can fully utilize the two-dimensional and three-dimensional graph structure information, and further improve the representation capability of the model.

Drawings

FIG. 1 is a schematic flow diagram of a molecular large model construction method based on multidimensional molecular information according to an embodiment of the invention;

FIG. 2 is a schematic flow chart of a transducer according to an embodiment of the present invention.

Detailed Description

Embodiments of the present invention are described in further detail below with reference to the accompanying drawings and examples. The following examples are illustrative of the invention but are not intended to limit the scope of the invention.

The embodiment provides a molecular large model construction method based on multi-dimensional molecular information, as shown in fig. 1, comprising the following steps:

step 1: the method comprises the steps of constructing an unsupervised pre-training data set, wherein data of the constructed unsupervised pre-training data set are derived from a PubChem database and a ZINC database, each piece of molecular data in the unsupervised pre-training data set comprises an ID number of the molecular data in an original database and a one-dimensional molecular SMILES representation, 1.1 parts per million of molecular data is derived from a PubChem database, and 10 parts per million of molecular data is derived from the ZINC database.

SMILES is a linear symbol that contains only simple atoms and bonds, and few grammatical rules, but can represent molecular information, and is similar to text information, and by learning this sequence representation in a large language model, molecular properties can be predicted by learning this sequence representation in a natural language model, by reference to learning patterns in natural language processing. However, this method has some problems: firstly, SMILES cannot fully capture molecular structure information, such as similarity information of two molecules, and the like, so that a model cannot fully utilize the structure information to influence final performance; meanwhile, one molecule can be characterized as a plurality of SMILES forms, so that learning is deviated and performance is affected; finally, since the input data only has a SMILES form, the input format of downstream tasks such as molecular property prediction and the like is greatly limited, and the method cannot be directly applied to large-scale drug screening.

Considering that the same molecule can have a plurality of different SMILES forms, this results in the inability to use the SMILES forms as a matching and deduplication operation for the compound, thus guaranteeing a one-to-one correspondence of molecules to SMILES sequences by removing hydrogen atoms, removing charges, removing small fragments, removing chiralities and standardizing tautomers when the unsupervised pretrained dataset is pre-processed, preserving the backbone structural representation of the molecule, including the ID number of the molecule in the original database and the one-dimensional molecular SMILES representation.

Performing a conformational generation process on the unsupervised pre-training dataset includes performing a molecular conformational generation process by RDkit kit based on the following steps:

generating a preliminary molecular conformation based on the distance geometry;

modifying the molecular conformation based on ETKDG method;

The molecular conformation was optimized based on MMFF force field.

And finally, generating a 2D graph representation of the molecule according to the standardized SMILES sequence, uniformly storing the atomic characteristics, the chemical bond characteristics, the adjacent matrix of the molecular graph and the atomic three-dimensional coordinates generated by the molecular conformation generation process into a pt type file, and ensuring that the model is directly called.

In this embodiment, the pre-training stage adopts the constructed unsupervised pre-training data set, and performs pre-training according to the set multi-dimensional self-supervision task, so that the model has a certain molecular characterization capability, and can be effectively generalized to various downstream tasks.

Step 2: carrying out structural coding on a molecular diagram in a molecular pre-training data set to obtain initialized atomic characteristics, and inputting the initialized atomic characteristics into a transducer, as shown in fig. 2;

in molecular representation learning, a transducer can effectively capture global information, and the global information is particularly important in molecular characterization learning, and for this purpose, the transducer is used as a backbone network of a molecular large model framework, and comprises a plurality of transducer blocks, wherein each transducer block consists of a self-focusing layer and a feedforward neural network layer, and the self-focusing layer and the feedforward neural network layer perform standard normalization operation.

The molecular diagram is shown in formula (1):

The position coding is an indispensable component in the Transformer, and when the node characteristics are input, besides the atomic node characteristic matrix, the node degree coding, the random walk position coding and the three-dimensional distance coding are integrated, and for the graph structure, no fixed node sequence exists, so that the position coding is very difficult, compared with other position coding modes, the position coding based on the random walk has lower calculation complexity, the problem of characteristic value symbols is not needed to be considered, the characteristic position information is calculated mainly by virtue of the adjacent matrix and the degree, and the specific position coding can be provided for the nodes with different k-hop topological neighbors.

The initialized atomic characteristics comprise an atomic node characteristic matrix, node degree codes, random walk position codes and three-dimensional distance codes, and the initialized atomic characteristics are shown in a formula (2):

x⁰＝[X^atom|X^degree|X^RW|X^3D] (2)

The node degree code X ^degree is shown in formula (3):

x^degree＝f_α(D) (3)

The random walk position code X ^RW is shown in formula (4):

RW＝AD^-1 (5)

wherein D ^-1 represents the inverse of the degree,

Three-dimensional distance codingAs shown in formula (6):

Step 3: the shortest path structure code, the side information code and the three-dimensional distance pair code are integrated into the self-attention layer of the transducer, the three-dimensional distance pair code interacts with the self-attention layer of the transducer in the training process, and node pair characteristics of the self-attention layer of the transducer are updated iteratively;

the shortest path structural code, the side information code and the three-dimensional distance pair code are integrated into the self-attention layer of the transducer by means of offset items, as shown in a formula (7):

wherein Att (X) ^l+1 represents the l+1st layer self-attention layer, att (r) ^l represents the l layer self-attention layer, SPD represents the shortest path structure encoding, edge represents the side information encoding, Representing the three-dimensional distance pair code,

The shortest path structure code SPD is shown in formula (8):

The Edge information code Edge is shown in formula (9):

Edge＝g_θ(E) (9)

where g _θ is a mapping function for side information,

The three-dimensional distance pair code is shown in formula (10):

the three-dimensional distance is used for interacting the coding and the self-attention layer of the transducer, and the iterative updating process of the self-attention layer of the transducer is shown in the formula (11) -the formula (12):

Wherein, Features representing the initial node pair i-j,/>M represents the mapping matrix and,Characteristic of node pair i-j representing the first self-attention layer, H is the number of heads of attention, d is the dimension of the hidden layer,/>Is the query of the h head of the first self-attention layer,/>Is the key of the h head of the self-attention l layer,

The updated node pair feature serves as a bias term for the next self-attention layer.

Step 4: and defining a two-dimensional space and three-dimensional space combined molecular graph self-supervision learning task, and obtaining a molecular large model based on multidimensional molecular information after training.

The self-supervision learning task of the joint sub-graph of the two-dimensional space and the three-dimensional space comprises a two-dimensional space covering node attribute prediction task and a three-dimensional space coordinate denoising task;

The two-dimensional space hidden node attribute prediction task adopts the node attribute which is predicted to be hidden as a pre-training task, and model learning molecular structure information is enabled to predict the hidden attribute by masking part of the node characteristics of the graph in input, and a loss function L _2D of the two-dimensional space hidden node attribute prediction task is shown as a formula (13):

L_2D＝-∑_i∈Mlogp(z_i|G^M) (13)

atomic three-dimensional coordinate information when three-dimensional space coordinate denoising task is input through input Adding Gaussian noiseDisturbing molecular geometry to minimize the difference between the model predicted noise value and the input noise value, and model predicted kth coordinate dimension noise output/>As shown in equation (14):

The loss function of the molecular large model based on multidimensional molecular information is shown in formula (17):

L＝αL_2D+βL_3D (17)

The model can be fused with molecular information of different visual angles and stored as a pt type file according to the design of the two-dimensional space and three-dimensional space combined molecular graph self-supervision learning task, so that the downstream task can be further fine-tuned, and the generalization performance is improved.

The embodiment also provides a molecular large model based on the multi-dimensional molecular information, which is obtained by adopting the molecular large model construction method based on the multi-dimensional molecular information.

The embodiment also provides application of the molecular large model based on the multi-dimensional molecular information in the biomedical field, and the data set to be subjected to the downstream task is input into the molecular large model based on the multi-dimensional molecular information for fine adjustment to obtain an output result corresponding to the downstream task, wherein the data set to be subjected to the downstream task comprises a molecular property prediction task data set, a three-dimensional coordinate generation task data set and a drug screening data set. The model can fully excavate molecular information in the field of biological medicine, learn molecular characterization by simulating two-dimensional structure information and potential energy conversion in a three-dimensional geometric space, improve the performance of the model on a plurality of downstream tasks such as molecular property prediction, target spot prediction, molecular synthesis and the like, and accelerate the drug screening speed, thereby providing important support and assistance for drug research and development.

The embodiment provides an application of a molecular large model based on multi-dimensional molecular information in a fine adjustment data set in the biomedical field, in general, the molecular tag information can be used for positively guiding a trained model in performance, and after the molecular large model in the embodiment completes a pre-training task, three types of output of a graph feature vector, a node feature matrix and a three-dimensional coordinate matrix can be finally generated to be connected with various downstream tasks, for example: and (3) completing a molecular property prediction task by using the graph feature vector, completing a molecular pose prediction task by using a three-dimensional coordinate matrix, and the like.

The dataset used in this embodiment is PCQM4mv2 dataset, and using the HOMO-LUMO values provided in PCQM4mv2 dataset, the supervised downstream task is defined as a regression task-predicting the quantum characteristics of the molecular graph to optimize model parameters in the self-supervised learning process, including:

Step 1: a supervised fine tuning dataset was constructed, the data of which was derived from the PCQM4mv2 dataset, and the PCQM4mv2 dataset contained 340 ten thousand organic molecules, recording the three-dimensional conformation and HOMO-LUMO energy gap under the molecular equilibrium state calculated using density functional theory.

Step 2: the supervision fine tuning data set is preprocessed, and the backbone structure representation of the molecules is reserved through removing hydrogen atoms, removing charges, removing small fragments, removing chirality and standardized tautomers, wherein the backbone structure representation comprises an ID number of the molecules in an original database and a one-dimensional molecular SMILES representation, so that one-to-one correspondence of the molecules and SMILES sequences is ensured.

Step 3: in this embodiment, MAE is used as the loss function, and an Adam optimizer is used to optimize the model learning parameters. And adjusting the super parameters according to the verification set.

MAE calculation is shown in equation (18):

where N represents the number of all molecular maps in the PCQM Mv2 dataset, Output result of nth molecule is expressed,/>Representing the true tag of the nth molecule.

The embodiments of the invention have been presented for purposes of illustration and description, and are not intended to be exhaustive or limited to the invention in the form disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art. The embodiments were chosen and described in order to best explain the principles of the invention and the practical application, and to enable others of ordinary skill in the art to understand the invention for various embodiments with various modifications as are suited to the particular use contemplated.

Claims

1. The molecular large model construction method based on multidimensional molecular information is characterized by comprising the following steps:

Defining a two-dimensional space and three-dimensional space joint molecular graph self-supervision learning task, and obtaining a molecular large model based on multidimensional molecular information after training;

The transformers comprise a plurality of transformers, each Transformer consists of a self-focusing layer and a feedforward neural network layer, and the self-focusing layer and the feedforward neural network layer perform standard normalization operation;

the molecular diagram is shown in formula (1):

G＝{X^atom,A,E,R} (1)

Wherein, For the atomic node characteristic matrix, n represents the number of atoms, d represents the atomic characteristic dimension, X ^atom contains the inherent attribute of atoms, A represents the adjacent matrix of the molecular diagram and covers the 1 st order topology information of the molecular diagram, E represents the set of the upper edges of the molecular diagram,/>Representing geometric space coordinates of the molecules in three-dimensional space;

X⁰＝[X^atom|X^degree|X^RW|X^3D] (2)

The node degree code X ^degree is shown in formula (3):

X^degree＝f_α(D) (3)

The random walk position code X ^RW is shown in formula (4):

RW＝AD^-1 (5)

wherein D ^-1 represents the inverse of the degree,

The three-dimensional distance codeAs shown in formula (6):

Wherein, The three-dimensional distance coding of the node i is represented, U (i) represents a neighbor node set of the node i, U (i) represents the sum of the neighbor numbers of the node i, r _i-r_j || represents distance information of the node i and the node j, r _i represents coordinate information of the node i, and r _j represents coordinate information of the node j;

And the shortest path structure code, the side information code and the three-dimensional distance pair code are merged into the self-attention layer of the transducer by means of offset items, as shown in a formula (7):

The shortest path structure code SPD is shown in a formula (8):

The Edge information code Edge is shown in a formula (9):

Edge＝g_θ(E) (9)

where g _θ is a mapping function for side information,

The three-dimensional distance pair code is shown in formula (10):

Wherein r _i represents coordinate information of a node i, r _j represents coordinate information of a node j, alpha _i,jj,β_i,j,μ^k,σ^k are learnable parameters, alpha _i,j,β_i,j is controlled by an atomic node element type, corresponding alpha _i,j,β_i,j of node pairs formed by different elements are different, mu ^k,σ^k is a parameter of Gaussian kernel mapping, and k represents the number of Gaussian kernels;

The three-dimensional distance pair code interacts with the self-attention layer of the transducer, and the interaction and the self-attention layer iteration updating edge node pair feature matrix and node feature updating process of the transducer are shown in the following formula (11) -formula (12):

The updated node pair characteristics are used as bias items of the self-attention layer of the next layer;

The self-supervision learning task of the defined two-dimensional space and three-dimensional space joint molecular graph comprises a two-dimensional space covering node attribute prediction task and a three-dimensional space coordinate denoising task;

L_2D＝-∑_i∈Mlogp(z_i∣G^M) (13)

the three-dimensional space coordinate denoising task is based on atomic three-dimensional coordinate information when the three-dimensional space coordinate denoising task is input Adding Gaussian noise Disturbing molecular geometry to minimize the difference between the model predicted noise value and the input noise value, and model predicted kth coordinate dimension noise output/>As shown in equation (14):

L＝αL_2D+βL_3D (17)

2. The method of claim 1, wherein the preprocessing comprises removing hydrogen atoms, removing charges, removing small fragments, removing chiralities and standardized tautomers, and retaining a backbone structural representation of the molecules, wherein the backbone structural representation comprises an ID number of the molecules in an original database and a one-dimensional molecular SMILES representation.

3. The method for constructing a molecular large model based on multidimensional molecular information according to claim 1, wherein the molecular conformation generation process comprises the molecular conformation generation process by RDkit kits based on the following steps:

generating a preliminary molecular conformation based on the distance geometry;

modifying the molecular conformation based on ETKDG method;

The molecular conformation was optimized based on MMFF force field.