CN116168775A - Molecular multi-mode model training and application method, storage medium and chip - Google Patents
Molecular multi-mode model training and application method, storage medium and chip Download PDFInfo
- Publication number
- CN116168775A CN116168775A CN202211099018.2A CN202211099018A CN116168775A CN 116168775 A CN116168775 A CN 116168775A CN 202211099018 A CN202211099018 A CN 202211099018A CN 116168775 A CN116168775 A CN 116168775A
- Authority
- CN
- China
- Prior art keywords
- molecular
- graph
- text
- model
- training
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 61
- 238000012549 training Methods 0.000 title claims abstract description 54
- 238000010586 diagram Methods 0.000 claims description 30
- 238000012512 characterization method Methods 0.000 claims description 9
- 238000004364 calculation method Methods 0.000 claims description 8
- 238000005070 sampling Methods 0.000 claims description 8
- 238000013480 data collection Methods 0.000 claims description 7
- 230000006870 function Effects 0.000 claims description 6
- 239000013598 vector Substances 0.000 claims description 6
- 150000001875 compounds Chemical class 0.000 claims description 5
- 238000013507 mapping Methods 0.000 claims description 5
- 239000000126 substance Substances 0.000 claims description 4
- 238000012217 deletion Methods 0.000 claims description 3
- 230000037430 deletion Effects 0.000 claims description 3
- 238000011478 gradient descent method Methods 0.000 claims description 2
- 238000013461 design Methods 0.000 abstract description 2
- 230000000694 effects Effects 0.000 abstract description 2
- 238000011156 evaluation Methods 0.000 abstract description 2
- 230000003993 interaction Effects 0.000 abstract 2
- 125000004429 atom Chemical group 0.000 description 17
- 239000011159 matrix material Substances 0.000 description 5
- 230000008569 process Effects 0.000 description 5
- 238000009826 distribution Methods 0.000 description 4
- 125000002887 hydroxy group Chemical group [H]O* 0.000 description 4
- 238000012360 testing method Methods 0.000 description 4
- QGZKDVFQNNGYKY-UHFFFAOYSA-N Ammonia Chemical compound N QGZKDVFQNNGYKY-UHFFFAOYSA-N 0.000 description 3
- 238000013473 artificial intelligence Methods 0.000 description 3
- 238000013528 artificial neural network Methods 0.000 description 3
- 238000013135 deep learning Methods 0.000 description 3
- 230000002441 reversible effect Effects 0.000 description 3
- QVGXLLKOCUKJST-UHFFFAOYSA-N atomic oxygen Chemical compound [O] QVGXLLKOCUKJST-UHFFFAOYSA-N 0.000 description 2
- 125000005843 halogen group Chemical group 0.000 description 2
- 231100000053 low toxicity Toxicity 0.000 description 2
- 238000010801 machine learning Methods 0.000 description 2
- 239000000463 material Substances 0.000 description 2
- 230000004879 molecular function Effects 0.000 description 2
- 239000001301 oxygen Substances 0.000 description 2
- 229910052760 oxygen Inorganic materials 0.000 description 2
- XLYOFNOQVPJJNP-UHFFFAOYSA-N water Substances O XLYOFNOQVPJJNP-UHFFFAOYSA-N 0.000 description 2
- ORILYTVJVMAKLC-UHFFFAOYSA-N Adamantane Natural products C1C(C2)CC3CC1CC2C3 ORILYTVJVMAKLC-UHFFFAOYSA-N 0.000 description 1
- 101100153586 Caenorhabditis elegans top-1 gene Proteins 0.000 description 1
- 101100370075 Mus musculus Top1 gene Proteins 0.000 description 1
- 125000003172 aldehyde group Chemical group 0.000 description 1
- 125000000217 alkyl group Chemical group 0.000 description 1
- 125000003277 amino group Chemical group 0.000 description 1
- 229910021529 ammonia Inorganic materials 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- 230000003416 augmentation Effects 0.000 description 1
- 230000004888 barrier function Effects 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 125000002915 carbonyl group Chemical group [*:2]C([*:1])=O 0.000 description 1
- 125000003178 carboxy group Chemical group [H]OC(*)=O 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 230000003930 cognitive ability Effects 0.000 description 1
- 239000000470 constituent Substances 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 230000008878 coupling Effects 0.000 description 1
- 238000010168 coupling process Methods 0.000 description 1
- 238000005859 coupling reaction Methods 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 239000003814 drug Substances 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 125000000524 functional group Chemical group 0.000 description 1
- 125000004435 hydrogen atom Chemical group [H]* 0.000 description 1
- 230000002401 inhibitory effect Effects 0.000 description 1
- 238000013140 knowledge distillation Methods 0.000 description 1
- 238000005259 measurement Methods 0.000 description 1
- 238000005065 mining Methods 0.000 description 1
- 229910052757 nitrogen Inorganic materials 0.000 description 1
- 125000004433 nitrogen atom Chemical group N* 0.000 description 1
- 238000010606 normalization Methods 0.000 description 1
- 230000000269 nucleophilic effect Effects 0.000 description 1
- 230000035699 permeability Effects 0.000 description 1
- 125000001997 phenyl group Chemical group [H]C1=C([H])C([H])=C(*)C([H])=C1[H] 0.000 description 1
- 125000002924 primary amino group Chemical group [H]N([H])* 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 238000005295 random walk Methods 0.000 description 1
- 230000002787 reinforcement Effects 0.000 description 1
- 238000012163 sequencing technique Methods 0.000 description 1
- 238000010200 validation analysis Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16C—COMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
- G16C20/00—Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
- G16C20/70—Machine learning, data mining or chemometrics
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/12—Use of codes for handling textual entities
- G06F40/126—Character encoding
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/194—Calculation of difference between files
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16C—COMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
- G16C20/00—Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
- G16C20/90—Programming languages; Computing architectures; Database systems; Data warehousing
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02A—TECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
- Y02A90/00—Technologies having an indirect contribution to adaptation to climate change
- Y02A90/10—Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Artificial Intelligence (AREA)
- Computing Systems (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- General Health & Medical Sciences (AREA)
- Software Systems (AREA)
- Computational Linguistics (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Evolutionary Computation (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Life Sciences & Earth Sciences (AREA)
- Crystallography & Structural Chemistry (AREA)
- Chemical & Material Sciences (AREA)
- Medical Informatics (AREA)
- Databases & Information Systems (AREA)
- Data Mining & Analysis (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Mathematical Physics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention realizes a molecular multi-mode model training and application method, a storage medium, a chip and a system by a method in the field of network security. Firstly, interaction is carried out between candidate document sets and sub-topics or queries through an Encoder structure in a Transformer, after formal representations of documents and sub-topics are obtained, combination weights are modeled through selected documents, all candidate documents and sub-topics, explicit scores and implicit scores are obtained through interaction, and finally the explicit scores and the implicit scores are combined into final diversified evaluation segments through updated weights. The method designs a explicit and implicit characteristic combination model for dynamically adjusting the weight under different steps of different inquiry so as to improve the effect of diversification of search results. And training the model through a loss function of the lambdaRank mode of listpairwise, and carrying out experimental results on the model to prove the effectiveness and the interpretability of the model.
Description
Technical Field
The invention relates to the technical field of machine learning, in particular to a molecular multi-modal model training and application method, a storage medium and a chip.
Background
Knowledge about the molecular correlation and discovery of molecular characteristics are critical to scientific exploration in various fields of biomedicine, chemistry, materials and the like. Traditional exploration methods require a large number of wet biochemical experiments by professionals, which are not only expensive but also time-consuming. With the progress of deep learning, scientific exploration such as predicting molecular properties and generating candidate molecules using artificial intelligence has become possible and many developments have been made.
However, unlike humans that understand molecules from multiple modes, most existing artificial intelligence models are directed to a single modality (e.g., molecular diagram, SMILES string, text) of a single cognitive ability-specific (e.g., attribute prediction, molecular generation, literature understanding) molecule. These models fall into two main categories. The language-based model takes as input natural language related to molecular knowledge and/or SMILES strings. Molecular property prediction models were designed for SMILES molecular strings, for example, in the operations of SMILES-bert, large scale unsuper-visual pre-training for molecular property prediction, by Shung Wang et al, and in the chemistry of Seyone Chithrananda, et al, large-scale self-supervised pretraining for molecular property prediction; the work of Iz Beltagy et al Scibert A pretrained language model for scientific text, diya Li et al Biomedical event extraction based on knowledge-drive tree-lstm, jinheuk Lee et al Biobert: a pre-trained biomedical language representation model for biomedical text mining, etc. focused on learning from biochemical text; zheni Zeng et al, in a deep-learning system bridging molecule structure and biomedical text with comprehension comparable to human professionals, developed a deep learning system to combine the learning of molecular related text with molecular SMILES strings to establish relationships between them; in the molecular generation model of Jike Wang, et al, multi-constraint molecular generation based on conditional transformer, knowl-edge distillation and reinforcement learning, jeffGuo, et al, improving de novo molecular design with curriculum learning, daniel Flam-Shepherd, et al, language models can learn complex molecular distributions, samul C Hoffman, et al, optimizing molecules using efficient queries from property evaluations, etc., the generated molecule is denoted as SMILES. Graph-based models can only handle molecular graphs. Current Graph Neural Network (GNN) based molecular property prediction models are learned from molecular graphs, or generative models are learned directly from graph data to generate component graphs. Training these models requires a large number of manual annotations or a collection of specific properties, while other properties of the molecule and related conditions are typically ignored. Thus, these models can only handle one form of the molecule and cannot obtain a comprehensive understanding of the molecule.
Disclosure of Invention
Therefore, the invention firstly provides a molecular multi-mode model training and application method, a storage medium and a chip, and a graph encoder and a text encoder are jointly learned from multi-mode molecular data so as to correlate a molecular graph with biomedical text description thereof, thereby solving the technical problem of the existing single-mode-based molecular data machine learning model.
The invention firstly provides a molecular multi-mode model training and application method, which comprises the following steps:
s100, a data collection unit is established, a molecular graph and semantic weak related text data thereof are extracted from published SCI papers, and a molecular graph data set is constructed;
s200, a molecular multi-mode model and a pre-training unit thereof are established, the molecular multi-mode model comprising a graph encoder and a text encoder is constructed, and the model is trained by a contrast learning method;
s300, a text-based molecular diagram generating unit is established, and the model is applied to different downstream tasks such as cross-modal retrieval, text description-based molecular diagram generation and the like.
Wherein the method of constructing the data set of the data collection unit comprises:
s201, collecting names, synonyms and SMILES strings of the first 50K molecular compounds in PubCHem;
s202, for each collected molecule, using a SMILES2graph function provided by OGB to convert the SMILES character string into a molecular diagram;
s203, searching sentences containing the names in abstract, introduction and conclusion parts of published scientific papers in the medical, biological, chemical and computer science fields of the S2orc database by using the names of the molecules as queries, recording each searched sentence and adjacent sentences thereof into a document as a paragraph, searching again by taking synonyms or aliases of the molecules as queries if the number of the paragraphs searched by the names is less than two, and terminating the molecular search in advance when the specified number of the paragraphs or the specified document size is searched;
s204, obtaining a molecular map-document pair to form a multi-modal molecular data set.
The molecular multi-mode model consists of a graph encoder and a text encoder, and the two encoders respectively extract a molecular graph representation and a text representation; using a graph isomorphic network as a graph encoder and using a language model Bert as a text encoder; the training stage model additionally uses a similarity calculation module, and the similarity calculation module uses two mapping heads to map the molecular graph and the text representation into a joint representation space respectively, so as to calculate cosine similarity of the mapped features.
The molecular multi-mode model pre-training method specifically comprises the following steps:
s401, initializing a graph encoder by using self-supervision training weights of a graph isomorphic network, and initializing a text encoder by using pre-training weights of BERT in Sci-BERT or KV-PLM;
s402, for each training period, sampling a batch of N molecular diagram-text pair data from a training sequence in sequence;
s403, for each group of small batch data { G ] 1 ,…,G N Two different enhancements are generated from each graph by means of random node deletion and random subgraph, respectively, 2N enhancement graphs are generated altogether, and these graphs are input into a graph encoder to obtain their characterization vectors wherein /> and />Representing the ith graph G i Is characterized by two enhancement maps;
s404, randomly extracting two different sentences from the document corresponding to each molecule, describing G for the ith diagram through a text encoder i The representation obtained for two different sentences of (a) is expressed asEach molecular diagram in a small batch corresponds to two different sentences, yielding a text representation of 2N +.>
S405, for the ith graph G i The total multiview loss includes four characterization pairs from multiple modalities and />Four contrast losses between->
where τ is a temperature parameter that is used to determine,i.e. a similarity calculation module which first will +.> and />Projecting to the same dimension, and then calculating cosine similarity between projection vectors;
s406, calculating contrast loss of graph modes: graph mode contrast loss for the i-th graph is:
s407, calculating the sum of losses of all samples in the batch:
wherein λ is the balance factor between the cross-modal loss and the graph modal loss, and is a hyper-parameter;
s408, for each batch, updating parameters of a graph encoder, a text encoder and a mapping head by back propagation of the total loss L until all the batches in all the current epochs are processed;
s409, repeating S402-S408 until the preset maximum epoch number of rounds is reached.
The text-based molecular diagram generation method comprises the following steps:
the method comprises a molecular multi-modal model trained by a pre-training method and a molecular generator which is pre-trained, is based on random seed sampling and allows a counter-transmission gradient, wherein parameters of the molecular multi-modal model and the molecular generator are fixed;
s501, inputting a text x describing a molecule T ;
S502, initializing to generate a seed q; setting q as a learnable parameter;
s503, generating a molecular diagram x according to q by using the trained molecular generator G ;
S504, respectively x G and xT Sending the images to a graph encoder and a text encoder of a pre-trained molecular multi-mode model, and extracting a corresponding graph sign z G And text characterization z T . And calculating the negative similarity of the molecular multi-mode model and the similarity by using a similarity calculation module of the molecular multi-mode model as a loss function:
l q =-sim(z G ,z T )/τ,
s505, for loss l q Counter-propagating and updating the seed q by using a gradient descent method;
s506, repeating the steps S503-S506 until the preset maximum epoch number is reached;
s507, sending the optimized q to a molecular generator, generating a final molecular diagram and outputting the molecular diagram.
Meanwhile, the invention also provides a storage medium for the embedded molecular multi-mode model training and application method and a chip applying the storage medium.
The invention has the technical effects that:
the invention provides a molecular multi-modal model (Momu), which implicitly establishes a connection between a molecular structure and a language description. The model can be applied to a very wide range of downstream tasks due to the ability to handle multiple modes of molecules.
The method can directly generate new molecules from the text description of the required conditions by using the molecular diagram generation method based on the text from the molecules generated in the molecular function description, solves the problem that effective molecules cannot be generated in the description in the prior art, and can generate molecular structures under the condition of meeting as many conditions as possible for the description with a plurality of conditions. In contrast to existing AI-based molecular generation methods that can only generate specified attributes, the present method can adaptively generate molecular candidates based on input text that can describe any desired condition or conditions. The pretrained molecular multi-mode model can promote scientific exploration of a plurality of molecular related fields such as biology, chemistry, materials, medicine and the like due to strong generalization capability and imagination.
The molecular multimodal model MoMu provided by the invention is based on pre-training of pair-wise multimodal data consisting of molecular maps and their weakly related biochemical descriptions retrieved from publicly available SCI papers. The pre-training molecular multi-mode model provided by the invention can be applied to a very wide downstream task because of being capable of processing molecules in multiple modes, and therefore, the invention provides a zero sample molecular generation method and a molecular description method based on the pre-training model. Experimental results show that the pre-training model has strong generalization capability in a wide range of downstream tasks, including cross-modal molecular retrieval, molecular titles, zero-sample molecular generation and molecular property prediction.
Drawings
FIG. 1 is a flow chart of the present invention for collection of a set of teletext data;
FIG. 2 is a schematic diagram of the architecture and pre-training principle of the molecular multi-modal model of the present invention;
FIG. 3 is a schematic block diagram of a text-based molecular diagram generation method of the present invention;
FIG. 4 is a graphical representation of the results of the text-based molecular diagram generation method of the present invention for some text input with respect to a functional description.
FIG. 5 is a graphical representation of the results of the text-based molecular graph generation method of the present invention for some text input with respect to structural descriptions.
Detailed Description
The following is a preferred embodiment of the present invention and a technical solution of the present invention is further described with reference to the accompanying drawings, but the present invention is not limited to this embodiment.
The principles and features of the present invention are described below with reference to the drawings, the examples are illustrated for the purpose of illustrating the invention and are not to be construed as limiting the scope of the invention. The invention is more particularly described by way of example in the following paragraphs with reference to the drawings. Advantages and features of the invention will become more apparent from the following description and from the claims. It should be noted that the drawings are in a very simplified form and are all to a non-precise scale, merely for convenience and clarity in aiding in the description of embodiments of the invention.
The invention provides a molecular multi-mode model training and application method, and a storage medium and a chip realized based on the method.
The molecular multi-mode model training and application method comprises three constituent units of data collection, a molecular multi-mode (Momu) model and pre-training and text-based molecular diagram generation.
A data collection unit:
the data collection unit is realized by constructing a data set of a molecular graph-text pair and is used for pre-training a molecular multi-modal model.
The construction method of the data set specifically comprises the following steps: first, the names, synonyms, and SMILES strings of the first 50K molecular compounds in PubCHem were collected, and the PubCHem database contained basic information of over 1.5 hundred million chemicals. To obtain a molecular map of the collected compounds, the SMILES string was converted into a molecular map using a SMILES2graph function provided by OGB. Text in the published scientific papers related to the corresponding molecules is retrieved in the S2orc database as weak semantic supervision. S2orc is a corpus containing 1.36 hundred million papers from different fields, only medical, biological, chemical and computer science fields are extracted, as they are more likely to contain descriptions related to molecules. In order to avoid special characters in the text that are related to the experimental data as much as possible, retrieval is performed only from the abstract, introduction and conclusion sections of each extracted paper.
As shown in fig. 1, for each molecule, a sentence containing its name is first retrieved using its name as a query. Each retrieved sentence and its neighboring sentences are recorded as a paragraph into a document. If there are fewer than two paragraphs retrieved by name, then the search is again conducted by molecular synonyms or aliases as queries. When 5000 paragraphs or document sizes are retrieved over 500Mb, molecular retrieval is terminated prematurely. Not all 50,000 molecules can be retrieved by the corresponding textual description. Finally, 15,613 molecular map-document pairs were obtained to form a multimodal molecular dataset. There are about 3700 ten thousand segments in all collected files. In each pair, sentences in the document contain weakly related semantic information of the corresponding molecular graph.
Molecular multi-modal (MoMu) model and pre-training unit thereof:
the overall architecture and pre-training process of the molecular multi-modal (MoMu) model is shown in fig. 2. MoMu consists of a graphic encoder and a text encoder that encode the molecular graph and text, respectively, into a joint representation space. A Graph Isomorphic Network (GIN) is used as a graph encoder, and a widely used language model Bert is used as a text encoder.
Unlike the training of a teletext multimodal base model with general teletext data, relatively little, if any, teletext data is relevant for a molecule, which is insufficient for training a molecular teletext encoder de novo. Just as humans should have the ability to recognize graphics and language while learning expertise, molecular knowledge that allows artificial intelligence learning expertise needs to be built on trained generic graphics and text encoders. Thus, the graph encoder is initialized with the self-supervised training weights of GINs provided in, and the text encoder is initialized with the pre-training weights of BERTs provided by Sci-BERT and KV-PLM, respectively. Momu initialized with weights of Sci-BERT and KV-PLM are denoted as Momu-S and Momu-K, respectively.
And then training Momu according to the collected pairing data set. For each pair of graph document data in a small batch, two separate graphs are created from the molecular graph using two different types of graph enhancements. Graph augmentation is performed using GraphCL-introduced data enhancement. For graph enhancement, two types of graph enhancement are considered, namely node deletion and subgraph, in particular. The node discard randomly discards some portion of the vertices of the original graph. For molecular figures, the absence of certain atoms (e.g., some hydrogen atoms in a compound) does not change its semantic information. Subgraphs refer to sampling a subgraph from the original graph using random walk. The properties of the molecules have a certain similarity to the properties of the molecules formed by their subgraphs, e.g. some molecules contain the same functional groups. Two different sentences are then randomly extracted from the document. Thus, each modality has two samples containing the same semantic information.
For the graph modality, small batch data (mini-batch) { G of molecular graph-document pair data of size N 1 ,…,G N Each graph produces two different enhancements, together producing a 2N enhancement graph. Subsequently, these figures are inputGraph encoders to obtain their token vectors wherein /> and />Representing the ith graph G i Is described herein, is a representation of two enhanced versions of (1). Meanwhile, G will be described by a text encoder i The characterization obtained by two different sentences of (a) is expressed as +.>Each molecular diagram in a small batch corresponds to two different sentences, yielding a text representation of 2N +.>Thus, for the ith graph G i The total multiview loss includes four contrast losses between four characterization pairs from the multimodality, i.e.> and />For simplicity only +.>The contrast loss of (c) is expressed as:
where τ is a temperature parameter that is used to determine,first of all the +.> and />Projection to the same dimension and then cosine similarity between projection vectors is calculated. The other three cross-modal contrast losses have the same form.
To further enhance the representation capabilities of the graph encoder, contrast learning is utilized in the graph modality. In particular, features of a positive pair are introduced by minimizing normalized cross entropy loss while pushing the features of a negative pair away, where positive pair is two enhancements of the same molecular graph and negative pair is from different molecular graphs. Based on the previous definition, the graph mode contrast loss of the ith graph is derived as:
where τ is the temperature parameter and the final loss is calculated in all samples in a small lot.
The pre-trained MoMu is able to process both molecular figures and natural language text in a unified manner and to obtain generic and transferable knowledge from these heterogeneous data, which can be easily generalized to different downstream tasks.
Implementation details:
a 5-layer GIN with a hidden layer dimension of 300 is used as a graphics encoder. The text encoder selects a BERT with a hidden layer size of 768. Two multi-layered perceptrons are used to project the graphical and sentence features into the same feature space, with each perceptron having an output dimension of 256. For two graph enhancement, the node drop rate is 10% and the size of the sample subgraph is 80% of the original graph. The input data of the text modality are two sentences randomly selected from the document of the corresponding graphic data. The model was pre-trained using an AdamW optimizer with a learning rate of 0.0001 and a weight decay of 1e-5 for 300 epochs. τ was set to 0.1 and the batch size was set to 256. The entire pre-training process was implemented using PyTorch and trained on 8 NVIDIA Tesla V100 PCIe 32GB GPUs.
Application of a molecular multi-modal model in a cross-modal retrieval task. Because the MoMu model provided by the invention is pre-trained by matching weakly relevant text with corresponding molecular patterns, it is able to process the patterns and text modalities of the molecules.
We evaluate its performance in cross-modal retrieval. Given a molecular map, map-to-text (G-T) retrieval is intended to retrieve the text description most relevant to the molecule. Conversely, given a paragraph of text, a text-to-graph (T-G) search is intended to retrieve the most relevant molecular graph it describes. MoM u was evaluated on a PCdes dataset containing SMILES and paired text descriptions for 15K molecules in PubChem. The dataset has been divided into a 10500 pair training set, a 1500 pair validation set, and a 3000 pair test set (two SMILES in the test set cannot be converted to a graph by Rdkit, so the remaining 2998 pairs are used for detection). We convert the SMILES string in each pair into a molecular diagram. In the G-T/T-G task, extracting tokens from query graphs/texts by using a Momu graph/text encoder, extracting tokens from all key texts/graphs to be retrieved by using a Momu text/graph encoder, calculating cosine similarity between the query tokens and all key tokens, and sequencing the key texts/graphs from large to small according to the similarity. According to the experimental setup in document [11], searches were performed in small batches (64 pairs per batch) and all test pairs, respectively, randomly sampled, and the average accuracy of top-1 search results and the recall results (mean ± standard deviation) of top-20 were reported, respectively. In the literature, we represent this setting by sentence-level search, by randomly extracting a sentence from the text corresponding to each molecule for search. The setup using the complete paragraph description for each molecule is further evaluated, called paragraph level retrieval.
The results of the comparison of the method of the present invention with Sci-BERT [8] and KV-PLM (KV-PLM differs from KV-PLM in the treatment of SMILES markers) are shown in Table 1. These methods all fine tune on the PCdes training set for fair comparison. For different settings of G-T and T-G tasks, momu-S and Momu-K are superior to other methods of searching directly using SMILES. The Momu of the present invention can better link molecular structure and natural language description than KV-PLM which jointly model molecular SMILES and language text.
Table 1 performance of different methods on PCdes datasets in graph-to-text (G-T) search and text-to-graph (T-G) search, where the results of sentence-level search by Sci-Bert, KV-PLM are reported in the literature.
Given that some of this 15K molecular map-text pair data in PCdes may have been collected as pre-training data for MoMu, 5,562 graphic-text pairs ranging in molecular id from 50,000 to 100,000 were collected from PubChem, which were not used for pre-training. The comparison with Sci-BERT and KV-PLM on this collected zero sample retrieval test set is shown in Table 2. The performance of Momu-S and Momu-K was significantly better than Sci-BERT and KV-PLM, further demonstrating the generalization ability of Momu. On both data sets, momu-S and Momu-K perform quite well, i.e., initializing a text encoder for Momu with KV-PLM does not result in better performance than initializing with Sci-BERT. This suggests that structural information learned from one-dimensional SMILES molecular strings is not easily transferred to the structured molecular graph, whereas Momu captures the structural information directly using the graph neural network under the supervision of the linguistic descriptions.
Table 2 different approaches on our collected dataset were performed on the performance of a zero sample graph-to-text (G-T) search and a zero sample text-to-graph (T-G) search.
A text-based molecular diagram generation unit:
as shown in FIG. 3, the zero sample text-to-graphic molecule generation method is based on Momu similarityThe measurement module is composed of a molecular generator which is based on random seed sampling and allows gradient back transmission. The invention is illustrated by taking a molecular generator MoFlow based on a flow model as an example. MoFlow defines a parameterized reversible mapping from gaussian distribution to molecular distribution. The molecular diagram G is composed of an atomic matrixAnd key matrix-> Composition, wherein N is the number of atoms in the molecule, C a and Cb Is the number of atom types and bond types. If the nth atom belongs to the c-th atom type, V n,c =1; otherwise V n,c =0. E if the bond between the n-th atom and the n '-th atom is of the c' -th bond type n,n',c' =1; otherwise E n,n',c' =0. MoFlow contains a map condition flow q v =f c (V|E) for encoding the atom matrix V given the key matrix E, thereby converting into a latent variable q v 。gflow:q e =f g (E) For encoding key matrix E as latent variable q e 。f c and fg Is realized by a graph roll-up neural network based on a graph coupling layer. q v and qe Connection q= [ q ] v ;q e ]Obeys a gaussian distribution P (q). After MoFlow training, a variable q can be sampled from P (q) and decomposed into two parts q v and qe They are input to the reverse map conditional flow +.>And reverse gflow->To obtain a probability matrix: />
wherein Is the predicted probability that the bond between the nth atom and the nth 'atom belongs to the c' bond type,/o>Is the probability that the nth atom belongs to the type of c atom. V and E can be determined by p-> and />Is obtained by performing the maximum value indexing operation on the last dimension of the block. GN is the graph normalization layer. By sampling different q from P (q), the MoFlow can generate different novel and efficient molecules.
Zero sample text-to-graphic molecule generation method describes text x T As input. The only parameter that can be learned in the method is q, which is initialized by random sampling from P (q). All parameters of pre-trained MoMu and Mo Flow were frozen. Input x T Is fed into a Momu text encoder to obtain a text representation z T . q is input to MoFlow to obtain and />In order to make all operations differentiable, thus allowing the gradient to counter-propagate, will +.>Rather than V-inputInto a map encoder of Momu to obtain a map representation z G . The trained graphic encoder GIN contains the embedding of all atoms and bond types. The original V is used as the index sentence for selecting the corresponding embedding from the atom types in the first layer. When using +.>When, for each atom, the characterization obtained is in fact a weighted sum of all atom embeddings. />The probability between the atom types of each node is used as the attention score. The loss function is the projection z T and zG Cosine similarity between:
l q =-sim(z G ,z T )/τ,
wherein sim (z) G ,z T ) For computing the similarity between the projected representations in MoMu. q can be determined by reference to l q Is updated by gradient back-propagation. Updates were made using Adam optimizer. After repeated updates up to 500 iterations, obtain optimized q, then input it into MoFlow to obtain and />Finally pair-> and />The last dimension of the block is subjected to maximum value indexing operation to obtain a molecular graph g= (V, E).
The molecules generated from the description of molecular functions by the method of the invention are shown in FIG. 4. For the description of "fluorescent molecules", molT5 cannot produce an effective molecule, whereas the method of the present invention produces a molecule having a conjugated double bond or conjugated molecule. For the description that "the molecule includes a hydroxyl group and a carboxyl group, is capable of decomposing to produce ammonia gas, and has an oxygen content exceeding 20%", various molecules having a hydroxyl group, a high oxygen content, and a nitrogen atom to produce ammonia are successfully produced, and thus three-quarters of the conditions are satisfied. Unlike existing AI-based molecular generation methods that can only generate specified attributes, the present method adaptively generates molecular candidates based on input text that may describe any desired condition or conditions. In the last description, three desired molecular properties are specified, including high water solubility, high barrier permeability and low toxicity, which can be evaluated by a fine-tuning property prediction model. The present invention is based on the process of Momu-S and Momu-K, which allows the generation of different molecules with high penetrability, low toxicity and high water solubility.
The molecules generated from the molecular structure description are shown in fig. 5. For descriptions containing nucleophilic groups, the methods of the invention produce different molecules with amino groups, hydroxyl groups, or double bonds. For descriptions containing electrophilic groups, the method of the present invention is capable of generating different molecules having carbonyl groups, alkyl-like groups or halogen atoms, despite inhibiting formal charge. For descriptions containing hydrophilic groups, the method of the present invention is capable of producing molecules containing hydroxyl, amino or aldehyde groups of different structures. For descriptions containing lipophilic groups, the method of the present invention produces molecules containing alkyl groups having different structures, halogen atoms, or benzene rings.
The present invention is not described in detail in part as being well known to those skilled in the art.
Claims (7)
1. A molecular multi-mode model training and application method is characterized in that: the method comprises the following steps:
s100, a data collection unit is established, a molecular graph and semantic weak related text data thereof are extracted from published SCI papers, and a molecular graph data set is constructed;
s200, a molecular multi-mode model and a pre-training unit thereof are established, the molecular multi-mode model comprising a graph encoder and a text encoder is constructed, and the model is trained by a contrast learning method;
s500, a text-based molecular diagram generating unit is established, a model is applied to different downstream tasks such as cross-modal retrieval, molecular diagram generation based on text description and the like, and the generated molecular diagram is finally output.
2. The method for training and applying a molecular multi-modal model according to claim 1, wherein: wherein the method of constructing the data set of the data collection unit comprises:
s201, collecting names, synonyms and SMILES strings of the first 50K molecular compounds in PubCHem;
s202, for each collected molecule, using a SMILES2graph function provided by OGB to convert the SMILES character string into a molecular diagram;
s203, searching sentences containing the names in abstract, introduction and conclusion parts of published scientific papers in the medical, biological, chemical and computer science fields of the S2orc database by using the names of the molecules as queries, recording each searched sentence and adjacent sentences thereof into a document as a paragraph, searching again by taking synonyms or aliases of the molecules as queries if the number of the paragraphs searched by the names is less than two, and terminating the molecular search in advance when the specified number of the paragraphs or the specified document size is searched;
s204, obtaining a molecular map-document pair to form a multi-modal molecular data set.
3. A method of molecular multimodal model training and application as claimed in claim 2, wherein: the molecular multi-mode model consists of a graph encoder and a text encoder, and the two encoders respectively extract a molecular graph representation and a text representation; using a graph isomorphic network as a graph encoder and using a language model Bert as a text encoder; the training stage model additionally uses a similarity calculation module, and the similarity calculation module uses two mapping heads to map the molecular graph and the text representation into a joint representation space respectively, so as to calculate cosine similarity of the mapped features.
4. A method of molecular multimodal model training and application as claimed in claim 3, wherein: the molecular multi-mode model pre-training method specifically comprises the following steps:
s401, initializing a graph encoder by using self-supervision training weights of a graph isomorphic network, and initializing a text encoder by using pre-training weights of BERT in Sci-BERT or KV-PLM;
s402, for each training period, sampling a batch of N molecular diagram-text pair data from a training sequence in sequence;
s403, for each group of small batch data { G ] 1 ,...,G N Two different enhancements are generated from each graph by means of random node deletion and random subgraph, respectively, 2N enhancement graphs are generated altogether, and these graphs are input into a graph encoder to obtain their characterization vectors wherein /> and />Representing the ith graph G i Is characterized by two enhancement maps;
s404, randomly extracting two different sentences from the document corresponding to each molecule, describing G for the ith diagram through a text encoder i The representation obtained for two different sentences of (a) is expressed asEach molecular diagram in a small batch corresponds to two different sentences, yielding a text representation of 2N +.>
S405, for the ith graph G i The total multiview loss includes four from multiple modalitiesPairs of characterizations and />Four contrast losses between->
where τ is a temperature parameter that is used to determine,i.e. a similarity calculation module which first will +.> and />Projecting to the same dimension, and then calculating cosine similarity between projection vectors;
s406, calculating contrast loss of graph modes: graph mode contrast loss for the i-th graph is:
s407, calculating the sum of losses of all samples in the batch:
wherein λ is the balance factor between the cross-modal loss and the graph modal loss, and is a hyper-parameter;
s408, for each batch, updating parameters of a graph encoder, a text encoder and a mapping head by back propagation of the total loss L until all the batches in all the current epochs are processed;
s409, repeating S402-S408 until the preset maximum epoch number of rounds is reached.
5. The method for training and applying the molecular multi-modal model as set forth in claim 4, wherein: the text-based molecular diagram generation method comprises the following steps:
the method comprises a molecular multi-modal model trained by a pre-training method and a molecular generator which is pre-trained, is based on random seed sampling and allows a counter-transmission gradient, wherein parameters of the molecular multi-modal model and the molecular generator are fixed;
s501, inputting a text x describing a molecule T ;
S502, initializing to generate a seed q; setting q as a learnable parameter;
s503, generating a molecular diagram x according to q by using the trained molecular generator G ;
S504, respectively x G and xT Sending the images to a graph encoder and a text encoder of a pre-trained molecular multi-mode model, and extracting a corresponding graph sign z G And text characterization z T And calculating the negative similarity of the molecular multi-mode model and the similarity by using a similarity calculation module of the molecular multi-mode model as a loss function:
l q =-sim(z G ,z T )/τ,
s505, for loss l q Counter-propagating and updating the seed q by using a gradient descent method;
s506, repeating the steps S503-S506 until the preset maximum epoch number is reached;
s507, sending the optimized q to a molecular generator, generating a final molecular diagram and outputting the molecular diagram.
6. A storage medium, characterized in that: a method of training and applying a molecular multimodal model as defined in any one of claims 1-5.
7. Chip, its characterized in that: use of the storage medium of claim 6.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211099018.2A CN116168775A (en) | 2022-09-07 | 2022-09-07 | Molecular multi-mode model training and application method, storage medium and chip |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211099018.2A CN116168775A (en) | 2022-09-07 | 2022-09-07 | Molecular multi-mode model training and application method, storage medium and chip |
Publications (1)
Publication Number | Publication Date |
---|---|
CN116168775A true CN116168775A (en) | 2023-05-26 |
Family
ID=86413784
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202211099018.2A Pending CN116168775A (en) | 2022-09-07 | 2022-09-07 | Molecular multi-mode model training and application method, storage medium and chip |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116168775A (en) |
-
2022
- 2022-09-07 CN CN202211099018.2A patent/CN116168775A/en active Pending
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111159223B (en) | Interactive code searching method and device based on structured embedding | |
CN110110324B (en) | Biomedical entity linking method based on knowledge representation | |
CN109670177A (en) | One kind realizing the semantic normalized control method of medicine and control device based on LSTM | |
CN115048447B (en) | Database natural language interface system based on intelligent semantic completion | |
CN111858940B (en) | Multi-head attention-based legal case similarity calculation method and system | |
CN111640471A (en) | Method and system for predicting activity of drug micromolecules based on two-way long-short memory model | |
CN111666762B (en) | Intestinal cancer diagnosis electronic medical record attribute value extraction method based on multitask learning | |
CN116204674B (en) | Image description method based on visual concept word association structural modeling | |
CN114913938B (en) | Small molecule generation method, equipment and medium based on pharmacophore model | |
CN111126040A (en) | Biomedical named entity identification method based on depth boundary combination | |
CN110428907A (en) | A kind of text mining method and system based on unstructured electronic health record | |
CN112151127A (en) | Unsupervised learning drug virtual screening method and system based on molecular semantic vector | |
Schäfer et al. | UMLS mapping and Word embeddings for ICD code assignment using the MIMIC-III intensive care database | |
CN113836896A (en) | Patent text abstract generation method and device based on deep learning | |
CN110299194B (en) | Similar case recommendation method based on comprehensive feature representation and improved wide-depth model | |
CN111581964A (en) | Theme analysis method for Chinese ancient books | |
CN111145914A (en) | Method and device for determining lung cancer clinical disease library text entity | |
CN112035629B (en) | Method for implementing question-answer model based on symbolized knowledge and neural network | |
Hu et al. | An integrated pipeline model for biomedical entity alignment | |
CN111782818A (en) | Device, method and system for constructing biomedical knowledge graph and memory | |
CN115964475A (en) | Dialogue abstract generation method for medical inquiry | |
CN116629361A (en) | Knowledge reasoning method based on ontology learning and attention mechanism | |
CN116168775A (en) | Molecular multi-mode model training and application method, storage medium and chip | |
CN115033706A (en) | Method for automatically complementing and updating knowledge graph | |
Wang et al. | Bi-directional joint embedding of encyclopedic knowledge and original text for chinese medical named entity recognition |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |