CN116168775A - Molecular multi-mode model training and application method, storage medium and chip - Google Patents

Molecular multi-mode model training and application method, storage medium and chip Download PDF

Info

Publication number
CN116168775A
CN116168775A CN202211099018.2A CN202211099018A CN116168775A CN 116168775 A CN116168775 A CN 116168775A CN 202211099018 A CN202211099018 A CN 202211099018A CN 116168775 A CN116168775 A CN 116168775A
Authority
CN
China
Prior art keywords
molecular
graph
text
model
training
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211099018.2A
Other languages
Chinese (zh)
Inventor
苏冰
文继荣
杜大钊
杨钊
周彧杰
李江梦
孙浩
卢志武
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Renmin University of China
Original Assignee
Renmin University of China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Renmin University of China filed Critical Renmin University of China
Priority to CN202211099018.2A priority Critical patent/CN116168775A/en
Publication of CN116168775A publication Critical patent/CN116168775A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16CCOMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
    • G16C20/00Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
    • G16C20/70Machine learning, data mining or chemometrics
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/126Character encoding
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/194Calculation of difference between files
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16CCOMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
    • G16C20/00Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
    • G16C20/90Programming languages; Computing architectures; Database systems; Data warehousing
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Evolutionary Computation (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Crystallography & Structural Chemistry (AREA)
  • Chemical & Material Sciences (AREA)
  • Medical Informatics (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Mathematical Physics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention realizes a molecular multi-mode model training and application method, a storage medium, a chip and a system by a method in the field of network security. Firstly, interaction is carried out between candidate document sets and sub-topics or queries through an Encoder structure in a Transformer, after formal representations of documents and sub-topics are obtained, combination weights are modeled through selected documents, all candidate documents and sub-topics, explicit scores and implicit scores are obtained through interaction, and finally the explicit scores and the implicit scores are combined into final diversified evaluation segments through updated weights. The method designs a explicit and implicit characteristic combination model for dynamically adjusting the weight under different steps of different inquiry so as to improve the effect of diversification of search results. And training the model through a loss function of the lambdaRank mode of listpairwise, and carrying out experimental results on the model to prove the effectiveness and the interpretability of the model.

Description

Molecular multi-mode model training and application method, storage medium and chip
Technical Field
The invention relates to the technical field of machine learning, in particular to a molecular multi-modal model training and application method, a storage medium and a chip.
Background
Knowledge about the molecular correlation and discovery of molecular characteristics are critical to scientific exploration in various fields of biomedicine, chemistry, materials and the like. Traditional exploration methods require a large number of wet biochemical experiments by professionals, which are not only expensive but also time-consuming. With the progress of deep learning, scientific exploration such as predicting molecular properties and generating candidate molecules using artificial intelligence has become possible and many developments have been made.
However, unlike humans that understand molecules from multiple modes, most existing artificial intelligence models are directed to a single modality (e.g., molecular diagram, SMILES string, text) of a single cognitive ability-specific (e.g., attribute prediction, molecular generation, literature understanding) molecule. These models fall into two main categories. The language-based model takes as input natural language related to molecular knowledge and/or SMILES strings. Molecular property prediction models were designed for SMILES molecular strings, for example, in the operations of SMILES-bert, large scale unsuper-visual pre-training for molecular property prediction, by Shung Wang et al, and in the chemistry of Seyone Chithrananda, et al, large-scale self-supervised pretraining for molecular property prediction; the work of Iz Beltagy et al Scibert A pretrained language model for scientific text, diya Li et al Biomedical event extraction based on knowledge-drive tree-lstm, jinheuk Lee et al Biobert: a pre-trained biomedical language representation model for biomedical text mining, etc. focused on learning from biochemical text; zheni Zeng et al, in a deep-learning system bridging molecule structure and biomedical text with comprehension comparable to human professionals, developed a deep learning system to combine the learning of molecular related text with molecular SMILES strings to establish relationships between them; in the molecular generation model of Jike Wang, et al, multi-constraint molecular generation based on conditional transformer, knowl-edge distillation and reinforcement learning, jeffGuo, et al, improving de novo molecular design with curriculum learning, daniel Flam-Shepherd, et al, language models can learn complex molecular distributions, samul C Hoffman, et al, optimizing molecules using efficient queries from property evaluations, etc., the generated molecule is denoted as SMILES. Graph-based models can only handle molecular graphs. Current Graph Neural Network (GNN) based molecular property prediction models are learned from molecular graphs, or generative models are learned directly from graph data to generate component graphs. Training these models requires a large number of manual annotations or a collection of specific properties, while other properties of the molecule and related conditions are typically ignored. Thus, these models can only handle one form of the molecule and cannot obtain a comprehensive understanding of the molecule.
Disclosure of Invention
Therefore, the invention firstly provides a molecular multi-mode model training and application method, a storage medium and a chip, and a graph encoder and a text encoder are jointly learned from multi-mode molecular data so as to correlate a molecular graph with biomedical text description thereof, thereby solving the technical problem of the existing single-mode-based molecular data machine learning model.
The invention firstly provides a molecular multi-mode model training and application method, which comprises the following steps:
s100, a data collection unit is established, a molecular graph and semantic weak related text data thereof are extracted from published SCI papers, and a molecular graph data set is constructed;
s200, a molecular multi-mode model and a pre-training unit thereof are established, the molecular multi-mode model comprising a graph encoder and a text encoder is constructed, and the model is trained by a contrast learning method;
s300, a text-based molecular diagram generating unit is established, and the model is applied to different downstream tasks such as cross-modal retrieval, text description-based molecular diagram generation and the like.
Wherein the method of constructing the data set of the data collection unit comprises:
s201, collecting names, synonyms and SMILES strings of the first 50K molecular compounds in PubCHem;
s202, for each collected molecule, using a SMILES2graph function provided by OGB to convert the SMILES character string into a molecular diagram;
s203, searching sentences containing the names in abstract, introduction and conclusion parts of published scientific papers in the medical, biological, chemical and computer science fields of the S2orc database by using the names of the molecules as queries, recording each searched sentence and adjacent sentences thereof into a document as a paragraph, searching again by taking synonyms or aliases of the molecules as queries if the number of the paragraphs searched by the names is less than two, and terminating the molecular search in advance when the specified number of the paragraphs or the specified document size is searched;
s204, obtaining a molecular map-document pair to form a multi-modal molecular data set.
The molecular multi-mode model consists of a graph encoder and a text encoder, and the two encoders respectively extract a molecular graph representation and a text representation; using a graph isomorphic network as a graph encoder and using a language model Bert as a text encoder; the training stage model additionally uses a similarity calculation module, and the similarity calculation module uses two mapping heads to map the molecular graph and the text representation into a joint representation space respectively, so as to calculate cosine similarity of the mapped features.
The molecular multi-mode model pre-training method specifically comprises the following steps:
s401, initializing a graph encoder by using self-supervision training weights of a graph isomorphic network, and initializing a text encoder by using pre-training weights of BERT in Sci-BERT or KV-PLM;
s402, for each training period, sampling a batch of N molecular diagram-text pair data from a training sequence in sequence;
s403, for each group of small batch data { G ] 1 ,…,G N Two different enhancements are generated from each graph by means of random node deletion and random subgraph, respectively, 2N enhancement graphs are generated altogether, and these graphs are input into a graph encoder to obtain their characterization vectors
Figure BDA0003836647490000031
wherein />
Figure BDA0003836647490000032
and />
Figure BDA0003836647490000033
Representing the ith graph G i Is characterized by two enhancement maps;
s404, randomly extracting two different sentences from the document corresponding to each molecule, describing G for the ith diagram through a text encoder i The representation obtained for two different sentences of (a) is expressed as
Figure BDA0003836647490000034
Each molecular diagram in a small batch corresponds to two different sentences, yielding a text representation of 2N +.>
Figure BDA0003836647490000035
S405, for the ith graph G i The total multiview loss includes four characterization pairs from multiple modalities
Figure BDA0003836647490000036
and />
Figure BDA0003836647490000037
Four contrast losses between->
wherein ,
Figure BDA0003836647490000038
corresponding contrast loss:
Figure BDA0003836647490000039
Figure BDA00038366474900000310
corresponding contrast loss:
Figure BDA00038366474900000311
Figure BDA00038366474900000312
corresponding contrast loss:
Figure BDA00038366474900000313
Figure BDA00038366474900000314
corresponding contrast loss:
Figure BDA0003836647490000041
where τ is a temperature parameter that is used to determine,
Figure BDA0003836647490000042
i.e. a similarity calculation module which first will +.>
Figure BDA0003836647490000043
and />
Figure BDA0003836647490000044
Projecting to the same dimension, and then calculating cosine similarity between projection vectors;
s406, calculating contrast loss of graph modes: graph mode contrast loss for the i-th graph is:
Figure BDA0003836647490000045
s407, calculating the sum of losses of all samples in the batch:
Figure BDA0003836647490000046
wherein λ is the balance factor between the cross-modal loss and the graph modal loss, and is a hyper-parameter;
s408, for each batch, updating parameters of a graph encoder, a text encoder and a mapping head by back propagation of the total loss L until all the batches in all the current epochs are processed;
s409, repeating S402-S408 until the preset maximum epoch number of rounds is reached.
The text-based molecular diagram generation method comprises the following steps:
the method comprises a molecular multi-modal model trained by a pre-training method and a molecular generator which is pre-trained, is based on random seed sampling and allows a counter-transmission gradient, wherein parameters of the molecular multi-modal model and the molecular generator are fixed;
s501, inputting a text x describing a molecule T
S502, initializing to generate a seed q; setting q as a learnable parameter;
s503, generating a molecular diagram x according to q by using the trained molecular generator G
S504, respectively x G and xT Sending the images to a graph encoder and a text encoder of a pre-trained molecular multi-mode model, and extracting a corresponding graph sign z G And text characterization z T . And calculating the negative similarity of the molecular multi-mode model and the similarity by using a similarity calculation module of the molecular multi-mode model as a loss function:
l q =-sim(z G ,z T )/τ,
s505, for loss l q Counter-propagating and updating the seed q by using a gradient descent method;
s506, repeating the steps S503-S506 until the preset maximum epoch number is reached;
s507, sending the optimized q to a molecular generator, generating a final molecular diagram and outputting the molecular diagram.
Meanwhile, the invention also provides a storage medium for the embedded molecular multi-mode model training and application method and a chip applying the storage medium.
The invention has the technical effects that:
the invention provides a molecular multi-modal model (Momu), which implicitly establishes a connection between a molecular structure and a language description. The model can be applied to a very wide range of downstream tasks due to the ability to handle multiple modes of molecules.
The method can directly generate new molecules from the text description of the required conditions by using the molecular diagram generation method based on the text from the molecules generated in the molecular function description, solves the problem that effective molecules cannot be generated in the description in the prior art, and can generate molecular structures under the condition of meeting as many conditions as possible for the description with a plurality of conditions. In contrast to existing AI-based molecular generation methods that can only generate specified attributes, the present method can adaptively generate molecular candidates based on input text that can describe any desired condition or conditions. The pretrained molecular multi-mode model can promote scientific exploration of a plurality of molecular related fields such as biology, chemistry, materials, medicine and the like due to strong generalization capability and imagination.
The molecular multimodal model MoMu provided by the invention is based on pre-training of pair-wise multimodal data consisting of molecular maps and their weakly related biochemical descriptions retrieved from publicly available SCI papers. The pre-training molecular multi-mode model provided by the invention can be applied to a very wide downstream task because of being capable of processing molecules in multiple modes, and therefore, the invention provides a zero sample molecular generation method and a molecular description method based on the pre-training model. Experimental results show that the pre-training model has strong generalization capability in a wide range of downstream tasks, including cross-modal molecular retrieval, molecular titles, zero-sample molecular generation and molecular property prediction.
Drawings
FIG. 1 is a flow chart of the present invention for collection of a set of teletext data;
FIG. 2 is a schematic diagram of the architecture and pre-training principle of the molecular multi-modal model of the present invention;
FIG. 3 is a schematic block diagram of a text-based molecular diagram generation method of the present invention;
FIG. 4 is a graphical representation of the results of the text-based molecular diagram generation method of the present invention for some text input with respect to a functional description.
FIG. 5 is a graphical representation of the results of the text-based molecular graph generation method of the present invention for some text input with respect to structural descriptions.
Detailed Description
The following is a preferred embodiment of the present invention and a technical solution of the present invention is further described with reference to the accompanying drawings, but the present invention is not limited to this embodiment.
The principles and features of the present invention are described below with reference to the drawings, the examples are illustrated for the purpose of illustrating the invention and are not to be construed as limiting the scope of the invention. The invention is more particularly described by way of example in the following paragraphs with reference to the drawings. Advantages and features of the invention will become more apparent from the following description and from the claims. It should be noted that the drawings are in a very simplified form and are all to a non-precise scale, merely for convenience and clarity in aiding in the description of embodiments of the invention.
The invention provides a molecular multi-mode model training and application method, and a storage medium and a chip realized based on the method.
The molecular multi-mode model training and application method comprises three constituent units of data collection, a molecular multi-mode (Momu) model and pre-training and text-based molecular diagram generation.
A data collection unit:
the data collection unit is realized by constructing a data set of a molecular graph-text pair and is used for pre-training a molecular multi-modal model.
The construction method of the data set specifically comprises the following steps: first, the names, synonyms, and SMILES strings of the first 50K molecular compounds in PubCHem were collected, and the PubCHem database contained basic information of over 1.5 hundred million chemicals. To obtain a molecular map of the collected compounds, the SMILES string was converted into a molecular map using a SMILES2graph function provided by OGB. Text in the published scientific papers related to the corresponding molecules is retrieved in the S2orc database as weak semantic supervision. S2orc is a corpus containing 1.36 hundred million papers from different fields, only medical, biological, chemical and computer science fields are extracted, as they are more likely to contain descriptions related to molecules. In order to avoid special characters in the text that are related to the experimental data as much as possible, retrieval is performed only from the abstract, introduction and conclusion sections of each extracted paper.
As shown in fig. 1, for each molecule, a sentence containing its name is first retrieved using its name as a query. Each retrieved sentence and its neighboring sentences are recorded as a paragraph into a document. If there are fewer than two paragraphs retrieved by name, then the search is again conducted by molecular synonyms or aliases as queries. When 5000 paragraphs or document sizes are retrieved over 500Mb, molecular retrieval is terminated prematurely. Not all 50,000 molecules can be retrieved by the corresponding textual description. Finally, 15,613 molecular map-document pairs were obtained to form a multimodal molecular dataset. There are about 3700 ten thousand segments in all collected files. In each pair, sentences in the document contain weakly related semantic information of the corresponding molecular graph.
Molecular multi-modal (MoMu) model and pre-training unit thereof:
the overall architecture and pre-training process of the molecular multi-modal (MoMu) model is shown in fig. 2. MoMu consists of a graphic encoder and a text encoder that encode the molecular graph and text, respectively, into a joint representation space. A Graph Isomorphic Network (GIN) is used as a graph encoder, and a widely used language model Bert is used as a text encoder.
Unlike the training of a teletext multimodal base model with general teletext data, relatively little, if any, teletext data is relevant for a molecule, which is insufficient for training a molecular teletext encoder de novo. Just as humans should have the ability to recognize graphics and language while learning expertise, molecular knowledge that allows artificial intelligence learning expertise needs to be built on trained generic graphics and text encoders. Thus, the graph encoder is initialized with the self-supervised training weights of GINs provided in, and the text encoder is initialized with the pre-training weights of BERTs provided by Sci-BERT and KV-PLM, respectively. Momu initialized with weights of Sci-BERT and KV-PLM are denoted as Momu-S and Momu-K, respectively.
And then training Momu according to the collected pairing data set. For each pair of graph document data in a small batch, two separate graphs are created from the molecular graph using two different types of graph enhancements. Graph augmentation is performed using GraphCL-introduced data enhancement. For graph enhancement, two types of graph enhancement are considered, namely node deletion and subgraph, in particular. The node discard randomly discards some portion of the vertices of the original graph. For molecular figures, the absence of certain atoms (e.g., some hydrogen atoms in a compound) does not change its semantic information. Subgraphs refer to sampling a subgraph from the original graph using random walk. The properties of the molecules have a certain similarity to the properties of the molecules formed by their subgraphs, e.g. some molecules contain the same functional groups. Two different sentences are then randomly extracted from the document. Thus, each modality has two samples containing the same semantic information.
For the graph modality, small batch data (mini-batch) { G of molecular graph-document pair data of size N 1 ,…,G N Each graph produces two different enhancements, together producing a 2N enhancement graph. Subsequently, these figures are inputGraph encoders to obtain their token vectors
Figure BDA0003836647490000071
wherein />
Figure BDA0003836647490000072
and />
Figure BDA0003836647490000073
Representing the ith graph G i Is described herein, is a representation of two enhanced versions of (1). Meanwhile, G will be described by a text encoder i The characterization obtained by two different sentences of (a) is expressed as +.>
Figure BDA0003836647490000074
Each molecular diagram in a small batch corresponds to two different sentences, yielding a text representation of 2N +.>
Figure BDA0003836647490000075
Thus, for the ith graph G i The total multiview loss includes four contrast losses between four characterization pairs from the multimodality, i.e.>
Figure BDA0003836647490000076
Figure BDA0003836647490000077
and />
Figure BDA0003836647490000078
For simplicity only +.>
Figure BDA0003836647490000079
The contrast loss of (c) is expressed as:
Figure BDA00038366474900000710
where τ is a temperature parameter that is used to determine,
Figure BDA00038366474900000711
first of all the +.>
Figure BDA00038366474900000712
and />
Figure BDA00038366474900000713
Projection to the same dimension and then cosine similarity between projection vectors is calculated. The other three cross-modal contrast losses have the same form.
To further enhance the representation capabilities of the graph encoder, contrast learning is utilized in the graph modality. In particular, features of a positive pair are introduced by minimizing normalized cross entropy loss while pushing the features of a negative pair away, where positive pair is two enhancements of the same molecular graph and negative pair is from different molecular graphs. Based on the previous definition, the graph mode contrast loss of the ith graph is derived as:
Figure BDA0003836647490000081
where τ is the temperature parameter and the final loss is calculated in all samples in a small lot.
The pre-trained MoMu is able to process both molecular figures and natural language text in a unified manner and to obtain generic and transferable knowledge from these heterogeneous data, which can be easily generalized to different downstream tasks.
Implementation details:
a 5-layer GIN with a hidden layer dimension of 300 is used as a graphics encoder. The text encoder selects a BERT with a hidden layer size of 768. Two multi-layered perceptrons are used to project the graphical and sentence features into the same feature space, with each perceptron having an output dimension of 256. For two graph enhancement, the node drop rate is 10% and the size of the sample subgraph is 80% of the original graph. The input data of the text modality are two sentences randomly selected from the document of the corresponding graphic data. The model was pre-trained using an AdamW optimizer with a learning rate of 0.0001 and a weight decay of 1e-5 for 300 epochs. τ was set to 0.1 and the batch size was set to 256. The entire pre-training process was implemented using PyTorch and trained on 8 NVIDIA Tesla V100 PCIe 32GB GPUs.
Application of a molecular multi-modal model in a cross-modal retrieval task. Because the MoMu model provided by the invention is pre-trained by matching weakly relevant text with corresponding molecular patterns, it is able to process the patterns and text modalities of the molecules.
We evaluate its performance in cross-modal retrieval. Given a molecular map, map-to-text (G-T) retrieval is intended to retrieve the text description most relevant to the molecule. Conversely, given a paragraph of text, a text-to-graph (T-G) search is intended to retrieve the most relevant molecular graph it describes. MoM u was evaluated on a PCdes dataset containing SMILES and paired text descriptions for 15K molecules in PubChem. The dataset has been divided into a 10500 pair training set, a 1500 pair validation set, and a 3000 pair test set (two SMILES in the test set cannot be converted to a graph by Rdkit, so the remaining 2998 pairs are used for detection). We convert the SMILES string in each pair into a molecular diagram. In the G-T/T-G task, extracting tokens from query graphs/texts by using a Momu graph/text encoder, extracting tokens from all key texts/graphs to be retrieved by using a Momu text/graph encoder, calculating cosine similarity between the query tokens and all key tokens, and sequencing the key texts/graphs from large to small according to the similarity. According to the experimental setup in document [11], searches were performed in small batches (64 pairs per batch) and all test pairs, respectively, randomly sampled, and the average accuracy of top-1 search results and the recall results (mean ± standard deviation) of top-20 were reported, respectively. In the literature, we represent this setting by sentence-level search, by randomly extracting a sentence from the text corresponding to each molecule for search. The setup using the complete paragraph description for each molecule is further evaluated, called paragraph level retrieval.
The results of the comparison of the method of the present invention with Sci-BERT [8] and KV-PLM (KV-PLM differs from KV-PLM in the treatment of SMILES markers) are shown in Table 1. These methods all fine tune on the PCdes training set for fair comparison. For different settings of G-T and T-G tasks, momu-S and Momu-K are superior to other methods of searching directly using SMILES. The Momu of the present invention can better link molecular structure and natural language description than KV-PLM which jointly model molecular SMILES and language text.
Table 1 performance of different methods on PCdes datasets in graph-to-text (G-T) search and text-to-graph (T-G) search, where the results of sentence-level search by Sci-Bert, KV-PLM are reported in the literature.
Figure BDA0003836647490000091
Given that some of this 15K molecular map-text pair data in PCdes may have been collected as pre-training data for MoMu, 5,562 graphic-text pairs ranging in molecular id from 50,000 to 100,000 were collected from PubChem, which were not used for pre-training. The comparison with Sci-BERT and KV-PLM on this collected zero sample retrieval test set is shown in Table 2. The performance of Momu-S and Momu-K was significantly better than Sci-BERT and KV-PLM, further demonstrating the generalization ability of Momu. On both data sets, momu-S and Momu-K perform quite well, i.e., initializing a text encoder for Momu with KV-PLM does not result in better performance than initializing with Sci-BERT. This suggests that structural information learned from one-dimensional SMILES molecular strings is not easily transferred to the structured molecular graph, whereas Momu captures the structural information directly using the graph neural network under the supervision of the linguistic descriptions.
Table 2 different approaches on our collected dataset were performed on the performance of a zero sample graph-to-text (G-T) search and a zero sample text-to-graph (T-G) search.
Figure BDA0003836647490000101
A text-based molecular diagram generation unit:
as shown in FIG. 3, the zero sample text-to-graphic molecule generation method is based on Momu similarityThe measurement module is composed of a molecular generator which is based on random seed sampling and allows gradient back transmission. The invention is illustrated by taking a molecular generator MoFlow based on a flow model as an example. MoFlow defines a parameterized reversible mapping from gaussian distribution to molecular distribution. The molecular diagram G is composed of an atomic matrix
Figure BDA0003836647490000102
And key matrix->
Figure BDA0003836647490000103
Figure BDA0003836647490000111
Composition, wherein N is the number of atoms in the molecule, C a and Cb Is the number of atom types and bond types. If the nth atom belongs to the c-th atom type, V n,c =1; otherwise V n,c =0. E if the bond between the n-th atom and the n '-th atom is of the c' -th bond type n,n',c' =1; otherwise E n,n',c' =0. MoFlow contains a map condition flow q v =f c (V|E) for encoding the atom matrix V given the key matrix E, thereby converting into a latent variable q v 。gflow:q e =f g (E) For encoding key matrix E as latent variable q e 。f c and fg Is realized by a graph roll-up neural network based on a graph coupling layer. q v and qe Connection q= [ q ] v ;q e ]Obeys a gaussian distribution P (q). After MoFlow training, a variable q can be sampled from P (q) and decomposed into two parts q v and qe They are input to the reverse map conditional flow +.>
Figure BDA0003836647490000112
And reverse gflow->
Figure BDA0003836647490000113
To obtain a probability matrix: />
Figure BDA0003836647490000114
Figure BDA0003836647490000115
wherein
Figure BDA0003836647490000116
Is the predicted probability that the bond between the nth atom and the nth 'atom belongs to the c' bond type,/o>
Figure BDA0003836647490000117
Is the probability that the nth atom belongs to the type of c atom. V and E can be determined by p->
Figure BDA0003836647490000118
and />
Figure BDA0003836647490000119
Is obtained by performing the maximum value indexing operation on the last dimension of the block. GN is the graph normalization layer. By sampling different q from P (q), the MoFlow can generate different novel and efficient molecules.
Zero sample text-to-graphic molecule generation method describes text x T As input. The only parameter that can be learned in the method is q, which is initialized by random sampling from P (q). All parameters of pre-trained MoMu and Mo Flow were frozen. Input x T Is fed into a Momu text encoder to obtain a text representation z T . q is input to MoFlow to obtain
Figure BDA00038366474900001110
and />
Figure BDA00038366474900001111
In order to make all operations differentiable, thus allowing the gradient to counter-propagate, will +.>
Figure BDA00038366474900001112
Rather than V-inputInto a map encoder of Momu to obtain a map representation z G . The trained graphic encoder GIN contains the embedding of all atoms and bond types. The original V is used as the index sentence for selecting the corresponding embedding from the atom types in the first layer. When using +.>
Figure BDA00038366474900001113
When, for each atom, the characterization obtained is in fact a weighted sum of all atom embeddings. />
Figure BDA00038366474900001114
The probability between the atom types of each node is used as the attention score. The loss function is the projection z T and zG Cosine similarity between:
l q =-sim(z G ,z T )/τ,
wherein sim (z) G ,z T ) For computing the similarity between the projected representations in MoMu. q can be determined by reference to l q Is updated by gradient back-propagation. Updates were made using Adam optimizer. After repeated updates up to 500 iterations, obtain optimized q, then input it into MoFlow to obtain
Figure BDA0003836647490000121
and />
Figure BDA0003836647490000122
Finally pair->
Figure BDA0003836647490000123
and />
Figure BDA0003836647490000124
The last dimension of the block is subjected to maximum value indexing operation to obtain a molecular graph g= (V, E).
The molecules generated from the description of molecular functions by the method of the invention are shown in FIG. 4. For the description of "fluorescent molecules", molT5 cannot produce an effective molecule, whereas the method of the present invention produces a molecule having a conjugated double bond or conjugated molecule. For the description that "the molecule includes a hydroxyl group and a carboxyl group, is capable of decomposing to produce ammonia gas, and has an oxygen content exceeding 20%", various molecules having a hydroxyl group, a high oxygen content, and a nitrogen atom to produce ammonia are successfully produced, and thus three-quarters of the conditions are satisfied. Unlike existing AI-based molecular generation methods that can only generate specified attributes, the present method adaptively generates molecular candidates based on input text that may describe any desired condition or conditions. In the last description, three desired molecular properties are specified, including high water solubility, high barrier permeability and low toxicity, which can be evaluated by a fine-tuning property prediction model. The present invention is based on the process of Momu-S and Momu-K, which allows the generation of different molecules with high penetrability, low toxicity and high water solubility.
The molecules generated from the molecular structure description are shown in fig. 5. For descriptions containing nucleophilic groups, the methods of the invention produce different molecules with amino groups, hydroxyl groups, or double bonds. For descriptions containing electrophilic groups, the method of the present invention is capable of generating different molecules having carbonyl groups, alkyl-like groups or halogen atoms, despite inhibiting formal charge. For descriptions containing hydrophilic groups, the method of the present invention is capable of producing molecules containing hydroxyl, amino or aldehyde groups of different structures. For descriptions containing lipophilic groups, the method of the present invention produces molecules containing alkyl groups having different structures, halogen atoms, or benzene rings.
The present invention is not described in detail in part as being well known to those skilled in the art.

Claims (7)

1. A molecular multi-mode model training and application method is characterized in that: the method comprises the following steps:
s100, a data collection unit is established, a molecular graph and semantic weak related text data thereof are extracted from published SCI papers, and a molecular graph data set is constructed;
s200, a molecular multi-mode model and a pre-training unit thereof are established, the molecular multi-mode model comprising a graph encoder and a text encoder is constructed, and the model is trained by a contrast learning method;
s500, a text-based molecular diagram generating unit is established, a model is applied to different downstream tasks such as cross-modal retrieval, molecular diagram generation based on text description and the like, and the generated molecular diagram is finally output.
2. The method for training and applying a molecular multi-modal model according to claim 1, wherein: wherein the method of constructing the data set of the data collection unit comprises:
s201, collecting names, synonyms and SMILES strings of the first 50K molecular compounds in PubCHem;
s202, for each collected molecule, using a SMILES2graph function provided by OGB to convert the SMILES character string into a molecular diagram;
s203, searching sentences containing the names in abstract, introduction and conclusion parts of published scientific papers in the medical, biological, chemical and computer science fields of the S2orc database by using the names of the molecules as queries, recording each searched sentence and adjacent sentences thereof into a document as a paragraph, searching again by taking synonyms or aliases of the molecules as queries if the number of the paragraphs searched by the names is less than two, and terminating the molecular search in advance when the specified number of the paragraphs or the specified document size is searched;
s204, obtaining a molecular map-document pair to form a multi-modal molecular data set.
3. A method of molecular multimodal model training and application as claimed in claim 2, wherein: the molecular multi-mode model consists of a graph encoder and a text encoder, and the two encoders respectively extract a molecular graph representation and a text representation; using a graph isomorphic network as a graph encoder and using a language model Bert as a text encoder; the training stage model additionally uses a similarity calculation module, and the similarity calculation module uses two mapping heads to map the molecular graph and the text representation into a joint representation space respectively, so as to calculate cosine similarity of the mapped features.
4. A method of molecular multimodal model training and application as claimed in claim 3, wherein: the molecular multi-mode model pre-training method specifically comprises the following steps:
s401, initializing a graph encoder by using self-supervision training weights of a graph isomorphic network, and initializing a text encoder by using pre-training weights of BERT in Sci-BERT or KV-PLM;
s402, for each training period, sampling a batch of N molecular diagram-text pair data from a training sequence in sequence;
s403, for each group of small batch data { G ] 1 ,...,G N Two different enhancements are generated from each graph by means of random node deletion and random subgraph, respectively, 2N enhancement graphs are generated altogether, and these graphs are input into a graph encoder to obtain their characterization vectors
Figure FDA0003836647480000011
wherein />
Figure FDA0003836647480000012
and />
Figure FDA0003836647480000013
Representing the ith graph G i Is characterized by two enhancement maps;
s404, randomly extracting two different sentences from the document corresponding to each molecule, describing G for the ith diagram through a text encoder i The representation obtained for two different sentences of (a) is expressed as
Figure FDA0003836647480000021
Each molecular diagram in a small batch corresponds to two different sentences, yielding a text representation of 2N +.>
Figure FDA0003836647480000022
S405, for the ith graph G i The total multiview loss includes four from multiple modalitiesPairs of characterizations
Figure FDA0003836647480000023
and />
Figure FDA0003836647480000024
Four contrast losses between->
wherein ,
Figure FDA0003836647480000025
corresponding contrast loss:
Figure FDA0003836647480000026
Figure FDA0003836647480000027
corresponding contrast loss:
Figure FDA0003836647480000028
Figure FDA0003836647480000029
corresponding contrast loss:
Figure FDA00038366474800000210
Figure FDA00038366474800000211
corresponding contrast loss:
Figure FDA00038366474800000212
where τ is a temperature parameter that is used to determine,
Figure FDA00038366474800000213
i.e. a similarity calculation module which first will +.>
Figure FDA00038366474800000214
and />
Figure FDA00038366474800000215
Projecting to the same dimension, and then calculating cosine similarity between projection vectors;
s406, calculating contrast loss of graph modes: graph mode contrast loss for the i-th graph is:
Figure FDA00038366474800000216
s407, calculating the sum of losses of all samples in the batch:
Figure FDA00038366474800000217
wherein λ is the balance factor between the cross-modal loss and the graph modal loss, and is a hyper-parameter;
s408, for each batch, updating parameters of a graph encoder, a text encoder and a mapping head by back propagation of the total loss L until all the batches in all the current epochs are processed;
s409, repeating S402-S408 until the preset maximum epoch number of rounds is reached.
5. The method for training and applying the molecular multi-modal model as set forth in claim 4, wherein: the text-based molecular diagram generation method comprises the following steps:
the method comprises a molecular multi-modal model trained by a pre-training method and a molecular generator which is pre-trained, is based on random seed sampling and allows a counter-transmission gradient, wherein parameters of the molecular multi-modal model and the molecular generator are fixed;
s501, inputting a text x describing a molecule T
S502, initializing to generate a seed q; setting q as a learnable parameter;
s503, generating a molecular diagram x according to q by using the trained molecular generator G
S504, respectively x G and xT Sending the images to a graph encoder and a text encoder of a pre-trained molecular multi-mode model, and extracting a corresponding graph sign z G And text characterization z T And calculating the negative similarity of the molecular multi-mode model and the similarity by using a similarity calculation module of the molecular multi-mode model as a loss function:
l q =-sim(z G ,z T )/τ,
s505, for loss l q Counter-propagating and updating the seed q by using a gradient descent method;
s506, repeating the steps S503-S506 until the preset maximum epoch number is reached;
s507, sending the optimized q to a molecular generator, generating a final molecular diagram and outputting the molecular diagram.
6. A storage medium, characterized in that: a method of training and applying a molecular multimodal model as defined in any one of claims 1-5.
7. Chip, its characterized in that: use of the storage medium of claim 6.
CN202211099018.2A 2022-09-07 2022-09-07 Molecular multi-mode model training and application method, storage medium and chip Pending CN116168775A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211099018.2A CN116168775A (en) 2022-09-07 2022-09-07 Molecular multi-mode model training and application method, storage medium and chip

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211099018.2A CN116168775A (en) 2022-09-07 2022-09-07 Molecular multi-mode model training and application method, storage medium and chip

Publications (1)

Publication Number Publication Date
CN116168775A true CN116168775A (en) 2023-05-26

Family

ID=86413784

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211099018.2A Pending CN116168775A (en) 2022-09-07 2022-09-07 Molecular multi-mode model training and application method, storage medium and chip

Country Status (1)

Country Link
CN (1) CN116168775A (en)

Similar Documents

Publication Publication Date Title
CN111159223B (en) Interactive code searching method and device based on structured embedding
CN110110324B (en) Biomedical entity linking method based on knowledge representation
CN109670177A (en) One kind realizing the semantic normalized control method of medicine and control device based on LSTM
CN115048447B (en) Database natural language interface system based on intelligent semantic completion
CN111858940B (en) Multi-head attention-based legal case similarity calculation method and system
CN111640471A (en) Method and system for predicting activity of drug micromolecules based on two-way long-short memory model
CN111666762B (en) Intestinal cancer diagnosis electronic medical record attribute value extraction method based on multitask learning
CN116204674B (en) Image description method based on visual concept word association structural modeling
CN114913938B (en) Small molecule generation method, equipment and medium based on pharmacophore model
CN111126040A (en) Biomedical named entity identification method based on depth boundary combination
CN110428907A (en) A kind of text mining method and system based on unstructured electronic health record
CN112151127A (en) Unsupervised learning drug virtual screening method and system based on molecular semantic vector
Schäfer et al. UMLS mapping and Word embeddings for ICD code assignment using the MIMIC-III intensive care database
CN113836896A (en) Patent text abstract generation method and device based on deep learning
CN110299194B (en) Similar case recommendation method based on comprehensive feature representation and improved wide-depth model
CN111581964A (en) Theme analysis method for Chinese ancient books
CN111145914A (en) Method and device for determining lung cancer clinical disease library text entity
CN112035629B (en) Method for implementing question-answer model based on symbolized knowledge and neural network
Hu et al. An integrated pipeline model for biomedical entity alignment
CN111782818A (en) Device, method and system for constructing biomedical knowledge graph and memory
CN115964475A (en) Dialogue abstract generation method for medical inquiry
CN116629361A (en) Knowledge reasoning method based on ontology learning and attention mechanism
CN116168775A (en) Molecular multi-mode model training and application method, storage medium and chip
CN115033706A (en) Method for automatically complementing and updating knowledge graph
Wang et al. Bi-directional joint embedding of encyclopedic knowledge and original text for chinese medical named entity recognition

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination