CN115762659A - Molecular pre-training representation method and system for fusing SMILES sequence and molecular diagram - Google Patents

Molecular pre-training representation method and system for fusing SMILES sequence and molecular diagram Download PDF

Info

Publication number
CN115762659A
CN115762659A CN202211282025.6A CN202211282025A CN115762659A CN 115762659 A CN115762659 A CN 115762659A CN 202211282025 A CN202211282025 A CN 202211282025A CN 115762659 A CN115762659 A CN 115762659A
Authority
CN
China
Prior art keywords
molecular
smiles
representation
diagram
level
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211282025.6A
Other languages
Chinese (zh)
Inventor
兰艳艳
马维英
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tsinghua University
Original Assignee
Tsinghua University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tsinghua University filed Critical Tsinghua University
Priority to CN202211282025.6A priority Critical patent/CN115762659A/en
Publication of CN115762659A publication Critical patent/CN115762659A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Image Analysis (AREA)

Abstract

The invention provides a molecule pre-training representation method and system fusing a SMILES sequence and a molecular diagram, comprising the following steps of: acquiring a character sequence and a molecular diagram in a SMILES form; inputting the character sequence in the SMILES form and the molecular diagram into a pre-trained Transformer model for processing from a molecular level and a fragment level respectively, and outputting a molecular representation vector; and taking the molecular representation vector as the input of a downstream task to finish the molecular representation. The invention solves the problems that the existing molecular representation scheme is not comprehensive and cannot achieve the ideal effect.

Description

Molecular pre-training representation method and system for fusing SMILES sequence and molecular diagram
Technical Field
The invention relates to the technical field of compound molecule representation, in particular to a molecule pre-training representation method and system fusing a SMILES sequence and a molecular diagram.
Background
Most molecular representation schemes employ single-form data for pre-training, such as natural language methods for SMILES strings or graph pre-training methods for molecular graphs. Recently, some schemes have used a double tower model to simultaneously use two data formats, such as DMP using SMILES and molecular map, MM-Deacon using SMILES and IUPAC names, CLOOME using molecular map and cell microscopy image. Such schemes use different branches for each data format to be encoded separately, and the resulting representations of the branches participate in training together through a final loss function.
Molecular representation schemes that use a single data format are more limited. The adoption of the SMILES sequence is not beneficial to capturing the topological structure of the molecule, and the adoption of the molecular diagram can cause the problems of excessive smoothness of a graph model and the like. Meanwhile, by utilizing the SMILES and the molecular diagram, the advantages of the SMILES and the molecular diagram can be complemented, and more comprehensive molecular representation can be obtained. However, in the current scheme adopting two data forms, such as DMP, the two-tower architecture is adopted to encode different data forms respectively, the interaction between SMILES and the molecular diagram is lacked, and the fine-grained correspondence relationship between the SMILES and the molecular diagram is ignored.
Disclosure of Invention
The invention provides a molecular pre-training representation method and a molecular pre-training representation system for fusing a SMILES sequence and a molecular diagram, which are used for solving the problems that the existing molecular representation scheme is not comprehensive and cannot achieve ideal effects.
The invention provides a molecule pre-training representation method for fusing a SMILES sequence and a molecular diagram, which comprises the following steps:
acquiring a character sequence and a molecular diagram in a SMILES form;
inputting the character sequence in the SMILES form and the molecular diagram into a pre-trained Transformer model for processing from a molecular level and a fragment level respectively, and outputting a molecular expression vector;
and taking the molecular representation vector as the input of a downstream task to finish the molecular representation.
According to the molecular pre-training representation method for fusing the SMILES sequence and the molecular diagram, which is provided by the invention, the character sequence in the SMILES form and the molecular diagram are input into a pre-trained Transformer model for processing, and the method specifically comprises the following steps:
on a molecular level, the character sequence in the SMILES form passes through a word segmentation device and a linear embedding layer to obtain a first embedding vector of each character;
the molecular diagram passes through a graph neural network to obtain a second embedded vector of each node;
and inputting the first embedding vector and the second embedding vector into a Transformer model together for processing.
According to the molecular pre-training representation method for fusing the SMILES sequence and the molecular diagram, which is provided by the invention, the character sequence in the SMILES form and the molecular diagram are input into a pre-trained Transformer model for processing, and the method further comprises the following steps:
on a segment level, carrying out segment division on the molecular picture by adopting a BRICS algorithm to obtain divided molecular picture segments;
and labeling the character sequence in the SMILES form by a labeling rule to obtain a character sequence with a label.
According to the molecular pre-training representation method for fusing the SMILES sequence and the molecular diagram, the task at the level of the fragment comprises the following steps: cross-modal mask and fragment alignment;
the cross-modal mask is respectively subjected to mask at a fragment level and a character level;
the slice level mask needs to provide complete information for another modality;
the character level mask need only provide a single modality.
According to the molecular pre-training representation method for fusing the SMILES sequence and the molecular diagram, provided by the invention, the fragment alignment is used for comparing and learning the SMILES fragment representation subjected to the multi-head attention mechanism with the average pooled molecular diagram representation
And capturing fine-grained corresponding relation between the modes through a Transformer model.
According to the molecule pre-training representation method for fusing the SMILES sequence and the molecular diagram, the molecular representation vector is used as the input of a downstream task to complete the molecular representation, and the method specifically comprises the following steps:
the molecular level task predicts the positive and negative pairs of the character sequence-molecular diagram in the form and judges whether the two pairs are from the same molecule;
predicting molecular fingerprints and functional group information through professional knowledge learning, and strengthening the learning of chemical professional knowledge;
through pre-training, the Transformer model learns the cross-modal information of two layers of fragments and molecules at the same time, and the enhanced molecule representation is carried out.
The invention also provides a molecular pre-training representation system fusing the SMILES sequence and the molecular diagram, which comprises:
the data acquisition module is used for acquiring the character sequence and the molecular diagram in the SMILES form;
the processing module is used for inputting the character sequence in the SMILES form and the molecular diagram into a pre-trained Transformer model for processing from a molecular level and a fragment level respectively and outputting a molecular representation vector;
and the molecular representation module is used for taking the molecular representation vector as the input of a downstream task to complete the molecular representation.
The invention also provides an electronic device, which comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the processor executes the program to realize the molecular pre-training representation method for fusing the SMILES sequence and the molecular diagram.
The present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements a molecular pre-training representation method of fusing a SMILES sequence and a molecular map as described in any of the above.
The present invention also provides a computer program product comprising a computer program which, when executed by a processor, implements a method of molecular pre-training representation of a fused SMILES sequence and molecular map as described in any one of the above.
According to the molecular pre-training representation method and system for fusing the SMILES sequence and the molecular diagram, the limit of the existing mixing method is broken through a Transformer model, and the cross-modal semantics of fine granularity between the SMILES sequence and the molecular diagram is captured. Obtaining corresponding SMILES and molecular picture segments, which are taken as heterogeneous input in a pre-training task together with the labels; and utilizes a shared converter for deep cross-modal fusion. By introducing two new pre-training tasks of the segment level, namely multilevel cross-modal shielding and alignment of the segment level, the fine-grained corresponding relation between capturing modes is realized, the corresponding relation between SMILES and the molecular diagram at different levels is fully excavated, and the molecular representation is enhanced.
Drawings
In order to more clearly illustrate the technical solutions of the present invention or the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
FIG. 1 is a schematic flow chart of a molecule pre-training representation method for fusing SMILES sequence and molecular diagram provided by the present invention;
FIG. 2 is a second schematic flow chart of a molecular pre-training representation method for fusing SMILES sequence and molecular diagram provided by the present invention;
FIG. 3 is a second schematic flow chart of a molecule pre-training representation method for fusing SMILES sequence and molecular diagram provided by the present invention;
FIG. 4 is a second schematic flow chart of a molecular pre-training representation method for fusing SMILES sequence and molecular diagram provided by the present invention;
FIG. 5 is a second schematic flow chart of a molecular pre-training representation method for fusing SMILES sequence and molecular diagram provided by the present invention;
FIG. 6 is a second schematic flow chart of the molecule pre-training representation method for fusing SMILES sequence and molecular diagram according to the present invention;
fig. 7 is a schematic structural diagram of an electronic device provided by the present invention.
Reference numerals:
110: a data acquisition module; 120: a processing module; 130: a molecular representation module;
710: a processor; 720: a communication interface; 730: a memory; 740: a communication bus.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention clearer, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is obvious that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be obtained by a person skilled in the art without inventive step based on the embodiments of the present invention, are within the scope of protection of the present invention.
A molecular pre-training representation method for fusing a SMILES sequence and a molecular diagram according to the present invention is described below with reference to fig. 1 to 5, and includes:
s100, acquiring a character sequence and a molecular diagram in a SMILES form;
s200, inputting the character sequence in the SMILES form and the molecular diagram into a pre-trained Transformer model for processing from a molecular level and a fragment level respectively, and outputting a molecular representation vector;
s300, taking the molecular representation vector as the input of a downstream task, and finishing the molecular representation.
According to the invention, through a single tower model (sharing a Transformer skeleton), different pre-training tasks of a molecular fragment level and a full molecular level are designed, the corresponding relation between SMILES and a molecular diagram at different levels is fully excavated, and the molecular representation is enhanced.
Inputting the character sequence in the SMILES form and the molecular diagram into a pre-trained Transformer model for processing, wherein the method specifically comprises the following steps:
s101, on a molecular level, the character sequence in the SMILES form passes through a word segmentation device and a linear embedding layer to obtain a first embedding vector of each character;
s102, obtaining a second embedded vector of each node after the molecular diagram passes through a neural network;
and S103, inputting the first embedding vector and the second embedding vector into a Transformer model together for processing.
Due to the sequential format of the SMILES string, NLP-based pre-training methods have recently been applied to SMILES and have achieved competitive results. For example, SMILES-BERT and ChemBERT are based on BERT and RoBERTA, and MLM is performed on SMILES respectively. SMILES-TRANSFORMER (and X-MOL utilize encoder-decoder architecture, with SMILES generating a learning representation.) due to the successful utilization of large-scale language models, these models have the ability to pre-train with large-scale molecular data, such as 11 billion molecules in X-MOL.
In general, the pre-training method based on the molecular graph can be mainly divided into three categories: generative, predictive and comparative methods. The goal of the generative method is to reconstruct the original molecular graph. For example, a topic tree of molecules is generated using predefined topic vocabulary. The predictive method constructs context according to the structure of the graph and adopts context-aware masks to carry out self-supervised learning. For example: neighbors in the stroke graph are defined as contexts and the relevant contexts on the molecular graph are used to predict masked atoms/edges/motives. And the contrast method mainly applies contrast learning on the molecular graph. For example, the graph is first transformed by masking of atoms/edges/subgraphs, constructing positive and negative pairs, and exploiting constant penalties. In addition, there is some work in comparative learning to propagate data using chemical knowledge.
Since one molecule can take various forms. For example, SMILES, molecular maps, IUPAC, and cell-based microscopy images, some hybrid approaches have been proposed to combine different modalities to learn a unified representation. Molecular maps were combined with SMILES, IUPAC with SMILES, and cell-based microscopy images with molecular maps, respectively. However, these methods all use a dual stream model architecture, where a separate encoder is first employed to obtain representations for each data form, and then these representations are aligned with further impairments, such as view consistency (in DMP), multilingual alignment (in MM-Deacon), and InfoLOOB contrast loss (in CLOOME). These dual-flow models are good at learning representations of each form, but have difficulty satisfying rich alignment information between the different forms.
The invention adapts to the correspondence between the SMILES and the molecular map. The basic idea of fine-grained alignment, although the framework designed for SMILES and molecular mapping is also applicable to the combination of other forms of data, such as IUPAC and cell-based microscopy images. In the future we will investigate how to design a generic framework to integrate different data forms. Our work has also been inspired by several recent visual-linguistic multimodal studies, which can be divided into dual-stream and single-stream approaches. Dual-stream methods such as CLIP, ALIGN and WenLan are good at capturing weak correlations between vision and language, while single-stream methods such as Oscar are more suitable for modeling strong correlations. SMILES and molecular graphs typically have strong associations, i.e., SMILES provides a detailed description of molecular graphs, like the strong association between images and languages in image topic tasks, so single-stream structures are more suitable for fusion between different dataforms of molecules.
In the present invention, the first embedding vector and the second embedding vector are intended to map a given SMILES and molecular map as vectors within the embedding layer for further computation. For the input SMILES, a phrase-based markup editor is employed that parses it into a series of tokens S = [ t ] 1 ,t 2 ,…,t n ]Then an embedding layer is applied to obtain the embedding of SMILES, denoted as s = [ s ] 1 ,s 2 ,…,s n ]Wherein s is i Is a person of token i of size D. And the input map is represented as G = { V, E }, where V is the set of vertices, V i Belongs to V, indicates the ith atom, E is an edge set, E ij E, represents the edge between the ith atom and the jth atom. Obtaining the embedding of the graph g = [ g ] 1 ,g 2 ,…,g m ]Where m = | V | is the atomic number, g i Is the D-dimensional vector of the ith atom.
Inputting the character sequence in the SMILES form and the molecular diagram into a pre-trained Transformer model for processing, and further comprising the following steps of:
s201, at a fragment level, performing fragment division on the molecular picture by adopting a BRICS algorithm to obtain divided molecular picture segments;
and S202, labeling the character sequence in the SMILES form by a labeling rule to obtain a labeled character sequence.
A Transformer layer for embedding the obtained SMILES and the position of the SMILES into p after the embedding layer s And connecting this sum to the graph embedding and then applying a baseThe transform encoder, noted 0, performs multi-modal fusion, including inter-modal and intra-modal fusion. For this purpose, an insertion x = θ (s + p) is obtained for each molecule s G), wherein x = (x) 1 ,x 2 ,…,x n ,x n+1 ,…,x n+m ),
Figure BDA0003898561630000081
Wherein the first n elements and the last m elements of x represent the corresponding SMILES and molecular diagram representations, respectively. Then, a typical averaging and clustering operation is applied to analyze x to obtain the final molecular description, which is denoted as x cls
Figure BDA0003898561630000082
Based on x cls Different penalties may be designed to facilitate the pre-training process. Local (i.e., fragment-level) and global (i.e., molecular-level) penalties are designed in consideration of the characteristics of SMILE and molecular diagram relationships.
The idea of the molecular graph fragment decomposition algorithm is to first apply a sophisticated method to decompose the graph into different fragments and then assign each SMILES character to a corresponding fragment according to the SMILES definition.
Specifically, using the BRICS algorithm, the stroke is divided into different segments, namely kj segments.
For convenience, the data obtained for the jth molecule is denoted as K. The label of a node may represent two vectors
Figure BDA0003898561630000091
To prove which fragment atom belongs to, where l g Representing the fragment, the 1 st atom of the ID of i has values from 0 to K-1.
The tasks at the segment level include: cross-modal mask and fragment alignment;
s301, respectively carrying out mask codes of a fragment level and a character level on the cross-modal mask codes;
s302, the fragment horizontal mask needs to provide complete information of another modality;
s303, the character horizontal mask only needs to provide a single mode.
And (4) multi-level cross-modal masking, and recovering the corresponding shielded SMILES or molecular picture segments by using all information of adjacent texts and other modalities.
The target is defined as the masked SMILES or molecular icon predicted for modal emphasis using all information of its surrounding environment and negative examples of other means. In this way, both inter-modality and intra-modality modalities can be considered. Since the markers are relatively small compared to the fragments, context information in the same modality is sufficient for MLM. These masking strategies were compared in ablation studies.
First, a character-level mask is introduced. For a given pair of SMILES-molecular graphs (S, G), a negative pair (S, G ') is constructed by randomly replacing graph G with another molecular graph G' in the training set. Thereafter, the input labels of the SMILES and the atoms of the graph are independently masked at the same ratio r. Specifically, for SMILES, | S | > rt is randomly selected from S for masking, and the type of masking is predicted.
And randomly selecting | V | rt atoms aiming at the molecular diagram, and setting initial characteristics of two atoms. The selected atom and its adjacent edges are connected to a predefined masked atom and then node embedding is obtained on the masked graph using GNN. The context information of the masked atoms is predicted according to the Grover approach, and the total loss is defined as follows:
Figure BDA0003898561630000092
S \m 、G′ \m representing the context in SMILES and graph G, for a cross-modal mask (f-CMM) at the segment level, we randomly select rr K segments and then randomly select a mode with a probability of 0.5 to mask the selected segment. The slice-level mask, the data processing and prediction target is the same as the symbol-level mask, except that the slice is used as a mask unit.Thus, the loss can be written as:
Figure BDA0003898561630000101
wherein s is m And g m SMILES representing masks and fragments S of a graph, respectively \m 、G \m Representing the context in S and G.
Segment alignment learns the SMILES segment representation through a multi-head attention mechanism in comparison with the average pooled molecular picture segment representation.
And capturing fine-grained corresponding relation between the modes through a Transformer model.
Fragment alignment (FLA) in the invention is a fine-grained cross-modal alignment strategy. The basic idea is to obtain mutually corresponding fragment representations in the SMILES and molecular diagram pairs and then align them by loss of contrast. For each segment
Figure BDA0003898561630000102
Figure BDA0003898561630000103
Is shown in
Figure BDA0003898561630000104
Obtained from a multi-headed attention layer; and the representation of the figures
Figure BDA0003898561630000105
Is an average set obtained directly by aggregating relevant atom insertions. It is obvious that
Figure BDA0003898561630000106
And
Figure BDA0003898561630000107
forming a direct alignment. Since they represent the same fragment in different morphemes. And the negative pair is constructed as follows. For each segment
Figure BDA0003898561630000108
The negative graph segment set Ns includes all segments of the other segments and other molecular graphs of the same graph. Here, the same segments in different figures are considered to constitute negative pairs because the context information is different. Likewise, each segment can be obtained
Figure BDA0003898561630000109
The corresponding negative set of instances Ng of the SMILES fragment. The loss is a combination of SMILES-based and graph-based contrast loss, i.e. LFLA = Ls + Lg, where
Figure BDA0003898561630000111
Figure BDA0003898561630000112
Wherein cos (,) represents a cosine similarity function and τ represents a temperature hyperparameter.
Taking the molecular representation vector as an input of a downstream task, and finishing the molecular representation, specifically comprising:
s401, a molecular level task predicts the positive and negative pairs of the character sequence-molecular diagram in the form and judges whether the two pairs are from the same molecule;
s402, predicting molecular fingerprints and functional group information through professional knowledge learning, and strengthening learning of chemical professional knowledge;
s403, pre-training, and simultaneously learning cross-modal information of two layers of the fragment and the molecule by the Transformer model to perform enhanced molecular representation.
In the invention, the molecular property prediction and the drug-drug interaction prediction are tested on 8 downstream tasks, and compared with various baseline models, the results are obviously improved in 6 tasks.
In the aspect of molecular property prediction, the scheme selects 4 classification tasks (BBBP, SIDER, clinTox, HIV) and 3 regression tasks (FreeSolv, ESOL, lipophilicity) from MoleculeNet. It can be seen that in comparison with the supervised, SMILES, molecular mapping, and hybrid approaches, the solution performed best on all 3 regression tasks and best or suboptimal on the classification task.
Tests were performed on 8 downstream tasks, including molecular property prediction and drug-drug interaction prediction, and compared with various baseline models, results were significantly improved in 6 of these tasks.
Referring to tables 1 and 2, the protocol selects 4 classification tasks (BBBP, SIDER, clinTox, HIV) and 3 regression tasks (FreeSolv, ESOL, lipophilicity) from MoleculeNet for molecular property prediction. It can be seen that in comparison with the supervised, SMILES, molecular mapping, and hybrid approaches, the solution performed best on all 3 regression tasks and best or suboptimal on the classification task.
TABLE 1 molecular Properties prediction results
Figure BDA0003898561630000121
In table 1, bold numbers are the best results and underlined numbers are the sub-best results.
TABLE 2 prediction of drug-drug interactions
Figure BDA0003898561630000122
In table 2, bold numbers indicate the best results, and underlined numbers indicate sub-best results.
The molecular pre-training representation method for fusing the SMILES sequence and the molecular diagram breaks through the limitation of the existing mixing method through a Transformer model, and captures fine-grained cross-modal semantics between the SMILES sequence and the molecular diagram. Obtaining corresponding SMILES and molecular picture segments, and using the segments and the marks as heterogeneous input in a pre-training task; and utilizes a shared converter for deep cross-modal fusion. By introducing two new pre-training tasks of the segment level, namely multilevel cross-modal shielding and alignment of the segment level, the fine-grained corresponding relation between capturing modes is realized, the corresponding relation between SMILES and the molecular diagram at different levels is fully excavated, and the molecular representation is enhanced.
Referring to fig. 6, the present invention also discloses a molecular pre-training representation system fusing a SMILES sequence and a molecular map, the system comprising:
a data acquisition module 110, configured to acquire a character sequence and a molecular diagram in a SMILES form;
a processing module 120, configured to input the character sequence in the SMILES form and the molecular diagram into a pre-trained Transformer model for processing from a molecular level and a fragment level, respectively, and output a molecular representation vector;
and a molecule representation module 130, configured to use the molecule representation vector as an input of a downstream task to complete molecule representation.
The processing module is arranged on a molecular level, and the character sequence in the SMILES form passes through a word segmentation device and a linear embedding layer to obtain a first embedding vector of each character;
the molecular diagram passes through a graph neural network to obtain a second embedded vector of each node;
and inputting the first embedding vector and the second embedding vector into a Transformer model together for processing.
On a fragment level, carrying out fragment division on the molecular graph by adopting a BRICS algorithm to obtain divided molecular graph fragments;
and labeling the character sequence in the SMILES form by a labeling rule to obtain a character sequence with a label.
The tasks at the segment level include: cross-modal mask and fragment alignment;
the cross-modal mask respectively carries out mask of a fragment level and a character level;
the slice level mask needs to provide complete information of another modality;
the character level mask need only provide a single modality.
Segment alignment comparative learning of SMILES segment representation with average pooled molecular picture segment representation via a multi-head attention mechanism
And capturing fine-grained correspondence between the modes through a Transformer model.
The molecule representation module takes the molecule representation vector as the input of a downstream task to complete the molecule representation, and specifically comprises the following steps:
the molecular level task predicts the positive and negative pairs of the character sequence-molecular diagram in the form and judges whether the two pairs are from the same molecule;
predicting molecular fingerprints and functional group information through professional knowledge learning, and strengthening the learning of chemical professional knowledge;
through pre-training, the Transformer model learns the cross-modal information of two layers of fragments and molecules at the same time, and the enhanced molecule representation is carried out.
The molecular pre-training representation system fusing the SMILES sequence and the molecular diagram breaks through the limitation of the existing mixing method through a Transformer model, and captures fine-grained cross-modal semantics between the SMILES sequence and the molecular diagram. Obtaining corresponding SMILES and molecular picture segments, which are taken as heterogeneous input in a pre-training task together with the labels; and utilizes a shared converter for deep cross-modal fusion. By introducing two new pre-training tasks of the segment level, namely multilevel cross-modal shielding and alignment of the segment level, the fine-grained corresponding relation between capturing modes is realized, the corresponding relation between SMILES and the molecular diagram at different levels is fully excavated, and the molecular representation is enhanced.
Fig. 7 illustrates a physical structure diagram of an electronic device, and as shown in fig. 7, the electronic device may include: a processor (processor) 710, a communication Interface (Communications Interface) 720, a memory (memory) 730, and a communication bus 740, wherein the processor 710, the communication Interface 720, and the memory 730 communicate with each other via the communication bus 740. Processor 710 may call logic instructions in memory 730 to perform a molecular pre-training representation method that fuses a SMILES sequence and a molecular diagram, the method comprising: acquiring a character sequence and a molecular diagram in a SMILES form;
inputting the character sequence in the SMILES form and the molecular diagram into a pre-trained Transformer model for processing from a molecular level and a fragment level respectively, and outputting a molecular expression vector;
and taking the molecular representation vector as the input of a downstream task to finish the molecular representation.
In addition, the logic instructions in the memory 730 can be implemented in the form of software functional units and stored in a computer readable storage medium when the logic instructions are sold or used as independent products. Based on such understanding, the technical solution of the present invention or a part thereof which substantially contributes to the prior art may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
In another aspect, the present invention also provides a computer program product, the computer program product includes a computer program, the computer program can be stored on a non-transitory computer readable storage medium, when the computer program is executed by a processor, the computer can execute a method for pre-training a molecule to fuse a SMILES sequence and a molecular diagram, the method includes: acquiring a character sequence and a molecular diagram in a SMILES form;
inputting the character sequence in the SMILES form and the molecular diagram into a pre-trained Transformer model for processing from a molecular level and a fragment level respectively, and outputting a molecular expression vector;
and taking the molecular representation vector as the input of a downstream task to complete the molecular representation.
In yet another aspect, the present invention further provides a non-transitory computer readable storage medium, on which a computer program is stored, the computer program being implemented by a processor to execute a method for fusing a molecular pre-training representation of a SMILES sequence and a molecular map provided by the above methods, the method comprising: acquiring a character sequence and a molecular diagram in a SMILES form;
inputting the character sequence in the SMILES form and the molecular diagram into a pre-trained Transformer model for processing from a molecular level and a fragment level respectively, and outputting a molecular representation vector;
and taking the molecular representation vector as the input of a downstream task to complete the molecular representation.
The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.
Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment may be implemented by software plus a necessary general hardware platform, and may also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.
Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, and not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims (10)

1. A molecular pretraining representation method fusing a SMILES sequence and a molecular map, comprising:
acquiring a character sequence and a molecular diagram in a SMILES form;
inputting the character sequence in the SMILES form and the molecular diagram into a pre-trained Transformer model for processing from a molecular level and a fragment level respectively, and outputting a molecular representation vector;
and taking the molecular representation vector as the input of a downstream task to finish the molecular representation.
2. The method of claim 1, wherein the inputting of the SMILES-form character sequence and the molecular diagram into a pre-trained Transformer model for processing comprises:
on a molecular level, the character sequence in the SMILES form passes through a word segmentation device and a linear embedding layer to obtain a first embedding vector of each character;
the molecular diagram passes through a graph neural network to obtain a second embedded vector of each node;
and inputting the first embedding vector and the second embedding vector into a Transformer model together for processing.
3. The method of claim 2, wherein the character sequence and the molecular diagram in the SMILES form are input to a pre-trained Transformer model for processing, and further comprising:
on a fragment level, carrying out fragment division on the molecular graph by adopting a BRICS algorithm to obtain divided molecular graph fragments;
and labeling the character sequence in the SMILES form with a label through a labeling rule to obtain the character sequence with the label.
4. The method of claim 3 for molecular pre-training representation of fused SMILES sequences and molecular maps, wherein the segment-level tasks include: cross-modal mask and fragment alignment;
the cross-modal mask respectively carries out mask of a fragment level and a character level;
the slice level mask needs to provide complete information of another modality;
the character level mask need only provide a single modality.
5. The method of claim 4 in which the segment alignment learns the SMILES segment representation by the multi-head attention mechanism in comparison with the average pooled molecular picture segment representation
And capturing fine-grained correspondence between the modes through a Transformer model.
6. The method for molecular pre-training representation of fused SMILES sequences and molecular graphs as claimed in claim 4, wherein said using said molecular representation vector as input to downstream task to complete molecular representation comprises:
the molecular level task predicts the positive and negative pairs of the character sequence-molecular diagram in the form and judges whether the two pairs are from the same molecule;
predicting molecular fingerprints and functional group information through professional knowledge learning, and strengthening the learning of chemical professional knowledge;
through pre-training, the Transformer model learns the cross-modal information of two layers of fragments and molecules at the same time, and the enhanced molecule representation is carried out.
7. A molecular pre-training representation system for fusing a SMILES sequence and a molecular map, the system comprising:
the data acquisition module is used for acquiring the character sequence and the molecular diagram in the SMILES form;
the processing module is used for inputting the character sequence in the SMILES form and the molecular diagram into a pre-trained Transformer model for processing from a molecular level and a fragment level respectively and outputting a molecular representation vector;
and the molecular representation module is used for taking the molecular representation vector as the input of a downstream task to complete the molecular representation.
8. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the molecular pre-training representation method of fusing a SMILES sequence and a molecular diagram according to any of claims 1 to 6 when executing the program.
9. A non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the molecular pre-training representation method of fusing a SMILES sequence and molecular diagram according to any of claims 1 to 6.
10. A computer program product comprising a computer program, wherein the computer program, when executed by a processor, implements the method of fusing molecular pre-training representations of a SMILES sequence and a molecular map according to any of claims 1 to 6.
CN202211282025.6A 2022-10-19 2022-10-19 Molecular pre-training representation method and system for fusing SMILES sequence and molecular diagram Pending CN115762659A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211282025.6A CN115762659A (en) 2022-10-19 2022-10-19 Molecular pre-training representation method and system for fusing SMILES sequence and molecular diagram

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211282025.6A CN115762659A (en) 2022-10-19 2022-10-19 Molecular pre-training representation method and system for fusing SMILES sequence and molecular diagram

Publications (1)

Publication Number Publication Date
CN115762659A true CN115762659A (en) 2023-03-07

Family

ID=85353925

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211282025.6A Pending CN115762659A (en) 2022-10-19 2022-10-19 Molecular pre-training representation method and system for fusing SMILES sequence and molecular diagram

Country Status (1)

Country Link
CN (1) CN115762659A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116612835A (en) * 2023-07-18 2023-08-18 微观纪元(合肥)量子科技有限公司 Training method for compound property prediction model and prediction method for compound property
CN117153294A (en) * 2023-10-31 2023-12-01 烟台国工智能科技有限公司 Molecular generation method of single system

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116612835A (en) * 2023-07-18 2023-08-18 微观纪元(合肥)量子科技有限公司 Training method for compound property prediction model and prediction method for compound property
CN116612835B (en) * 2023-07-18 2023-10-10 微观纪元(合肥)量子科技有限公司 Training method for compound property prediction model and prediction method for compound property
CN117153294A (en) * 2023-10-31 2023-12-01 烟台国工智能科技有限公司 Molecular generation method of single system
CN117153294B (en) * 2023-10-31 2024-01-26 烟台国工智能科技有限公司 Molecular generation method of single system

Similar Documents

Publication Publication Date Title
Wang et al. TPLinker: Single-stage joint extraction of entities and relations through token pair linking
CN111755078B (en) Drug molecule attribute determination method, device and storage medium
Kajino Molecular hypergraph grammar with its application to molecular optimization
Wei et al. Emotion-aware chat machine: Automatic emotional response generation for human-like emotional interaction
JP2024500182A (en) Explainable transducer transformer
CN115762659A (en) Molecular pre-training representation method and system for fusing SMILES sequence and molecular diagram
Han et al. Fine-grained cross-modal alignment network for text-video retrieval
CN111104512B (en) Game comment processing method and related equipment
CN112597296B (en) Abstract generation method based on plan mechanism and knowledge graph guidance
Guo et al. Context-aware graph inference with knowledge distillation for visual dialog
Liu et al. Visual question answering via attention-based syntactic structure tree-LSTM
CN113641830B (en) Model pre-training method, device, electronic equipment and storage medium
CN109858024B (en) Word2 vec-based room source word vector training method and device
CN113836866B (en) Text encoding method, text encoding device, computer readable medium and electronic equipment
CN112528654A (en) Natural language processing method and device and electronic equipment
Sun et al. Graph prompt learning: A comprehensive survey and beyond
CN113487024A (en) Alternate sequence generation model training method and method for extracting graph from text
US20220253288A1 (en) Natural solution language
CN115271093A (en) Neural network based multimodal transformer for multitasking user interface modeling
Ou et al. Cross-modal generation and pair correlation alignment hashing
Bowles et al. Radio galaxy zoo EMU: towards a semantic radio galaxy morphology taxonomy
CN113486659A (en) Text matching method and device, computer equipment and storage medium
CN116386895B (en) Epidemic public opinion entity identification method and device based on heterogeneous graph neural network
Li et al. Diverter-guider recurrent network for diverse poems generation from image
Sur AACR: feature fusion effects of algebraic amalgamation composed representation on (de) compositional network for caption generation for images

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination