CN112786108B

CN112786108B - Training method, device, equipment and medium of molecular understanding model

Info

Publication number: CN112786108B
Application number: CN202110082654.3A
Authority: CN
Inventors: 李宇琨; 张涵; 肖东凌; 孙宇
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2021-01-21
Filing date: 2021-01-21
Publication date: 2023-10-24
Anticipated expiration: 2041-01-21
Also published as: CN112786108A

Abstract

The invention discloses a training method, device, equipment and medium of a molecular understanding model, relates to the technical field of computers, and in particular relates to the technical field of artificial intelligence such as natural language processing and deep learning. The training method comprises the following steps: obtaining pre-training data, the pre-training data comprising: a first molecular representation sequence sample and a second molecular representation sequence sample, the first molecular representation sequence sample and the second molecular representation sequence sample being two different molecular representation sequence samples of the same molecule; processing the first molecular representation sequence sample by adopting the molecular understanding model to obtain a pre-training output; and calculating a pre-training loss function according to the pre-training output and the second molecular representation sequence sample, and updating parameters of the molecular understanding model according to the pre-training loss function. The present disclosure may enhance the molecular understanding effect of the molecular understanding model.

Description

Training method, device, equipment and medium of molecular understanding model

Technical Field

The disclosure relates to the technical field of computers, in particular to the technical field of artificial intelligence such as natural language processing and deep learning, and especially relates to a training method, device, equipment and medium of a molecular understanding model.

Background

Artificial intelligence (Artificial Intelligence, AI) is the discipline of studying the process of making a computer to simulate certain mental processes and intelligent behaviors of a person (e.g., learning, reasoning, thinking, planning, etc.), both hardware-level and software-level techniques. Artificial intelligence hardware technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing, and the like; the artificial intelligence software technology mainly comprises a computer vision technology, a voice recognition technology, a natural language processing technology, a machine learning/deep learning technology, a big data processing technology, a knowledge graph technology and the like.

The simplified molecular linear input specification (Simplified Molecular Input Line Entry Specification, SMILES) is a specification that explicitly describes a molecule with a string of american standard code for information interchange (American Standard Code for Information Interchange, ASCII). Based on SMILES, a molecule may be represented as one or more SMILES sequences. With the development of the deep learning technology, the deep learning technology can be applied to the field of physics and chemistry.

In the related art, when a molecule is understood, the training is performed by using a mask language model (Masked Language Model, MLM) task by using an Encoder (Bidirectional Encoder Representations from Transformers, BERT) model of a bidirectional transducer based on a single SMILES sequence of the molecule.

Disclosure of Invention

The present disclosure provides a training method, apparatus, device and medium for a molecular understanding model.

According to an aspect of the present disclosure, there is provided a training method of a molecular understanding model, including: obtaining pre-training data, the pre-training data comprising: a first molecular representation sequence sample and a second molecular representation sequence sample, the first molecular representation sequence sample and the second molecular representation sequence sample being two different molecular representation sequence samples of the same molecule; processing the first molecular representation sequence sample by adopting the molecular understanding model to obtain a pre-training output; and calculating a pre-training loss function according to the pre-training output and the second molecular representation sequence sample, and updating parameters of the molecular understanding model according to the pre-training loss function.

According to another aspect of the present disclosure, there is provided a molecular processing method based on a molecular model including a molecular understanding model obtained using two different molecular representation sequence samples of the same molecule and an output network, the molecular processing method including: processing the molecular application input by adopting the molecular understanding model to obtain hidden layer output, wherein the molecular application input comprises a fixed identifier when the output network generates a network for the molecule; and processing the hidden layer output by adopting the output network to obtain molecular application output.

According to another aspect of the present disclosure, there is provided a training apparatus of a molecular understanding model, including: the acquisition module is used for acquiring pre-training data, wherein the pre-training data comprises: a first molecular representation sequence sample and a second molecular representation sequence sample, the first molecular representation sequence sample and the second molecular representation sequence sample being two different molecular representation sequence samples of the same molecule; the processing module is used for processing the first molecular representation sequence sample by adopting the molecular understanding model so as to obtain a pre-training output; and the updating module is used for calculating a pre-training loss function according to the pre-training output and the second molecular representation sequence sample, and updating parameters of the molecular understanding model according to the pre-training loss function.

According to another aspect of the present disclosure, there is provided a molecular processing device based on a molecular model including a molecular understanding model and an output network, the molecular processing device including: the first processing module is used for processing the molecular application input by adopting the molecular understanding model to obtain hidden layer output, and the molecular application input comprises a fixed identifier when the output network is a molecular generation network; and the second processing module is used for processing the hidden layer output by adopting the output network so as to obtain molecular application output.

According to another aspect of the present disclosure, there is provided an electronic device including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of the above aspects.

According to another aspect of the present disclosure, there is provided a non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method according to any one of the above aspects.

According to another aspect of the present disclosure, there is provided a computer program product comprising a computer program which, when executed by a processor, implements a method according to any of the above aspects.

According to the technical scheme, the molecular understanding effect of the molecular understanding model can be improved.

It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the disclosure, nor is it intended to be used to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following specification.

Drawings

The drawings are for a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

FIG. 1 is a schematic diagram according to a first embodiment of the present disclosure;

FIG. 2 is a schematic diagram according to a second embodiment of the present disclosure;

FIG. 3 is a schematic diagram according to a third embodiment of the present disclosure;

FIG. 4 is a schematic diagram according to a fourth embodiment of the present disclosure;

FIG. 5 is a schematic diagram according to a fifth embodiment of the present disclosure;

FIG. 6 is a schematic diagram according to a sixth embodiment of the present disclosure;

FIG. 7 is a schematic diagram according to a seventh embodiment of the present disclosure;

FIG. 8 is a schematic diagram according to an eighth embodiment of the present disclosure;

FIG. 9 is a schematic diagram according to a ninth embodiment of the present disclosure;

FIG. 10 is a schematic diagram according to a tenth embodiment of the present disclosure;

FIG. 11 is a schematic illustration according to an eleventh embodiment of the present disclosure;

fig. 12 is a schematic diagram of an electronic device used to implement either the training method or the molecular processing method of the molecular understanding model of an embodiment of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present disclosure to facilitate understanding, and should be considered as merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

When the deep learning technique is applied to the field of physics and chemistry, a large number of molecules of a compound can be converted into a SMILES sequence, and the SMILES sequence is input into a BERT model as text to perform pre-training, so as to train a pre-training model, and then fine-tuning (fine-tuning) can be performed on the pre-training model based on a downstream molecular task.

In the related art, a single SMILES sequence of a molecule is input into a BERT model, and pretraining is performed based on an MLM task to obtain a pretraining model for understanding the molecule. It uses a single SMILES sequence, and does not fully utilize the characteristics of the SMILES sequence, resulting in poor molecular understanding of the pre-trained molecular understanding model.

In order to solve the problem of poor molecular understanding effect of the molecular understanding model existing in the related art, the present disclosure provides some examples as follows.

Fig. 1 is a schematic diagram according to a first embodiment of the present disclosure. The embodiment provides a training method of a molecular understanding model, which comprises the following steps:

101. obtaining pre-training data, the pre-training data comprising: a first molecule represents a sequence sample and a second molecule represents a sequence sample, the first molecule represents a sequence sample and the second molecule represents two different molecules of the same molecule represent a sequence sample.

102. And processing the first molecular representation sequence sample by adopting the molecular understanding model to obtain a pre-training output.

103. And calculating a pre-training loss function according to the pre-training output and the second molecular representation sequence sample, and updating parameters of the molecular understanding model according to the pre-training loss function.

A molecule is a physicochemical term, a molecule being the smallest unit in a substance that is capable of independent existence that is relatively stable and maintains the physicochemical properties of the substance. Molecules are made up of atoms that are bound into molecules by a certain force in a certain order and arrangement, which may be referred to as a molecular structure. Thus, a molecule can be characterized by an atomic and molecular structure, and the physicochemical properties of the molecule depend not only on the kind and number of atoms that make up the molecule, but also on the molecular structure.

Natural language understanding (Natural Language Understanding, NLU) is an important component of natural language processing (Natural Language Processing, NLP), and the core task of NLU is to convert natural language into machine-processable formal language, establishing a connection of natural language to machine.

Similar to natural language understanding, molecular understanding refers to converting a sequence of molecular representations into a molecular understanding representation that is a representation that can be processed by a machine, e.g., the molecular understanding representation specifically includes a probability distribution vector corresponding to each time step, where the i (i=1,..n) th element of the probability distribution vector is a probability of the i-th word in the vocabulary, where n is a dimension of the vocabulary, and in the embodiments of the present disclosure, the vocabulary refers to the vocabulary corresponding to the sequence of molecular representations, and the molecular representation sequence is exemplified by a SMILES sequence, and since the SMILES sequence is an ASCII string, the corresponding vocabulary may be an ASCII vocabulary.

In some embodiments, the molecular representation sequence is a SMILES sequence. By adopting the SMILES sequences, the characteristics of the same molecule corresponding to a plurality of SMILES sequences can be fully utilized, and compared with a mode of understanding the molecule by a single SMILES sequence, the molecule can be better understood by different SMILES sequences of the same molecule, so that the molecular understanding effect of a molecular understanding model is improved.

Based on the SMILES, different SMILES sequences of the same molecule can be obtained, for example, referring to fig. 2, corresponding to the same molecule 201, and based on the SMILES, various SMILES sequences 202 corresponding to the molecule 201 can be obtained.

Further, two different SMILES sequences may be randomly acquired among the plurality of SMILES sequences in a random manner. For example, corresponding to the molecule 201 shown in fig. 2, the two SMILES sequences obtained may be the first and third, namely: CC (Oc 1 cccccc 1C (=o) O) =o, and C (C1C (cccc 1) Oc (=o) C) (=o) O.

To distinguish from the application phases, the data employed by the training phases may be referred to as samples, e.g., the application phases may be referred to as molecular representation sequences and the training phases may be referred to as molecular representation sequence samples. Thus, in the training phase, two different molecular representation sequence samples, which may be referred to as a first molecular representation sequence sample and a second molecular representation sequence sample, may be obtained for the same molecule in the manner described above.

When the first molecular representation sequence sample and the second molecular representation sequence sample are obtained and then the first molecular representation sequence sample can be input into the molecular understanding model, the molecular understanding model initially processes the first molecular representation sequence sample by adopting initial parameters, the output of the molecular understanding model is called as pretraining output, then a pretraining loss function can be calculated based on the pretraining output and the second molecular representation sequence sample, and the parameters of the molecular understanding model can be updated based on the pretraining loss function until the pretraining loss function converges, and the parameters when the pretraining loss function converges are used as final parameters of the molecular understanding model. The pre-training loss function is not limited, and is, for example, a negative log-likelihood (NLL) function.

In some embodiments, as shown in fig. 3, the molecular understanding model may include an input layer, which may be an embedding layer, for converting an input sequence into an input vector, and a hidden layer, which may specifically include an encoder (encoder) 301 and a decoder (decoder) 302. Taking a molecular representation sequence as an example of a SMILES sequence, when training a molecular understanding model, converting a first SMILES sequence sample into an input vector through an enabling layer, inputting the input vector into an encoder, and obtaining a pre-training output through processing of the encoder and a decoder. The pre-training output is a probability distribution vector of each time step, and then, based on the pre-training output and a desired output sequence sample corresponding to the time step, i.e., a second SMILES sequence sample, a pre-training loss function can be calculated, so that parameters of the molecular understanding model are updated based on the pre-training loss function.

In some embodiments, the encoder includes a first self-attention (self-attention) layer that employs a bi-directional self-attention mechanism; and/or the decoder comprises a second self-attention layer, the second self-attention layer employing a unidirectional self-attention mechanism.

By adopting a bidirectional self-attention mechanism by the encoder and adopting a unidirectional self-attention mechanism by the decoder, different self-attention mechanisms can be adopted according to different inputs, the realization is more flexible, and the molecular understanding effect of a molecular understanding model can be improved.

In some embodiments, the encoder further comprises a first shared network, the decoder further comprises a second shared network, the first and second shared networks having the same network structure and network parameters.

The encoder and the decoder adopt a shared network, so that the same characteristics can be better utilized in the encoding and decoding processes, and the molecular understanding effect of the molecular understanding model is improved.

For example, referring to fig. 4, the encoder and decoder may be implemented based on a transducer network, and in fig. 4, the structure of each transducer layer corresponding to the encoder is represented by a plurality of transducer layers (Transformer layer), such as the structure of each encoder of the transducer network. The structures of the respective converters are the same, in each converter layer, the decoder is similar to the structure of the encoder, i.e. may include a self-attention layer and a shared network, the network in the encoder may be referred to as a first attention layer and a first shared network for distinction, the network in the decoder may be referred to as a second attention layer and a second shared network, the difference is that, as shown in fig. 4, the first self-attention layer 401 in the encoder is a bi-directional self-attention layer, the second self-attention layer 402 in the decoder is a unidirectional self-attention layer, and the shared networks of the two may be forward feedback (forward feedback) layers of the encoders of the converter network.

A sequence refers to a combination of a plurality of sequence units, which may be different according to application scenarios, for example, in the field of chinese NLP, a sequence unit may refer to each word of chinese.

In the physicochemical field to which embodiments of the present disclosure relate, a sequence unit may refer to a character that characterizes a molecule, e.g., corresponding to a SMILES sequence, the sequence unit is an ASCII character, particularly as depicted in fig. 2 as C, O, etc.

When outputting the sequence, the sequence unit may be output one by one, for example, corresponding to A, B, C three characters, the first time step may be output a, the second time step may be output B, and the third time step may be output C. In the embodiment of the disclosure, when outputting the current character, the character which has been output before may be used for outputting, for example, when outputting the character B, the character C may be output based on the output a which has been output, and when outputting the character C, the character C may be output based on the character a and the character B.

Accordingly, in some embodiments, processing the first molecular representation sequence sample to obtain the pre-training output using the molecular understanding model includes: performing bi-directional self-attention processing on the first molecular representation sequence samples with the first self-attention layer of the encoder to obtain bi-directional self-attention processing results; processing the bi-directional self-attention processing result with the first shared network portion of the encoder to obtain a coded output; performing unidirectional self-attention processing on the encoded output and the generated output using the second self-attention layer of the decoder to obtain unidirectional self-attention processing results; and processing the unidirectional self-attention processing result by adopting the second shared network part of the decoder to obtain the pre-training output.

For example, referring to fig. 5, the embedding layer 501 is used to convert the first SMILES sequence sample into a first input vector, the first input vector sequentially passes through the first self-attention layer and the first shared network of the encoder 502, and outputs a coded vector to the decoder, another input of the decoder is a second input vector obtained by converting the output sequence through the embedding layer 501, and the coded vector and the second input vector sequentially pass through the second self-attention layer and the second shared network of the decoder 503, and then output a pretraining output, which may specifically be a probability distribution vector. The currently generated sequence units may then be determined based on the probability distribution vector, as generated output for a subsequent time step, input to the decoder after passing through the embedding layer, and so on, generating sequence units one by one until the end of the generation of the terminator.

Through the generation flow of the pre-training output, the accuracy of the pre-training output can be improved, and then the molecular understanding effect of the molecular understanding model is improved.

In this embodiment, by training the molecular understanding model by using two different molecular expression sequence samples of the same molecule, the characteristic of the molecular expression sequence can be fully utilized and the molecular understanding effect of the molecular understanding model can be improved compared with the mode of training the model by using a single molecular expression sequence sample.

Any one of the embodiments or combinations of the embodiments described above describes a pre-training process for a molecular understanding model, and therefore, the molecular understanding model may be used as a pre-training model, and then the pre-training model may be trimmed to obtain a trimmed model, and the trimmed model may be used for downstream molecular processing tasks. The trimmed model may be referred to as a molecular model, and a training process of the molecular model is described below.

Fig. 6 is a schematic diagram according to a sixth embodiment of the present disclosure. The present embodiment provides a training method of a molecular model, as shown in fig. 6, including:

601. and acquiring fine tuning training data.

602. And fine tuning the molecular understanding model by adopting the fine tuning training data so as to obtain the molecular model.

The molecular understanding model can be obtained after training by adopting any embodiment.

The fine tuning training data may be selected accordingly based on the difference in molecular processing tasks.

In some embodiments, the molecular processing tasks may include: molecular prediction tasks, and/or molecular generation tasks. Further, the molecular prediction tasks may include: molecular classification tasks, and/or molecular regression tasks. Further, the molecular generation task may include: generating new molecules, generating new molecules with specific properties, generating optimized molecules.

For molecular prediction tasks:

the corresponding fine tuning training data may be referred to as first fine tuning training data, the first fine tuning training data comprising: a first input sample and a first output sample; the first input sample is a molecular representation sequence sample, the first output sample is label data corresponding to the molecular representation sequence sample, and if the prediction is classified, the label data is a classification label; and/or, if the prediction is regression, the tag data is a regression tag.

The classification tag can be manually marked according to actual demands, for example, a marked DNA sequence or protein, the marked protein comprises marked seed storage protein, marked isozymes, marked allelic enzymes and the like, the isozymes are different molecular forms of enzymes coded at a plurality of gene loci, and the allelic enzymes are different molecular forms of enzymes coded by different alleles of the same gene locus. The regression labels can also be manually marked according to actual requirements.

For molecular generation tasks:

the corresponding fine tuning training data may be referred to as second fine tuning training data, which is also different for the three molecular generation tasks described above.

Molecular generation tasks corresponding to the generation of new molecules:

the second fine tuning training data comprises: a plurality of sets of sample pairs, each set of sample pairs comprising: a second input sample and a second output sample, the second input sample comprising a fixed identifier, the second output sample being a molecular representation sequence sample, and the molecular representation sequence samples in each set of sample pairs being molecular representation sequence samples of similar molecules meeting a preset similarity condition. The similar conditions may be set according to practical requirements, for example, molecules having the same atomic composition and similar molecular structure are used as similar molecules, or the like. The determination conditions for the atomic composition similarity and/or the molecular structure similarity can also be set according to actual requirements.

A molecular generation task corresponding to the generation of new molecules with specific properties:

the second fine tuning training data comprises: a plurality of sets of sample pairs, each set of sample pairs comprising: a second input sample and a second output sample, the second input sample comprising a fixed identifier and an attribute sample, the second output sample being a molecular representation sequence sample, and the molecular representation sequence samples in each set of sample pairs being molecular representation sequence samples of similar molecules having the attribute and meeting a preset similarity condition. Similar conditions can be found in the new molecular generation task described above, among others. Unlike the new molecule generation task described above, the task here also requires that the new molecule have a specific property, and therefore, the input sample also includes a property sample, and the selected output sample corresponding molecule also needs to have a property corresponding to the property sample. The property refers to biological, physical, chemical, etc. properties of the molecule, such as toxicity, activity, etc. In practical application, the attribute value of each attribute may be preconfigured, then one attribute may be selected as an attribute sample, in addition, a molecular representation sequence of a similar molecule with the selected attribute is taken as a second output sample, a vector corresponding to a fixed identifier containing attribute sample information is taken as an input vector of a molecular understanding model, and the input vector and the second output sample are adopted for training to obtain a molecular model of a corresponding task. The vector corresponding to the fixed identifier containing the attribute sample information can be obtained according to the principle of the subsequent application stage.

Corresponding to a molecular generation task of generating optimized molecules:

the second fine tuning training data comprises: a plurality of sets of sample pairs, each set of sample pairs comprising: the second input sample comprises a fixed identifier and an input molecule representation sequence sample, the second output sample is an output molecule representation sequence sample, and the output molecule representation sequence sample corresponds to a molecule and is an optimized molecule for the molecule corresponding to the input molecule representation sequence sample. The optimization molecule can be selected according to requirements, for example, a molecule with a certain attribute is used as the optimization molecule of the molecule to be optimized.

In this embodiment, the molecular understanding model is finely tuned to obtain a molecular model, so that the molecular model is applicable to various downstream molecular tasks, the training workload is reduced, and the training efficiency is improved.

The above embodiments illustrate the training process of a molecular model, based on which the molecules can be processed during the application phase to accomplish various molecular processing tasks.

Fig. 7 is a schematic diagram of a seventh embodiment of the present disclosure, which provides a molecular processing method based on a molecular model, where the molecular model includes a molecular understanding model and an output network, and the molecular understanding model is obtained by using two different molecular expression sequence samples of the same molecule, and the processing method includes:

701. And processing the molecular application input by adopting the molecular understanding model to obtain hidden layer output, wherein the molecular application input comprises a fixed identifier when the output network generates a network for the molecule.

702. And processing the hidden layer output by adopting the output network to obtain molecular application output.

The output network is different based on the difference in molecular processing tasks.

For example, the output network is a molecular prediction network corresponding to the molecular prediction task, and the output network is a molecular generation network corresponding to the molecular generation task.

Further, the molecular prediction network and/or the molecular generation network may also be different depending on the particular molecular prediction task and/or the particular molecular generation task.

In addition, the molecular application inputs and molecular application outputs are different based on the difference in molecular processing tasks.

For molecular prediction tasks:

referring to fig. 8, the molecular application inputs are: the molecules to be predicted represent sequences, exemplified by the SMILES sequence in FIG. 8. The molecular application outputs are: predicted values. The molecular representation sequence to be predicted can be a single molecular representation sequence or a plurality of spliced molecular representation sequences.

After the SMILES sequence is input into the molecular understanding model 801, a predicted value corresponding to the SMILES sequence is output through the molecular prediction network 802 as an output network, and the predicted value may be a classification value and/or a regression value.

For molecular generation tasks:

the output network is a molecular generation network; the molecular application inputs include: a fixed identifier; the molecular application output includes: the molecule represents a sequence.

Molecular generation tasks corresponding to the generation of new molecules:

referring to the left-hand diagram of fig. 9, the molecular application inputs are: a fixed identifier; the molecular application outputs are: the molecular representation sequence of the novel molecule is exemplified by the SMILES sequence in FIG. 9. The fixed identifier may be, for example, [ CLS ], or the fixed identifier may also be a start identifier or the like, and in addition, the fixed identifier may be one or more, for example, may include a start identifier and a stop identifier.

After the fixed identifier is input into the molecule understanding model 901, a SMILES sequence of a new molecule is output through the molecule generating network 902 as an output network.

referring to the middle diagram of fig. 9, the molecular application inputs are: fixed identifier and specific attribute information; the molecular application outputs are: a molecular representation sequence of a new molecule having said specific property.

After the fixed identifier and the specific attribute information are input into the molecular understanding model 901, the embedding layer may convert the fixed identifier and the specific information into vectors corresponding to the fixed identifier containing the specific attribute information, and then output the SMILES sequence having the specific attribute information through the molecular generating network 902 as an output network.

The vectors corresponding to the fixed identifier and the fixed identifier containing the specific attribute information are represented in different filling manners in fig. 9, and the vector corresponding to the fixed identifier containing the specific attribute information may be obtained by multiplying the value corresponding to the fixed identifier by the attribute value corresponding to the specific attribute information, and converting the product into a vector by using an embedding layer; alternatively, the embedded layer may include: the character embedding layer converts the fixed identifier into a fixed identifier vector, the attribute embedding layer converts the attribute value of the specific attribute information into an attribute vector, and then the fixed identifier vector and the attribute vector are added to obtain the character embedding layer.

Corresponding to a molecular generation task of generating optimized molecules:

referring to the right-hand side of fig. 9, the molecular application inputs are: a fixed identifier and a molecular representation sequence to be optimized; the molecular application outputs are: the optimized molecule represents the sequence.

After the fixed identifier and the SMILES sequence to be optimized are input into the molecular understanding model 901, the optimized SMILES sequence is output through the molecular generating network 902 as an output network.

In some embodiments, the processing the hidden layer output using the output network to obtain a molecular application output includes: searching for a molecular application output corresponding to the hidden layer output by adopting the output network, wherein the searching comprises: random sample search, or, bundle search.

Further, as shown in the left and middle diagrams of fig. 9, when new molecules are generated or new molecules having specific properties are generated, a random sampling search (random sampling search) may be specifically employed, so that a wider range of new molecules may be obtained; as shown in the right-hand diagram of fig. 9, a beam search (beam search) may be employed in generating optimized molecules, so that more targeted and accurate optimized molecules may be obtained.

In this embodiment, the molecular model is obtained by fine tuning the molecular understanding model by using the molecular model, and can be applied to various downstream molecular tasks, and the complexity of molecular generation can be reduced by performing the molecular generation task based on the fixed identifier. In addition, different molecular tasks, such as a molecular prediction task and/or a molecular generation task, may be accomplished through the output network and the difference in molecular application inputs.

Fig. 10 is a schematic diagram according to a tenth embodiment of the present disclosure. The present embodiment provides a training device for molecular understanding model, as shown in fig. 10, the device 1000 includes: an acquisition module 1001, a processing module 1002, and an update module 1003.

The obtaining module 1001 is configured to obtain pre-training data, where the pre-training data includes: a first molecular representation sequence sample and a second molecular representation sequence sample, the first molecular representation sequence sample and the second molecular representation sequence sample being two different molecular representation sequence samples of the same molecule; the processing module 1002 is configured to process the first molecular representation sequence sample using the molecular understanding model to obtain a pre-training output; the updating module 1003 is configured to calculate a pre-training loss function according to the pre-training output and the second molecular representation sequence sample, and update parameters of the molecular understanding model according to the pre-training loss function.

In some embodiments, the molecular understanding model includes an encoder and a decoder; the encoder includes a first self-attention layer employing a bi-directional self-attention mechanism; and/or the decoder comprises a second self-attention layer, the second self-attention layer employing a unidirectional self-attention mechanism.

In some embodiments, the encoder further comprises a first shared network portion, the decoder further comprises a second shared network portion, the first and second shared network portions having the same network structure and network parameters.

In some embodiments, the processing module 1002 is specifically configured to: performing bi-directional self-attention processing on the first molecular representation sequence samples with the first self-attention layer of the encoder to obtain bi-directional self-attention processing results; processing the bi-directional self-attention processing result with the first shared network portion of the encoder to obtain a coded output; performing unidirectional self-attention processing on the encoded output and the generated output using the second self-attention layer of the decoder to obtain unidirectional self-attention processing results; and processing the unidirectional self-attention processing result by adopting the second shared network part of the decoder to obtain the pre-training output.

In some embodiments, the first molecular representation sequence sample is a SMILES sequence sample; and/or, the second molecular representation sequence sample is a SMILES sequence sample.

In this embodiment, by training the molecular understanding model by using two different molecular expression sequence samples of the same molecule, the characteristics of the molecular expression sequence can be fully utilized, and the molecular understanding effect of the molecular understanding model can be improved.

Fig. 11 is a schematic diagram according to an eleventh embodiment of the present disclosure. The present embodiment provides a molecular processing device based on a molecular module, where the molecular model includes a molecular understanding model and an output network, where the molecular understanding model is obtained by using two different molecular expression sequence samples of the same molecule, and the molecular processing device 1100 includes: a first processing module 1101 and a second processing module 1102.

The first processing module 1101 is configured to process a molecular application input by using the molecular understanding model to obtain a hidden layer output, where the molecular application input includes a fixed identifier when the output network is a molecular generation network; the second processing module 1102 is configured to process the hidden layer output by using the output network to obtain a molecular application output.

In some embodiments, when the output network is a molecular generation network, the molecular application output includes a molecular representation sequence, wherein if the molecular generation network is used to generate a new molecule, the molecular representation sequence is a molecular representation sequence of the new molecule; alternatively, if the molecular generation network is used to generate new molecules with specific properties, the molecular application inputs further include: information of the specific attribute; the molecular representation sequence is a molecular representation sequence of a new molecule having the specific property; alternatively, if the molecular generation network is used to generate optimized molecules, the molecular application inputs further include: the sequence of the molecular representation to be optimized; the molecular representation sequence is an optimized molecular representation sequence.

In some embodiments, the output network is a molecular prediction network; the molecular application inputs include: the molecule to be predicted represents the sequence; the molecular application output includes: and the molecule to be predicted represents a predicted value corresponding to the sequence.

In this embodiment, the molecular model is obtained by fine tuning the molecular understanding model by using the molecular model, and can be applied to various downstream molecular tasks, and the complexity of molecular generation can be reduced by performing the molecular generation task based on the fixed identifier.

It is to be understood that the same or corresponding content in different embodiments of the disclosure may be referred to each other, and that content not described in detail in an embodiment may be referred to related content of other embodiments.

According to embodiments of the present disclosure, the present disclosure also provides an electronic device, a readable storage medium and a computer program product.

Fig. 12 shows a schematic block diagram of an example electronic device 1200 that can be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile apparatuses, such as personal digital assistants, cellular telephones, smartphones, wearable devices, and other similar computing apparatuses. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 12, the electronic device 1200 includes a computing unit 1201 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM) 1202 or a computer program loaded from a storage unit 1208 into a Random Access Memory (RAM) 1203. In the RAM 1203, various programs and data required for the operation of the electronic device 1200 may also be stored. The computing unit 1201, the ROM 1202, and the RAM 1203 are connected to each other via a bus 1204. An input/output (I/O) interface 1205 is also connected to the bus 1204.

Various components in the electronic device 1200 are connected to the I/O interface 1205, including: an input unit 1206 such as a keyboard, mouse, etc.; an output unit 1207 such as various types of displays, speakers, and the like; a storage unit 1208 such as a magnetic disk, an optical disk, or the like; and a communication unit 1209, such as a network card, modem, wireless communication transceiver, etc. The communication unit 1209 allows the electronic device 1200 to exchange information/data with other devices through a computer network, such as the internet, and/or various telecommunications networks.

The computing unit 1201 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of computing unit 1201 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, digital Signal Processors (DSPs), and any suitable processor, controller, microcontroller, etc. The computing unit 1201 performs the various methods and processes described above, such as a training method of a molecular understanding model or a molecular processing method. For example, in some embodiments, the training method of the image recognition model may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as the storage unit 1208. In some embodiments, part or all of the computer program may be loaded and/or installed onto the electronic device 1200 via the ROM 1202 and/or the communication unit 1209. When the computer program is loaded into the RAM 1203 and executed by the computing unit 1201, one or more steps of the training method of the molecular understanding model or the molecular processing method described above may be performed. Alternatively, in other embodiments, the computing unit 1201 may be configured to perform the training method or molecular processing method of the molecular understanding model in any other suitable way (e.g., by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for carrying out methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the internet.

The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, also called a cloud computing server or a cloud host, and is a host product in a cloud computing service system, so that the defects of high management difficulty and weak service expansibility in the traditional physical hosts and VPS service ("Virtual Private Server" or simply "VPS") are overcome. The server may also be a server of a distributed system or a server that incorporates a blockchain.

It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps recited in the present disclosure may be performed in parallel or sequentially or in a different order, provided that the desired results of the technical solutions of the present disclosure are achieved, and are not limited herein.

The above detailed description should not be taken as limiting the scope of the present disclosure. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present disclosure are intended to be included within the scope of the present disclosure.

Claims

1. A method of training a molecular understanding model, comprising:

obtaining pre-training data, the pre-training data comprising: a first molecular representation sequence sample and a second molecular representation sequence sample, the first molecular representation sequence sample and the second molecular representation sequence sample being two different molecular representation sequence samples of the same molecule;

processing the first molecular representation sequence sample by adopting the molecular understanding model to obtain a pre-training output;

Calculating a pre-training loss function according to the pre-training output and the second molecular representation sequence sample, and updating parameters of the molecular understanding model according to the pre-training loss function;

wherein the molecular understanding model includes an encoder and a decoder; the encoder includes a first self-attention layer employing a bi-directional self-attention mechanism; and/or the decoder comprises a second self-attention layer, the second self-attention layer adopting a unidirectional self-attention mechanism;

wherein the encoder further comprises a first shared network portion, the decoder further comprises a second shared network portion, the first and second shared network portions having the same network structure and network parameters;

wherein said processing said first molecular representation sequence sample using said molecular understanding model to obtain a pre-training output comprises:

performing bi-directional self-attention processing on the first molecular representation sequence samples with the first self-attention layer of the encoder to obtain bi-directional self-attention processing results;

processing the bi-directional self-attention processing result with the first shared network portion of the encoder to obtain a coded output;

Performing unidirectional self-attention processing on the encoded output and the generated output using the second self-attention layer of the decoder to obtain unidirectional self-attention processing results;

processing the unidirectional self-attention processing result with the second shared network portion of the decoder to obtain the pre-training output;

wherein the first molecular representation sequence sample is a SMILES sequence sample; and/or, the second molecular representation sequence sample is a SMILES sequence sample;

wherein molecular understanding refers to converting a sequence of molecular representations into a molecular understanding representation, which is a representation that a machine is capable of handling.

2. A molecular processing method based on a molecular model, the molecular model including a molecular understanding model and an output network, the molecular understanding model being obtained using two different molecular representation sequence samples of the same molecule, the molecular processing method comprising:

processing the molecular application input by adopting the molecular understanding model to obtain hidden layer output, wherein the molecular application input comprises a fixed identifier when the output network generates a network for the molecule;

processing the hidden layer output by adopting the output network to obtain molecular application output;

Wherein the molecular understanding model is trained using the method of claim 1.

3. The method of claim 2, wherein, when the output network is a molecular generation network, the molecular application output comprises a sequence of molecular representations, wherein,

if the molecular generation network is used for generating new molecules, the molecular representation sequence is the molecular representation sequence of the new molecules; or,

if the molecular generation network is used to generate new molecules with specific properties, the molecular application inputs further include: information of the specific attribute; the molecular representation sequence is a molecular representation sequence of a new molecule having the specific property; or,

if the molecular generation network is used to generate optimized molecules, the molecular application inputs further include: the sequence of the molecular representation to be optimized; the molecular representation sequence is an optimized molecular representation sequence;

wherein, the attribute refers to biological, physical and chemical properties of the molecule.

4. The method of claim 2, wherein,

when the output network is a molecular prediction network, the molecular application inputs include: the molecule to be predicted represents the sequence; the molecular application output includes: and the molecule to be predicted represents a predicted value corresponding to the sequence.

5. A training device for a molecular understanding model, comprising:

the acquisition module is used for acquiring pre-training data, wherein the pre-training data comprises: a first molecular representation sequence sample and a second molecular representation sequence sample, the first molecular representation sequence sample and the second molecular representation sequence sample being two different molecular representation sequence samples of the same molecule;

the processing module is used for processing the first molecular representation sequence sample by adopting the molecular understanding model so as to obtain a pre-training output;

the updating module is used for calculating a pre-training loss function according to the pre-training output and the second molecular representation sequence sample and updating parameters of the molecular understanding model according to the pre-training loss function;

The processing module is specifically configured to:

6. A molecular processing device based on a molecular model, the molecular model including a molecular understanding model and an output network, the molecular understanding model being derived using two different molecular representation sequence samples of the same molecule, the molecular processing device comprising:

The first processing module is used for processing the molecular application input by adopting the molecular understanding model to obtain hidden layer output, and the molecular application input comprises a fixed identifier when the output network is a molecular generation network;

the second processing module is used for processing the hidden layer output by adopting the output network so as to obtain molecular application output;

wherein the molecular understanding model is trained using the apparatus of claim 5.

7. The apparatus of claim 6, wherein the molecular application output comprises a sequence of molecular representations when the output network is a molecular generation network,

8. The apparatus of claim 6, wherein,

9. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the training method of any one of claim 1 or the processing method of any one of claims 2-4.

10. A non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the training method of any one of claims 1 or the processing method of any one of claims 2-4.