CN117275570A

CN117275570A - Training method of protein model, and acquisition method and device of protein data

Info

Publication number: CN117275570A
Application number: CN202311195860.0A
Authority: CN
Inventors: 陈致远; 薛洋; 陈天浩; 方晓敏; 张肖男; 何径舟
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2023-09-15
Filing date: 2023-09-15
Publication date: 2023-12-22

Abstract

The disclosure provides a training method of a protein model, and a method and a device for acquiring protein data, and relates to the technical field of artificial intelligence, in particular to the technical fields of biological calculation, deep learning and large models. The method comprises the following steps: acquiring multi-modal protein sample data; correlating protein sample data of any two modes to obtain candidate training samples; training a protein model based on a plurality of the candidate training samples. Therefore, the candidate training samples can be obtained based on protein sample data of any two modes, the modes corresponding to the candidate training samples are combined into a plurality of modes to train the protein model, so that the protein model can learn multi-mode protein sample data in the training process, the protein model can realize the input and output of the multi-mode data, and the protein model can be suitable for various protein design scenes.

Description

Training method of protein model, and acquisition method and device of protein data

Technical Field

The present disclosure relates to the field of artificial intelligence, and more particularly, to the field of biological computing, deep learning, and large model technologies, and more particularly, to a training method of a protein model, an acquisition method of protein data, an apparatus, an electronic device, a storage medium, and a computer program product.

Background

At present, along with the continuous development of artificial intelligence technology, a large model has the advantages of good generalization and the like, and is widely applied to the fields of information extraction, text credibility evaluation, machine translation and the like. However, the method for training a protein model in the related art has a problem of poor applicability of the trained protein model.

Disclosure of Invention

The disclosure provides a training method of a protein model, an acquisition method and device of protein data, electronic equipment, a storage medium and a computer program product.

According to a first aspect of the present disclosure, a method for training a protein model is provided, comprising: acquiring multi-modal protein sample data; correlating protein sample data of any two modes to obtain candidate training samples; training a protein model based on a plurality of the candidate training samples.

According to a second aspect of the present disclosure, there is provided a method for acquiring protein data, including: acquiring protein data of a first modality; inputting the protein data of the first mode into a protein model, and outputting the protein data of at least one second mode by the protein model, wherein the protein model is obtained by adopting the training method of the protein model proposed by the first aspect.

According to a third aspect of the present disclosure, there is provided a training device for a protein model, comprising: the acquisition module is used for acquiring multi-mode protein sample data; the association module is used for associating protein sample data of any two modes to obtain candidate training samples; and the training module is used for training the protein model based on a plurality of candidate training samples.

According to a fourth aspect of the present disclosure, there is provided an acquisition apparatus of protein data, including: the first acquisition module is used for acquiring protein data of a first modality; the second obtaining module is configured to input the protein data of the first modality into a protein model, and output the protein data of at least one second modality from the protein model, where the protein model is obtained by using the training method of the protein model set forth in the first aspect.

According to a fifth aspect of the present disclosure, there is provided an electronic device, comprising: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the training method of the protein model set forth in the first aspect or the acquiring method of the protein data set forth in the second aspect.

According to a sixth aspect of the present disclosure, there is provided a non-transitory computer-readable storage medium storing computer instructions for causing the computer to execute the training method of the protein model set forth in the first aspect or the acquisition method of the protein data set forth in the second aspect.

According to a seventh aspect of the present disclosure, a computer program product is presented, comprising a computer program which, when being executed by a processor, implements the training method of the protein model presented in the first aspect above, or implements the acquisition method of the protein data presented in the second aspect above.

It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the disclosure, nor is it intended to be used to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following specification.

Drawings

The drawings are for a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

FIG. 1 is a flow chart of a training method of a protein model according to an embodiment of the disclosure;

FIG. 2 is a flow chart of a method of training a protein model according to another embodiment of the present disclosure;

FIG. 3 is a flow chart of a method of training a protein model according to another embodiment of the present disclosure;

FIG. 4 is a flow chart of a method of training a protein model according to another embodiment of the present disclosure;

FIG. 5 is a flow chart of a method of training a protein model according to another embodiment of the present disclosure;

FIG. 6 is a flow chart of a method of training a protein model according to another embodiment of the present disclosure;

FIG. 7 is a flow chart of a method of training a protein model according to another embodiment of the present disclosure;

FIG. 8 is a flow chart of a method of training a protein model according to another embodiment of the present disclosure;

FIG. 9 is a flow chart of a method of training a protein model according to another embodiment of the present disclosure;

FIG. 10 is a schematic diagram of a protein model according to an embodiment of the present disclosure;

FIG. 11 is a flow chart of a method for acquiring protein data according to an embodiment of the disclosure;

FIG. 12 is a schematic structural view of a training device for protein models according to an embodiment of the present disclosure;

FIG. 13 is a schematic diagram of a protein data acquisition device according to an embodiment of the disclosure;

fig. 14 is a schematic block diagram of an electronic device of an embodiment of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present disclosure to facilitate understanding, and should be considered as merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

AI (Artificial Intelligence ) is a technical science that studies, develops theories, methods, techniques and application systems for simulating, extending and expanding human intelligence. At present, the AI technology has the advantages of high automation degree, high accuracy and low cost, and is widely applied.

Biological computing (Biocomputing) refers to a computational model with biomacromolecules as "data", and is mainly divided into 3 types: protein computing, RNA (Ribonucleic Acid) computing and DNA (DeoxyriboNucleic Acid) computing, or refers to sub-fields of computer science and computer engineering, using bioengineering and biobuilding computers, but similar to bioinformatics, this is a trans-disciplinary science, using computers to store and process biological data.

DL (Deep Learning) is a new research direction in the field of ML (Machine Learning), and is an inherent rule and expression hierarchy of Learning sample data, so that a Machine can analyze Learning ability like a person, can recognize data such as characters, images and sounds, and is widely applied to speech and image recognition.

Large models refer to machine learning models where the model has a large parameter size and complexity, requires a large amount of computing resources and memory space to train and store, and often requires distributed computing and special hardware acceleration techniques. The large model has stronger generalization capability and expression capability.

Fig. 1 is a flow chart of a training method of a protein model according to an embodiment of the disclosure. As shown in fig. 1, the method includes:

s101, acquiring multi-mode protein sample data.

It should be noted that, the execution body of the training method of the protein model according to the embodiment of the present disclosure may be a hardware device having data information processing capability and/or software necessary for driving the hardware device to operate. Alternatively, the execution body may include a workstation, a server, a computer, a user terminal, and other intelligent devices. The user terminal comprises, but is not limited to, a mobile phone, a computer, intelligent voice interaction equipment, intelligent household appliances, vehicle-mounted terminals and the like.

The modality corresponding to the protein sample data is not limited too much, and may include description, sequence, structure, and generation conditions, for example, and the protein sample data may include protein sample description, protein sample sequence, protein sample structure, protein sample generation conditions, and the like. The protein sample description, the protein sample sequence and the sample generation condition are all in a text format, and the protein sample structure is in a picture format, for example, the protein sample structure can comprise a two-dimensional picture, a three-dimensional picture and the like.

It should be noted that a protein sample description refers to a sample description of properties, functions, etc. of a protein, for example, a sample description that may include chemical properties of a protein, activity of binding to different receptors, etc. For example, "the binding of heme should be stable and able to withstand environmental changes", "in general function, the protein needs to meet oxygen transport from the cheek to various peripheral tissues", "the protein can target the hemoglobin complex and perform heme binding, oxygen binding".

The protein sample sequence refers to a sample amino acid sequence constituting a protein, and may include, for example, a tag sequence of a sample amino acid constituting a protein. Wherein the identification of the sample amino acid may include the name, number, etc. of the sample amino acid.

The protein sample structure refers to a sample space structure, a sample three-dimensional structure, and the like of a protein, and may include a sample primary structure, a sample secondary structure, a sample tertiary structure, a sample quaternary structure, and the like of the protein, and may also include positions of amino acids, connection modes of amino acids (such as peptide bonds, hydrogen bonds, disulfide bonds, and the like), spatial arrangement modes of polypeptide chains, and the like.

The sample generation conditions of the protein refer to sample generation conditions for generating protein data. For example, the shape, class, target conjugate, no-production condition, etc. of the protein may be included.

In one embodiment, the protein sample description, sample generation conditions are spoken text.

In one embodiment, the number of modalities of protein sample data is at least three.

S102, correlating protein sample data of any two modes to obtain candidate training samples.

It should be noted that, each candidate training sample includes two protein sample data, the mode combination corresponding to the candidate training sample can be obtained based on the modes corresponding to the two protein sample data in the candidate training sample, and if the number of modes is multiple, the number of mode combinations corresponding to the candidate training sample is multiple.

It should be noted that, the correlation of protein sample data of any two modes may be implemented by any data correlation method in the related art, which is not limited herein.

In one embodiment, the correlating the protein sample data of any two modes may include establishing a mapping relationship, a mapping table, a correspondence relationship, and the like between the protein sample data of any two modes.

For example, if the protein sample data includes protein sample data of modality a, protein sample data of modality B, and protein sample data of modality C, the protein sample data of modality a and the protein sample data of modality B may be correlated to obtain a candidate training sample 1, the protein sample data of modality a and the protein sample data of modality C may be correlated to obtain a candidate training sample 2, and the protein sample data of modality B and the protein sample data of modality C may be correlated to obtain a candidate training sample 3.

In one embodiment, correlating the protein sample data of any two modalities may include correlating the protein sample data of the first modality with the protein sample data of the second modality using the protein sample data of the first modality as a tag, and/or correlating the protein sample data of the first modality with the protein sample data of the second modality using the protein sample data of the second modality as a tag.

For example, if the protein sample data includes protein sample descriptions, protein sample sequences, protein sample structures, and protein sample generation conditions, correlating the protein sample data in any two modalities to obtain candidate training samples, the following at least two possible embodiments may be included:

In the mode 1, a protein sample sequence is used as a label, and the protein sample description and the protein sample sequence are correlated to obtain a first training sample.

And 2, taking the protein sample description as a label, and correlating the protein sample sequence with the protein sample description to obtain a second training sample.

And 3, using the protein sample sequence as a label, and correlating the protein sample structure with the protein sample sequence to obtain a third training sample.

And 4, taking the protein sample description as a label, and correlating the protein sample structure with the protein sample description to obtain a fourth training sample.

And 5, taking the protein sample structure as a label, and correlating the protein sample description with the protein sample structure to obtain a fifth training sample.

And 6, taking the protein sample structure as a label, and correlating the protein sample sequence with the protein sample structure to obtain a sixth training sample.

Mode 7, a seventh training sample is obtained by correlating the protein sample structure with the protein sample generation condition using the protein sample structure as a label.

In embodiment 8, the protein sample sequence is used as a tag, and the sample generation condition of the protein and the protein sample sequence are correlated to obtain an eighth training sample.

S103, training the protein model based on the plurality of candidate training samples.

It should be noted that the protein model is not limited too much, for example, the protein model includes a large model, for example, a transducer model, and it should be noted that the transducer model is a neural network model based on a self-attention mechanism.

It should be noted that, training the protein model based on a plurality of candidate training samples may be implemented by any model training method in the related art, which is not limited herein.

In one embodiment, training the protein model based on the plurality of candidate training samples includes inputting non-tagged protein sample data in the candidate training samples to the protein model, outputting protein prediction results from the protein model, and training the protein model based on the tagged protein sample data and the protein prediction results in the candidate training samples.

For example, a loss function of the protein model can be obtained based on protein sample data and a protein prediction result which are used as labels in the candidate training samples, model parameters of the protein model are updated based on the loss function, the next candidate training sample is adopted, and the protein model with the model parameters adjusted is continuously trained until a model training ending condition is reached.

According to the training method of the protein model, multi-mode protein sample data are obtained, the protein sample data of any two modes are correlated, candidate training samples are obtained, and the protein model is trained based on a plurality of candidate training samples. Therefore, the candidate training samples can be obtained based on protein sample data of any two modes, the modes corresponding to the candidate training samples are combined into a plurality of modes to train the protein model, so that the protein model can learn multi-mode protein sample data in the training process, the protein model can realize the input and output of the multi-mode data, and the protein model can be suitable for various protein design scenes.

In the above embodiment, regarding training the protein model based on the plurality of candidate training samples in step S103, it can be further understood with reference to fig. 2, and fig. 2 is a flow chart of a training method of the protein model according to another embodiment of the disclosure, as shown in fig. 2, and the method includes:

s201, multi-mode protein sample data are acquired.

S202, correlating protein sample data of any two modes to obtain candidate training samples.

For the relevant content of steps S201-S202, refer to the above embodiment, and are not repeated here.

S203, based on the modes corresponding to the two protein sample data in the candidate training samples, a mode combination corresponding to the candidate training samples is obtained.

For example, the candidate training samples 1 to 3 in the above embodiment are continued as an example.

The modalities corresponding to the two protein sample data in the candidate training sample 1 include the modality A, B, and the modality A, B can be used as the modality combination corresponding to the candidate training sample 1.

The modalities corresponding to the two protein sample data in the candidate training sample 2 include the modality A, C, and the modality A, C may be used as the modality combination corresponding to the candidate training sample 2.

The modalities corresponding to the two protein sample data in the candidate training sample 3 include the modality B, C, and the modality B, C may be used as the modality combination corresponding to the candidate training sample 3.

In one embodiment, the method includes obtaining a mode combination corresponding to the candidate training sample based on modes corresponding to two protein sample data in the candidate training sample, and sequentially splicing the modes corresponding to non-labeled protein sample data in the candidate training sample and the modes corresponding to labeled protein sample data in the candidate training sample to obtain the mode combination corresponding to the candidate training sample.

For example, the first to eighth training samples in the above-described embodiments are continued as examples.

The mode corresponding to the non-labeled protein sample data in the first training sample is description, the mode corresponding to the labeled protein sample data in the first training sample is sequence, and the mode corresponding to the first training sample can be combined (description and sequence).

The mode corresponding to the non-labeled protein sample data in the second training sample is a sequence, the mode corresponding to the labeled protein sample data in the second training sample is a description, and the mode (sequence and description) corresponding to the second training sample can be used as a mode combination.

The mode corresponding to the non-labeled protein sample data in the third training sample is a structure, the mode corresponding to the labeled protein sample data in the third training sample is a sequence, and the mode (structure and sequence) corresponding to the third training sample can be used as a mode combination.

The acquiring process of the mode combinations corresponding to the fourth to eighth training samples may refer to the relevant content of the first to third training samples, which is not described herein again.

S204, dividing the candidate training samples of the same modal combination into the same candidate training sample set.

For example, continuing to take the first to eighth training samples in the above embodiment as an example, the mode combinations corresponding to the first to eighth training samples are different, the first training samples may be divided into the first training sample set, the second training samples may be divided into the second training sample set, the third training samples may be divided into the third training sample set, the fourth training samples may be divided into the fourth training sample set, the fifth training samples may be divided into the fifth training sample set, the sixth training samples may be divided into the sixth training sample set, the seventh training samples may be divided into the seventh training sample set, and the eighth training samples may be divided into the eighth training sample set.

S205, training the protein model based on the plurality of candidate training sample sets.

It should be noted that, training the protein model based on a plurality of candidate training sample sets may be implemented by any model training method in the related art, which is not limited herein.

It should be noted that, for each candidate training sample set, the sequence in which the protein model is trained is not excessively limited, for example, the training processes of multiple candidate training sample sets on the protein model are parallel, or the training processes of multiple candidate training sample sets on the protein model are serial.

In one embodiment, training the protein model based on the plurality of candidate training sample sets may include sorting the plurality of candidate training sample sets, starting with a first candidate training sample set of the sorting, training the protein model with the currently traversed candidate training sample set until a last candidate training sample set is traversed. Thus, the method can train the protein model based on a plurality of candidate training sample sets in sequence.

According to the training method of the protein model, the modes corresponding to the two protein sample data in the candidate training samples are obtained and serve as the mode combination corresponding to the candidate training samples, the candidate training samples of the same mode combination are divided into the same candidate training sample set, and the protein model is trained based on the plurality of candidate training sample sets. Therefore, the generation of a plurality of candidate training sample sets can be realized in consideration of the mode combination corresponding to the candidate training samples, so as to train the protein model.

In the above embodiment, regarding the correlation of the protein sample data of any two modalities in step S102 to obtain the candidate training samples, and regarding the training of the protein model based on the plurality of candidate training samples in step S103, it can be further understood with reference to fig. 3, and fig. 3 is a flow chart of a training method of the protein model according to another embodiment of the disclosure, as shown in fig. 3, the method includes:

S301, multi-mode protein sample data are acquired.

S302, using the protein sample sequence as a label, and correlating the protein sample description with the protein sample sequence to obtain a first training sample.

For the relevant content of steps S301 to S302, refer to the above embodiment, and are not repeated here.

S303, inputting the protein sample description in the first training sample into a large model, and outputting a first protein prediction sequence by the large model.

S304, training the large model based on the protein sample sequence in the first training sample and the first protein prediction sequence.

In embodiments of the present disclosure, the protein model comprises a large model.

It should be noted that, the training of the large model based on the protein sample sequence and the first protein prediction sequence in the first training sample may be implemented by using any training method of the large model in the related art, which is not limited herein.

In one embodiment, training the large model based on the protein sample sequence and the first protein prediction sequence in the first training sample includes obtaining a loss function of the large model based on the protein sample sequence and the first protein prediction sequence in the first training sample, updating model parameters of the large model based on the loss function, and returning to use the next first training sample, and continuing training the large model with the model parameters adjusted until a model training end condition is reached.

According to the training method of the protein model, a protein sample sequence is used as a label, protein sample description and the protein sample sequence are associated to obtain a first training sample, the protein sample description in the first training sample is input into a large model, a first protein prediction sequence is output by the large model, and the large model is trained based on the protein sample sequence and the first protein prediction sequence in the first training sample. Therefore, the protein model comprises a large model, and the large model can be trained based on the first training sample, so that the large model can learn protein sample data of description and sequence modes in the training process, the large model can realize the input of the description mode data and the output of the sequence mode data, namely, the large model is suitable for a protein design scene of description-sequence.

In the above embodiment, regarding the correlation of the protein sample data of any two modalities in step S102 to obtain the candidate training samples, and regarding the training of the protein model based on the plurality of candidate training samples in step S103, it can be further understood with reference to fig. 4, and fig. 4 is a flow chart of a training method of the protein model according to another embodiment of the disclosure, as shown in fig. 4, the method includes:

S401, multi-mode protein sample data are acquired.

S402, using the protein sample description as a label, and correlating the protein sample sequence with the protein sample description to obtain a second training sample.

For the relevant content of steps S401 to S402, refer to the above embodiment, and are not repeated here.

S403, inputting the protein sample sequence in the second training sample into the large model, and outputting the first protein prediction description by the large model.

S404, training the large model based on the protein sample description and the first protein prediction description in the second training sample.

It should be noted that, the training of the large model based on the protein sample description and the first protein prediction description in the second training sample may be implemented by using any training method of the large model in the related art, which is not limited herein.

In one embodiment, training the large model based on the protein sample description and the first protein prediction description in the second training sample includes obtaining a loss function of the large model based on the protein sample description and the first protein prediction description in the second training sample, updating model parameters of the large model based on the loss function, and returning to use the next second training sample, and continuing training the large model with the model parameters adjusted until a model training end condition is reached.

According to the training method of the protein model, protein sample description is used as a label, a protein sample sequence and the protein sample description are associated to obtain a second training sample, the protein sample sequence in the second training sample is input into the large model, the large model outputs first protein prediction description, and the large model is trained based on the protein sample description and the first protein prediction description in the second training sample. Therefore, the protein model comprises a large model, and the large model can be trained based on the second training sample, so that the large model can learn protein sample data of description and sequence modes in the training process, the large model can realize the input of the sequence mode data and the output of the description mode data, namely the large model is suitable for a protein design scene of 'sequence-mode'.

In the above embodiment, regarding the correlation of the protein sample data of any two modalities in step S102 to obtain the candidate training samples, and regarding the training of the protein model based on the plurality of candidate training samples in step S103, it can be further understood with reference to fig. 5, and fig. 5 is a flow chart of a training method of the protein model according to another embodiment of the disclosure, as shown in fig. 5, the method includes:

S501, multi-mode protein sample data are acquired.

S502, using the protein sample sequence as a label, and correlating the protein sample structure with the protein sample sequence to obtain a third training sample.

For the relevant content of steps S501-S502, refer to the above embodiment, and are not repeated here.

S503, inputting the protein sample structure in the third training sample to the encoder, and outputting the first structure code by the encoder.

S504, inputting the first structural code into a large model, and outputting a second protein prediction sequence by the large model.

S505, training the encoder and the large model based on the protein sample sequence in the third training sample and the second protein prediction sequence.

In embodiments of the present disclosure, the protein model includes an encoder and a large model.

The protein sample structure in the third training sample is input to the encoder, the protein sample structure in the third training sample is encoded by the encoder, and the first structure code is output. The encoding of the protein sample structure may be implemented by any method for encoding structural data in the related art, and is not limited herein.

It should be noted that, based on the protein sample sequence and the second protein prediction sequence in the third training sample, the training of the encoder and the large model may be implemented by any model training method in the related art, which is not limited herein.

In one embodiment, the encoder and the large model are trained based on the protein sample sequence and the second protein prediction sequence in the third training sample, a loss function of the protein model is obtained based on the protein sample sequence and the second protein prediction sequence in the third training sample, model parameters of the encoder and the large model are updated based on the loss function, the next third training sample is adopted in a return mode, and the encoder and the large model with the model parameters adjusted are continuously trained until a model training end condition is reached.

According to the training method of the protein model, a protein sample sequence is used as a label, a protein sample structure and the protein sample sequence are correlated to obtain a third training sample, the protein sample structure in the third training sample is input to an encoder, the encoder outputs a first structure code, the first structure code is input to a large model, the large model outputs a second protein prediction sequence, and the encoder and the large model are trained based on the protein sample sequence and the second protein prediction sequence in the third training sample. Therefore, the protein model comprises an encoder and a large model, and the encoder and the large model can be trained based on a third training sample, so that the protein model can learn protein sample data of a structure and a sequence mode in the training process, the protein model can realize the input of the structure mode data and the output of the sequence mode data, namely the protein model is suitable for a protein design scene of 'structure-sequence'.

In the above embodiment, regarding the correlation of the protein sample data of any two modalities in step S102 to obtain candidate training samples, and regarding the training of the protein model based on the plurality of candidate training samples in step S103, it can be further understood with reference to fig. 6, and fig. 6 is a flow chart of a training method of the protein model according to another embodiment of the disclosure, as shown in fig. 6, the method includes:

s601, multi-mode protein sample data is acquired.

S602, using the protein sample description as a label, and correlating the protein sample structure with the protein sample description to obtain a fourth training sample.

For the relevant content of steps S601-S602, reference may be made to the above embodiments, and details are not repeated here.

And S603, inputting the protein sample structure in the fourth training sample to an encoder, and outputting a second structure code by the encoder.

S604, inputting the second structural code into the large model, and outputting a second protein prediction description by the large model.

S605, training the encoder and the large model based on the protein sample description and the second protein prediction description in the fourth training sample.

The protein sample structure in the fourth training sample is input to the encoder, the protein sample structure in the fourth training sample is encoded by the encoder, and the second structure code is output. The encoding of the protein sample structure may be implemented by any method for encoding structural data in the related art, and is not limited herein.

It should be noted that, based on the protein sample description and the second protein prediction description in the fourth training sample, the training of the encoder and the large model may be implemented by any model training method in the related art, which is not limited herein.

In one embodiment, the encoder and the large model are trained based on the protein sample description and the second protein prediction description in the fourth training sample, including obtaining a loss function of the protein model based on the protein sample description and the second protein prediction description in the fourth training sample, updating model parameters of the encoder and the large model based on the loss function, and returning to use the next fourth training sample, and continuing to train the encoder and the large model for adjusting the model parameters until a model training end condition is reached.

According to the training method of the protein model, protein sample description is used as a label, the protein sample structure and the protein sample description are correlated to obtain a fourth training sample, the protein sample structure in the fourth training sample is input to an encoder, the encoder outputs a second structure code, the second structure code is input to a large model, the large model outputs a second protein prediction description, and the encoder and the large model are trained based on the protein sample description and the second protein prediction description in the fourth training sample. Therefore, the protein model comprises an encoder and a large model, and the encoder and the large model can be trained based on a fourth training sample, so that the protein model can learn protein sample data of a structure and a description mode in the training process, the protein model can realize the input of the structure mode data and the output of the description mode data, namely the protein model is suitable for a protein design scene of 'structure-description'.

In the above embodiment, regarding the correlation of the protein sample data of any two modalities in step S102 to obtain candidate training samples, and regarding the training of the protein model based on the plurality of candidate training samples in step S103, it can be further understood with reference to fig. 7, and fig. 7 is a flow chart of a training method of the protein model according to another embodiment of the disclosure, as shown in fig. 7, the method includes:

S701, multi-mode protein sample data is acquired.

S702, using the protein sample structure as a label, and correlating the protein sample description with the protein sample structure to obtain a fifth training sample.

For the relevant content of steps S701-S702, refer to the above embodiments, and are not repeated here.

S703, inputting the protein sample description in the fifth training sample into the large model, and outputting the first structural feature by the large model.

S704, inputting the first structural feature into a generator, and outputting the first protein prediction structure by the generator.

S705, training the large model and generator based on the protein sample structure and the first protein prediction structure in the fifth training sample.

In an embodiment of the present disclosure, the protein model includes a large model and a generator.

The first structural feature refers to a structural feature of a protein, and is in a text format. Inputting the protein sample description in the fifth training sample into a large model, predicting structural features of the protein by the large model based on the protein sample description in the fifth training sample, and outputting the first structural features.

The generation of the first protein prediction structure based on the first structural feature may be performed by any method of generating structural data in the related art, and is not limited thereto.

It should be noted that, based on the protein sample structure and the first protein prediction structure in the fifth training sample, the training of the large model and the generator may be implemented by any model training method in the related art, which is not limited herein.

In one embodiment, training the large model and the generator based on the protein sample structure and the first protein prediction structure in the fifth training sample includes obtaining a loss function of the protein model based on the protein sample structure and the first protein prediction structure in the fifth training sample, updating model parameters of the large model and the generator based on the loss function, and returning to use the next fifth training sample, and continuing training the large model and the generator for adjusting the model parameters until a model training end condition is reached.

According to the training method of the protein model, a protein sample structure is used as a label, protein sample description and the protein sample structure are associated to obtain a fifth training sample, protein sample description in the fifth training sample is input to a large model, first structural features are output by the large model, the first structural features are input to a generator, a first protein prediction structure is output by the generator, and the large model and the generator are trained based on the protein sample structure and the first protein prediction structure in the fifth training sample. Therefore, the protein model comprises a large model and a generator, and the large model and the generator can be trained based on a fifth training sample, so that the protein model can learn structural and modal-describing protein sample data in the training process, the protein model can realize the input of the modal-describing data and the output of the structural modal data, namely the protein model is suitable for a protein design scene of description-structure.

In the above embodiment, regarding the correlation of the protein sample data of any two modalities in step S102 to obtain the candidate training samples, and regarding the training of the protein model based on the plurality of candidate training samples in step S103, it can be further understood with reference to fig. 8, and fig. 8 is a flow chart of a training method of the protein model according to another embodiment of the disclosure, as shown in fig. 8, the method includes:

s801, multi-mode protein sample data is acquired.

S802, taking the protein sample structure as a label, and correlating the protein sample sequence with the protein sample structure to obtain a sixth training sample.

For the relevant content of step S801, refer to the above embodiment, and are not described herein.

S803, inputting the protein sample sequence in the sixth training sample into the large model, and outputting the second structural feature by the large model.

S804, inputting the second structural feature to a generator, and outputting the second protein predicted structure by the generator.

S805, training the large model and generator based on the protein sample structure and the second protein prediction structure in the sixth training sample.

The second structural feature refers to a structural feature of the protein, and is in a text format. And inputting the protein sample sequence in the sixth training sample into the large model, predicting the structural characteristics of the protein based on the protein sample sequence in the sixth training sample by the large model, and outputting the second structural characteristics.

The generation of the second protein predicted structure based on the second structural feature may be performed by any method of generating structural data in the related art, and is not limited thereto.

It should be noted that, based on the protein sample structure and the second protein prediction structure in the sixth training sample, the training of the large model and the generator may be implemented by any model training method in the related art, which is not limited herein.

In one embodiment, training the large model and the generator based on the protein sample structure and the second protein prediction structure in the sixth training sample includes obtaining a loss function of the protein model based on the protein sample structure and the second protein prediction structure in the sixth training sample, updating model parameters of the large model and the generator based on the loss function, and returning to use the next sixth training sample, and continuing to train the large model and the generator for adjusting the model parameters until a model training end condition is reached.

According to the training method of the protein model, a protein sample structure is used as a label, a protein sample sequence and a protein sample structure are associated to obtain a sixth training sample, the protein sample sequence in the sixth training sample is input to a large model, the large model outputs second structural features, the generator outputs second protein prediction structures, and the large model and the generator are trained based on the protein sample structure and the second protein prediction structures in the sixth training sample. Therefore, the protein model comprises a large model and a generator, and the large model and the generator can be trained based on a sixth training sample, so that the protein model can learn protein sample data of a structure and a sequence mode in the training process, the protein model can realize the input of the sequence mode data and the output of the structure mode data, namely the protein model is suitable for a protein design scene of a sequence-structure.

In the above embodiment, regarding the correlation of the protein sample data of any two modalities in step S102 to obtain candidate training samples, and regarding the training of the protein model based on the plurality of candidate training samples in step S103, it can be further understood with reference to fig. 9, and fig. 9 is a flow chart of a training method of the protein model according to another embodiment of the disclosure, as shown in fig. 9, the method includes:

S901, multi-mode protein sample data is acquired.

S902, using the protein sample structure as a label, correlating the protein sample generation condition with the protein sample structure to obtain a seventh training sample.

For the relevant content of steps S901-S902, refer to the above embodiments, and are not repeated here.

S903, the sample generation condition in the seventh training sample is input to the large model, the sample generation condition in the seventh training sample is encoded by the large model, and the condition encoding is output.

S904, inputting the condition code into a diffusion model, and outputting a third protein prediction structure by the diffusion model.

S905, training a large model and a diffusion model based on the protein sample structure and the third protein prediction structure in the seventh training sample.

In embodiments of the present disclosure, the protein model includes a large model and a diffusion model.

It should be noted that, the encoding of the sample generation condition may be implemented by any encoding method of text data in the related art, which is not limited herein, for example, may include an encoding method of embedding a vector.

The generation of the third protein prediction structure based on the conditional encoding may be performed using any diffusion model in the related art, and is not limited thereto.

It should be noted that, based on the protein sample structure and the third protein prediction structure in the seventh training sample, the training of the large model and the diffusion model may be implemented by any model training method in the related art, which is not limited herein.

In one embodiment, training the large model and the diffusion model based on the protein sample structure and the third protein prediction structure in the seventh training sample includes obtaining a loss function of the protein model based on the protein sample structure and the third protein prediction structure in the seventh training sample, updating model parameters of the large model and the diffusion model based on the loss function, and returning to use a next seventh training sample, and continuing to train the large model and the diffusion model for adjusting the model parameters until a model training end condition is reached.

In one embodiment, prior to inputting the condition codes into the diffusion model, further comprising obtaining a plurality of protein sample structures, training the diffusion model based on the plurality of protein sample structures. Therefore, a plurality of protein sample structures can be utilized in advance to train the diffusion model, so that the diffusion model can learn a large number of protein sample structures in the training process, and the generation of the universal protein sample structures can be realized by the diffusion model.

It should be noted that, based on a plurality of protein sample structures, the training of the diffusion model may be implemented by using any training method of the diffusion model in the related art, which is not limited herein.

In one embodiment, training the diffusion model based on the plurality of protein sample structures may include denoising the protein sample structures, inputting the denoised protein sample structures to the diffusion model, outputting predicted noise from the diffusion model, and training the diffusion model based on the predicted noise, the protein sample structures, and the denoised protein sample structures.

For example, training the diffusion model based on the prediction noise, the protein sample structure, and the noise-added protein sample structure may include obtaining a loss function of the diffusion model based on the prediction noise, the protein sample structure, and the noise-added protein sample structure, updating model parameters of the diffusion model based on the loss function, and returning to use the next protein sample structure, and continuing to train the diffusion model for adjusting the model parameters until a model training end condition is reached.

According to the training method of the protein model, a protein sample structure is used as a label, a sample generation condition of a protein and the protein sample structure are related to obtain a seventh training sample, the sample generation condition in the seventh training sample is input into a large model, the sample generation condition in the seventh training sample is encoded by the large model, condition encoding is output, the condition encoding is input into a diffusion model, a third protein prediction structure is output by the diffusion model, and the large model and the diffusion model are trained based on the protein sample structure and the third protein prediction structure in the seventh training sample. Therefore, the protein model comprises a large model and a diffusion model, and the large model and the diffusion model can be trained based on a seventh training sample, so that the protein model can learn the structure and generate protein sample data of a condition mode in the training process, the protein model can realize the input of the condition mode data and the output of the structure mode data, namely the protein model is suitable for a protein design scene of generating a condition-structure.

On the basis of any of the above embodiments, the protein model includes a large model, and the step S103 includes pre-training the large model based on sample texts of a plurality of knowledge domains before training the protein model based on a plurality of candidate training samples. Therefore, the large model can be pre-trained based on sample texts in multiple knowledge fields, so that the large model can learn the sample texts in multiple knowledge fields in the pre-training process, and the large model can realize generation of general text data.

It should be noted that the knowledge field is not limited too much, and may include, for example, medicine, meteorology, literature, and the like.

On the basis of any of the above embodiments, as shown in fig. 10, the protein model includes a large model, an encoder, a generator, and a diffusion model.

The large model may be trained based on the first through seventh training samples, the encoder may be trained based on the third and fourth training samples, the generator may be trained based on the fifth and sixth training samples, and the diffusion model may be trained based on the seventh training sample.

The training process of the large model, the encoder, the generator and the diffusion model can be referred to the above embodiments, and will not be repeated here.

Fig. 11 is a flow chart of a method for acquiring protein data according to an embodiment of the disclosure. As shown in fig. 11, the method includes:

s1101, acquiring protein data of a first mode.

S1102, inputting the protein data of the first mode into a protein model, and outputting at least one protein data of the second mode by the protein model, wherein the protein model is obtained by adopting a training method of the protein model.

It should be noted that, the execution body of the protein data acquisition method according to the embodiment of the present disclosure may be a hardware device having a data information processing capability and/or software necessary for driving the hardware device to operate. Alternatively, the execution body may include a workstation, a server, a computer, a user terminal, and other intelligent devices. The user terminal comprises, but is not limited to, a mobile phone, a computer, intelligent voice interaction equipment, intelligent household appliances, vehicle-mounted terminals and the like.

The first mode and the second mode are different modes. For the content of the first modality and the second modality, refer to the above embodiment, and are not described herein.

In an embodiment of the disclosure, inputting the protein data of the first modality into the protein model, outputting the protein data of the at least one second modality from the protein model may include the following several possible implementations:

Mode 1, a protein description is input into a protein model, and a protein sequence and/or a protein structure are output from the protein model.

Thus, the protein model in the method can realize the input of description mode data and the output of sequence and/or structure mode data, namely the protein model is applicable to the protein design scene of description-sequence and/or structure.

For example, continuing with FIG. 10 as an example, a protein description may be input to a large model from which a protein sequence is output.

For example, continuing with the example of FIG. 10, a protein description may be input to a large model, structural features may be output by the large model, structural features may be input to a generator, and protein structure may be output by the generator.

Mode 2, inputting a protein sequence into a protein model, outputting a protein description and/or a protein structure from the protein model.

Thus, the protein model in the method can realize the input of sequence modal data and the output of description and/or structure modal data, namely, the protein model is applicable to a protein design scene of sequence-description and/or structure.

For example, continuing with FIG. 10 as an example, a protein sequence may be input to a large model from which a protein description is output.

For example, continuing with the example of FIG. 10, a protein sequence may be input to a large model, structural features may be output from the large model, structural features may be input to a generator, and protein structure may be output from the generator.

Mode 3, inputting the protein structure into a protein model, outputting a protein description and/or a protein sequence from the protein model.

Thus, the protein model in the method can realize the input of structural mode data and the output of description and/or sequence mode data, namely, the protein model is applicable to the protein design scene of 'structure-description and/or sequence'.

For example, continuing with the example of FIG. 10, a protein structure may be input to an encoder, the structure code may be output by the encoder, the structure code may be input to a large model, and the protein description may be output by the large model.

For example, continuing with the example of FIG. 10, the protein structure may be input to an encoder, the structural code may be output by the encoder, the structural code may be input to a large model, and the protein sequence may be output by the large model.

In embodiment 4, the protein production conditions are input into the protein model, and the protein structure is output from the protein model.

Therefore, the protein model in the method can realize the input of the generation condition modal data and the output of the structural modal data, namely, the protein model is suitable for a protein design scene of 'generation condition-structure'.

For example, continuing to take fig. 10 as an example, the protein production conditions may be input to a large model, the large model encodes the production conditions, the condition codes are output, the condition codes are input to a diffusion model, and the diffusion model outputs the protein structure.

According to the protein data acquisition method, protein data of a first mode are acquired, the protein data of the first mode are input into a protein model, and at least one protein data of a second mode is output by the protein model, wherein the protein model is obtained by adopting a training method of the protein model, the protein model can realize the input and output of multi-mode data, and the protein model can be suitable for various protein design scenes.

In the technical scheme of the disclosure, the related processes of collecting, storing, using, processing, transmitting, providing, disclosing and the like of the personal information of the user accord with the regulations of related laws and regulations, and the public order colloquial is not violated.

According to an embodiment of the present disclosure, the present disclosure further provides a training device for a protein model, which is configured to implement the training method for a protein model.

FIG. 12 is a block diagram of a training apparatus for protein models according to an embodiment of the present disclosure.

As shown in fig. 12, a training apparatus 1200 for a protein model includes: an acquisition module 1201, an association module 1202, and a training module 1203.

An acquisition module 1201, configured to acquire multi-modal protein sample data;

the association module 1202 is configured to associate protein sample data of any two modalities to obtain a candidate training sample;

the training module 1203 is configured to train the protein model based on the plurality of candidate training samples.

In one embodiment of the present disclosure, the training module 1203 is further configured to: obtaining a mode combination corresponding to the candidate training sample based on modes corresponding to two protein sample data in the candidate training sample; dividing the candidate training samples of the same modal combination into the same candidate training sample set; training the protein model based on a plurality of the candidate training sample sets.

In one embodiment of the present disclosure, the association module 1202 is further configured to: using a protein sample sequence as a label, and correlating the protein sample description with the protein sample sequence to obtain a first training sample;

the protein model includes a large model, and the training module 1203 is further configured to: inputting protein sample descriptions in the first training sample into the large model, and outputting a first protein prediction sequence by the large model; training the large model based on a sequence of protein samples in the first training sample and the first protein predicted sequence.

In one embodiment of the present disclosure, the association module 1202 is further configured to: taking the protein sample description as a label, and correlating the protein sample sequence with the protein sample description to obtain a second training sample;

the protein model includes a large model, and the training module 1203 is further configured to: inputting a protein sample sequence in the second training sample into the large model, and outputting a first protein prediction description by the large model; training the large model based on the protein sample description in the second training sample and the first protein prediction description.

In one embodiment of the present disclosure, the association module 1202 is further configured to: taking the protein sample sequence as a label, and correlating the protein sample structure with the protein sample sequence to obtain a third training sample;

the protein model includes an encoder and a large model, and the training module 1203 is further configured to: inputting protein sample structures in the third training samples to the encoder, outputting a first structure code by the encoder; inputting the first structural code to the large model, outputting a second protein predicted sequence from the large model; training the encoder and the large model based on a sequence of protein samples in the third training sample and the second protein prediction sequence.

In one embodiment of the present disclosure, the association module 1202 is further configured to: taking the protein sample description as a label, and correlating the protein sample structure with the protein sample description to obtain a fourth training sample;

the protein model includes an encoder and a large model, and the training module 1203 is further configured to: inputting protein sample structures in the fourth training samples to the encoder, outputting a second structure code by the encoder; inputting the second structural code to the large model, outputting a second protein predictive description from the large model; training the encoder and the large model based on a protein sample description in the fourth training sample and the second protein prediction description.

In one embodiment of the present disclosure, the association module 1202 is further configured to: taking the protein sample structure as a label, and correlating the protein sample description with the protein sample structure to obtain a fifth training sample;

the protein model includes a large model and a generator, the training module 1203 is further configured to: inputting protein sample descriptions in the fifth training sample into the large model, and outputting first structural features by the large model; inputting the first structural feature to the generator, outputting a first protein predicted structure by the generator; training the large model and the generator based on protein sample structure and the first protein prediction structure in the fifth training sample.

In one embodiment of the present disclosure, the association module 1202 is further configured to: taking the protein sample structure as a label, and correlating the protein sample sequence with the protein sample structure to obtain a sixth training sample;

the protein model includes a large model and a generator, the training module 1203 is further configured to: inputting a protein sample sequence in the sixth training sample into the large model, and outputting a second structural feature by the large model; inputting the second structural feature to the generator, outputting a second protein predicted structure by the generator; training the large model and the generator based on protein sample structure and the second protein prediction structure in the sixth training sample.

In one embodiment of the present disclosure, the association module 1202 is further configured to: taking the protein sample structure as a label, and correlating the protein sample generation condition with the protein sample structure to obtain a seventh training sample;

the protein model includes a large model and a diffusion model, and the training module 1203 is further configured to: inputting sample generation conditions in the seventh training sample into the large model, encoding the sample generation conditions in the seventh training sample by the large model, and outputting condition codes; inputting the condition code to the diffusion model, outputting a third protein predicted structure from the diffusion model; training the large model and the diffusion model based on protein sample structure and the third protein prediction structure in the seventh training sample.

In one embodiment of the present disclosure, before the inputting the condition codes into the diffusion model, the training module 1203 is further configured to: acquiring a plurality of protein sample structures; the diffusion model is trained based on a plurality of the protein sample structures.

In one embodiment of the disclosure, the protein model includes a large model, and the training module 1203 is further configured to, prior to training the protein model based on the plurality of candidate training samples: the large model is pre-trained based on sample text for multiple knowledge domains.

The protein model training device obtains multi-mode protein sample data, correlates protein sample data of any two modes to obtain candidate training samples, and trains the protein model based on a plurality of candidate training samples. Therefore, the candidate training samples can be obtained based on protein sample data of any two modes, the modes corresponding to the candidate training samples are combined into a plurality of modes to train the protein model, so that the protein model can learn multi-mode protein sample data in the training process, the protein model can realize the input and output of the multi-mode data, and the protein model can be suitable for various protein design scenes.

According to an embodiment of the present disclosure, the present disclosure further provides a protein data acquisition device, which is configured to implement the above protein data acquisition method.

Fig. 13 is a block diagram of an apparatus for acquiring protein data according to an embodiment of the present disclosure.

As shown in fig. 13, the protein data acquisition apparatus 1300 includes: a first acquisition module 1301 and a second acquisition module 1302.

A first obtaining module 1301, configured to obtain protein data of a first modality;

the second obtaining module 1302 is configured to input the protein data of the first modality to a protein model, and output, by the protein model, protein data of at least one second modality, where the protein model is obtained by using the training method of the protein model.

In one embodiment of the present disclosure, the second obtaining module 1302 is further configured to: protein descriptions are input to the protein model from which protein sequences and/or protein structures are output.

In one embodiment of the present disclosure, the second obtaining module 1302 is further configured to: protein sequences are input to the protein model from which protein descriptions and/or protein structures are output.

In one embodiment of the present disclosure, the second obtaining module 1302 is further configured to: the protein structure is input to the protein model, from which protein descriptions and/or protein sequences are output.

In one embodiment of the present disclosure, the second obtaining module 1302 is further configured to: the protein production conditions are input to the protein model, and the protein structure is output from the protein model.

The protein data acquisition device acquires the protein data of the first mode, inputs the protein data of the first mode into the protein model, and outputs the protein data of at least one second mode by the protein model, wherein the protein model is obtained by adopting a training method of the protein model, and the protein model can realize the input and output of multi-mode data, namely, the protein model can be suitable for various protein design scenes.

According to embodiments of the present disclosure, the present disclosure also proposes an electronic device, a readable storage medium and a computer program product.

Fig. 14 shows a schematic block diagram of an example electronic device 1400 that may be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 14, the apparatus 1400 includes a computing unit 1401 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM) 1402 or a computer program loaded from a storage unit 1408 into a Random Access Memory (RAM) 1403. In the RAM 1403, various programs and data required for the operation of the device 1400 can also be stored. The computing unit 1401, the ROM 1402, and the RAM 1403 are connected to each other through a bus 1404. An input/output (I/O) interface 1405 is also connected to the bus 1404.

Various components in device 1400 are connected to I/O interface 1405, including: an input unit 1406 such as a keyboard, a mouse, or the like; an output unit 1406 such as various types of displays, speakers, and the like; a storage unit 1408 such as a magnetic disk, an optical disk, or the like; and a communication unit 1409 such as a network card, a modem, a wireless communication transceiver, and the like. The communication unit 1409 allows the device 1400 to exchange information/data with other devices through a computer network such as the internet and/or various telecommunications networks.

The computing unit 1401 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of computing unit 1401 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, etc. The computing unit 1401 performs the respective methods and processes described above, for example, a training method of a protein model, an acquisition method of protein data. For example, in some embodiments, the training method of the protein model, the acquisition method of the protein data may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as the storage unit 1408. In some embodiments, part or all of the computer program may be loaded and/or installed onto the device 1400 via the ROM 1402 and/or the communication unit 1409. When the computer program is loaded into the RAM 1403 and executed by the computing unit 1401, one or more steps of the training method of the protein model, the acquisition method of the protein data described above may be performed. Alternatively, in other embodiments, the computing unit 1401 may be configured to perform the training method of the protein model, the acquisition method of the protein data, by any other suitable means (e.g. by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for carrying out methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To address interactions with a user account, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user account; and a keyboard and pointing device (e.g., a mouse or trackball) through which a user account may present input to the computer. Other kinds of devices may also be used to propose interactions with a user account; for example, feedback presented to the user account may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user account may be received in any form, including acoustic input, speech input, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user account computer having a graphical user account interface or a web browser through which a user account can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the internet.

The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server incorporating a blockchain.

According to an embodiment of the present disclosure, there is also provided a computer program product, including a computer program, where the computer program, when executed by a processor, implements the steps of the training method for protein model and the acquiring method for protein data described in the foregoing embodiments of the present disclosure.

It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps recited in the present disclosure may be performed in parallel, sequentially, or in a different order, provided that the desired results of the disclosed aspects are achieved, and are not limited herein.

The above detailed description should not be taken as limiting the scope of the present disclosure. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present disclosure are intended to be included within the scope of the present disclosure.

Claims

1. A method of training a protein model, comprising:

acquiring multi-modal protein sample data;

Correlating protein sample data of any two modes to obtain candidate training samples;

training a protein model based on a plurality of the candidate training samples.

2. The method of claim 1, wherein the training a protein model based on the plurality of candidate training samples comprises:

obtaining a mode combination corresponding to the candidate training sample based on modes corresponding to two protein sample data in the candidate training sample;

dividing the candidate training samples of the same modal combination into the same candidate training sample set;

training the protein model based on a plurality of the candidate training sample sets.

3. The method of claim 1, wherein correlating the protein sample data of any two modalities to obtain candidate training samples comprises:

using a protein sample sequence as a label, and correlating the protein sample description with the protein sample sequence to obtain a first training sample;

the protein model comprises a large model, the training of the protein model based on a plurality of the candidate training samples comprises:

inputting protein sample descriptions in the first training sample into the large model, and outputting a first protein prediction sequence by the large model;

Training the large model based on a sequence of protein samples in the first training sample and the first protein predicted sequence.

4. The method of claim 1, wherein correlating the protein sample data of any two modalities to obtain candidate training samples comprises:

taking the protein sample description as a label, and correlating the protein sample sequence with the protein sample description to obtain a second training sample;

inputting a protein sample sequence in the second training sample into the large model, and outputting a first protein prediction description by the large model;

training the large model based on the protein sample description in the second training sample and the first protein prediction description.

5. The method of claim 1, wherein correlating the protein sample data of any two modalities to obtain candidate training samples comprises:

taking the protein sample sequence as a label, and correlating the protein sample structure with the protein sample sequence to obtain a third training sample;

The protein model includes an encoder and a large model, the training the protein model based on a plurality of the candidate training samples, comprising:

inputting protein sample structures in the third training samples to the encoder, outputting a first structure code by the encoder;

inputting the first structural code to the large model, outputting a second protein predicted sequence from the large model;

training the encoder and the large model based on a sequence of protein samples in the third training sample and the second protein prediction sequence.

6. The method of claim 1, wherein correlating the protein sample data of any two modalities to obtain candidate training samples comprises:

taking the protein sample description as a label, and correlating the protein sample structure with the protein sample description to obtain a fourth training sample;

inputting protein sample structures in the fourth training samples to the encoder, outputting a second structure code by the encoder;

Inputting the second structural code to the large model, outputting a second protein predictive description from the large model;

training the encoder and the large model based on a protein sample description in the fourth training sample and the second protein prediction description.

7. The method of claim 1, wherein correlating the protein sample data of any two modalities to obtain candidate training samples comprises:

taking the protein sample structure as a label, and correlating the protein sample description with the protein sample structure to obtain a fifth training sample;

the protein model includes a large model and a generator, the training the protein model based on a plurality of the candidate training samples, comprising:

inputting protein sample descriptions in the fifth training sample into the large model, and outputting first structural features by the large model;

inputting the first structural feature to the generator, outputting a first protein predicted structure by the generator;

training the large model and the generator based on protein sample structure and the first protein prediction structure in the fifth training sample.

8. The method of claim 1, wherein correlating the protein sample data of any two modalities to obtain candidate training samples comprises:

taking the protein sample structure as a label, and correlating the protein sample sequence with the protein sample structure to obtain a sixth training sample;

inputting a protein sample sequence in the sixth training sample into the large model, and outputting a second structural feature by the large model;

inputting the second structural feature to the generator, outputting a second protein predicted structure by the generator;

training the large model and the generator based on protein sample structure and the second protein prediction structure in the sixth training sample.

9. The method of claim 1, wherein correlating the protein sample data of any two modalities to obtain candidate training samples comprises:

taking the protein sample structure as a label, and correlating the protein sample generation condition with the protein sample structure to obtain a seventh training sample;

The protein model comprises a large model and a diffusion model, the training of the protein model based on a plurality of the candidate training samples comprises:

inputting sample generation conditions in the seventh training sample into the large model, encoding the sample generation conditions in the seventh training sample by the large model, and outputting condition codes;

inputting the condition code to the diffusion model, outputting a third protein predicted structure from the diffusion model;

training the large model and the diffusion model based on protein sample structure and the third protein prediction structure in the seventh training sample.

10. The method of claim 9, wherein the inputting the condition code to the diffusion model is preceded by:

acquiring a plurality of protein sample structures;

the diffusion model is trained based on a plurality of the protein sample structures.

11. The method of any of claims 1-10, wherein the protein model comprises a large model, the training of the protein model based on the plurality of candidate training samples further comprising, prior to:

The large model is pre-trained based on sample text for multiple knowledge domains.

12. A method of acquiring protein data, comprising:

acquiring protein data of a first modality;

inputting the protein data of the first modality into a protein model, outputting the protein data of at least one second modality from the protein model, wherein the protein model is obtained by using the training method of the protein model according to any one of claims 1-11.

13. The method of claim 12, wherein the inputting the protein data of the first modality into a protein model, outputting the protein data of at least one second modality from the protein model, comprises:

protein descriptions are input to the protein model from which protein sequences and/or protein structures are output.

14. The method of claim 12, wherein the inputting the protein data of the first modality into a protein model, outputting the protein data of at least one second modality from the protein model, comprises:

protein sequences are input to the protein model from which protein descriptions and/or protein structures are output.

15. The method of claim 12, wherein the inputting the protein data of the first modality into a protein model, outputting the protein data of at least one second modality from the protein model, comprises:

the protein structure is input to the protein model, from which protein descriptions and/or protein sequences are output.

16. The method of claim 12, wherein the inputting the protein data of the first modality into a protein model, outputting the protein data of at least one second modality from the protein model, comprises:

the protein production conditions are input to the protein model, and the protein structure is output from the protein model.

17. A protein model training device comprising:

the acquisition module is used for acquiring multi-mode protein sample data;

the association module is used for associating protein sample data of any two modes to obtain candidate training samples;

and the training module is used for training the protein model based on a plurality of candidate training samples.

18. The apparatus of claim 17, wherein the training module is further to:

19. The apparatus of claim 17, wherein the association module is further configured to:

the protein model includes a large model, the training module is further configured to:

20. The apparatus of claim 17, wherein the association module is further configured to:

21. The apparatus of claim 17, wherein the association module is further configured to:

the protein model includes an encoder and a large model, the training module is further configured to:

22. The apparatus of claim 17, wherein the association module is further configured to:

23. The apparatus of claim 17, wherein the association module is further configured to:

the protein model includes a large model and a generator, the training module further configured to:

24. The apparatus of claim 17, wherein the association module is further configured to:

25. The apparatus of claim 17, wherein the association module is further configured to:

the protein model includes a large model and a diffusion model, the training module is further configured to:

26. The apparatus of claim 25, wherein the training module, prior to the inputting the condition codes to the diffusion model, is further to:

acquiring a plurality of protein sample structures;

27. The apparatus of any of claims 17-26, wherein the protein model comprises a large model, the training module, prior to training the protein model based on the plurality of candidate training samples, further configured to:

28. An acquisition device for protein data, comprising:

the first acquisition module is used for acquiring protein data of a first modality;

a second acquisition module, configured to input the protein data of the first modality into a protein model, and output the protein data of at least one second modality from the protein model, where the protein model is obtained by using the training method of the protein model according to any one of claims 1-11.

29. The apparatus of claim 28, wherein the second acquisition module is further configured to:

30. The apparatus of claim 28, wherein the second acquisition module is further configured to:

31. The apparatus of claim 28, wherein the second acquisition module is further configured to:

32. The apparatus of claim 28, wherein the second acquisition module is further configured to:

33. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-16.

34. A non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method of any one of claims 1-16.

35. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any of claims 1-16.