CN115565607B

CN115565607B - Method, device, readable medium and electronic equipment for determining protein information

Info

Publication number: CN115565607B
Application number: CN202211289842.4A
Authority: CN
Inventors: 边成; 张志诚; 李永会
Original assignee: Douyin Vision Co Ltd
Current assignee: Douyin Vision Co Ltd
Priority date: 2022-10-20
Filing date: 2022-10-20
Publication date: 2024-02-23
Anticipated expiration: 2042-10-20
Also published as: CN115565607A

Abstract

The present disclosure relates to a method, apparatus, readable medium and electronic device for determining protein information, comprising: acquiring a target protein sequence of protein information to be determined; inputting the target protein sequence into a target protein representation model to obtain a target protein representation output by the target protein representation model; determining protein information of a target protein sequence from the target protein representation, the protein information including at least one of protein structural information, protein functional information, protein stability information, and protein interaction information; the target protein representation model is pre-generated through a plurality of sample sets, the sample sets comprise sample protein sequences and sample sequence information corresponding to the sample protein sequences, the sample sequence information comprises sample protein representations, sample gene ontology features and sample relationship information, and the sample relationship information is used for representing the relationship between the sample protein sequences and the sample gene ontology features.

Description

Method, device, readable medium and electronic equipment for determining protein information

Technical Field

The present disclosure relates to the field of computer technology, and in particular, to a method, an apparatus, a readable medium, and an electronic device for determining protein information.

Background

Proteins are fundamental substances of all life, are the most basic and important components of cells of the body, and predicting protein structures is helpful for understanding the effects of proteins, and is very important for biology, medicine and pharmacy. In the related art, a protein pre-training model is generated according to a traditional transducer model, fine adjustment is carried out on the protein pre-training model, the protein structure is predicted, and based on the protein pre-training model, the accuracy of the protein pre-training model directly influences the accuracy of the predicted protein structure. Therefore, how to improve the accuracy of the pre-training model is a problem to be solved.

Disclosure of Invention

This summary is provided to introduce a selection of concepts in a simplified form that are further described below in the detailed description. This summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.

In a first aspect, the present disclosure provides a method of determining protein information, the method comprising:

acquiring a target protein sequence of protein information to be determined;

inputting the target protein sequence into a target protein representation model to obtain a target protein representation output by the target protein representation model;

Determining protein information of the target protein sequence from the target protein representation, the protein information including at least one of protein structure information, protein function information, protein stability information, and protein interaction information;

the target protein representation model is pre-generated through a plurality of sample sets, the sample sets comprise sample protein sequences and sample sequence information corresponding to the sample protein sequences, the sample sequence information comprises sample protein representations, sample gene ontology features and sample relationship information, and the sample relationship information is used for representing the relationship between the sample protein sequences and the sample gene ontology features.

In a second aspect, the present disclosure provides an apparatus for determining protein information, the apparatus comprising:

the first acquisition module is used for acquiring a target protein sequence of protein information to be determined;

the second acquisition module is used for inputting the target protein sequence into a target protein representation model so as to acquire a target protein representation output by the target protein representation model;

a determining module for determining protein information of the target protein sequence from the target protein representation, the protein information comprising at least one of protein structure information, protein function information, protein stability information, and protein interaction information;

In a third aspect, the present disclosure provides a computer readable medium having stored thereon a computer program which when executed by a processing device performs the steps of the method of the first aspect of the present disclosure.

In a fourth aspect, the present disclosure provides an electronic device comprising:

a storage device having at least one computer program stored thereon;

at least one processing means for executing the at least one computer program in the storage means to carry out the steps of the method of the first aspect of the present disclosure.

Through the technical scheme, the target protein sequence of the protein information to be determined is obtained; inputting the target protein sequence into a target protein representation model to obtain a target protein representation output by the target protein representation model; determining protein information of the target protein sequence from the target protein representation, the protein information including at least one of protein structure information, protein function information, protein stability information, and protein interaction information; the target protein representation model is pre-generated through a plurality of sample sets, the sample sets comprise sample protein sequences and sample sequence information corresponding to the sample protein sequences, the sample sequence information comprises sample protein representations, sample gene ontology features and sample relationship information, and the sample relationship information is used for representing the relationship between the sample protein sequences and the sample gene ontology features. That is, the training set for training a target protein representation model according to the present disclosure includes, in addition to a sample protein sequence and a sample protein representation corresponding to the sample protein sequence, a sample gene ontology feature corresponding to the sample protein sequence and sample relationship information corresponding to the sample protein sequence, by which the protein representation capability of the target protein representation model can be improved, so that the accuracy of the target protein representation model is higher, and thus, the target protein representation determined according to the target protein representation model is also more accurate, thereby improving the accuracy of target protein information determined according to the target protein representation.

Additional features and advantages of the present disclosure will be set forth in the detailed description which follows.

Drawings

The above and other features, advantages, and aspects of embodiments of the present disclosure will become more apparent by reference to the following detailed description when taken in conjunction with the accompanying drawings. The same or similar reference numbers will be used throughout the drawings to refer to the same or like elements. It should be understood that the figures are schematic and that elements and components are not necessarily drawn to scale. In the drawings:

FIG. 1 is a flowchart illustrating a method of determining protein information according to an exemplary embodiment of the present disclosure;

FIG. 2 is a flow chart illustrating a method of model generation according to an exemplary embodiment of the present disclosure;

FIG. 3 is a schematic representation of a protein representation according to an exemplary embodiment of the present disclosure;

FIG. 4 is a model training schematic diagram illustrating one exemplary embodiment of the present disclosure;

FIG. 5 is a block diagram illustrating an apparatus for determining protein information according to an exemplary embodiment of the present disclosure;

FIG. 6 is a block diagram illustrating another apparatus for determining protein information according to an exemplary embodiment of the present disclosure;

fig. 7 is a block diagram of an electronic device, according to an exemplary embodiment of the present disclosure.

Detailed Description

Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While certain embodiments of the present disclosure have been shown in the accompanying drawings, it is to be understood that the present disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein, but are provided to provide a more thorough and complete understanding of the present disclosure. It should be understood that the drawings and embodiments of the present disclosure are for illustration purposes only and are not intended to limit the scope of the present disclosure.

It should be understood that the various steps recited in the method embodiments of the present disclosure may be performed in a different order and/or performed in parallel. Furthermore, method embodiments may include additional steps and/or omit performing the illustrated steps. The scope of the present disclosure is not limited in this respect.

The term "including" and variations thereof as used herein are intended to be open-ended, i.e., including, but not limited to. The term "based on" is based at least in part on. The term "one embodiment" means "at least one embodiment"; the term "another embodiment" means "at least one additional embodiment"; the term "some embodiments" means "at least some embodiments. Related definitions of other terms will be given in the description below.

It should be noted that the terms "first," "second," and the like in this disclosure are merely used to distinguish between different devices, modules, or units and are not used to define an order or interdependence of functions performed by the devices, modules, or units.

It should be noted that references to "one", "a plurality" and "a plurality" in this disclosure are intended to be illustrative rather than limiting, and those of ordinary skill in the art will appreciate that "one or more" is intended to be understood as "one or more" unless the context clearly indicates otherwise.

The names of messages or information interacted between the various devices in the embodiments of the present disclosure are for illustrative purposes only and are not intended to limit the scope of such messages or information.

The present disclosure is described below in connection with specific embodiments.

FIG. 1 is a flowchart illustrating a method of determining protein information, as shown in FIG. 1, according to an exemplary embodiment of the present disclosure, which may include:

s101, acquiring a target protein sequence of protein information to be determined.

Wherein the target protein sequence may be any length protein sequence, which is not limited in the present disclosure.

S102, inputting the target protein sequence into a target protein representation model to obtain a target protein representation output by the target protein representation model.

Wherein the target protein representation may include the sequence length and hidden dimensions (hidden dim) of the target protein sequence, e.g., the target protein representation may be P ε R ^L×D L is the sequence length and D is the hidden dimension. The target protein representation model is pre-generated through a plurality of sample sets, the sample sets comprise sample protein sequences and sample sequence information corresponding to the sample protein sequences, the sample sequence information comprises sample protein representations, sample gene ontology features and sample relationship information, and the sample relationship information is used for representing the relationship between the sample protein sequences and the sample gene ontology features.

In this step, the target protein sequence may be input into the target protein model, and the sequence length and the hidden dimension of the target protein sequence may be determined for the target protein sequence by the target protein model, to obtain a target protein representation corresponding to the target protein sequence.

S103, determining protein information of the target protein sequence according to the target protein representation.

Wherein the protein information may include at least one of protein structure information, protein function information, protein stability information, and protein interaction information.

In this step, after the target protein representation corresponding to the target protein sequence is obtained, the protein information of the target protein sequence can be determined by a pre-generated information determination model. For example, the information determination model may be a structure prediction model, and the structure information of the target protein sequence may be obtained after inputting the target protein representation into the structure prediction model.

By adopting the method, the training set for training the target protein expression model comprises the sample gene body characteristic corresponding to the sample protein sequence and the sample relation information corresponding to the sample protein sequence besides the sample protein sequence and the sample protein expression corresponding to the sample protein sequence, and the protein expression capacity of the target protein expression model can be improved through the sample gene body characteristic and the sample relation information, so that the accuracy of the target protein expression model is higher, and the target protein expression determined according to the target protein expression model is more accurate, so that the accuracy of the target protein information determined according to the target protein expression is improved.

FIG. 2 is a flow chart of a model generation method, as shown in FIG. 2, according to an exemplary embodiment of the present disclosure, which may include:

S21, acquiring a plurality of sample sets.

The sample set may be a protein knowledge graph disclosed in the prior art, for example, the sample set may be selected from proteins in a Gene on log (GO) database, and the GO database stores protein triples, where the triples include a protein sequence, a Gene Ontology feature corresponding to the protein sequence, and an association relationship corresponding to the protein sequence, and the Gene Ontology feature and the association relationship are both described in text. For example, the triplet may be ("mnprkkrllvivlfgigagiglvv.", "part of", "cytosolic large ribosomal subunit: the large subunit of a ribosome located in the cytosol."), where "mnprkkrllvivavlfgigagiglvv." is the protein sequence, "part of" is the gene ontology feature corresponding to the protein sequence, "cytosolic large ribosomal subunit: the large subunit of a ribosome located in the cytosol." is the association corresponding to the protein sequence.

In this step, a plurality of protein triples may be obtained from the GO database, for each protein triplet, the protein sequence in the protein triplet is taken as the sample protein sequence, the gene ontology feature and the association relationship in the protein triplet are taken as the sample gene ontology feature and the sample association relationship corresponding to the sample protein sequence, and the sequence length and the hidden dimension of the sample protein sequence are determined, so as to obtain a sample protein representation corresponding to the sample protein sequence.

S22, determining a current sample set from a plurality of sample sets, taking a preset protein representation model as a current protein representation model, and circularly executing a model training step according to the current sample set until the trained current protein representation model meets a preset iteration stopping condition, and taking the trained current protein representation model as the target protein representation model.

Wherein the model training step comprises: determining a first loss value corresponding to the current sample set through the current protein representation model; determining a second loss value corresponding to the current sample set through a pre-generated relation determination model; determining a target loss value according to the first loss value and the second loss value; under the condition that the current protein representation model does not meet the preset iteration stopping condition according to the target loss value, updating parameters of the current protein representation model according to the target loss value to obtain a trained current protein representation model, taking the trained current protein representation model as a new current protein representation model, and determining a new current sample set from a plurality of sample sets. The relationship determination model may be an MLP (Multilayer Perceptron, multi-layer perceptron) model.

In this step, any one of the plurality of sample sets may be used as the current sample set, after the current sample set is determined, the preset protein expression model is obtained, the preset protein expression model is used as the current protein expression model, the first loss value corresponding to the current sample set is determined through the current protein expression model, and the second loss value corresponding to the current sample set is determined through the relation determination model.

In one possible implementation, a sample protein sequence corresponding to the current sample set may be input to the current protein representation model to obtain a predicted protein representation output by the current protein representation model; and determining a first loss value corresponding to the current sample set according to the predicted protein representation and the sample protein representation corresponding to the current sample set. The first penalty value may be determined by an MLM (Masked Language Model, mask language model) penalty function, for example.

The current protein representation model includes a fourier transform layer that can perform a two-dimensional fourier transform process, for example, a sample protein sequence corresponding to the current sample set can be input into the current protein representation model, sample dimension information of the sample protein sequence corresponding to the current sample set can be determined by the fourier transform layer, and the predicted protein representation can be determined by the protein representation determination layer according to the sample dimension information.

FIG. 3 is a schematic representation of a protein representation model shown in accordance with an exemplary embodiment of the present disclosure, the current protein representation model further including an embedding layer, the protein representation determination layer including Add & Nor, a feed-forward network, and Add & Nor, an output of the embedding layer coupled to an input of the Fourier transform layer, an output of the Fourier transform layer coupled to an output of the protein representation determination layer, as shown in FIG. 3. After the sample protein sequence is input into the current protein representation model, determining a sequence embedding vector corresponding to the sample protein sequence through the embedding layer, inputting the sequence embedding vector into the Fourier transform layer, carrying out one-dimensional Fourier transform on the sequence dimension of the sample protein sequence through the Fourier transform layer, and carrying out one-dimensional Fourier transform on the hidden dimension of the sample protein sequence to obtain sample dimension information comprising the sample sequence dimension and the sample hidden dimension. The sample dimension information may then be input to the protein representation determination layer, through which the predicted protein representation is obtained.

In addition, because the time complexity of the fourier transform processing is relatively low, the calculation time can be greatly reduced by the processing of the fourier transform layer under the condition that the protein sequence is relatively long, so that the model training efficiency is improved.

After obtaining the predicted protein representation corresponding to the current sample set, inputting the sample gene ontology features corresponding to the current sample set into a preset feature extraction model to obtain gene feature information output by the preset feature extraction model; inputting the gene characteristic information and the predicted protein representation corresponding to the current sample set into the relation determination model to obtain predicted relation information output by the relation determination model; and determining a second loss value corresponding to the current sample set according to the prediction relation information and the sample relation information corresponding to the current sample set. The preset feature extraction model may be a feature extraction model mature in the prior art, for example, the preset feature extraction model may be a BioBERT model.

For example, the sample gene ontology feature corresponding to the current sample set may be input into the preset feature extraction model to obtain the gene feature information corresponding to the current sample set, and the gene feature information and the predicted protein representation corresponding to the current sample set may be input into the relationship determination model to obtain the predicted relationship information corresponding to the current sample set. And then, determining a second loss value corresponding to the current sample set according to the prediction relation information and the sample relation information corresponding to the current sample set through a cross entropy loss function.

After determining the first loss value and the second loss value, the target loss value may be determined based on the first loss value and the second loss value. For example, the sum of the first loss value and the second loss value may be taken as the target loss value, or different weights may be set in advance for the first loss value and the second loss value, for example, the first weight corresponding to the first loss value is 0.6, the second weight corresponding to the second loss value is 0.4, the product of the first loss value and the first weight is calculated to obtain a first target loss value, the product of the second loss value and the second weight is calculated to obtain a second target loss value, and the sum of the first target loss value and the second target loss value is taken as the target loss value.

After determining the target loss value, a preset loss value threshold value can be obtained, and under the condition that the target loss value is less than or equal to the preset loss value threshold value, the current protein representation model can be determined to meet the preset stopping iteration condition, and the current protein representation model is taken as the target protein representation model; in the case that the target loss value is determined to be greater than the preset loss value threshold, it may be determined that the current protein representation model does not meet the preset stop iteration condition, parameters of the current protein representation model are updated according to the target loss value, a trained current protein representation model is obtained, the trained current protein representation model is used as a new current protein representation model, a new current sample set is determined from a plurality of sample sets, and, for example, the new current sample set may be determined from the plurality of sample sets randomly.

After obtaining the new current protein representation model and the new current sample set, the model training step may be continued until it is determined that the target loss value is less than or equal to the preset loss value threshold, and the finally determined new current protein representation model is taken as the target protein representation model.

Fig. 4 is a schematic diagram of model training according to an exemplary embodiment of the present disclosure, as shown in fig. 4, a sample protein sequence is input into the current protein representation model to obtain a predicted protein representation corresponding to the sample protein sequence, the first loss value is determined, the sample gene ontology feature is input into the preset feature extraction model to obtain gene feature information corresponding to the sample protein sequence, the predicted protein representation and the gene feature information are input into the relationship determination model, and the second loss value is determined through the relationship determination model and sample relationship information corresponding to the sample protein sequence.

In one possible implementation manner, in a case that it is determined that the current protein representation model does not meet the preset stopping iteration condition according to the target loss value, updating parameters of the relationship determination model according to the target loss value, obtaining a trained relationship determination model, and taking the trained relationship determination model as a new relationship determination model. In an example, when the target loss value is greater than the preset loss value threshold, the parameters of the relation determination model are updated synchronously according to the target loss value to obtain a new relation determination model, and the second loss value is determined by the new relation determination model when the model training step is performed next time. In this way, more accurate prediction relation information can be obtained, so that the second loss value determined according to the prediction relation information is more accurate, and the accuracy of the target protein representation model is further improved.

By adopting the model training method, the sample gene ontology features also participate in the training of the target protein expression model, so that the protein expression capacity of the target protein expression model is stronger, and the accuracy of the target protein expression model is improved. In addition, in the model training process, the dimension information of the sample protein sequence is determined through the Fourier transform layer, so that the calculated amount is greatly reduced, and the model training efficiency is improved.

FIG. 5 is a block diagram of an apparatus for determining protein information, as shown in FIG. 5, according to an exemplary embodiment of the present disclosure, which may include:

a first acquisition module 501 for acquiring a target protein sequence of protein information to be determined;

a second obtaining module 502, configured to input the target protein sequence into a target protein representation model, so as to obtain a target protein representation output by the target protein representation model;

a determining module 503 for determining protein information of the target protein sequence from the target protein representation, the protein information comprising at least one of protein structure information, protein function information, protein stability information, and protein interaction information;

Optionally, fig. 6 is a block diagram of another apparatus for determining protein information according to an exemplary embodiment of the present disclosure, as shown in fig. 6, the apparatus further comprising:

a model training module 504 for obtaining a plurality of the sample sets; determining a current sample set from a plurality of sample sets, taking a preset protein representation model as a current protein representation model, and circularly executing a model training step according to the current sample set until the trained current protein representation model is determined to meet a preset stopping iteration condition, and taking the trained current protein representation model as the target protein representation model; the model training step comprises the following steps: determining a first loss value corresponding to the current sample set through the current protein representation model; determining a second loss value corresponding to the current sample set through a pre-generated relation determination model; determining a target loss value according to the first loss value and the second loss value; under the condition that the current protein representation model does not meet the preset iteration stopping condition according to the target loss value, updating parameters of the current protein representation model according to the target loss value to obtain a trained current protein representation model, and taking the trained current protein representation model as a new current protein representation model.

Optionally, the model training module 504 is further configured to:

inputting a sample protein sequence corresponding to the current sample set into the current protein representation model to obtain a predicted protein representation output by the current protein representation model;

and determining a first loss value corresponding to the current sample set according to the predicted protein representation and the sample protein representation corresponding to the current sample set.

Optionally, the model training module 504 is further configured to:

inputting the sample gene ontology features corresponding to the current sample set into a preset feature extraction model to obtain gene feature information output by the preset feature extraction model;

inputting the gene characteristic information and the predicted protein representation corresponding to the current sample set into the relation determination model to obtain predicted relation information output by the relation determination model;

and determining a second loss value corresponding to the current sample set according to the prediction relation information and the sample relation information corresponding to the current sample set.

Optionally, the current protein representation model includes a fourier transform layer and a protein representation determination layer, the model training module 504 is further configured to:

inputting the sample protein sequence corresponding to the current sample set into the current protein representation model, determining sample dimension information of the sample protein sequence corresponding to the current sample set through the Fourier transform layer, and determining the predicted protein representation through the protein representation determination layer according to the sample dimension information.

Optionally, the model training module 504 is further configured to:

under the condition that the current protein representation model does not meet the preset iteration stopping condition according to the target loss value, updating parameters of the relation determination model according to the target loss value to obtain a trained relation determination model, and taking the trained relation determination model as a new relation determination model.

According to the device, the training set for training the target protein expression model comprises the sample gene body characteristic corresponding to the sample protein sequence and the sample relation information corresponding to the sample protein sequence besides the sample protein sequence and the sample protein expression corresponding to the sample protein sequence, and the protein expression capacity of the target protein expression model can be improved through the sample gene body characteristic and the sample relation information, so that the accuracy of the target protein expression model is higher, the target protein expression determined according to the target protein expression model is more accurate, and the accuracy of the target protein information determined according to the target protein expression is improved.

The specific manner in which the various modules perform the operations in the apparatus of the above embodiments have been described in detail in connection with the embodiments of the method, and will not be described in detail herein.

Referring now to fig. 7, a schematic diagram of an electronic device 600 suitable for use in implementing embodiments of the present disclosure is shown. The terminal devices in the embodiments of the present disclosure may include, but are not limited to, mobile terminals such as mobile phones, notebook computers, digital broadcast receivers, PDAs (personal digital assistants), PADs (tablet computers), PMPs (portable multimedia players), in-vehicle terminals (e.g., in-vehicle navigation terminals), and the like, and stationary terminals such as digital TVs, desktop computers, and the like. The electronic device shown in fig. 7 is merely an example and should not be construed to limit the functionality and scope of use of the disclosed embodiments.

As shown in fig. 7, the electronic device 600 may include a processing means (e.g., a central processing unit, a graphic processor, etc.) 601, which may perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM) 602 or a program loaded from a storage means 608 into a Random Access Memory (RAM) 603. In the RAM 603, various programs and data required for the operation of the electronic apparatus 600 are also stored. The processing device 601, the ROM 602, and the RAM 603 are connected to each other through a bus 604. An input/output (I/O) interface 605 is also connected to bus 604.

In general, the following devices may be connected to the I/O interface 605: input devices 606 including, for example, a touch screen, touchpad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, and the like; an output device 607 including, for example, a Liquid Crystal Display (LCD), a speaker, a vibrator, and the like; storage 608 including, for example, magnetic tape, hard disk, etc.; and a communication device 609. The communication means 609 may allow the electronic device 600 to communicate with other devices wirelessly or by wire to exchange data. While fig. 7 shows an electronic device 600 having various means, it is to be understood that not all of the illustrated means are required to be implemented or provided. More or fewer devices may be implemented or provided instead.

In particular, according to embodiments of the present disclosure, the processes described above with reference to flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a non-transitory computer readable medium, the computer program comprising program code for performing the method shown in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network via communication means 609, or from storage means 608, or from ROM 602. The above-described functions defined in the methods of the embodiments of the present disclosure are performed when the computer program is executed by the processing device 601.

It should be noted that the computer readable medium described in the present disclosure may be a computer readable signal medium or a computer readable storage medium, or any combination of the two. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples of the computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this disclosure, a computer-readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present disclosure, however, the computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, with the computer-readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, fiber optic cables, RF (radio frequency), and the like, or any suitable combination of the foregoing.

In some implementations, the clients, servers may communicate using any currently known or future developed network protocol, such as HTTP (HyperText Transfer Protocol ), and may be interconnected with any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include a local area network ("LAN"), a wide area network ("WAN"), the internet (e.g., the internet), and peer-to-peer networks (e.g., ad hoc peer-to-peer networks), as well as any currently known or future developed networks.

The computer readable medium may be contained in the electronic device; or may exist alone without being incorporated into the electronic device.

The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to: acquiring a target protein sequence of protein information to be determined; inputting the target protein sequence into a target protein representation model to obtain a target protein representation output by the target protein representation model; determining protein information of the target protein sequence from the target protein representation, the protein information including at least one of protein structure information, protein function information, protein stability information, and protein interaction information; the target protein representation model is pre-generated through a plurality of sample sets, the sample sets comprise sample protein sequences and sample sequence information corresponding to the sample protein sequences, the sample sequence information comprises sample protein representations, sample gene ontology features and sample relationship information, and the sample relationship information is used for representing the relationship between the sample protein sequences and the sample gene ontology features.

Computer program code for carrying out operations of the present disclosure may be written in one or more programming languages, including, but not limited to, an object oriented programming language such as Java, smalltalk, C ++ and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computer (for example, through the Internet using an Internet service provider).

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The modules described in the embodiments of the present disclosure may be implemented in software or hardware. The name of the module is not limited to the module itself in some cases, and for example, the first acquisition module may be also described as "a module for acquiring a target protein sequence of protein information to be determined".

The functions described above herein may be performed, at least in part, by one or more hardware logic components. For example, without limitation, exemplary types of hardware logic components that may be used include: a Field Programmable Gate Array (FPGA), an Application Specific Integrated Circuit (ASIC), an Application Specific Standard Product (ASSP), a system on a chip (SOC), a Complex Programmable Logic Device (CPLD), and the like.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

In accordance with one or more embodiments of the present disclosure, example 1 provides a method of determining protein information, comprising: acquiring a target protein sequence of protein information to be determined; inputting the target protein sequence into a target protein representation model to obtain a target protein representation output by the target protein representation model; determining protein information of the target protein sequence from the target protein representation, the protein information including at least one of protein structure information, protein function information, protein stability information, and protein interaction information; the target protein representation model is pre-generated through a plurality of sample sets, the sample sets comprise sample protein sequences and sample sequence information corresponding to the sample protein sequences, the sample sequence information comprises sample protein representations, sample gene ontology features and sample relationship information, and the sample relationship information is used for representing the relationship between the sample protein sequences and the sample gene ontology features.

According to one or more embodiments of the present disclosure, example 2 provides the method of example 1, the target protein representation model being pre-generated by: acquiring a plurality of sample sets; determining a current sample set from a plurality of sample sets, taking a preset protein representation model as a current protein representation model, and circularly executing a model training step according to the current sample set until the trained current protein representation model is determined to meet a preset iteration stopping condition, and taking the trained current protein representation model as the target protein representation model; the model training step comprises the following steps: determining a first loss value corresponding to the current sample set through the current protein representation model; determining a second loss value corresponding to the current sample set through a pre-generated relation determination model; determining a target loss value according to the first loss value and the second loss value; and under the condition that the current protein representation model does not meet the preset iteration stopping condition according to the target loss value, updating parameters of the current protein representation model according to the target loss value to obtain a trained current protein representation model, taking the trained current protein representation model as a new current protein representation model, and determining a new current sample set from a plurality of sample sets.

According to one or more embodiments of the present disclosure, example 3 provides the method of example 2, the determining, by the current protein representation model, a first loss value corresponding to the current sample set comprising: inputting a sample protein sequence corresponding to the current sample set into the current protein representation model to obtain a predicted protein representation output by the current protein representation model; and determining a first loss value corresponding to the current sample set according to the predicted protein representation and the sample protein representation corresponding to the current sample set.

According to one or more embodiments of the present disclosure, example 4 provides the method of example 3, the determining, by a pre-generated relationship determination model, a second loss value corresponding to the current sample set comprising: inputting the sample gene ontology features corresponding to the current sample set into a preset feature extraction model to obtain gene feature information output by the preset feature extraction model; inputting the gene characteristic information and the predicted protein representation corresponding to the current sample set into the relation determination model to obtain predicted relation information output by the relation determination model; and determining a second loss value corresponding to the current sample set according to the prediction relation information and the sample relation information corresponding to the current sample set.

In accordance with one or more embodiments of the present disclosure, example 5 provides the method of example 3, the current protein representation model including a fourier transform layer and a protein representation determination layer, the inputting the sample protein sequence corresponding to the current sample set into the current protein representation model to obtain the predicted protein representation output by the current protein representation model comprising: and inputting the sample protein sequence corresponding to the current sample set into the current protein representation model, determining sample dimension information of the sample protein sequence corresponding to the current sample set through the Fourier transform layer, and determining the predicted protein representation through the protein representation determination layer according to the sample dimension information.

Example 6 provides the method of example 2, according to one or more embodiments of the present disclosure, the method further comprising: and under the condition that the current protein representation model is determined to not meet the preset iteration stopping condition according to the target loss value, updating parameters of the relation determination model according to the target loss value to obtain a trained relation determination model, and taking the trained relation determination model as a new relation determination model.

In accordance with one or more embodiments of the present disclosure, example 7 provides an apparatus for determining protein information, comprising: the first acquisition module is used for acquiring a target protein sequence of protein information to be determined; the second acquisition module is used for inputting the target protein sequence into a target protein representation model so as to acquire a target protein representation output by the target protein representation model; a determining module for determining protein information of the target protein sequence from the target protein representation, the protein information comprising at least one of protein structure information, protein function information, protein stability information, and protein interaction information; the target protein representation model is pre-generated through a plurality of sample sets, the sample sets comprise sample protein sequences and sample sequence information corresponding to the sample protein sequences, the sample sequence information comprises sample protein representations, sample gene ontology features and sample relationship information, and the sample relationship information is used for representing the relationship between the sample protein sequences and the sample gene ontology features.

Example 8 provides the apparatus of example 7, further comprising a model training module to obtain a plurality of the sample sets, according to one or more embodiments of the present disclosure; determining a current sample set from a plurality of sample sets, taking a preset protein representation model as a current protein representation model, and circularly executing a model training step according to the current sample set until the trained current protein representation model is determined to meet a preset iteration stopping condition, and taking the trained current protein representation model as the target protein representation model; the model training step comprises the following steps: determining a first loss value corresponding to the current sample set through the current protein representation model; determining a second loss value corresponding to the current sample set through a pre-generated relation determination model; determining a target loss value according to the first loss value and the second loss value; and under the condition that the current protein representation model does not meet the preset iteration stopping condition according to the target loss value, updating parameters of the current protein representation model according to the target loss value to obtain a trained current protein representation model, and taking the trained current protein representation model as a new current protein representation model.

According to one or more embodiments of the present disclosure, example 9 provides a computer-readable medium having stored thereon a computer program which, when executed by a processing device, implements the steps of the method of any one of examples 1 to 6.

In accordance with one or more embodiments of the present disclosure, example 10 provides an electronic device, comprising: a storage device having a computer program stored thereon; processing means for executing the computer program in the storage means to implement the steps of the method of any one of examples 1 to 6.

The foregoing description is only of the preferred embodiments of the present disclosure and description of the principles of the technology being employed. It will be appreciated by persons skilled in the art that the scope of the disclosure referred to in this disclosure is not limited to the specific combinations of features described above, but also covers other embodiments which may be formed by any combination of features described above or equivalents thereof without departing from the spirit of the disclosure. Such as those described above, are mutually substituted with the technical features having similar functions disclosed in the present disclosure (but not limited thereto).

Moreover, although operations are depicted in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order. In certain circumstances, multitasking and parallel processing may be advantageous. Likewise, while several specific implementation details are included in the above discussion, these should not be construed as limiting the scope of the present disclosure. Certain features that are described in the context of separate embodiments can also be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable subcombination.

Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are example forms of implementing the claims. The specific manner in which the various modules perform the operations in the apparatus of the above embodiments have been described in detail in connection with the embodiments of the method, and will not be described in detail herein.

Claims

1. A method of determining protein information, the method comprising:

acquiring a target protein sequence of protein information to be determined;

the target protein expression model is pre-generated through a plurality of sample sets, the sample sets comprise sample protein sequences and sample sequence information corresponding to the sample protein sequences, the sample sequence information comprises sample protein expressions, sample gene ontology features and sample relationship information, and the sample relationship information is used for representing the relationship between the sample protein sequences and the sample gene ontology features;

The target protein representation model is pre-generated by:

acquiring a plurality of sample sets;

determining a current sample set from a plurality of sample sets, taking a preset protein representation model as a current protein representation model, and circularly executing a model training step according to the current sample set until the trained current protein representation model is determined to meet a preset iteration stopping condition, and taking the trained current protein representation model as the target protein representation model;

the model training step comprises the following steps:

determining a first loss value corresponding to the current sample set through the current protein representation model;

determining a second loss value corresponding to the current sample set through a pre-generated relation determination model;

determining a target loss value according to the first loss value and the second loss value;

under the condition that the current protein representation model does not meet the preset iteration stopping condition according to the target loss value, updating parameters of the current protein representation model according to the target loss value to obtain a trained current protein representation model, taking the trained current protein representation model as a new current protein representation model, and determining a new current sample set from a plurality of sample sets;

Wherein the current protein representation model includes a fourier transform layer and a protein representation determination layer, and determining, by the current protein representation model, a first loss value corresponding to the current sample set includes:

inputting a sample protein sequence corresponding to the current sample set into the current protein representation model, determining sample dimension information of the sample protein sequence corresponding to the current sample set through the Fourier transform layer, and determining a predicted protein representation through the protein representation determination layer according to the sample dimension information;

determining the first loss value from the predicted protein representation;

the determining, by the fourier transform layer, sample dimension information of a sample protein sequence corresponding to the current sample set includes:

and carrying out one-dimensional Fourier transform on the sequence dimension and the hidden dimension of the sample protein sequence corresponding to the current sample set through the Fourier transform layer to obtain the sample dimension information.

2. The method of claim 1, wherein determining, by the current protein representation model, a first loss value corresponding to the current sample set comprises:

3. The method of claim 2, wherein determining, by the pre-generated relationship determination model, a second loss value corresponding to the current sample set comprises:

4. The method according to claim 1, wherein the method further comprises:

And under the condition that the current protein representation model is determined to not meet the preset iteration stopping condition according to the target loss value, updating parameters of the relation determination model according to the target loss value to obtain a trained relation determination model, and taking the trained relation determination model as a new relation determination model.

5. An apparatus for determining protein information, the apparatus comprising:

The device also comprises a model training module, a model training module and a model training module, wherein the model training module is used for acquiring a plurality of sample sets; determining a current sample set from a plurality of sample sets, taking a preset protein representation model as a current protein representation model, and circularly executing a model training step according to the current sample set until the trained current protein representation model is determined to meet a preset iteration stopping condition, and taking the trained current protein representation model as the target protein representation model; the model training step comprises the following steps: determining a first loss value corresponding to the current sample set through the current protein representation model; determining a second loss value corresponding to the current sample set through a pre-generated relation determination model; determining a target loss value according to the first loss value and the second loss value; under the condition that the current protein representation model does not meet the preset iteration stopping condition according to the target loss value, updating parameters of the current protein representation model according to the target loss value to obtain a trained current protein representation model, and taking the trained current protein representation model as a new current protein representation model;

determining the first loss value from the predicted protein representation;

6. A computer readable medium on which a computer program is stored, characterized in that the program, when being executed by a processing device, carries out the steps of the method according to any one of claims 1-4.

7. An electronic device, comprising:

a storage device having a computer program stored thereon;

processing means for executing said computer program in said storage means to carry out the steps of the method according to any one of claims 1-4.