CN117672364A

CN117672364A - Method, equipment and medium for predicting protein mutation stability

Info

Publication number: CN117672364A
Application number: CN202311758039.5A
Authority: CN
Inventors: 许锦波; 井晓阳; 王效涛; 王腾龙; 谭无为
Original assignee: Shanghai Molecular Heart Intelligent Technology Co ltd
Current assignee: Shanghai Molecular Heart Intelligent Technology Co ltd
Priority date: 2023-12-19
Filing date: 2023-12-19
Publication date: 2024-03-08
Anticipated expiration: 2043-12-19
Also published as: CN117672364B

Abstract

It is an object of the present application to provide a method, apparatus and medium for predicting protein mutation stability, the method comprising: acquiring a pre-training protein language model and a pre-training protein structure model; constructing a graph neural network based on protein mutation training data, a pre-training protein language model and a pre-training protein structure model; training to obtain a protein mutation stability prediction model based on protein fitness data and protein mutation free energy change data by combining a graph neural network; based on the mutation data of the target protein, protein mutation stability information corresponding to the mutation data of the target protein is determined by utilizing a pre-training protein language model, a pre-training protein structure model, a graph neural network and a protein mutation stability prediction model. According to the method, various characteristic data are effectively integrated in a comprehensive information fusion mode to predict the stability change caused by protein mutation, and the comprehensiveness and accuracy of prediction of the influence of the protein mutation are effectively improved.

Description

Method, equipment and medium for predicting protein mutation stability

Technical Field

The application relates to the technical field of bioinformatics, in particular to a technology for predicting protein mutation stability.

Background

Currently, the change in thermal stability of proteins after mutation can be determined by thermodynamic experiments, such as differential scanning calorimetry or spectroscopy methods, and the like. The method can provide direct observation for the change of the folding state of the protein, but has long experimental period and high cost. The stability change caused by the mutation can also be predicted by learning the rules of the protein mutation from the data of the large-scale known protein mutant and the corresponding stability change by using techniques such as machine learning. This method has high prediction efficiency, but has low prediction accuracy.

Disclosure of Invention

It is an object of the present application to provide a method, apparatus and medium for predicting protein mutation stability.

According to one aspect of the present application, there is provided a method for predicting protein mutation stability, the method comprising:

acquiring a pre-training protein language model and a pre-training protein structure model;

constructing a graphic neural network based on protein mutation training data, the pre-training protein language model and the pre-training protein structure model, wherein the protein mutation training data comprises first protein sequence information, and structural information and mutation information corresponding to the first protein sequence information;

Training to obtain a protein mutation stability prediction model based on protein fitness data and free energy change data caused by mutation by combining the graph neural network;

and determining protein mutation stability information corresponding to the target protein mutation data by using the pre-training protein language model, the pre-training protein structure model, the graph neural network and the protein mutation stability prediction model based on the target protein mutation data, wherein the target protein mutation data comprises target protein sequence information, and structural information and mutation information corresponding to the target protein sequence information.

According to another aspect of the present application, there is provided a method for constructing a protein mutation stability prediction model, the method comprising:

Based on the protein fitness data and the protein mutation free energy change data, combining the graph neural network, training to obtain a protein mutation stability prediction model.

According to one aspect of the present application there is provided a computer device for predicting the stability of a protein mutation comprising a memory, a processor and a computer program stored on the memory, characterised in that the processor executes the computer program to carry out the steps of any of the methods as described above.

According to one aspect of the present application there is provided a computer readable storage medium having stored thereon a computer program, characterized in that the computer program when executed by a processor implements the steps of any of the methods described above.

According to one aspect of the present application there is provided a computer program product comprising a computer program, characterized in that the computer program when executed by a processor implements the steps of any of the methods described above.

According to one aspect of the present application, there is provided an apparatus for predicting protein mutation stability, the apparatus comprising:

the one-to-one module is used for acquiring a pre-training protein language model and a pre-training protein structure model;

The two-module is used for constructing a graphic neural network based on protein mutation training data, the pre-training protein language model and the pre-training protein structure model, wherein the protein mutation training data comprises first protein sequence information, and structural information and mutation information corresponding to the first protein sequence information;

the three modules are used for training to obtain a protein mutation stability prediction model based on protein fitness data and free energy change data caused by mutation by combining the graph neural network;

and the four modules are used for determining protein mutation stability information corresponding to the target protein mutation data based on the target protein mutation data by utilizing the pre-training protein language model, the pre-training protein structure model, the graph neural network and the protein mutation stability prediction model, wherein the target protein mutation data comprises target protein sequence information, structural information corresponding to the target protein sequence information and mutation information.

According to one aspect of the present application, there is provided an apparatus for constructing a protein mutation stability prediction model, the apparatus comprising:

The second module is used for acquiring a pre-training protein language model and a pre-training protein structure model;

the second module is used for constructing a graphic neural network based on protein mutation training data, the pre-training protein language model and the pre-training protein structure model, wherein the protein mutation training data comprises first protein sequence information, and structural information and mutation information corresponding to the first protein sequence information;

and the second module and the third module are used for training to obtain a protein mutation stability prediction model based on protein fitness data and protein mutation free energy change data by combining the graph neural network.

Compared with the prior art, the method and the device have the advantages that the pre-training protein language model and the pre-training protein structure model are obtained; constructing a graphic neural network based on protein mutation training data, the pre-training protein language model and the pre-training protein structure model, wherein the protein mutation training data comprises first protein sequence information, and structural information and mutation information corresponding to the first protein sequence information; training to obtain a protein mutation stability prediction model based on protein fitness data and free energy change data caused by mutation by combining the graph neural network; and determining protein mutation stability information corresponding to the target protein mutation data by using the pre-training protein language model, the pre-training protein structure model, the graph neural network and the protein mutation stability prediction model based on the target protein mutation data, wherein the target protein mutation data comprises target protein sequence information, and structural information and mutation information corresponding to the target protein sequence information. The method comprehensively uses a pre-training protein language model, a pre-training protein structure model and a graphic neural network, effectively integrates protein fitness data and protein mutation free energy change data, establishes a relation from protein sequence, structure to stability, accurately evaluates the mutual influence among amino acid residues, predicts the stability change caused by mutation, and effectively improves the comprehensiveness and accuracy of protein mutation influence prediction.

Drawings

Other features, objects and advantages of the present application will become more apparent upon reading of the detailed description of non-limiting embodiments, made with reference to the following drawings, in which:

FIG. 1 shows a flow chart of a method for predicting protein mutation stability in accordance with one embodiment of the present application;

FIG. 2 shows a flow chart for predicting protein mutation stability according to one embodiment of the present application;

FIG. 3 illustrates a flow chart of a method for constructing a protein mutation stability prediction model according to an embodiment of the present application;

FIG. 4 shows a block diagram of an apparatus for predicting protein mutation stability in accordance with one embodiment of the present application;

FIG. 5 shows a block diagram of an apparatus for constructing a protein mutation stability prediction model according to an embodiment of the present application;

FIG. 6 illustrates an exemplary system that may be used to implement various embodiments described herein.

The same or similar reference numbers in the drawings refer to the same or similar parts.

Detailed Description

The present application is described in further detail below with reference to the accompanying drawings.

In one typical configuration of the present application, the terminal, the devices of the services network, and the trusted party each include one or more processors (e.g., central processing units (Central Processing Unit, CPU)), input/output interfaces, network interfaces, and memory.

The Memory may include non-volatile Memory in a computer readable medium, random access Memory (RandomAccess Memory, RAM) and/or non-volatile Memory, etc., such as Read Only Memory (ROM) or Flash Memory (Flash Memory). Memory is an example of computer-readable media.

Computer readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of storage media for a computer include, but are not limited to, phase-Change Memory (PCM), programmable Random Access Memory (Programmable RandomAccess Memory, PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (Dynamic Random Access Memory, DRAM), other types of Random Access Memory (RAM), read-Only Memory (ROM), electrically erasable programmable read-Only Memory (EEPROM), flash Memory or other Memory technology, read-Only Memory (Compact Disc Read-Only Memory, CD-ROM), digital versatile disks (Digital Versatile Disc, DVD) or other optical storage, magnetic cassettes, magnetic tape storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by the computing device.

The device referred to in the present application includes, but is not limited to, a user device, a network device, or a device formed by integrating a user device and a network device through a network. The user equipment includes, but is not limited to, any mobile electronic product which can perform man-machine interaction with a user (for example, perform man-machine interaction through a touch pad), such as a smart phone, a tablet computer and the like, and the mobile electronic product can adopt any operating system, such as an Android operating system, an iOS operating system and the like. The network device includes an electronic device capable of automatically performing numerical calculation and information processing according to a preset or stored instruction, and its hardware includes, but is not limited to, a microprocessor, an application specific integrated circuit (Application Specific Integrated Circuit, ASIC), a programmable logic device (Programmable Logic Device, PLD), a field programmable gate array (Field Programmable gateway array, FPGA), a digital signal processor (Digital Signal Processor, DSP), an embedded device, and the like. The network device includes, but is not limited to, a computer, a network host, a single network server, a plurality of network server sets, or a cloud of servers; here, the Cloud is composed of a large number of computers or network servers based on Cloud Computing (Cloud Computing), which is a kind of distributed Computing, a virtual supercomputer composed of a group of loosely coupled computer sets. Including but not limited to the internet, wide area networks, metropolitan area networks, local area networks, VPN networks, wireless Ad Hoc networks (Ad Hoc networks), and the like. Preferably, the device may be a program running on the user device, the network device, or a device formed by integrating the user device and the network device, the touch terminal, or the network device and the touch terminal through a network.

Of course, those skilled in the art will appreciate that the above-described devices are merely examples, and that other devices now known or hereafter may be present as appropriate for the application, are intended to be within the scope of the present application and are incorporated herein by reference.

In the description of the present application, the meaning of "a plurality" is two or more, unless explicitly defined otherwise.

FIG. 1 shows a flow chart of a method for predicting protein mutation stability according to one embodiment of the present application, the method comprising: step S11, step S12, step S13, and step S14. In step S11, the apparatus 1 acquires a pre-trained protein language model and a pre-trained protein structure model; in step S12, the device 1 constructs a graph neural network based on protein mutation training data, the pre-training protein language model, and the pre-training protein structure model, where the protein mutation training data includes first protein sequence information, structure information corresponding to the first protein sequence information, and mutation information; in step S13, the device 1 trains to obtain a protein mutation stability prediction model based on the protein fitness data and the protein mutation free energy variation data in combination with the graph neural network; in step S14, the apparatus 1 determines protein mutation stability information corresponding to the target protein mutation data based on the target protein mutation data, wherein the target protein mutation data includes target protein sequence information, structural information corresponding to the target protein sequence information, and mutation information, using the pre-trained protein language model, the pre-trained protein structure model, the graph neural network, and the protein mutation stability prediction model.

In step S11, the apparatus 1 acquires a pre-trained protein language model and a pre-trained protein structure model. In some embodiments, the device 1 includes, but is not limited to, a user device, a network device, e.g., a tablet, a computer, a server, having information processing or computing capabilities. In some embodiments, the pre-trained protein language model and the pre-trained protein structure model may be obtained by the device 1 based on corresponding data training, or may be a pre-trained protein language model and a pre-trained protein structure model obtained by the device 1 from other devices.

In some embodiments, the step S11 includes: step S111 (not shown), the apparatus 1 constructs a pre-trained protein language model based on the protein sequence data; in step S112 (not shown), the apparatus 1 constructs a pre-trained protein structure model based on the protein structure data. In some embodiments, the pre-trained protein language model and the pre-trained protein structure model learn characteristics and rules of proteins from a large amount of protein sequence data and protein structure data, so that information of interaction information, functional areas, structures and the like of the proteins can be predicted.

In some embodiments, the step S111 includes: step S1111 (not shown), the apparatus 1 acquires the protein sequence data; in step S1112 (not shown), the apparatus 1 constructs a pre-trained protein language model by unsupervised training based on the protein sequence data. In some embodiments, in unsupervised training, the device 1 may build the pre-trained protein language model using a mask (mask) mechanism. For example, masking (masking) a part of amino acid positions in each protein sequence in the protein sequence data, and inputting the masked protein sequence into the protein language model to predict the type of the masked amino acid, thereby performing model training.

In some embodiments, the step S1111 includes: the device 1 acquires original protein sequence data; and preprocessing the original protein sequence data to obtain protein sequence data. In some embodiments, the device 1 may extract the corresponding raw protein sequence data from a corresponding protein database (e.g., uniProt, proteinatlas, or InterPro, etc., protein database) or related literature (e.g., related papers, reports, or patent literature, etc.). In some embodiments, the preprocessing operations include, but are not limited to, deleting repeated sequences from the original protein sequence data, filtering the original protein sequence data, and normalizing the original protein sequence data. For example, the apparatus 1 may determine similarity information between protein sequences in the original protein sequence data, and delete repetitive sequences having too high similarity in the original protein sequence data based on the similarity information. The apparatus 1 may also filter low quality protein sequences in the original protein sequence data based on sequence length or the like (e.g., protein sequences whose sequence length is not satisfactory). To facilitate model processing, the device 1 may also normalize raw protein sequence data from different sources (e.g., one-hot encoding based on amino acid composition of the protein sequence) and convert it to the same expression form.

In some embodiments, the protein structure data may be obtained by the device 1 from a corresponding protein structure database (e.g., PDB (Protein Data Bank)). In some embodiments, the device 1 may utilize a mask (mask) mechanism to build the pre-trained protein structure model. Masking (masking) a portion of the structure in the protein structure data, predicting masked structure information based on known structure information using the pre-trained protein structure model, thereby training to obtain a pre-trained protein structure model. The apparatus 1 may also use the pre-trained protein structure model to predict amino acid sequences or protein side chain structures based on protein backbone structure, thereby training to obtain a pre-trained protein structure model.

In step S12, the apparatus 1 constructs a graph neural network based on protein mutation training data, the pre-training protein language model, and the pre-training protein structure model, where the protein mutation training data includes first protein sequence information, structure information corresponding to the first protein sequence information, and mutation information. In some embodiments, the apparatus 1 constructs the graphic neural network in combination with the protein mutation training data, the pre-training protein language model and the pre-training protein structure model, so as to complete fusion of the pre-training protein language model and the pre-training protein structure model, and information such as protein sequence, structure and mutation through the graphic neural network.

In some embodiments, the step S12 includes: step S121 (not shown), the device 1 performs information integration based on the protein mutation training data, the pre-training protein language model, and the pre-training protein structure model, to obtain corresponding coding information, where the protein mutation training data includes first protein sequence information, structure information corresponding to the first protein sequence information, and mutation information; in step S122 (not shown), the apparatus 1 constructs a corresponding graph neural network based on the encoded information. For example, the device 1 integrates the protein mutation training data and its output via the pre-trained protein language model and the pre-trained protein structure model to obtain corresponding coding information. And determining a graph structure corresponding to the first protein sequence information based on the coding information, and training by using the graph structure to obtain the graph neural network. The graph structure includes node information and side information. The node information includes the coding information, and the side information includes spatial relationship information (e.g., inter-residue distance, etc.) between amino acid residues in the first protein sequence. Based on the graph structure, the graph neural network is constructed using self-supervised learning (e.g., using generative models, contrast learning, etc.). For example, masking (mask) a part of the contents in the node information or the side information, and training of the graph neural network is performed by predicting the masking contents. The neural network of the figure can optimize the input coding information through a message transmission mechanism, so that fusion optimization of a pre-training protein language model, a pre-training protein structure model, information such as protein sequences, structures and mutation is performed.

In some embodiments, the step S121 includes: the equipment 1 respectively utilizes the pre-training protein language model and the pre-training protein structure model for the first protein sequence information and the structure information corresponding to the first protein sequence information to determine corresponding first characterization information and second characterization information; determining corresponding global structure information and local structure information based on the structure information and mutation information corresponding to the first protein sequence information; and determining corresponding coding information based on the first characterization information, the second characterization information, the global structure information and the local structure information. For example, the apparatus 1 acquires first characterization information (emmedding) using the pre-trained protein language model based on the first protein sequence information. And acquiring second characterization information by utilizing the pre-training protein structure model based on the structure information corresponding to the first protein sequence information. And connecting (connecting) the first characterization information, the second characterization information, global structure information and local structure information to form the coding information. The global structure information is main chain information corresponding to the first protein sequence, and the local structure information is main chain and side chain information near the mutation residue corresponding to the first protein sequence.

In step S13, the apparatus 1 trains a protein mutation stability prediction model based on the protein fitness data and the protein mutation free energy variation data in combination with the graph neural network. In some embodiments, the protein fitness data includes second protein sequence information, structural information and mutation information corresponding to the second protein sequence information, and fitness (fitness) information of a mutant corresponding to the second protein sequence information. The protein mutant free energy change data includes mutant free energy change data (ddG) corresponding to the second protein sequence information. In some embodiments, the device 1 may use a migration learning method to combine the protein fitness data and the protein mutation free energy variation data for model training, capture the association between protein mutation and function and stability, and obtain a protein mutation stability prediction model. In some embodiments, the protein mutation stability prediction model comprises a number of nonlinear layers. In some embodiments, a Mean Square Error (MSE) or a Spearman correlation coefficient (Spearman's correlation coefficient, SCC) or the like may be employed as a loss (loss) function during model training, based on a gradient descent strategy training.

In some embodiments, the step S13 includes: the equipment 1 is trained to obtain a protein fitness model by combining the graphic neural network based on the protein fitness data; and training the protein fitness model based on the protein mutation free energy change data to obtain the protein mutation stability prediction model. For example, the apparatus 1 acquires the encoded information corresponding to the second protein sequence information based on the protein fitness data through the pre-trained protein language model, the pre-trained protein structure model, and the graph neural network. And utilizing a transfer learning mode, firstly combining the fitness (fitness) information based on the coding information corresponding to the second protein sequence information, and training to obtain a protein fitness model. And training by further combining ddG information on the basis of the protein fitness model to obtain a protein mutation stability prediction model.

In step S14, the apparatus 1 determines protein mutation stability information corresponding to the target protein mutation data based on the target protein mutation data, wherein the target protein mutation data includes target protein sequence information, structural information corresponding to the target protein sequence information, and mutation information, using the pre-trained protein language model, the pre-trained protein structure model, the graph neural network, and the protein mutation stability prediction model. For example, the target protein mutation data is used for obtaining corresponding target coding information through the pre-training protein language model, the pre-training protein structure model and the graphic neural network. The target coding information is subjected to the protein mutation stability prediction model to obtain corresponding protein mutation stability information. The protein mutation stability information comprises free energy change information before and after mutation of a target protein sequence.

In some embodiments, referring to a flowchart for predicting protein mutation stability shown in fig. 2, the step S14 includes: the equipment 1 respectively utilizes the pre-training protein language model and the pre-training protein structure model for the target protein sequence information and the structure information corresponding to the target protein sequence information to determine corresponding first target characterization information and second target characterization information; determining corresponding target coding information by using the graph neural network based on the first target characterization information, the second target characterization information and the target protein mutation data; and determining protein mutation stability information corresponding to the target protein mutation data through the protein mutation stability prediction model based on the target coding information. Here, the determination manners of the first target characterization information and the second target characterization information are the same as or similar to the determination manners of the first characterization information and the second characterization information in the aforementioned step S121, so that the description is omitted herein and incorporated by reference. The device 1 determines corresponding global and local structural information based on the target protein mutation data. And determining corresponding target coding information through the graph neural network based on the first target characterization information, the second target characterization information, corresponding global structure information and local structure information, and predicting mutation stability. In the method, various characteristics are integrated through a graphic neural network to predict the mutation stability of the protein, and fitness (fitness) and ddG data are adopted in the training of a prediction model of the mutation stability of the protein, so that the comprehensiveness and accuracy of the prediction are effectively improved.

In some embodiments, the step S13 further includes: the device 1 fine-tunes the protein mutation stability prediction model based on a target dataset, wherein the target dataset matches the target protein mutation data. For example, to achieve a more accurate stability prediction, the device 1 may also fine tune the protein mutation stability prediction model for a particular prediction task, acquiring a dataset matching the prediction task. The target data set comprises protein mutation free energy variation data matching the target protein mutation data to be predicted.

FIG. 3 shows a flow chart of a method for constructing a protein mutation stability prediction model, according to one embodiment of the present application, the method comprising: step S21, step S22, and step S23. In step S21, the apparatus 2 acquires a pre-trained protein language model and a pre-trained protein structure model; in step S22, the device 2 constructs a graph neural network based on protein mutation training data, the pre-training protein language model, and the pre-training protein structure model, where the protein mutation training data includes first protein sequence information, structure information corresponding to the first protein sequence information, and mutation information; in step S23, the apparatus 2 trains a protein mutation stability prediction model based on the protein fitness data and the protein mutation free energy variation data in combination with the graph neural network. Here, the specific embodiments of step S21, step S22 and step S23 shown in fig. 3 are the same as or similar to the specific examples of step S11, step S12 and step S13, respectively, and are not described in detail herein, and are incorporated by reference.

In some embodiments, training of the protein mutation stability prediction model and prediction of protein mutation stability may be performed in different devices. The device 2 composes the pre-training protein language model, the pre-training protein structure model, the graphic neural network and the protein mutation stability prediction model into a comprehensive information fusion system for protein mutation stability prediction. Other devices can directly utilize the system to input mutation data of target protein into the system, so as to obtain corresponding mutation stability information of the protein. In the system, target protein mutation data are used for obtaining corresponding target coding information through the pre-training protein language model, the pre-training protein structure model and the graphic neural network; the target coding information is subjected to the protein mutation stability prediction model to obtain corresponding protein mutation stability information. The specific process is the same as or similar to the aforementioned step S14, and thus will not be described in detail, and is incorporated herein by reference.

Fig. 4 shows a block diagram of an apparatus for predicting protein mutation stability according to an embodiment of the present application, the apparatus 1 includes a one-to-one module 11, a two-to-one module 12, a three-to-three module 13, and a four-to-four module 14. The one-to-one module 11 acquires a pre-training protein language model and a pre-training protein structure model; the two modules 12 construct a graphic neural network based on protein mutation training data, the pre-training protein language model and the pre-training protein structure model, wherein the protein mutation training data comprises first protein sequence information, and structural information and mutation information corresponding to the first protein sequence information; the three modules 13 are combined with the graphic neural network based on protein fitness data and protein mutation free energy change data to train and obtain a protein mutation stability prediction model; the four modules 14 determine protein mutation stability information corresponding to the target protein mutation data based on the target protein mutation data, wherein the target protein mutation data comprises target protein sequence information, structural information corresponding to the target protein sequence information and mutation information by using the pre-training protein language model, the pre-training protein structure model, the graph neural network and the protein mutation stability prediction model. Here, the specific embodiments of the one-to-one module 11, the two-module 12, the three-module 13 and the four-module 14 shown in fig. 3 are the same as or similar to the specific embodiments of the foregoing step S11, the step S12, the step S13 and the step S14, respectively, so that the detailed description is omitted herein and the reference is made.

In some embodiments, the one-to-one module 11 includes one-to-one unit 111 (not shown) and two-to-one unit 112 (not shown). The one-to-one unit 111 constructs a pre-trained protein language model based on the protein sequence data; the one-to-two unit 112 builds a pre-trained protein structure model based on the protein structure data. The embodiments of the one-to-one unit 111 and the one-to-two unit 112 are the same as or similar to the embodiments of the step S111 and the step S112, respectively, and thus are not described in detail herein, and are incorporated by reference.

In some embodiments, the one-to-one unit 111 includes one-to-one sub-unit 1111 (not shown) and one-to-two sub-unit 1112 (not shown). The one-to-one subunit 1111 obtains the protein sequence data; the one-to-one subunit 1112 builds a pre-trained protein language model by unsupervised training based on the protein sequence data. The embodiments of the one-to-one sub-unit 1111 and the one-to-one sub-unit 1112 are the same as or similar to the embodiments of the step S1111 and the step S1112, respectively, and are not described in detail herein, and are incorporated by reference.

In some embodiments, the module 12 includes a cell 121 (not shown) and a cell 122 (not shown). The two-one unit 121 performs information integration based on the protein mutation training data, the pre-training protein language model, and the pre-training protein structure model, and obtains corresponding coding information, where the protein mutation training data includes first protein sequence information, structure information corresponding to the first protein sequence information, and mutation information; the two-unit 122 constructs a corresponding graph neural network based on the encoded information. Here, the embodiments of the two-one unit 121 and the two-two unit 122 are the same as or similar to the embodiments of the step S121 and the step S122, respectively, so that the detailed description is omitted herein for reference.

Fig. 5 shows a block diagram of an apparatus for constructing a protein mutation stability prediction model according to an embodiment of the present application, the apparatus 2 includes a two-one module 21, a two-two module 22, and a two-three module 23. The two modules 21 acquire a pre-training protein language model and a pre-training protein structure model; the second module 22 constructs a graphic neural network based on protein mutation training data, the pre-training protein language model and the pre-training protein structure model, wherein the protein mutation training data comprises first protein sequence information, and structural information and mutation information corresponding to the first protein sequence information; the second and third modules 23 are based on protein fitness data and protein mutation free energy change data, and are combined with the graphic neural network to train to obtain a protein mutation stability prediction model. Here, the specific embodiments of the two-one module 21, the two-two module 22 and the two-three module 23 shown in fig. 5 are the same as or similar to the specific embodiments of the foregoing step S21, the step S22 and the step S23, respectively, and are not described in detail herein, and are incorporated by reference.

FIG. 6 illustrates an exemplary system that may be used to implement various embodiments described herein; in some embodiments, as shown in fig. 6, the system 300 can function as any of the devices of the various described embodiments. In some embodiments, system 300 can include one or more computer-readable media (e.g., system memory or NVM/storage 320) having instructions and one or more processors (e.g., processor(s) 305) coupled with the one or more computer-readable media and configured to execute the instructions to implement the modules to perform the actions described herein.

For one embodiment, the system control module 310 may include any suitable interface controller to provide any suitable interface to at least one of the processor(s) 305 and/or any suitable device or component in communication with the system control module 310.

The system control module 310 may include a memory controller module 330 to provide an interface to the system memory 315. Memory controller module 330 may be a hardware module, a software module, and/or a firmware module.

The system memory 315 may be used, for example, to load and store data and/or instructions for the system 300. For one embodiment, system memory 315 may include any suitable volatile memory, such as, for example, a suitable DRAM. In some embodiments, the system memory 315 may comprise a double data rate type four synchronous dynamic random access memory (DDR 4 SDRAM).

For one embodiment, system control module 310 may include one or more input/output (I/O) controllers to provide an interface to NVM/storage 320 and communication interface(s) 325.

For example, NVM/storage 320 may be used to store data and/or instructions. NVM/storage 320 may include any suitable nonvolatile memory (e.g., flash memory) and/or may include any suitable nonvolatile storage device(s) (e.g., one or more Hard Disk Drives (HDDs), one or more Compact Disc (CD) drives, and/or one or more Digital Versatile Disc (DVD) drives).

NVM/storage 320 may include storage resources that are physically part of the device on which system 300 is installed or which may be accessed by the device without being part of the device. For example, NVM/storage 320 may be accessed over a network via communication interface(s) 325.

Communication interface(s) 325 may provide an interface for system 300 to communicate over one or more networks and/or with any other suitable device. The system 300 may wirelessly communicate with one or more components of a wireless network in accordance with any of one or more wireless network standards and/or protocols.

For one embodiment, at least one of the processor(s) 305 may be packaged together with logic of one or more controllers (e.g., memory controller module 330) of the system control module 310. For one embodiment, at least one of the processor(s) 305 may be packaged together with logic of one or more controllers of the system control module 310 to form a System In Package (SiP). For one embodiment, at least one of the processor(s) 305 may be integrated on the same die as logic of one or more controllers of the system control module 310. For one embodiment, at least one of the processor(s) 305 may be integrated on the same die with logic of one or more controllers of the system control module 310 to form a system on chip (SoC).

In various embodiments, the system 300 may be, but is not limited to being: a server, workstation, desktop computing device, or mobile computing device (e.g., laptop computing device, handheld computing device, tablet, netbook, etc.). In various embodiments, system 300 may have more or fewer components and/or different architectures. For example, in some embodiments, system 300 includes one or more cameras, keyboards, liquid Crystal Display (LCD) screens (including touch screen displays), non-volatile memory ports, multiple antennas, graphics chips, application Specific Integrated Circuits (ASICs), and speakers.

In addition to the methods and apparatus described in the above embodiments, the present application also provides a computer-readable storage medium storing computer code which, when executed, performs a method as described in any one of the preceding claims.

The present application also provides a computer program product which, when executed by a computer device, performs a method as claimed in any preceding claim.

The present application also provides a computer device comprising:

one or more processors;

a memory for storing one or more computer programs;

the one or more computer programs, when executed by the one or more processors, cause the one or more processors to implement the method of any preceding claim.

It should be noted that the present application may be implemented in software and/or a combination of software and hardware, for example, using Application Specific Integrated Circuits (ASIC), a general purpose computer or any other similar hardware device. In one embodiment, the software programs of the present application may be executed by a processor to implement the steps or functions as described above. Likewise, the software programs of the present application (including associated data structures) may be stored on a computer readable recording medium, such as RAM memory, magnetic or optical drive or diskette and the like. In addition, some steps or functions of the present application may be implemented in hardware, for example, as circuitry that cooperates with the processor to perform various steps or functions.

Furthermore, portions of the present application may be implemented as a computer program product, such as computer program instructions, which when executed by a computer, may invoke or provide methods and/or techniques in accordance with the present application by way of operation of the computer. Those skilled in the art will appreciate that the form of computer program instructions present in a computer readable medium includes, but is not limited to, source files, executable files, installation package files, etc., and accordingly, the manner in which the computer program instructions are executed by a computer includes, but is not limited to: the computer directly executes the instruction, or the computer compiles the instruction and then executes the corresponding compiled program, or the computer reads and executes the instruction, or the computer reads and installs the instruction and then executes the corresponding installed program. Herein, a computer-readable medium may be any available computer-readable storage medium or communication medium that can be accessed by a computer.

Communication media includes media whereby a communication signal containing, for example, computer readable instructions, data structures, program modules, or other data, is transferred from one system to another. Communication media may include conductive transmission media such as electrical cables and wires (e.g., optical fibers, coaxial, etc.) and wireless (non-conductive transmission) media capable of transmitting energy waves, such as acoustic, electromagnetic, RF, microwave, and infrared. Computer readable instructions, data structures, program modules, or other data may be embodied as a modulated data signal, for example, in a wireless medium, such as a carrier wave or similar mechanism, such as that embodied as part of spread spectrum technology. The term "modulated data signal" means a signal that has one or more of its characteristics set or changed in such a manner as to encode information in the signal. The modulation may be analog, digital or hybrid modulation techniques.

By way of example, and not limitation, computer-readable storage media may include volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer-readable instructions, data structures, program modules or other data. For example, computer-readable storage media include, but are not limited to, volatile memory, such as random access memory (RAM, DRAM, SRAM); and nonvolatile memory such as flash memory, various read only memory (ROM, PROM, EPROM, EEPROM), magnetic and ferromagnetic/ferroelectric memory (MRAM, feRAM); and magnetic and optical storage devices (hard disk, tape, CD, DVD); or other now known media or later developed computer-readable information/data that can be stored for use by a computer system.

An embodiment according to the present application comprises an apparatus comprising a memory for storing computer program instructions and a processor for executing the program instructions, wherein the computer program instructions, when executed by the processor, trigger the apparatus to operate a method and/or a solution according to the embodiments of the present application as described above.

It will be evident to those skilled in the art that the present application is not limited to the details of the foregoing illustrative embodiments, and that the present application may be embodied in other specific forms without departing from the spirit or essential characteristics thereof. The present embodiments are, therefore, to be considered in all respects as illustrative and not restrictive, the scope of the application being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference sign in a claim should not be construed as limiting the claim concerned. Furthermore, it is evident that the word "comprising" does not exclude other elements or steps, and that the singular does not exclude a plurality. A plurality of units or means recited in the apparatus claims can also be implemented by means of one unit or means in software or hardware. The terms first, second, etc. are used to denote a name, but not any particular order.

Claims

1. A method for predicting protein mutation stability, wherein the method comprises:

based on the protein fitness data and the protein mutation free energy change data, combining the graph neural network, and training to obtain a protein mutation stability prediction model;

2. The method of claim 1, wherein the acquiring a pre-trained protein language model and a pre-trained protein structure model comprises:

constructing a pre-trained protein language model based on the protein sequence data;

Based on the protein structure data, a pre-trained protein structure model is constructed.

3. The method of claim 2, wherein constructing a pre-trained protein language model based on the protein sequence data comprises:

acquiring the protein sequence data;

based on the protein sequence data, a pre-trained protein language model is constructed through unsupervised training.

4. The method of claim 3, wherein the obtaining the protein sequence data comprises:

acquiring original protein sequence data;

and preprocessing the original protein sequence data to obtain protein sequence data.

5. The method of any one of claims 1 to 4, wherein constructing a graph neural network based on the protein mutation training data, the pre-trained protein language model, the pre-trained protein structure model, wherein the protein mutation training data comprises first protein sequence information, structural information corresponding to the first protein sequence information, and mutation information comprises:

information integration is carried out based on the protein mutation training data, the pre-training protein language model and the pre-training protein structure model to obtain corresponding coding information, wherein the protein mutation training data comprises first protein sequence information, and structure information and mutation information corresponding to the first protein sequence information;

And constructing a corresponding graph neural network based on the coding information.

6. The method of claim 5, wherein the integrating information based on the protein mutation training data, the pre-trained protein language model, and the pre-trained protein structure model, and obtaining corresponding encoded information, wherein the protein mutation training data includes first protein sequence information, structure information corresponding to the first protein sequence information, and mutation information includes:

for the structure information corresponding to the first protein sequence information and the first protein sequence information, determining corresponding first characterization information and second characterization information by using the pre-training protein language model and the pre-training protein structure model respectively;

determining corresponding global structure information and local structure information based on the structure information and mutation information corresponding to the first protein sequence information;

and determining corresponding coding information based on the first characterization information, the second characterization information, the global structure information and the local structure information.

7. The method of claim 1, wherein training the protein mutation stability prediction model based on the protein fitness data and the protein mutation free energy variation data in combination with the graph neural network comprises:

Training to obtain a protein fitness model based on the protein fitness data by combining the graph neural network;

and training the protein fitness model based on the protein mutation free energy change data to obtain the protein mutation stability prediction model.

8. The method of claim 1, wherein the training a protein mutation stability prediction model based on the protein fitness data and the protein mutation free energy variation data in combination with the graph neural network further comprises:

the protein mutation stability prediction model is fine-tuned based on a target dataset, wherein the target dataset matches the target protein mutation data.

9. The method of claim 1, wherein the determining protein mutation stability information corresponding to the target protein mutation data based on the target protein mutation data using the pre-trained protein language model, the pre-trained protein structure model, the graph neural network, and the protein mutation stability prediction model, wherein the target protein mutation data includes target protein sequence information, structural information corresponding to the target protein sequence information, and mutation information includes:

For the target protein sequence information and the structural information corresponding to the target protein sequence information, determining corresponding first target characterization information and second target characterization information by using the pre-training protein language model and the pre-training protein structural model respectively;

determining corresponding target coding information by using the graph neural network based on the first target characterization information, the second target characterization information and the target protein mutation data;

and determining protein mutation stability information corresponding to the target protein mutation data through the protein mutation stability prediction model based on the target coding information.

10. A method for constructing a protein mutation stability prediction model, wherein the method comprises:

11. A computer device for predicting protein mutation stability, comprising a memory, a processor and a computer program stored on the memory, characterized in that the processor executes the computer program to carry out the steps of the method according to any one of claims 1 to 10.

12. A computer readable storage medium having stored thereon a computer program/instruction which when executed by a processor performs the steps of the method according to any of claims 1 to 10.

13. A computer program product comprising a computer program which, when executed by a processor, implements the steps of the method according to any one of claims 1 to 10.