CN116844632A

CN116844632A - Method and device for determining antibody sequence structure

Info

Publication number: CN116844632A
Application number: CN202310833816.1A
Authority: CN
Inventors: 井晓阳; 许锦波
Original assignee: Beijing Molecular Heart Technology Co ltd
Current assignee: Beijing Molecular Heart Technology Co ltd
Priority date: 2023-07-07
Filing date: 2023-07-07
Publication date: 2023-10-03
Anticipated expiration: 2043-07-07
Also published as: CN116844632B

Abstract

The application aims to provide a method and a device for determining the sequence structure of an antibody, wherein the method comprises the following steps: training a protein language model based on a deep self-attention transformation network architecture using a first protein sequence; training a protein structure prediction model based on the protein language model and the second protein sequence and corresponding protein structure information; on the basis of the protein structure prediction model, training an antibody structure prediction model by utilizing the antibody sequence and corresponding antibody structure information; and determining corresponding antibody structure prediction information by using the antibody structure prediction model. According to the application, a protein language model is utilized, a general protein structure prediction model is trained firstly, then the model is adjusted by combining with an antibody structure, and a corresponding antibody structure prediction model is determined to predict the antibody structure, so that the number of requirements on antibody samples is reduced, the generalization of the antibody structure prediction model is improved, and the accuracy and efficiency of structure prediction are improved.

Description

Method and device for determining antibody sequence structure

Technical Field

The application relates to the technical field of bioinformatics, in particular to a technology for determining the sequence structure of an antibody.

Background

The antibody is an important protein and is widely applied to the fields of medicine, biology and the like. The antibody structure may provide information on the molecular structure, conformation and specific functional sites of the antibody, helping to understand the relationship between the specific structure of the antibody and its function. By predicting and analyzing antibody structure, valuable information can be provided for drug design, by computational modeling and analysis, to improve antibody properties such as enhanced binding affinity, reduced immunogenicity, increased stability, etc. Currently, the main antibody structure acquisition method is to acquire the antibody structure by the technologies of X-ray crystallography, nuclear magnetic resonance and the like, but the experiments consume huge time and money. Although methods for predicting antibody structure using computational methods such as deep neural networks, large-scale structure sampling, etc. have now emerged, the accuracy and universality of the above computational methods for predicting antibody structure are limited by the extremely diverse antibody complementarity determining regions (Complementarity Determining Region, CDRs) and the relatively few antibody samples of currently known structures.

Disclosure of Invention

It is an object of the present application to provide a method and apparatus for determining the structure of an antibody sequence.

According to one aspect of the present application there is provided a method for determining the structure of an antibody sequence, the method comprising:

training to obtain a protein language model by utilizing a first protein sequence based on a deep self-attention transformation network architecture;

training to obtain a protein structure prediction model based on the protein language model and protein structure information of a second protein sequence corresponding to the second protein sequence;

training to obtain an antibody structure prediction model based on the protein language model and antibody structure information of an antibody sequence corresponding to the antibody sequence;

and determining antibody structure prediction information corresponding to the target antibody sequence by using the protein language model and the antibody structure prediction model based on the target antibody sequence.

According to one aspect of the present application there is provided a computer device for determining the structure of an antibody sequence comprising a memory, a processor and a computer program stored on the memory, characterised in that the processor executes the computer program to carry out the steps of any of the methods described above.

According to one aspect of the present application there is provided a computer readable storage medium having stored thereon a computer program, characterized in that the computer program when executed by a processor implements the steps of any of the methods described above.

According to one aspect of the present application there is provided a computer program product comprising a computer program, characterized in that the computer program when executed by a processor implements the steps of any of the methods described above.

According to one aspect of the present application there is provided an apparatus for determining the structure of an antibody sequence, the apparatus comprising:

the system comprises a one-to-one module, a first protein sequence and a second protein sequence, wherein the one-to-one module is used for obtaining a protein language model based on a deep self-attention transformation network architecture through training;

the two-module is used for training to obtain a protein structure prediction model based on the protein language model and protein structure information of the second protein sequence corresponding to the second protein sequence;

the three modules are used for training to obtain an antibody structure prediction model based on the protein language model and the antibody structure information of the antibody sequence corresponding to the antibody sequence and the protein structure prediction model;

and the four modules are used for determining the antibody structure prediction information corresponding to the target antibody sequence based on the target antibody sequence by utilizing the protein language model and the antibody structure prediction model.

Compared with the prior art, the method has the advantages that the protein language model is obtained through training by utilizing the first protein sequence based on the deep self-attention transformation network architecture; training to obtain a protein structure prediction model based on the protein language model and protein structure information of a second protein sequence corresponding to the second protein sequence; training to obtain an antibody structure prediction model based on the protein language model and antibody structure information of an antibody sequence corresponding to the antibody sequence; and determining antibody structure prediction information corresponding to the target antibody sequence by using the protein language model and the antibody structure prediction model based on the target antibody sequence. According to the application, a protein language model is utilized, a general protein structure prediction model is trained firstly, then the model is adjusted by combining with an antibody structure, and a corresponding antibody structure prediction model is determined to predict the antibody structure, so that the number of requirements on antibody samples is reduced, the generalization of the antibody structure prediction model is improved, and the accuracy and efficiency of structure prediction are improved.

Drawings

Other features, objects and advantages of the present application will become more apparent upon reading of the detailed description of non-limiting embodiments, made with reference to the accompanying drawings in which:

FIG. 1 shows a flow chart of a method for determining the structure of an antibody sequence according to one embodiment of the application;

FIG. 2 shows a block diagram of an apparatus for determining the structure of an antibody sequence according to one embodiment of the application;

FIG. 3 illustrates an exemplary system that may be used to implement various embodiments described in the present application.

The same or similar reference numbers in the drawings refer to the same or similar parts.

Detailed Description

The application is described in further detail below with reference to the accompanying drawings.

In one exemplary configuration of the application, the terminal, the device of the service network, and the trusted party each include one or more processors (e.g., central processing units (Central Processing Unit, CPU)), input/output interfaces, network interfaces, and memory.

The Memory may include non-volatile Memory in a computer readable medium, random access Memory (Random Access Memory, RAM) and/or non-volatile Memory, etc., such as Read Only Memory (ROM) or Flash Memory (Flash Memory). Memory is an example of computer-readable media.

Computer readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of storage media for a computer include, but are not limited to, phase-Change Memory (PCM), programmable Random Access Memory (Programmable Random Access Memory, PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (Dynamic Random Access Memory, DRAM), other types of Random Access Memory (RAM), read-Only Memory (ROM), electrically erasable programmable read-Only Memory (EEPROM), flash Memory or other Memory technology, read-Only Memory (Compact Disc Read-Only Memory, CD-ROM), digital versatile disks (Digital Versatile Disc, DVD) or other optical storage, magnetic cassettes, magnetic tape storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by the computing device.

The device includes, but is not limited to, a user device, a network device, or a device formed by integrating a user device and a network device through a network. The user equipment includes, but is not limited to, any mobile electronic product which can perform man-machine interaction with a user (for example, perform man-machine interaction through a touch pad), such as a smart phone, a tablet computer and the like, and the mobile electronic product can adopt any operating system, such as an Android operating system, an iOS operating system and the like. The network device includes an electronic device capable of automatically performing numerical calculation and information processing according to a preset or stored instruction, and the hardware includes, but is not limited to, a microprocessor, an application specific integrated circuit (Application Specific Integrated Circuit, ASIC), a programmable logic device (Programmable Logic Device, PLD), a field programmable gate array (Field Programmable Gate Array, FPGA), a digital signal processor (Digital Signal Processor, DSP), an embedded device, and the like. The network device includes, but is not limited to, a computer, a network host, a single network server, a plurality of network server sets, or a cloud of servers; here, the Cloud is composed of a large number of computers or network servers based on Cloud Computing (Cloud Computing), which is a kind of distributed Computing, a virtual supercomputer composed of a group of loosely coupled computer sets. Including but not limited to the internet, wide area networks, metropolitan area networks, local area networks, VPN networks, wireless ad hoc networks (ad hoc networks), and the like. Preferably, the device may be a program running on the user device, the network device, or a device formed by integrating the user device and the network device, the touch terminal, or the network device and the touch terminal through a network.

Of course, those skilled in the art will appreciate that the above-described devices are merely examples, and that other devices now known or hereafter may be present as applicable to the present application, and are intended to be within the scope of the present application and are incorporated herein by reference.

In the description of the present application, the meaning of "a plurality" is two or more unless explicitly defined otherwise.

FIG. 1 shows a flow chart of a method for determining the structure of an antibody sequence according to one embodiment of the application, the method comprising: step S11, step S12, step S13, and step S14. In step S11, the device 1 trains and obtains a protein language model by using the first protein sequence based on the deep self-attention transformation network architecture; in step S12, the apparatus 1 trains to obtain a protein structure prediction model based on the protein language model and protein structure information of a second protein sequence corresponding to the second protein sequence; in step S13, the apparatus 1 trains and obtains an antibody structure prediction model based on the protein language model and antibody structure information of an antibody sequence corresponding to the antibody sequence and the protein structure prediction model; in step S14, the apparatus 1 determines antibody structure prediction information corresponding to the target antibody sequence based on the target antibody sequence using the protein language model and the antibody structure prediction model.

In step S11, the apparatus 1 trains and obtains a protein language model based on the deep self-attention transformation network architecture using the first protein sequence. In some embodiments, the device 1 includes, but is not limited to, a user device, a network device, e.g., a tablet, a computer, a server, having information processing or computing capabilities. In some embodiments, the deep self-attention transformation network comprises a transducer model. The first protein sequence comprises a plurality of protein sequences encoded in a form readable by the transducer model.

In some embodiments, the step S11 includes: the device 1 is based on a deep self-attention transformation network architecture, and the protein language model is obtained through training by combining the first protein sequence by using a masking mechanism. For example, masking (masking) a portion of amino acid positions in each of the first protein sequences, and inputting the masked protein sequences into the protein language model to predict the type of the masked amino acids, thereby performing model training. In some embodiments, the device 1 may employ a supervised learning approach to optimize model parameters of the protein language model by minimizing a loss function during training of the protein language model. The loss function may employ a cross entropy loss function.

In some embodiments, the step S11 further includes: step S111 (not shown), the device 1 obtaining a third protein sequence; in step S112 (not shown), the apparatus 1 performs a preprocessing operation on the third protein sequence to obtain the first protein sequence. In some embodiments, the device 1 may extract the corresponding protein sequence as the third protein sequence from a corresponding protein database (e.g., uniProt, proteinatlas, or InterPro, etc., protein database) or related literature (e.g., related papers, reports, or patent literature, etc.). In some embodiments, to ensure the training quality of the protein language model, the apparatus 1 may perform a corresponding preprocessing operation on the third protein sequence, and use the first protein sequence obtained by the preprocessing for model training. In some embodiments, the preprocessing operation includes at least any one of: removing the repeat sequence from the third protein sequence; filtering the third protein sequence; normalizing the third protein sequence; the third protein sequence is encoded in a predetermined form. In some embodiments, for the same protein sequence in the third protein sequence, device 1 may retain only one of the protein sequences. In some embodiments, the device 1 may also filter low quality protein sequences in the third protein sequence. For example, the device 1 may filter out protein sequences that do not meet the sequence length requirements based on the sequence length of the protein sequences. In some embodiments, the device 1 may also normalize protein sequences of different origins, whose expression forms are different, to convert them to the same expression form. In some embodiments, the device 1 may also encode each of the third protein sequences (e.g., one-hot encoding based on the amino acid composition of the protein sequences), converting the third protein sequences into a form readily available to the deep self-attention-shifting network.

In some embodiments, the preprocessing operation includes filtering the third protein sequence, and the step S112 includes: the device 1 determines a plurality of protein clusters based on similarity information between protein sequences in the third protein sequence; based on the plurality of protein clusters, a respective first protein sequence is determined. For example, the apparatus 1 may determine similarity information between protein sequences in the third protein sequence using an algorithm such as MMseqs2 (Many-against-Manysequence search), CD-HIT, or PSI-BLAST. And clustering based on the similarity information to determine a plurality of protein clusters. The device 1 may select one or more protein sequences from each protein cluster to compose a corresponding first protein sequence.

In step S12, the apparatus 1 trains and obtains a protein structure prediction model based on the protein language model and protein structure information of a second protein sequence corresponding to the second protein sequence. In some embodiments, the device 1 first obtains the second protein sequence for training and the corresponding protein structure information. And training a protein structure prediction model by utilizing the second protein sequence and the corresponding protein structure information. For example, the second protein sequence may be extracted from a corresponding protein database (e.g., uniProt, proteinatlas, or an InterPro protein database) or related literature (e.g., related papers, reports, or patent literature, etc.). Similar to the processing of the third protein sequence described above, the apparatus 1 may also process the second protein sequence by a similar pretreatment operation, so that the second protein sequence after the pretreatment operation is more conducive to subsequent model training. The protein structure information may be determined from experiments (e.g., X-ray crystallography, nuclear magnetic resonance, etc.), or may be determined by existing protein structure prediction methods (e.g., alphaFold2, rosettafid, etc.). To facilitate model training, the protein structure information may be represented using three-dimensional coordinates.

In some embodiments, the step S12 includes: step S121 (not shown), the apparatus 1 determining first encoded information corresponding to the second protein sequence based on the protein language model; in step S122 (not shown), the apparatus 1 trains to obtain the protein structure prediction model based on the first coding information and the protein structure information corresponding to the second protein sequence. For example, the device 1 first inputs the second protein sequence into a protein language model, and obtains corresponding first coding information (ebedding). The first coding information comprises co-evolution information of a protein sequence. The device 1 performs training of a protein structure prediction model based on the first encoded information comprising co-evolution information and the protein structure information, which is advantageous for improving the structure prediction accuracy of the protein structure prediction model. In some embodiments, the apparatus 1 performs model training based on the designed deep learning model by performing corresponding protein structure prediction using the first coding information, and obtains the protein structure prediction model. The deep learning model includes, but is not limited to, a model based on an attention mechanism (e.g., a transducer model). In some embodiments, the step S122 includes: the device 1 trains to obtain the protein structure prediction model by minimizing a loss function based on the first coding information and the protein structure information corresponding to the second protein sequence. For example, during training, the device 1 may optimize model parameters of the protein structure prediction model by minimizing a loss function. The loss function includes, but is not limited to, a mean square error loss function or a structural correlation loss function.

In step S13, the apparatus 1 trains and obtains an antibody structure prediction model based on the protein language model and antibody structure information of an antibody sequence corresponding to the antibody sequence. In some embodiments, the device 1 first obtains the antibody sequences and corresponding the antibody structure information for training. And performing model training on the basis of the protein structure prediction model obtained in the step S12 to obtain a corresponding antibody structure prediction model. The antibody sequences may be obtained from the corresponding antibody database or literature. Similar to the treatment of the aforementioned third protein sequence, the apparatus 1 may also treat the antibody sequence by a similar pretreatment operation. The antibody structural information is determined by experimentation (e.g., X-ray crystallography, nuclear magnetic resonance, etc.). To facilitate model training, the antibody structural information may be represented using three-dimensional coordinates. According to the method, the antibody structure prediction model is trained on the basis of the protein structure prediction model, so that the performance of the obtained antibody structure prediction model can be effectively improved, and the situation that the accuracy and universality of the model obtained directly on the basis of the antibody sequence and the corresponding structure are often poor due to the fact that a large amount of time and resources are consumed for determining the antibody structure information, and the number of antibody sequences of known antibody structure information which can be used for training is relatively small is avoided.

In some embodiments, the step S13 includes: the device 1 determines second coding information corresponding to the antibody sequence based on the protein language model; and training to obtain the antibody structure prediction model based on the protein structure prediction model based on the second coding information and the antibody structure information corresponding to the antibody sequence. For example, the apparatus 1 first inputs the antibody sequence into the protein language model to obtain corresponding second coding information (ebedding). The second coding information comprises co-evolution information of a protein sequence. The apparatus 1 performs training based on the protein structure prediction model obtained in step S12. During the training process, the device 1 can also optimize the corresponding model parameters by minimizing the loss function. The loss function includes, but is not limited to, a mean square error loss function or a structural correlation loss function. In some embodiments, given the complexity of predicting the complementarity determining regions (Complementarity Determining Region, CDRs) of an antibody sequence, the weight of the complementarity determining regions of the antibody sequence may be increased during training to obtain better prediction of the complementarity determining regions of the antibody sequence.

In step S14, the apparatus 1 determines antibody structure prediction information corresponding to the target antibody sequence based on the target antibody sequence using the protein language model and the antibody structure prediction model. In some embodiments, the step S14 includes: the device 1 determines third coding information corresponding to the target antibody sequence based on the protein language model; and determining the antibody structure prediction information corresponding to the target antibody sequence by using the antibody structure prediction model based on the third coding information. For example, the device 1 may first encode the target antibody sequence (e.g., one-hot encoding based on the amino acid composition of the target antibody sequence), converting the target antibody sequence into a form that is readily available to the protein language model. And obtaining third coding information corresponding to the target antibody sequence through the protein language model. And then carrying out antibody structure prediction of the target antibody sequence based on the antibody structure prediction model.

In some embodiments, the method further comprises: in step S15 (not shown), the apparatus 1 performs structural optimization on the antibody structure prediction information, and determines corresponding target antibody structure prediction information. For example, for the antibody structure prediction information obtained in step S14, the apparatus 1 may optimize the antibody structure prediction information based on a molecular dynamics simulation, statistics, or empirical energy function, or the like. And iteratively adjusting the position information of each atom in the antibody structure prediction information to finally obtain the target antibody structure prediction information with the lowest energy.

In some embodiments, the step S15 includes: the equipment 1 constructs a corresponding complex structure by combining a target antigen sequence corresponding to the target antibody sequence based on the antibody structure prediction information; and carrying out structural optimization on the antibody structure prediction information based on the complex structure, and determining corresponding target antibody structure prediction information. For example, in performing structural optimization on the antibody structure prediction information, the apparatus 1 may combine the target antibody sequence with a corresponding target antigen sequence to construct a corresponding complex structure. Based on the complex structure, the structure of the complementarity determining region corresponding to the target antibody sequence is optimized to further improve the structure optimization effect.

Fig. 2 shows a block diagram of an apparatus for determining the structure of an antibody sequence according to an embodiment of the present application, the apparatus 1 includes a one-to-one module 11, a two-to-one module 12, a three-to-three module 13, and a four-to-four module 14. The one-to-one module 11 trains and obtains a protein language model by utilizing a first protein sequence based on a deep self-attention transformation network architecture; the two modules 12 train to obtain a protein structure prediction model based on the protein language model and protein structure information of a second protein sequence corresponding to the second protein sequence; the three modules 13 are based on the protein language model and antibody structure information of an antibody sequence corresponding to the antibody sequence, and train to obtain an antibody structure prediction model on the basis of the protein structure prediction model; the four-module 14 determines the antibody structure prediction information corresponding to the target antibody sequence based on the target antibody sequence by using the protein language model and the antibody structure prediction model. Here, the specific embodiments of the one-to-one module 11, the two-module 12, the three-module 13 and the four-module 14 shown in fig. 2 are the same as or similar to the specific embodiments of the foregoing step S11, the step S12, the step S13 and the step S14, respectively, and thus are not described in detail and are incorporated herein by reference.

In some embodiments, the one-to-one module 11 includes one-to-one unit 111 (not shown) and two-to-one unit 112 (not shown). The one-to-one unit 111 obtains a third protein sequence; the one-to-two unit 112 performs a pretreatment operation on the third protein sequence to obtain the first protein sequence. The embodiments of the one-to-one unit 111 and the one-to-two unit 112 are the same as or similar to the embodiments of the step S111 and the step S112, respectively, and thus are not described in detail herein, and are incorporated by reference.

In some embodiments, the module 12 includes a cell 121 (not shown) and a cell 122 (not shown). The first unit 121 determines first coding information corresponding to the second protein sequence based on the protein language model; the two-unit 122 trains to obtain the protein structure prediction model based on the first coding information and the protein structure information corresponding to the second protein sequence. Here, the embodiments of the two-one unit 121 and the two-two unit 122 are the same as or similar to the embodiments of the step S121 and the step S122, respectively, so that the detailed description is omitted herein for reference.

In some embodiments, the device 1 further comprises: a five-module 15 (not shown) performs structural optimization on the antibody structure prediction information to determine corresponding target antibody structure prediction information. The embodiment of the fifth module 15 is the same as or similar to the embodiment of the step S15, and is not described herein, but is incorporated by reference.

FIG. 3 illustrates an exemplary system that may be used to implement various embodiments described herein; in some embodiments, as shown in fig. 3, system 300 can function as any of the devices of the various described embodiments. In some embodiments, system 300 may include one or more computer-readable media (e.g., system memory or NVM/storage 320) having instructions and one or more processors (e.g., processor(s) 305) coupled with the one or more computer-readable media and configured to execute the instructions to implement the modules to perform the actions described in the present application.

For one embodiment, the system control module 310 may include any suitable interface controller to provide any suitable interface to at least one of the processor(s) 305 and/or any suitable device or component in communication with the system control module 310.

The system control module 310 may include a memory controller module 330 to provide an interface to the system memory 315. Memory controller module 330 may be a hardware module, a software module, and/or a firmware module.

The system memory 315 may be used, for example, to load and store data and/or instructions for the system 300. For one embodiment, system memory 315 may include any suitable volatile memory, such as, for example, a suitable DRAM. In some embodiments, the system memory 315 may comprise a double data rate type four synchronous dynamic random access memory (DDR 4 SDRAM).

For one embodiment, system control module 310 may include one or more input/output (I/O) controllers to provide an interface to NVM/storage 320 and communication interface(s) 325.

For example, NVM/storage 320 may be used to store data and/or instructions. NVM/storage 320 may include any suitable nonvolatile memory (e.g., flash memory) and/or may include any suitable nonvolatile storage device(s) (e.g., one or more Hard Disk Drives (HDDs), one or more Compact Disc (CD) drives, and/or one or more Digital Versatile Disc (DVD) drives).

NVM/storage 320 may include storage resources that are physically part of the device on which system 300 is installed or which may be accessed by the device without being part of the device. For example, NVM/storage 320 may be accessed over a network via communication interface(s) 325.

Communication interface(s) 325 may provide an interface for system 300 to communicate over one or more networks and/or with any other suitable device. The system 300 may wirelessly communicate with one or more components of a wireless network in accordance with any of one or more wireless network standards and/or protocols.

For one embodiment, at least one of the processor(s) 305 may be packaged together with logic of one or more controllers (e.g., memory controller module 330) of the system control module 310. For one embodiment, at least one of the processor(s) 305 may be packaged together with logic of one or more controllers of the system control module 310 to form a System In Package (SiP). For one embodiment, at least one of the processor(s) 305 may be integrated on the same die as logic of one or more controllers of the system control module 310. For one embodiment, at least one of the processor(s) 305 may be integrated on the same die with logic of one or more controllers of the system control module 310 to form a system on chip (SoC).

In various embodiments, the system 300 may be, but is not limited to being: a server, workstation, desktop computing device, or mobile computing device (e.g., laptop computing device, handheld computing device, tablet, netbook, etc.). In various embodiments, system 300 may have more or fewer components and/or different architectures. For example, in some embodiments, system 300 includes one or more cameras, keyboards, liquid Crystal Display (LCD) screens (including touch screen displays), non-volatile memory ports, multiple antennas, graphics chips, application Specific Integrated Circuits (ASICs), and speakers.

In addition to the methods and apparatus described in the above embodiments, the present application also provides a computer-readable storage medium storing computer code which, when executed, performs a method as described in any one of the preceding claims.

The application also provides a computer program product which, when executed by a computer device, performs a method as claimed in any preceding claim.

The present application also provides a computer device comprising:

one or more processors;

a memory for storing one or more computer programs;

the one or more computer programs, when executed by the one or more processors, cause the one or more processors to implement the method of any preceding claim.

It should be noted that the present application may be implemented in software and/or a combination of software and hardware, e.g., using Application Specific Integrated Circuits (ASIC), a general purpose computer or any other similar hardware device. In one embodiment, the software program of the present application may be executed by a processor to perform the steps or functions described above. Likewise, the software programs of the present application (including associated data structures) may be stored on a computer readable recording medium, such as RAM memory, magnetic or optical drive or diskette and the like. In addition, some steps or functions of the present application may be implemented in hardware, for example, as circuitry that cooperates with the processor to perform various steps or functions.

Furthermore, portions of the present application may be implemented as a computer program product, such as computer program instructions, which when executed by a computer, may invoke or provide methods and/or techniques in accordance with the present application by way of operation of the computer. Those skilled in the art will appreciate that the form of computer program instructions present in a computer readable medium includes, but is not limited to, source files, executable files, installation package files, etc., and accordingly, the manner in which the computer program instructions are executed by a computer includes, but is not limited to: the computer directly executes the instruction, or the computer compiles the instruction and then executes the corresponding compiled program, or the computer reads and executes the instruction, or the computer reads and installs the instruction and then executes the corresponding installed program. Herein, a computer-readable medium may be any available computer-readable storage medium or communication medium that can be accessed by a computer.

Communication media includes media whereby a communication signal containing, for example, computer readable instructions, data structures, program modules, or other data, is transferred from one system to another. Communication media may include conductive transmission media such as electrical cables and wires (e.g., optical fibers, coaxial, etc.) and wireless (non-conductive transmission) media capable of transmitting energy waves, such as acoustic, electromagnetic, RF, microwave, and infrared. Computer readable instructions, data structures, program modules, or other data may be embodied as a modulated data signal, for example, in a wireless medium, such as a carrier wave or similar mechanism, such as that embodied as part of spread spectrum technology. The term "modulated data signal" means a signal that has one or more of its characteristics set or changed in such a manner as to encode information in the signal. The modulation may be analog, digital or hybrid modulation techniques.

By way of example, and not limitation, computer-readable storage media may include volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer-readable instructions, data structures, program modules or other data. For example, computer-readable storage media include, but are not limited to, volatile memory, such as random access memory (RAM, DRAM, SRAM); and nonvolatile memory such as flash memory, various read only memory (ROM, PROM, EPROM, EEPROM), magnetic and ferromagnetic/ferroelectric memory (MRAM, feRAM); and magnetic and optical storage devices (hard disk, tape, CD, DVD); or other now known media or later developed computer-readable information/data that can be stored for use by a computer system.

An embodiment according to the application comprises an apparatus comprising a memory for storing computer program instructions and a processor for executing the program instructions, wherein the computer program instructions, when executed by the processor, trigger the apparatus to operate a method and/or a solution according to the embodiments of the application as described above.

It will be evident to those skilled in the art that the application is not limited to the details of the foregoing illustrative embodiments, and that the present application may be embodied in other specific forms without departing from the spirit or essential characteristics thereof. The present embodiments are, therefore, to be considered in all respects as illustrative and not restrictive, the scope of the application being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference sign in a claim should not be construed as limiting the claim concerned. Furthermore, it is evident that the word "comprising" does not exclude other elements or steps, and that the singular does not exclude a plurality. A plurality of units or means recited in the apparatus claims can also be implemented by means of one unit or means in software or hardware. The terms first, second, etc. are used to denote a name, but not any particular order.

Claims

1. A method for determining the structure of an antibody sequence, wherein the method comprises:

2. The method of claim 1, wherein training to obtain a protein language model using a first protein sequence based on the deep self-attention-transforming network architecture further comprises:

obtaining a third protein sequence;

and performing a pretreatment operation on the third protein sequence to obtain the first protein sequence.

3. The method of claim 2, wherein the preprocessing operation comprises at least any one of:

removing the repeat sequence from the third protein sequence;

filtering the third protein sequence;

normalizing the third protein sequence;

the third protein sequence is encoded in a predetermined form.

4. A method according to claim 3, wherein the pre-processing operation comprises filtering the third protein sequence, the performing the pre-processing operation on the third protein sequence resulting in the first protein sequence comprising:

determining a plurality of protein clusters based on similarity information between protein sequences in the third protein sequence;

based on the plurality of protein clusters, a respective first protein sequence is determined.

5. The method of claim 1, wherein training to obtain a protein language model using a first protein sequence based on the deep self-attention-transforming network architecture comprises:

based on the deep self-attention transformation network architecture, the protein language model is obtained through training by combining a masking mechanism with the first protein sequence.

6. The method of claim 1, wherein training to obtain a protein structure prediction model based on the protein language model and protein structure information of a second protein sequence corresponding to the second protein sequence comprises:

determining first coding information corresponding to the second protein sequence based on the protein language model;

and training to obtain the protein structure prediction model based on the first coding information and the protein structure information corresponding to the second protein sequence.

7. The method of claim 6, wherein training to obtain the protein structure prediction model based on the first encoding information and protein structure information corresponding to the second protein sequence comprises:

and training to obtain the protein structure prediction model by minimizing a loss function based on the first coding information and the protein structure information corresponding to the second protein sequence.

8. The method of claim 1, wherein training to obtain an antibody structure prediction model based on the protein language model and antibody structure information of an antibody sequence corresponding to the antibody sequence comprises:

determining second coding information corresponding to the antibody sequence based on the protein language model;

and training to obtain the antibody structure prediction model based on the protein structure prediction model based on the second coding information and the antibody structure information corresponding to the antibody sequence.

9. The method of claim 1, wherein the determining, based on the target antibody sequence, the antibody structure prediction information corresponding to the target antibody sequence using the protein language model and the antibody structure prediction model comprises:

determining third coding information corresponding to the target antibody sequence based on the protein language model;

and determining the antibody structure prediction information corresponding to the target antibody sequence by using the antibody structure prediction model based on the third coding information.

10. The method of claim 1, wherein the method further comprises:

and carrying out structural optimization on the antibody structure prediction information, and determining corresponding target antibody structure prediction information.

11. The method of claim 10, wherein the structurally optimizing the antibody structure prediction information, determining corresponding target antibody structure prediction information comprises:

based on the antibody structure prediction information, combining a target antigen sequence corresponding to the target antibody sequence to construct a corresponding complex structure;

and carrying out structural optimization on the antibody structure prediction information based on the complex structure, and determining corresponding target antibody structure prediction information.

12. A computer device for determining the structure of an antibody sequence, comprising a memory, a processor and a computer program stored on the memory, characterized in that the processor executes the computer program to carry out the steps of the method according to any one of claims 1 to 11.

13. A computer readable storage medium having stored thereon a computer program/instruction which when executed by a processor performs the steps of the method according to any of claims 1 to 11.