CN115881220B - Antibody structure prediction processing method and device - Google Patents

Antibody structure prediction processing method and device Download PDF

Info

Publication number
CN115881220B
CN115881220B CN202310114453.6A CN202310114453A CN115881220B CN 115881220 B CN115881220 B CN 115881220B CN 202310114453 A CN202310114453 A CN 202310114453A CN 115881220 B CN115881220 B CN 115881220B
Authority
CN
China
Prior art keywords
training
segment
atom
model
fragment
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202310114453.6A
Other languages
Chinese (zh)
Other versions
CN115881220A (en
Inventor
刘旭阳
邓镇丰
顾睿初
温翰
张林峰
孙伟杰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Shenshi Technology Co ltd
Original Assignee
Beijing Shenshi Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Shenshi Technology Co ltd filed Critical Beijing Shenshi Technology Co ltd
Priority to CN202310114453.6A priority Critical patent/CN115881220B/en
Publication of CN115881220A publication Critical patent/CN115881220A/en
Application granted granted Critical
Publication of CN115881220B publication Critical patent/CN115881220B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Landscapes

  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The embodiment of the invention relates to a processing method and a device for antibody structure prediction, wherein the method comprises the following steps: training the antibody structure prediction model based on a plurality of preset iteration times to obtain a plurality of groups of structure prediction model parameters; training the scoring model of the antibody structure; obtaining FV fragment sequences; traversing the multiple groups of structure prediction model parameters, setting parameters of the antibody structure prediction model based on the structure prediction model parameters of the current traversal, inputting heavy chain and light chain residue sequences into the current antibody structure prediction model, and performing FV segment three-dimensional structure prediction to obtain a corresponding FV segment structure; inputting the obtained M FV segment structures into an antibody structure scoring model respectively for confidence scoring; and selecting the FV segment structure corresponding to the maximum score from the M scores to be used as the optimal FV segment structure and outputting the FV segment structure. The invention can improve the structure prediction precision based on the combination of the antibody structure prediction model and the antibody structure scoring model.

Description

Antibody structure prediction processing method and device
Technical Field
The invention relates to the technical field of data processing, in particular to a method and a device for processing antibody structure prediction.
Background
The antibody structure consists of two light chains and two heavy chains, which assemble into one large Y-complex. The variable Fragment (FV) region is responsible for antigen binding via a set of Complementarity Determining Regions (CDRs), and the high flexibility of the CDR H3 region is a difficulty in obtaining accurate antibody structures. Existing antibody structure prediction modes include an experimental prediction mode and an intelligent model prediction mode. The prediction mode based on the experiment is slow in speed and high in cost. The predictive mode based on the intelligent model is commonly known as an alpha field model at present. However, in specific applications, we find that, because the AlphaFold model is not specifically designed for antibodies, the antibody structure prediction training is performed according to the conventional AlphaFold model training manner, and thus better prediction accuracy cannot always be obtained.
Disclosure of Invention
The invention aims at overcoming the defects of the prior art and provides a processing method, a device, electronic equipment and a computer readable storage medium for predicting an antibody structure; an antibody structure prediction model is built in advance based on the model structure of the alpha fold-Multimer, and a corresponding antibody structure scoring model is built based on the SEGNNs network; then, based on the preset M iteration times n i Training the antibody structure prediction model to obtain corresponding M groups of structure prediction model parameters, and training an antibody structure scoring model; respectively inputting M groups of structure prediction model parameters into the antibody structure prediction model for setting to obtain corresponding M antibody structure prediction models with different parameters; and then when a pair of light chain residue sequences and heavy chain residue sequences are obtained each time, inputting the pair of light chain residue sequences and the pair of heavy chain residue sequences into the M antibody structure prediction models for structure prediction respectively, carrying out confidence scoring on the M prediction structures respectively through the antibody structure scoring model, and outputting the prediction structure corresponding to the highest score as an optimal FV segment structure. The invention not only can make up the technical defect that no targeted model is provided for three-dimensional structure prediction of the antibody FV fragment in the traditional scheme; the prediction accuracy of the three-dimensional structure of the antibody FV fragment can be improved through the combined operation of the antibody structure prediction model and the antibody structure scoring model.
To achieve the above object, in a first aspect, an embodiment of the present invention provides a method for processing antibody structure prediction, the method comprising:
based on a plurality of preset iteration times n i Training the antibody structure prediction model to obtain a plurality of corresponding structure prediction model parameters; training the scoring model of the antibody structure; i is more than or equal to 1 and less than or equal to M, wherein M is a positive integer which is more than or equal to 1;
obtaining the residue sequence of the antibody FV fragment as a corresponding first FV fragment sequence; the first FV fragment sequence comprising a first heavy chain residue sequence, a first light chain residue sequence;
traversing the structure prediction model parameters of the plurality of groups of structure prediction model parameters; the structure prediction model parameter of the current traversal is used as a corresponding current structure prediction model parameter, the model parameter of the antibody structure prediction model is set as the current structure prediction model parameter, and the antibody structure prediction model after the current setting is used as a corresponding current antibody structure prediction model; inputting the first heavy chain residue sequence and the first light chain residue sequence into the current antibody structure prediction model to perform FV segment three-dimensional structure prediction processing to obtain a corresponding first FV segment structure;
inputting the obtained M first FV segment structures into the trained antibody structure scoring model respectively for confidence scoring to obtain corresponding first structure scores;
And selecting the first FV segment structure corresponding to the maximum score from the obtained M first structure scores as an optimal FV segment structure and outputting the optimal FV segment structure.
Preferably, the antibody structure prediction model is a prediction model obtained by transferring an alpha fold-Multimer model from a machine learning frame JAX to a machine learning frame pyrach and performing structure optimization; the alpha fold-Multimer model is a polymer structure prediction model realized based on the alpha fold model; the model structures of the antibody structure prediction model and the alpha fold-Multimer model are consistent with the model structures of the alpha fold model;
the antibody structure scoring model includes a segns network and a weighted pooling module.
Preferably, the method is based on a plurality of preset iteration times n i Training the antibody structure prediction model to obtain a plurality of corresponding groups of structure prediction model parameters, which specifically comprise:
training the antibody structure prediction model based on a polymer structure prediction model training mode of the alpha fold-Multimer model, and taking the trained antibody structure prediction model as a corresponding first training model;
after the first training model is obtained, extracting all antibody structure data in a protein three-dimensional structure database to form a corresponding first data set; the first dataset comprises a plurality of first antibody structure data; the first antibody structural data includes first FV fragment data; the first FV segment data includes a first segment residue sequence and a corresponding first segment three-dimensional structural tag; the first fragment residue sequence comprises a first fragment light chain residue sequence and a first fragment heavy chain residue sequence; the first segment three-dimensional structure tag comprises a first segment atom set and a first segment atom connection bond set; the first segment atom set includes a plurality of first segment atoms; each first fragment atom corresponds to a group of atomic characteristic data, including an atomic identifier, an atomic name, an atomic coordinate, an atomic element type consisting of an element type and an atomic type, a residue type identifier and an atomic electric quantity; the first set of segment atom linkages comprises a plurality of first segment atom linkages; each first segment atom connection key corresponds to a group of connection key information, and comprises a connection key identifier, a key head atom identifier, a key tail atom identifier and a connection key type;
Constructing a secondary training data set according to the first data set to obtain a corresponding first training data set; the first training data set includes a plurality of first training FV fragment data; the first training FV segment data comprises a first training segment light chain residue sequence, a first training segment heavy chain residue sequence and a corresponding first training segment three-dimensional structure tag;
according to the number n of iterations i Respectively carrying out secondary training on the first training model by the first training data set to obtain a corresponding group of structure prediction model parameters; and the M obtained structure prediction model parameters form corresponding multiple groups of structure prediction model parameters.
Further, the constructing the secondary training data set according to the first data set to obtain a corresponding first training data set specifically includes:
constructing the first training data set initialized to be empty;
adding the first segment light chain residue sequence, the first segment heavy chain residue sequence, and the first segment three-dimensional structure tag of the first FV segment data of each of the first antibody structural data in the first dataset as corresponding first training segment light chain residue sequence, the first training segment heavy chain residue sequence, and the first training segment three-dimensional structure tag to the first training dataset to form corresponding first training FV segment data;
Identifying a preset data enhancement mode; if the data enhancement mode is a first mode, performing motion simulation on the first segment three-dimensional structure labels of the first antibody structure data based on a preset dynamics simulator, and performing FV segment three-dimensional structure sampling processing in a motion simulation process based on a preset motion simulation sampling rule so as to obtain a plurality of sampled FV segment three-dimensional structures; if the data enhancement mode is a second mode, performing diffusion sample data generation processing according to the first segment three-dimensional structure labels of the first antibody structure data based on a preset diffusion model so as to obtain a plurality of sampling FV segment three-dimensional structures; taking each obtained three-dimensional structure of the sampled FV segment as a corresponding first training segment three-dimensional structure label, taking the first segment light chain residue sequence and the first segment heavy chain residue sequence of the first FV segment data corresponding to each three-dimensional structure of the sampled FV segment as the corresponding first training segment light chain residue sequence and the first training segment heavy chain residue sequence, and adding the corresponding first training segment data consisting of the first training segment light chain residue sequence, the first training segment heavy chain residue sequence and the first training segment three-dimensional structure label corresponding to each three-dimensional structure of the sampled FV segment into the first training data set; the data enhancement mode includes a first mode and a second mode; the dynamics simulator comprises a simulator based on the principle of enhanced dynamics and a simulator based on the principle of molecular dynamics.
Further, according to each iteration number n i And the first training data set respectively carries out secondary training on the first training model to obtain a corresponding group of structure prediction model parameters, which specifically comprises the following steps:
step 51, initializing the count value of the first training counter to 0; and the current iteration times n i As a corresponding first maximum count value;
step 52 of randomly sampling one of said first training FV fragment data from said first training dataset as corresponding current training FV fragment data;
step 53, using the first training fragment light chain residue sequence, the first training fragment heavy chain residue sequence and the corresponding first training fragment three-dimensional structure tag of the current training FV fragment data as the corresponding current training light chain residue sequence, the current training heavy chain residue sequence and the current fragment three-dimensional structure tag;
step 54, inputting the current training light chain residue sequence and the current training heavy chain residue sequence into the first training model to perform FV fragment three-dimensional structure prediction processing to obtain a corresponding first training FV fragment structure; substituting the first training FV segment structure and the current segment three-dimensional structure label into a model loss function of the first training model to perform loss calculation to obtain a corresponding first loss value;
Step 55, evaluating the first loss value based on a preset first convergence loss range; if the first loss value does not meet the first convergence loss range, go to step 56; if the first loss value meets the first convergence loss range, go to step 57;
step 56, substituting the model parameters of the first training model into the model loss function of the first training model to construct a corresponding first objective function; solving model parameters of the first training model towards the direction of enabling the first objective function to reach the minimum value, and taking a solving result as a corresponding first updated model parameter; model parameter updating processing is carried out on the first training model based on the first updating model parameters; and returning to the step 54 to continue training when the model parameter updating process is successful;
step 57, adding 1 to the count value of the first training counter; identifying whether the count value of the first training counter after adding 1 is larger than the first maximum count value; if the count value of the first training counter is less than or equal to the first maximum count value, selecting the next first training FV fragment data from the first training dataset as new current training FV fragment data, and returning to step 53 to continue training; and if the count value of the first training counter is larger than the first maximum count value, stopping the secondary training of the round and storing the current model parameters of the first training model as a corresponding group of structure prediction model parameters.
Preferably, the training of the pair of antibody structure scoring models specifically includes:
selecting any specified number K of the first FV fragment data of the first antibody structural data from the first dataset to form a corresponding second dataset; the designated number K is a positive integer greater than or equal to 1;
traversing each of the first FV fragment data of the second dataset; traversing, namely taking the first FV segment data currently traversed as corresponding current FV segment data; training the antibody structure scoring model according to the current FV segment data; when the training is finished, the next first FV segment data is transferred to continue traversing until the last first FV segment data of the second data set is trained; when the traversing is finished, carrying out model parameter curing treatment on the antibody structure scoring model based on the latest model parameters of the antibody structure scoring model;
and if the model parameter curing treatment is successful, the current antibody structure scoring model is regarded as the training mature antibody structure scoring model.
Further, the training the antibody structure scoring model according to the current FV fragment data specifically includes:
Taking the first fragment light chain residue sequence, the first fragment heavy chain residue sequence and the first fragment three-dimensional structure tag of the current FV fragment data as corresponding current light chain residue sequence, current heavy chain residue sequence and current FV fragment tag structures; the current FV segment tag structure includes the first segment atom set and the first segment atom set of linkages; the first set of fragment atoms includes a plurality of the first fragment atoms; each first fragment atom corresponds to a group of atomic characteristic data, including an atomic identifier, an atomic name, an atomic coordinate, an atomic element type consisting of an element type and an atomic type, a residue type identifier and an atomic electric quantity; the first set of segment atom linkages comprises a plurality of the first segment atom linkages; each first segment atom connection key corresponds to a group of connection key information, and comprises a connection key identifier, a key head atom identifier, a key tail atom identifier and a connection key type;
performing CDRH3 residue substrate segment sequence identification processing on the current heavy chain residue sequence based on an IMGT database to obtain a corresponding first heavy chain CDRH3 segment sequence;
marking a three-dimensional structure region corresponding to the first heavy chain CDRH3 fragment sequence in the current FV fragment tag structure as a corresponding CDRH3 tag region, and marking three-dimensional structure regions other than the CDRH3 tag region as corresponding non-CDRH 3 tag regions; the first fragment atoms in the CDRH3 tag region at the region edge are marked as corresponding first edge atoms, and all the first fragment atoms in the non-CDRH 3 tag region, with which the atomic distance between the first edge atoms does not exceed a preset first distance threshold, are marked as corresponding first neighborhood atoms; and forming a corresponding first label atom set by all the first fragment atoms and all the first neighborhood atoms of the CDRH3 label region; extracting a first segment atom connection bond of which the key head atom identification or key tail atom identification is matched with each first segment atom in the first tag atom set to form a corresponding first tag connection bond set; the first tag atom set and the first tag connection key set form a corresponding first tag data set; the first distance threshold defaults to 10 a;
Performing constrained motion simulation on the current FV segment label structure based on a constrained dynamics principle, and performing FV segment three-dimensional structure sampling processing in a constrained motion simulation process based on a preset constrained motion simulation sampling rule so as to obtain a plurality of first FV segment sampling structures; the first FV segment sampling structure includes a second segment atom set and a second segment atom set of linkages; the second set of fragment atoms includes a plurality of the second fragment atoms; each second fragment atom corresponds to a group of atomic characteristic data, including an atomic identifier, an atomic name, an atomic coordinate, an atomic element type consisting of an element type and an atomic type, a residue type identifier and an atomic electric quantity; the second set of fragment atom linkages comprises a plurality of the second fragment atom linkages; each second segment atom connection key corresponds to a group of connection key information, and comprises a connection key identifier, a key head atom identifier, a key tail atom identifier and a connection key type;
marking a three-dimensional structure region corresponding to the first heavy chain CDRH3 fragment sequence in each of the first FV fragment sample structures as a corresponding CDRH3 region, and marking three-dimensional structure regions other than the CDRH3 region as corresponding non-CDRH 3 regions; and marking the second segment atoms in the CDRH3 region at the region edge as corresponding second edge atoms, and marking all the second segment atoms in the non-CDRH 3 region with the atomic distance not exceeding the first distance threshold value as corresponding second neighborhood atoms; and forming a corresponding first training atom set by all the second fragment atoms and all the second neighborhood atoms of the CDRH3 region; extracting the first segment atom connection bonds matched with the second segment atoms in the first label atom set from the first segment atom connection bond set by using the first segment atom identification or the second segment atom identification in the second segment atom connection bond set to form a corresponding first training connection bond set; the first training atom set and the first training connection key set form a corresponding first training data set;
Traversing each first training data set; traversing, namely taking the first training data set which is currently traversed as a corresponding current training data set; performing model training treatment on the antibody structure scoring model according to the current training data set and the first label data set; if the model training processing of the current time is successful, the next first training data set is transferred to continue to traverse until the model training processing corresponding to the last first training data set is successful.
Further preferably, the model training processing for the antibody structure scoring model according to the current training data set and the first tag data set specifically includes:
step 81, performing E (3) isovariogram construction processing according to the current training data set to obtain a corresponding first training E3 isovariogram; e (3) isovariogram construction processing is carried out according to the first tag data set to obtain a corresponding first tag E3 isovariogram;
step 82, performing root mean square error calculation on the first training E3 isovariogram and the first tag E3 isovariogram to generate corresponding first tag RMSD data;
Step 83, inputting the first training E3 isovariogram into the segns network of the antibody structure scoring model to perform isovariogram prediction to obtain a corresponding second training E3 isovariogram; inputting the second training E3 isovariogram into the weighting pooling module of the antibody structure scoring model to perform root mean square error calculation to obtain corresponding first training RMSD data;
step 84, inputting the first tag RMSD data and the first training RMSD data into the loss function of the antibody structure scoring model to calculate a corresponding second loss value; the loss function of the antibody structure scoring model defaults to an MSE loss function for performing mean square error calculation on the input label RMSD data and training RMSD data;
step 85, evaluating the second loss value based on a preset second convergence loss range; if the second loss value does not meet the second convergence loss range, go to step 86; if the second loss value meets the second convergence loss range, go to step 87;
step 86, substituting the model parameters of the antibody structure scoring model into the loss function of the antibody structure scoring model to construct a corresponding second objective function; solving model parameters of the antibody structure scoring model towards the direction of enabling the second objective function to reach the minimum value, and taking the solved result as corresponding second updated model parameters; model parameter updating processing is carried out on the antibody structure scoring model based on the second updating model parameters; returning to the step 83 to continue training when the model parameter updating process is successful;
Step 87, confirming that the current model training process is successful.
Preferably, the first FV fragment structure comprises a third set of fragment atoms and a third set of fragment atom linkages; the third set of fragment atoms includes a plurality of third fragment atoms; each third fragment atom corresponds to a group of atomic characteristic data, including an atomic identifier, an atomic name, an atomic coordinate, an atomic element type consisting of an element type and an atomic type, a residue type identifier and an atomic electric quantity; the third set of fragment atom linkages comprises a plurality of third fragment atom linkages; and each third segment atom connecting key corresponds to a group of connecting key information, and comprises a connecting key identifier, a key head atom identifier, a key tail atom identifier and a connecting key type.
Preferably, the inputting the obtained M first FV fragment structures into the trained antibody structure scoring model for confidence scoring to obtain corresponding first structure scores, specifically includes:
performing CDRH3 residue substrate segment sequence identification processing on the first heavy chain residue sequence corresponding to each first FV segment structure based on the IMGT database to obtain a corresponding second heavy chain CDRH3 segment sequence;
Marking a three-dimensional structure region corresponding to the second heavy chain CDRH3 fragment sequence in each of the first FV fragment structures as a corresponding predicted CDRH3 tag region, and marking three-dimensional structure regions other than the predicted CDRH3 tag region as corresponding non-predicted CDRH3 tag regions; and marking the third fragment atoms in the predicted CDRH3 tag region at the region edge as corresponding third edge atoms, and marking all the third fragment atoms in the non-predicted CDRH3 tag region with the atomic distances from the third edge atoms not exceeding the first distance threshold as corresponding third neighborhood atoms; and forming a corresponding first prediction atom set by all third fragment atoms and all third neighborhood atoms of the prediction CDRH3 tag region; extracting a first segment atom connection bond of which the head bond atom identification or the tail bond atom identification is matched with each third segment atom in the first predicted atom set from the third segment atom connection bond set to form a corresponding first predicted connection bond set; the first prediction atom set and the first prediction connecting key set form a corresponding first prediction data set;
E (3) isovariogram construction processing is carried out according to each first prediction data set to obtain a corresponding first prediction E3 isovariogram; inputting each first predicted E3 isovariogram into the SEGNNs network of the antibody structure scoring model for isovariogram prediction to obtain a corresponding second predicted E3 isovariogram, and inputting the second predicted E3 isovariogram into the weighting pooling module of the antibody structure scoring model for root mean square error calculation to obtain corresponding first predicted RMSD data;
inquiring a preset first corresponding relation table reflecting the corresponding relation between root mean square error and confidence coefficient according to the first prediction RMSD data corresponding to each first FV segment structure, and extracting a first confidence coefficient field of a first corresponding relation record in which a first root mean square error range field in the first corresponding relation table is matched with the current first prediction RMSD data as a corresponding first structure score; the first corresponding relation table comprises a plurality of first corresponding relation records; the first correspondence record includes the first root mean square error range field and the first confidence level field; the smaller the root mean square error of the first root mean square error range field is, the higher the confidence of the corresponding first confidence field is.
Further, the E (3) isogram construction process specifically includes:
taking the current training data set, the first tag data set or the first prediction data set which are input at present as a corresponding first data set; the first training atom set and the first training connection key set of the current training data set which are input at present, or the first label atom set and the first label connection key set of the first label data set, or the first prediction atom set and the first prediction connection key set of the first prediction data set are used as corresponding first atom set and first connection key set; the first set of atoms includes a plurality of first atoms; each first atom corresponds to a group of atomic characteristic data, including an atomic identifier, an atomic name, an atomic coordinate, an atomic element type consisting of an element type and an atomic type, a residue type identifier and an atomic electric quantity; the first set of connection keys includes a plurality of first connection keys; each first connecting key corresponds to a group of connecting key information and comprises a connecting key identifier, a key head atom identifier, a key tail atom identifier and a connecting key type;
Constructing an E (3) isograph node as a corresponding first node according to each first atom of the first atom set; setting a first node identifier, a first node coordinate, a first node element type and a first node atom type of the corresponding first node according to the atom identifier, the atom coordinate, the element type and the atom type of each first atom;
constructing a variable graph node edge of E (3) and the like as a corresponding first edge according to each first connection key of the first connection key set; setting a first edge identifier, a first edge head node identifier, a first edge tail node identifier and a first edge connecting key type of the corresponding first edge according to the connecting key identifier, the key head atom identifier, the key tail atom identifier and the connecting key type of each first connecting key;
forming a corresponding current E3 isovariogram by all the obtained first nodes and all the first edges;
if the first data set obtained at this time is the current training data set, outputting the current E3 isovariogram as a corresponding first training E3 isovariogram; if the first data set obtained at this time is the first tag data set, outputting the current E3 isovariogram as a corresponding first tag E3 isovariogram; and if the first data set obtained at this time is the first prediction data set, outputting the current E3 isovariogram as the corresponding first prediction E3 isovariogram.
A second aspect of the embodiments of the present invention provides an apparatus for implementing the method for antibody structure prediction according to the first aspect, where the apparatus includes: the system comprises a model training module, a residue sequence acquisition module, an FV segment structure prediction module, a predicted structure scoring module and an optimal FV segment structure processing module;
the model training module is used for based on a plurality of preset iteration times n i Training the antibody structure prediction model to obtain a plurality of corresponding structure prediction model parameters; training the scoring model of the antibody structure; i is more than or equal to 1 and less than or equal to M, wherein M is a positive integer which is more than or equal to 1;
the residue sequence acquisition module is used for acquiring the residue sequence of the antibody FV fragment as a corresponding first FV fragment sequence; the first FV fragment sequence comprising a first heavy chain residue sequence, a first light chain residue sequence;
the FV segment structure prediction module is used for traversing the structure prediction model parameters of the plurality of groups of structure prediction model parameters; the structure prediction model parameter of the current traversal is used as a corresponding current structure prediction model parameter, the model parameter of the antibody structure prediction model is set as the current structure prediction model parameter, and the antibody structure prediction model after the current setting is used as a corresponding current antibody structure prediction model; inputting the first heavy chain residue sequence and the first light chain residue sequence into the current antibody structure prediction model to perform FV segment three-dimensional structure prediction processing to obtain a corresponding first FV segment structure;
The prediction structure scoring module is used for inputting the obtained M first FV segment structures into the trained antibody structure scoring model respectively for confidence scoring to obtain corresponding first structure scores;
and the optimal FV segment structure processing module is used for selecting the first FV segment structure corresponding to the maximum score from the obtained M first structure scores as an optimal FV segment structure and outputting the optimal FV segment structure.
A third aspect of an embodiment of the present invention provides an electronic device, including: memory, processor, and transceiver;
the processor is configured to couple to the memory, and read and execute the instructions in the memory, so as to implement the method steps described in the first aspect;
the transceiver is coupled to the processor and is controlled by the processor to transmit and receive messages.
A fourth aspect of the embodiments of the present invention provides a computer-readable storage medium storing computer instructions that, when executed by a computer, cause the computer to perform the instructions of the method of the first aspect.
The embodiment of the invention provides a processing method, a processing device, electronic equipment and a computer readable storage medium for antibody structure prediction; an antibody structure prediction model is built in advance based on the model structure of the alpha fold-Multimer, and a corresponding antibody structure scoring model is built based on the SEGNNs network; then, based on the preset M iteration times n i Training the antibody structure prediction model to obtain corresponding M groups of structure prediction model parameters, and training an antibody structure scoring model; substituting M groups of structure prediction model parameters into the antibody structure prediction model respectively for setting to obtain corresponding M antibody structure prediction models with different parameters; however, the method is thatAnd when a pair of light chain residue sequences and heavy chain residue sequences are obtained each time, inputting the pair of light chain residue sequences and the pair of heavy chain residue sequences into the M antibody structure prediction models for structure prediction respectively, carrying out confidence scoring on the M prediction structures respectively through the antibody structure scoring model, and outputting the prediction structure corresponding to the highest score as an optimal FV segment structure. The invention not only makes up the technical defect that no targeted model is provided for three-dimensional structure prediction of the antibody FV fragment in the traditional scheme; the prediction accuracy of the three-dimensional structure of the antibody FV fragment is improved through the combined operation of the antibody structure prediction model and the antibody structure scoring model.
Drawings
FIG. 1 is a schematic diagram of a method for predicting antibody structure according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of an antibody structure scoring model according to an embodiment of the present invention;
Fig. 3 is a block diagram of a processing device for predicting an antibody structure according to a second embodiment of the present invention;
fig. 4 is a schematic structural diagram of an electronic device according to a third embodiment of the present invention.
Description of the embodiments
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in further detail below with reference to the accompanying drawings, and it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
An embodiment of the present invention provides a method for processing antibody structure prediction, as shown in fig. 1, which is a schematic diagram of a method for processing antibody structure prediction according to an embodiment of the present invention, the method mainly includes the following steps:
step 1, based on a plurality of preset iteration times n i Training the antibody structure prediction model to obtain a plurality of corresponding structure prediction model parameters; and scoring the antibody structureTraining rows;
here, 1.ltoreq.i.ltoreq.M, M being a preset positive integer greater than or equal to 1, M iterations n i Are not equal to each other; the aim of the current step 1 is to train an antibody structure prediction model and an antibody structure scoring model respectively; training the antibody structure prediction model is based on M iteration times n i M times of training are performed on the antibody structure prediction model, and the iteration number of each training is formed by n times of training i Determining, thereby obtaining M groups of different antibody structure prediction model parameters; when training the scoring model of the antibody structure, training only once, obtaining a group of model parameters and solidifying the model parameters;
the method specifically comprises the following steps: step 11, based on a plurality of preset iteration times n i Training the antibody structure prediction model to obtain a plurality of corresponding structure prediction model parameters;
the method specifically comprises the following steps: step 111, training an antibody structure prediction model based on a polymer structure prediction model training mode of an alpha fold-Multimer model, and taking the trained antibody structure prediction model as a corresponding first training model;
the antibody structure prediction model in the embodiment of the invention is a prediction model obtained by migrating an alpha fold-Multimer model from a machine learning framework JAX to a machine learning framework pyrach and performing structural optimization, and is used for predicting a corresponding FV segment three-dimensional structure according to an input light chain and heavy chain residue sequence of an antibody FV segment, wherein the FV segment three-dimensional structure comprises the distribution of all atoms in the segment in space and the distribution of all atom connecting bonds in the segment in space; in the embodiment of the invention, the antibody structure prediction model with consistent structure is trained by a model training mode based on an alpha fold-Multimer model in the current step 111, and then the antibody structure prediction model is further improved and trained by the subsequent steps 112-115;
It should be noted that, the above-mentioned AlphaFold model is a model for predicting a three-dimensional structure of a protein based on a residue sequence, and its specific implementation scheme is as shown in published paper Highly accurate protein structure prediction with AlphaFold, and its input is a residue sequence and its output is a predicted three-dimensional structure of a protein, and the three-dimensional structure of a protein also includes the distribution of all residue atoms in space and the distribution of all atom linkages in space; the above-mentioned alpha fold-Multimer model is a model for predicting a structure of a polymer based on the above-mentioned alpha fold model, and its specific implementation scheme is as shown in published paper Protein complex prediction with AlphaFold-Multimer, wherein the input is a sequence of a plurality of residue chains of the polymer, and the output is a three-dimensional structure of the polymer, and the three-dimensional structure of the polymer also includes the distribution of all the residue atoms in space and the distribution of all the atom linkages in space; the essence of the alpha fold-Multimer model is that the model structure of the alpha fold-Multimer model is very similar to that of the alpha fold model by upgrading on the basis of the alpha fold model, and the upgrading on the processing of the Multimer is only performed on the aspects of data preprocessing and partial loss function, and details can be seen in the paper Protein complex prediction with AlphaFold-Multimer, and details are not repeated herein;
The model structure of the antibody structure prediction model of the embodiment of the invention is consistent with the alpha fold-Multimer model, and the loss function used by the model also uses the loss function of the alpha fold-Multimer model; however, in the antibody structure prediction model of the embodiment of the invention, in order to improve the stability of the model and the prediction precision of the three-dimensional structure of the antibody FV fragment, the following optimization is performed on the basis of an alpha fold model and an alpha fold-Multimer model:
1) Migrating the model from the machine learning framework JAX to under the machine learning framework pytorch;
2) All ReLU activation functions in the AlphaFold model and the AlphaFold-Multimer model are replaced by gaussian error linear units (Gaussian Error Linear Units, gels) activation functions; for a detailed description of the GELUs, reference may be made to paper Gaussian Error Linear Units (GELUs), and it can be known through the description of the paper that better model training results can be obtained by using the GELUs to replace the ReLU as the activation function;
3) Since the outproductmean module in the AlphaFold model tends to produce larger values, this tends to lead to unstable training; to improve this, the antibody structure prediction model of the embodiment of the present invention adds a post-processing layer to the outproductmean module output to reduce its value, and its processing principle is: x=linear (LayerNorm (x)), linear being the Linear layer, layerNorm function being a well-known regularized per channel (channel) function;
4) In the AlphaFold model and the AlphaFold-Multimer model, most of the auxiliary heads (auxliary heads) except for the predicted-LDDT heads use a single Linear projection layer (Linear) for data processing, the principle is: x=linear (x); in order to improve the characteristic projection effect of the linear projection layer, the antibody structure prediction model of the embodiment of the invention enhances the step by a newly added GELUs activation function, and the principle is as follows: x=linear (GELU (LayerNorm (x)));
5) In the technical scheme of the AlphaFold Multimer model, a random mask is used when the MSA of the homologous sequence in the multimer is processed, which may cause the problem of data leakage when a prediction task is executed;
step 112, after obtaining the first training model, extracting all antibody structure data in the protein three-dimensional structure database to form a corresponding first data set;
wherein the first dataset comprises a plurality of first antibody structure data; the first antibody structural data includes first FV fragment data; the first FV fragment data comprising a first fragment residue sequence and a corresponding first fragment three-dimensional structural tag; the first fragment residue sequence comprises a first fragment light chain residue sequence and a first fragment heavy chain residue sequence; the first segment three-dimensional structure tag comprises a first segment atom set and a first segment atom connection bond set; the first segment atom set includes a plurality of first segment atoms; each first fragment atom corresponds to a group of atomic characteristic data, including an atomic identifier, an atomic name, an atomic coordinate, an atomic element type consisting of an element type and an atomic type, a residue type identifier and an atomic electric quantity; the first segment atom bonding set comprises a plurality of first segment atom bonding; each first segment atom connection key corresponds to a group of connection key information, and comprises a connection key identifier, a key head atom identifier, a key tail atom identifier and a connection key type;
Here, the first training model is an antibody structure prediction model for training the model training mode of the AlphaFold-Multimer model in the reference paper Protein complex prediction with AlphaFold-Multimer; the protein three-dimensional structure database is also known as PDB (Protein Data Bank) database; the antibody structure data obtained from the PDB database is in a specified PDB data format, and according to the embodiment of the invention, the residue sequence of the FV fragment is extracted from the original obtained antibody structure data according to the PDB data format to serve as a corresponding first fragment residue sequence, the residue sequences of the light chain and the heavy chain of the FV fragment are extracted to serve as a corresponding first fragment light chain residue sequence and a corresponding first fragment heavy chain residue sequence, the characteristics (such as an atomic identifier, an atomic name, an atomic coordinate, an atomic element type composed of an element type and an atomic type, an affiliated residue type identifier, an atomic electric quantity and the like) of all atoms in the FV fragment are extracted to form a corresponding first fragment atomic connection bond set, and the characteristics (such as a connection bond identifier, a bond head atomic identifier, a bond tail atomic identifier, a connection bond type and the like) of all atoms in the FV fragment are extracted to form a corresponding first fragment atomic connection bond set;
Step 113, constructing a secondary training data set according to the first data set to obtain a corresponding first training data set;
wherein the first training data set comprises a plurality of first training FV fragment data; the first training FV segment data comprises a first training segment light chain residue sequence, a first training segment heavy chain residue sequence and a corresponding first training segment three-dimensional structural tag;
the method specifically comprises the following steps: step 1131, constructing a first training data set initialized to empty;
step 1132, adding the first segment light chain residue sequence, the first segment heavy chain residue sequence and the first segment three-dimensional structure tag of the first FV segment data of each first antibody structural data in the first dataset as corresponding first training segment light chain residue sequence, first training segment heavy chain residue sequence and first training segment three-dimensional structure tag to the first training dataset;
here, the current step directly generates corresponding training data from all the first FV fragment data in the first dataset and adds the corresponding training data to the first training dataset;
step 1133, identifying a preset data enhancement mode; if the data enhancement mode is the first mode, performing motion simulation on the first segment three-dimensional structure labels of the first antibody structure data based on a preset dynamics simulator, and performing FV segment three-dimensional structure sampling processing in the motion simulation process based on a preset motion simulation sampling rule so as to obtain a plurality of sampled FV segment three-dimensional structures; if the data enhancement mode is the second mode, performing diffusion sample data generation processing according to the first segment three-dimensional structure labels of the first antibody structure data based on a preset diffusion model so as to obtain a plurality of sampling FV segment three-dimensional structures; the obtained three-dimensional structures of the sampling FV segments are used as corresponding first training segment three-dimensional structure labels, the first segment light chain residue sequence and the first segment heavy chain residue sequence of the first FV segment data corresponding to the three-dimensional structures of the sampling FV segments are used as corresponding first training segment light chain residue sequence and first training segment heavy chain residue sequence, and the corresponding first training segment light chain residue sequence, first training segment heavy chain residue sequence and first training segment three-dimensional structure labels of the three-dimensional structures of the sampling FV segments form corresponding first training FV segment data to be added into the first training data set;
Wherein the data enhancement mode comprises a first mode and a second mode; the dynamics simulators include simulators based on the principle of enhanced dynamics (Reinforced Dynamics, riD) and simulators based on the principle of molecular dynamics (Molecular Dynamics, MD);
the method comprises the steps that in the current step, each first FV segment data in a first data set is used as an original structure, and data enhancement processing is carried out on each original structure by adopting two data enhancement modes to obtain a plurality of new structures, and the new structures are added into a first training data set;
when the data enhancement mode is the first mode, the data enhancement processing is performed by adopting a dynamics simulator, the principle is that a group of force fields are applied to an original structure to perform motion simulation based on the enhancement dynamics principle or the molecular dynamics principle, the original structure rotates, stretches, folds and the like in the simulation process so as to generate a series of changed structure conformations, at the moment, a plurality of new structures, namely a plurality of sampling FV segment three-dimensional structures, can be obtained by sampling the structure conformations changed in the simulation process for a plurality of times based on a preset motion simulation sampling rule, wherein the motion simulation sampling rule can be preset to be a sampling rule with a time step or a simulation iteration time step, and can also be set based on an actual application rule;
When the data enhancement mode is the second mode, the embodiment of the invention adopts a Diffusion Model (Diffusion Model) to carry out data enhancement processing, wherein the Diffusion Model is a data generation Model, and the Model principle can refer to paper of Diffusion Models A Comprehensive Survey Of Methods and Applications, namely, noise is simply injected into an input original structure to generate a plurality of newly added structure samples, namely a plurality of sampling FV segment three-dimensional structures;
step 114, according to each iteration number n i Respectively carrying out secondary training on the first training model by the first training data set to obtain a corresponding group of structure prediction model parameters; and the M obtained structure prediction model parameters form corresponding multiple groups of structure prediction model parameters;
the first training model is an antibody structure prediction model which is trained by the training mode of the polymer structure prediction model based on the alpha fold-Multimer model in the step 111, the current step 114 is that after a first training data set which is all antibody FV fragments is constructed by the steps 112-113, M rounds of independent training are carried out on the first training model based on the first training data set, model parameters obtained by the current round are used as a group of structure prediction model parameters to be stored after each round of independent training, and the training aims to further improve the prediction precision of the antibody FV fragments of the model;
The method specifically comprises the following steps: step 1141, according to each iteration number n i Respectively carrying out secondary training on the first training model by the first training data set to obtain a corresponding group of structure prediction model parameters;
the method specifically comprises the following steps: step 11411, initializing a count value of the first training counter to 0; and the current iteration number n i As a corresponding first maximum count value;
step 11412, randomly sampling a first training FV fragment data from the first training dataset as corresponding current training FV fragment data;
step 11413, using the first training fragment light chain residue sequence, the first training fragment heavy chain residue sequence and the corresponding first training fragment three-dimensional structure tag of the current training FV fragment data as the corresponding current training light chain residue sequence, the current training heavy chain residue sequence and the current fragment three-dimensional structure tag;
step 11414, inputting the current training light chain residue sequence and the current training heavy chain residue sequence into a first training model for FV fragment three-dimensional structure prediction processing to obtain a corresponding first training FV fragment structure; substituting the first training FV segment structure and the current segment three-dimensional structure label into a model loss function of a first training model to perform loss calculation to obtain a corresponding first loss value;
Here, the model loss function of the first training model in the current step is the overall loss function of the AlphaFold-Multimer model, which can be known by referring to paper Protein complex prediction with AlphaFold-Multimer, and is not repeatedly discussed herein;
step 11415, evaluating the first loss value based on a preset first convergence loss range; if the first loss value does not meet the first convergence loss range, go to step 11416; if the first loss value meets the first convergence loss range, go to step 11417;
here, the first convergence loss range is a preset loss value convergence range, and can be set based on practical application;
step 11416, substituting the model parameters of the first training model into the model loss function of the first training model to construct a corresponding first objective function; solving model parameters of the first training model towards the direction of enabling the first objective function to reach the minimum value, and taking the solved result as corresponding first updated model parameters; model parameter updating processing is carried out on the first training model based on the first updating model parameters; returning to step 11414 to continue training when the model parameter updating process is successful;
Step 11417, adding 1 to the count value of the first training counter; identifying whether the count value of the first training counter after adding 1 is larger than a first maximum count value; if the count value of the first training counter is less than or equal to the first maximum count value, selecting the next first training FV fragment data from the first training dataset as new current training FV fragment data, and returning to step 11413 to continue training; if the count value of the first training counter is larger than the first maximum count value, stopping the secondary training of the round and storing the current model parameters of the first training model as a corresponding group of structure prediction model parameters;
step 1142, forming corresponding multiple groups of structure prediction model parameters by the obtained M structure prediction model parameters;
step 12, training an antibody structure scoring model;
here, before explaining step 12 in detail, a model description is made on the antibody structure scoring model according to the embodiment of the present invention; the antibody structure scoring model according to the embodiment of the present invention is shown in fig. 2, which is a schematic structural diagram of the antibody structure scoring model according to the first embodiment of the present invention, and includes: a SEGNNs network and a weighted pooling module; the SEGNNs network comprises a double-independent-heat encoding module, a spherical harmonic encoding module, a radial encoding module, an independent-heat encoding module, a controllable multi-layer sensing module, a controllable group convolution module and an equal-variation nonlinear function module; the double independent heat coding module is connected with the controllable multi-layer sensing module; the spherical harmonic coding module, the radial coding module and the independent heat coding module are respectively connected with the controllable group convolution module; the controllable group convolution module is also connected with the constant-variation nonlinear function module;
The SEGNNs network of the antibody structure scoring model of the embodiment of the invention is realized based on controllable E (3) isograph neural networks (Steerable E (3) Equivariant Graph Neural Networks, SEGNNs) provided in paper GEOMETRIC AND PHYSICAL QUANTITIES IMPROVE E (3) EQUIVARIANT MESSAGE PASSING;
the E (3) isograms mentioned later in the embodiment of the invention are composed of a plurality of first nodes and a plurality of first edges; each first node actually corresponds to an atom, and each first node corresponds to a group of node attributes including a first node identifier, a first node coordinate, a first node element type and a first node atom type; each first edge actually corresponds to one atomic connecting key, and each first edge corresponds to one group of edge attributes, including a first edge identifier, a first edge head node identifier, a first edge tail node identifier and a first edge connecting key type;
the double-independent thermal module of the SEGNNs network is used for carrying out double-independent thermal coding on the two attribute characteristics of the first node element type and the first node atom type of all the first nodes in the input E (3) isovariogram; the spherical harmonic coding module is used for coding the length characteristics of each first edge in the input E (3) isovariogram based on a preset spherical harmonic function; the radial coding module is used for coding the radial characteristics of each first side in the input E (3) isovariogram based on a preset radial basis function; the single-heat coding module is used for coding the connection key characteristics of each first side in the input E (3) isovariogram; the controllable multi-layer sensing module is realized by referring to the related content (CLEBSCH-GORDAN PRODUCT AND STEERABLE MLPS) of the CLEBSCH-GORDAN tensor and controllable multi-layer sensing part in the paper GEOMETRIC AND PHYSICAL QUANTITIES IMPROVE E (3) EQUIVARIANT MESSAGE PASSING; the implementation of the controllable group convolution module is implemented by referring to the content (STEERABLE GROUP CONVOLUTIONS) related to the controllable group convolution part in the paper GEOMETRIC AND PHYSICAL QUANTITIES IMPROVE E (3) EQUIVARIANT MESSAGE PASSING; the isomorphism nonlinear function module refers to the related content of the isomorphism nonlinear message transfer function in the paper GEOMETRIC AND PHYSICAL QUANTITIES IMPROVE E (3) EQUIVARIANT MESSAGE PASSING; the SEGNNs network of the embodiment of the invention has the structures of input and output characteristic tensors of variable graph structures such as E (3), and the atomic numbers with characteristics on the input and output variable graphs such as E (3) are unchanged; the SEGNNs network in the embodiment of the invention takes the input E (3) isovariate as an original structure, carries out further stable structure prediction on the original structure and outputs the E (3) isovariate which can embody the connection bond relation of atoms and atoms in the stable prediction structure;
The weighting pooling module of the antibody structure scoring model of the embodiment of the invention realizes the root mean square deviation (Root Mean Square Deviation, RMSD) calculation of the variable graphs such as input, output E (3) and the like of the SEGNNs network through weighting pooling operation so as to obtain corresponding RMSD data; if the obtained RMSD data is smaller, it is described that the smaller the error between the original structure corresponding to the transformation diagram such as the input E (3) and the stable structure corresponding to the transformation diagram such as the output E (3) is, the higher the approximation is; conversely, if the obtained RMSD data is larger, it is described that the larger the error between the original structure corresponding to the transformation diagram such as the input E (3) and the stable structure corresponding to the transformation diagram such as the output E (3) is, the lower the approximation is;
the training principle of the antibody structure scoring model of the embodiment of the invention is as follows: extracting first FV segment data of a specified number K of first antibody structure data from the reliable first data set to form a corresponding second data set; and performing a round of training on the antibody structure scoring model based on the respective first FV fragment data; in each round of training process, taking a first segment three-dimensional structure tag of current first FV segment data as an original structure, carrying out constraint motion simulation on the original structure based on a constraint dynamics principle, carrying out multiple times of sampling in the simulation process to obtain a plurality of first FV segment sampling structures, carrying out CDRH3 region and neighborhood structural feature extraction on the original structure to obtain a corresponding first tag data set, respectively carrying out CDRH3 region and neighborhood structural feature extraction on the plurality of sampling structures corresponding to the original structure to obtain a corresponding plurality of first training data sets, and carrying out one-time training on an antibody structure scoring model by using each first training data set, wherein a reference tag of each training is the first tag data set; the antibody structure scoring model obtained after training in the above manner can be used for evaluating the prediction error of the CDRH3 region in the three-dimensional structure of the FV segment outputted by the antibody structure prediction model;
The following is a detailed description of step 12 in the embodiment of the present invention, where step 12 specifically includes:
step 121, selecting any specified quantity K of first FV fragment data of the first antibody structural data from the first dataset to form a corresponding second dataset;
wherein the designated number K is a positive integer greater than or equal to 1;
step 122, traversing each first FV segment data of the second data set; traversing, namely taking the first FV segment data in the current traversal as corresponding current FV segment data; training an antibody structure scoring model according to the current FV segment data; when the training is finished, the next first FV segment data is transferred to continue traversing until the last first FV segment data of the second data set is trained; when the traversal is finished, carrying out model parameter curing treatment on the antibody structure scoring model based on the latest model parameters of the antibody structure scoring model;
the training method for the antibody structure scoring model according to the current FV segment data specifically comprises the following steps:
step A1, a first fragment light chain residue sequence, a first fragment heavy chain residue sequence and a first fragment three-dimensional structure tag of the current FV fragment data are used as corresponding current light chain residue sequences, current heavy chain residue sequences and current FV fragment tag structures;
The current FV segment label structure comprises a first segment atom set and a first segment atom connection bond set; the first segment atom set includes a plurality of first segment atoms; each first fragment atom corresponds to a group of atomic characteristic data, including an atomic identifier, an atomic name, an atomic coordinate, an atomic element type consisting of an element type and an atomic type, a residue type identifier and an atomic electric quantity; the first segment atom bonding set comprises a plurality of first segment atom bonding; each first segment atom connection key corresponds to a group of connection key information, and comprises a connection key identifier, a key head atom identifier, a key tail atom identifier and a connection key type;
here, the data structure of the current FV fragment tag structure is actually the same as the data structure of the aforementioned first FV fragment data;
step A2, respectively carrying out CDRH3 residue substrate segment sequence identification treatment on the current heavy chain residue sequence based on an IMGT database to obtain a corresponding first heavy chain CDRH3 segment sequence;
here, the IMGT database is a public database of CDRH3 fragment queries of antibodies, and the current heavy chain residue sequence is fed into the database for querying to locate the CDRH3 region on the heavy chain, i.e., the position of the CDRH3 fragment; the initial base mark of the CDRH3 fragment on the heavy chain can be obtained through searching and positioning of an IMGT database; naturally, the first heavy chain CDRH3 fragment sequence in the CDRH3 residue fragment can be found in the current heavy chain residue sequence after the starting base identity of the CDRH3 fragment is known;
Step A3, marking a three-dimensional structure region corresponding to a first heavy chain CDRH3 fragment sequence in the current FV fragment tag structure as a corresponding CDRH3 tag region, and marking three-dimensional structure regions except the CDRH3 tag region as a corresponding non-CDRH 3 tag region; the first segment atoms in the CDRH3 label area at the edge of the area are marked as corresponding first edge atoms, and all the first segment atoms in the non-CDRH 3 label area, which have the atomic distance not exceeding a preset first distance threshold value, are marked as corresponding first neighborhood atoms; and all first fragment atoms of the CDRH3 tag region and all first neighborhood atoms form a corresponding first tag atom set; extracting first segment atom connection bonds matched with each first segment atom in the first tag atom set by using a key head atom identification or a key tail atom identification in the first segment atom connection bond set to form a corresponding first tag connection bond set; the first tag atom set and the first tag connection key set form a corresponding first tag data set;
wherein the first distance threshold defaults to 10 a;
the obtained first tag data set is composed of a first tag atom set and a first tag connecting key set, wherein the first tag atom set is actually a set of all atoms in a CDRH3 region and a neighborhood thereof in the FV segment tag structure, and the first tag connecting key set is actually a set of all atom connecting keys in the CDRH3 region and the neighborhood thereof in the FV segment tag structure;
Step A4, performing constrained motion simulation on the current FV segment label structure based on a constrained dynamics principle, and performing FV segment three-dimensional structure sampling processing in a constrained motion simulation process based on a preset constrained motion simulation sampling rule so as to obtain a plurality of first FV segment sampling structures;
the first FV segment sampling structure comprises a second segment atom set and a second segment atom connecting bond set; the second set of fragment atoms includes a plurality of second fragment atoms; each second fragment atom corresponds to a group of atomic characteristic data, including an atomic identifier, an atomic name, an atomic coordinate, an atomic element type consisting of an element type and an atomic type, a residue type identifier and an atomic electric quantity; the second set of fragment atom linkages comprises a plurality of second fragment atom linkages; each second segment atom connecting key corresponds to a group of connecting key information, and comprises a connecting key identifier, a key head atom identifier, a key tail atom identifier and a connecting key type;
the method for performing constraint motion simulation on the current FV fragment tag structure based on the constraint dynamics principle in the current step A4 is similar to the processing mode of performing data enhancement processing by using a dynamics simulator in the step 1133, namely, a group of force field constraints are applied to an original structure based on the constraint dynamics principle and motion simulation is performed, the original structure can perform motions such as rotation, stretching and folding in the simulation process to generate a series of changed structure conformations, at the moment, a plurality of newly added structures, namely, a plurality of first FV fragment sampling structures, can be obtained by performing multiple sampling on the structure conformations changed in the simulation process based on a preset constraint motion simulation sampling rule, wherein the constraint motion simulation sampling rule can be preset as a sampling rule with a time step or a simulation iteration time step, and can also be set based on an actual application rule;
Step A5, marking the three-dimensional structure region corresponding to the first heavy chain CDRH3 fragment sequence in each first FV fragment sampling structure as a corresponding CDRH3 region, and marking the three-dimensional structure regions except the CDRH3 region as a corresponding non-CDRH 3 region; and the second segment atoms in the CDRH3 region at the region edge are marked as corresponding second edge atoms, and all the second segment atoms in the non-CDRH 3 region, which have the atomic distance not exceeding a first distance threshold value with each second edge atom, are marked as corresponding second neighborhood atoms; and all second segment atoms of the CDRH3 region and all second neighborhood atoms form a corresponding first training atom set; extracting second segment atom connection bonds matched with each second segment atom in the first label atom set from a head atom identification or a tail atom identification in the second segment atom connection bond set to form a corresponding first training connection bond set; the first training atom set and the first training connection key set form a corresponding first training data set;
each obtained first training data set is composed of a group of first training atom sets and a first training connecting key set, wherein the first training atom sets are actually sets of all atoms in a CDRH3 region and a neighborhood thereof in a corresponding FV segment sampling structure, and the first training connecting key sets are actually sets of all atom connecting keys in the CDRH3 region and the neighborhood thereof in the corresponding FV segment sampling structure;
Step A6, traversing each first training data set; traversing, namely taking the first training data set which is traversed currently as a corresponding current training data set; model training is carried out on the antibody structure scoring model according to the current training data set and the first label data set; if the current model training process is successful, the next first training data set is transferred to continue to traverse until the model training process corresponding to the last first training data set is successful;
the method specifically comprises the steps of performing model training processing on an antibody structure scoring model according to a current training data set and a first label data set, wherein the model training processing comprises the following steps of:
step B1, performing E (3) isovariogram construction processing according to a current training data set to obtain a corresponding first training E3 isovariogram; e (3) isovariogram construction processing is carried out according to the first tag data set to obtain a corresponding first tag E3 isovariogram;
step B2, performing root mean square error calculation on the first training E3 isovariogram and the first label E3 isovariogram to generate corresponding first label RMSD data;
step B3, inputting the first training E3 isovariogram into the SEGNNs network of the antibody structure scoring model to perform isovariogram prediction to obtain a corresponding second training E3 isovariogram; inputting the second training E3 isovariogram into a weighting pooling module of the antibody structure scoring model to perform root mean square error calculation to obtain corresponding first training RMSD data;
Step B4, inputting the first label RMSD data and the first training RMSD data into a loss function of the antibody structure scoring model to calculate to obtain a corresponding second loss value;
the loss function of the antibody structure scoring model defaults to an MSE loss function for performing mean square error calculation on the input label RMSD data and training RMSD data;
step B5, evaluating a second loss value based on a preset second convergence loss range; if the second loss value does not meet the second convergence loss range, go to step B6; if the second loss value meets the second convergence loss range, turning to the step B7;
here, the second convergence loss range is a preset loss value convergence range, and can be adjusted according to actual application requirements;
step B6, substituting model parameters of the antibody structure scoring model into a loss function of the antibody structure scoring model to construct a corresponding second objective function; solving model parameters of the antibody structure scoring model towards the direction of enabling the second objective function to reach the minimum value, and taking the solved result as corresponding second updated model parameters; model parameter updating processing is carried out on the antibody structure scoring model based on the second updating model parameters; and when the model parameter updating process is successful, returning to the step B3 to continue training;
Step B7, confirming that the current model training process is successful;
and step 123, if the model parameter curing treatment is successful, the current antibody structure scoring model is regarded as a training mature antibody structure scoring model.
Step 2, obtaining a residue sequence of an antibody FV fragment as a corresponding first FV fragment sequence;
wherein the first FV fragment sequence comprises a first heavy chain residue sequence, a first light chain residue sequence.
Step 3, traversing the structure prediction model parameters of the plurality of groups of structure prediction model parameters; the method comprises the steps of traversing, taking the currently traversed structure prediction model parameters as corresponding current structure prediction model parameters, setting the model parameters of the antibody structure prediction model as current structure prediction model parameters, and taking the antibody structure prediction model after current setting as a corresponding current antibody structure prediction model; inputting the first heavy chain residue sequence and the first light chain residue sequence into a current antibody structure prediction model to perform FV segment three-dimensional structure prediction processing to obtain a corresponding first FV segment structure;
wherein the first FV fragment structure comprises a third set of fragment atoms and a third set of fragment atom linkages; the third segment atom set includes a plurality of third segment atoms; each third fragment atom corresponds to a group of atomic characteristic data, including an atomic identifier, an atomic name, an atomic coordinate, an atomic element type consisting of an element type and an atomic type, a residue type identifier and an atomic electric quantity; the third segment atom bonding set comprises a plurality of third segment atom bonding; each third segment atom linkage corresponds to a set of linkage information including a linkage identifier, a head atom identifier, a tail atom identifier, and a linkage type.
Here, it can be known from the foregoing that the model parameters of the M-group antibody structure prediction model, that is, the M-group structure prediction model parameters, are obtained through the model training in the step 1; if the M groups of structure prediction model parameters are respectively substituted into the antibody structure prediction model to be set, obtaining M antibody structure prediction models with different parameters; in the current step 3, the first heavy chain residue sequence and the first light chain residue sequence are respectively input into the antibody structure prediction models with different parameters for performing FV fragment three-dimensional structure prediction, and FV fragment three-dimensional structures with M different structural features, namely M first FV fragment structures, are obtained, wherein the data format of the first FV fragment structures is similar to the data format of the first FV fragment data.
Step 4, inputting the obtained M first FV segment structures into a training mature antibody structure scoring model respectively for confidence scoring to obtain corresponding first structure scores;
the method specifically comprises the following steps: step 41, carrying out CDRH3 residue substrate segment sequence identification processing on a first heavy chain residue sequence corresponding to each first FV segment structure based on an IMGT database to obtain a corresponding second heavy chain CDRH3 segment sequence;
here, the input of the first heavy chain residue sequence into the IMGT database can be queried to locate the CDRH3 region on the heavy chain, i.e. the position of the CDRH3 fragment; the initial base mark of the CDRH3 fragment on the heavy chain can be obtained through searching and positioning of an IMGT database; naturally, the sequence of the second heavy chain CDRH3 fragment in the CDRH3 residue fragment can be found in the first heavy chain residue sequence after the initial base identity of the CDRH3 fragment is known;
Step 42, marking the three-dimensional structure region corresponding to the second heavy chain CDRH3 fragment sequence in each first FV fragment structure as a corresponding predicted CDRH3 tag region, and marking the three-dimensional structure regions except the predicted CDRH3 tag region as a corresponding non-predicted CDRH3 tag region; marking third fragment atoms in the predicted CDRH3 tag region at the region edge as corresponding third edge atoms, and marking all third fragment atoms in the non-predicted CDRH3 tag region, which have the atomic distance not exceeding a first distance threshold value with each third edge atom, as corresponding third neighborhood atoms; and all third fragment atoms and all third neighborhood atoms of the predicted CDRH3 tag region form a corresponding first predicted atom set; extracting third segment atom connection bonds matched with all third segment atoms in the first prediction atom set by using a bond head atom identification or a bond tail atom identification in the third segment atom connection bond set to form a corresponding first prediction connection bond set; the first prediction atom set and the first prediction connecting key set form a corresponding first prediction data set;
the first prediction data set is composed of a first prediction atom set and a first prediction connecting bond set, wherein the first prediction atom set is actually a set of all atoms in a three-dimensional structure of the FV fragment predicted by the antibody structure prediction model, namely a CDRH3 region in the first FV fragment structure and a neighborhood thereof, and the first label connecting bond set is actually a set of all atom connecting bonds in the predicted three-dimensional structure of the FV fragment predicted, namely the CDRH3 region in the first FV fragment structure and the neighborhood thereof;
Step 43, performing E (3) isovariogram construction processing according to each first prediction data set to obtain a corresponding first prediction E3 isovariogram; inputting each first prediction E3 isovariogram into a SEGNNs network of the antibody structure scoring model for isovariogram prediction to obtain a corresponding second prediction E3 isovariogram, and inputting the second prediction E3 isovariogram into a weighted pooling module of the antibody structure scoring model for root mean square error calculation to obtain corresponding first prediction RMSD data;
here, it can be known from the foregoing that the evaluation of the antibody structure scoring model according to the embodiment of the present invention can evaluate the three-dimensional structure of the FV segment predicted by the antibody structure prediction model, that is, the prediction error of the CDRH3 region in the first FV segment structure, and the evaluation result is the first predicted RMSD data;
step 44, inquiring a preset first corresponding relation table reflecting the corresponding relation between the root mean square error and the confidence coefficient according to the first prediction RMSD data corresponding to each first FV fragment structure, and extracting a first confidence coefficient field of a first corresponding relation record in which a first root mean square error range field in the first corresponding relation table is matched with the current first prediction RMSD data as a corresponding first structure score;
the first corresponding relation table comprises a plurality of first corresponding relation records; the first correspondence record includes a first root mean square error range field and a first confidence level field; the smaller the root mean square error of the first root mean square error range field, the higher the confidence of the corresponding first confidence field.
And 5, selecting a first FV segment structure corresponding to the maximum score from the obtained M first structure scores as an optimal FV segment structure and outputting the optimal FV segment structure.
In summary, the method for processing the antibody structure prediction according to the embodiment of the present invention firstly completes training of two models through the step 1; and then in the actual prediction process, carrying out M times of antibody structure prediction through the steps 2-5 by using M antibody structure prediction models and an antibody structure scoring model which are obtained through training in the step 1 and outputting the optimal FV segment structure from the M prediction structures through the antibody structure scoring model. It should be noted that, the embodiment of the present invention may also perform visual display of the three-dimensional structure based on the atom-atom connection bond distribution information of the optimal FV fragment structure.
In step B1 of step A6 of step 122 of step 12 of step 1 and in step 43 of step 4, the variogram construction process such as E (3) is mentioned, and the processing procedure of the variogram construction process such as E (3) is described in detail herein; the E (3) isogram construction processing of the embodiment of the invention specifically comprises the following steps:
step C1, taking the current training data set, the first label data set or the first prediction data set which are input at the present time as a corresponding first data set; the first training atom set and the first training connecting key set of the current training data set input at the present time, or the first label atom set and the first label connecting key set of the first label data set, or the first prediction atom set and the first prediction connecting key set of the first prediction data set are used as the corresponding first atom set and first connecting key set;
Wherein the first set of atoms includes a plurality of first atoms; each first atom corresponds to a group of atomic characteristic data, including an atomic identifier, an atomic name, an atomic coordinate, an atomic element type consisting of an element type and an atomic type, a residue type identifier and an atomic electric quantity; the first connection key set comprises a plurality of first connection keys; each first connecting key corresponds to a group of connecting key information and comprises a connecting key identifier, a key head atom identifier, a key tail atom identifier and a connecting key type;
step C2, constructing a variable graph node such as E (3) and the like as a corresponding first node according to each first atom of the first atom set; setting the first node identification, the first node coordinate, the first node element type and the first node atom type of the corresponding first node according to the atom identification, the atom coordinate, the element type and the atom type of each first atom;
step C3, constructing a variable graph node edge such as E (3) and the like as a corresponding first edge according to each first connection key of the first connection key set; setting a first edge identifier, a first edge head node identifier, a first edge tail node identifier and a first edge connecting key type of the corresponding first edge according to the connecting key identifier, the key head atom identifier, the key tail atom identifier and the connecting key type of each first connecting key;
Step C4, forming a corresponding current E3 isovariogram by all the obtained first nodes and all the first edges;
step C5, if the first data set obtained at this time is the current training data set, outputting the current E3 isovariogram as a corresponding first training E3 isovariogram; if the first data set obtained at this time is the first tag data set, outputting the current E3 isovariogram as a corresponding first tag E3 isovariogram; and if the first data set obtained at this time is the first prediction data set, outputting the current E3 isovariogram as a corresponding first prediction E3 isovariogram.
Fig. 3 is a block diagram of a processing apparatus for predicting an antibody structure according to a second embodiment of the present invention, where the apparatus is a terminal device or a server for implementing the foregoing method embodiment, or may be an apparatus capable of enabling the foregoing terminal device or the server to implement the foregoing method embodiment, and for example, the apparatus may be an apparatus or a chip system of the foregoing terminal device or the server. As shown in fig. 3, the apparatus includes: a model training module 201, a residue sequence acquisition module 202, a FV fragment structure prediction module 203, a predicted structure scoring module 204, and an optimal FV fragment structure processing module 205.
The model training module 201 is configured to perform a training based on a preset number of iterations n i Training the antibody structure prediction model to obtain a plurality of corresponding structure prediction model parameters; training the scoring model of the antibody structure; i is more than or equal to 1 and less than or equal to M, wherein M is a positive integer which is more than or equal to 1.
The residue sequence acquisition module 202 is configured to acquire a residue sequence of an FV fragment of an antibody as a corresponding first FV fragment sequence; the first FV fragment sequence comprises a first heavy chain residue sequence and a first light chain residue sequence.
The FV segment structure prediction module 203 is configured to traverse structure prediction model parameters of the plurality of groups of structure prediction model parameters; the method comprises the steps of traversing, taking the currently traversed structure prediction model parameters as corresponding current structure prediction model parameters, setting the model parameters of the antibody structure prediction model as current structure prediction model parameters, and taking the antibody structure prediction model after current setting as a corresponding current antibody structure prediction model; and inputting the first heavy chain residue sequence and the first light chain residue sequence into a current antibody structure prediction model to perform FV segment three-dimensional structure prediction processing to obtain a corresponding first FV segment structure.
The prediction structure scoring module 204 is configured to input the obtained M first FV fragment structures into a training mature antibody structure scoring model for confidence scoring to obtain corresponding first structure scores.
The optimal FV fragment structure processing module 205 is configured to select a first FV fragment structure corresponding to a maximum score from the M obtained first structure scores as an optimal FV fragment structure and output the optimal FV fragment structure.
The processing device for predicting the antibody structure provided by the embodiment of the invention can execute the method steps in the method embodiment, and the implementation principle and the technical effect are similar, and are not repeated here.
It should be noted that, it should be understood that the division of the modules of the above apparatus is merely a division of a logic function, and may be fully or partially integrated into a physical entity or may be physically separated. And these modules may all be implemented in software in the form of calls by the processing element; or can be realized in hardware; the method can also be realized in a form of calling software by a processing element, and the method can be realized in a form of hardware by a part of modules. For example, the model building module may be a processing element which is set up separately, may be implemented in a chip of the above apparatus, or may be stored in a memory of the above apparatus in the form of program codes, and may be called by a processing element of the above apparatus to execute the functions of the above determination module. The implementation of the other modules is similar. In addition, all or part of the modules can be integrated together or can be independently implemented. The processing element described herein may be an integrated circuit having signal processing capabilities. In implementation, each step of the above method or each module above may be implemented by an integrated logic circuit of hardware in a processor element or an instruction in a software form.
For example, the modules above may be one or more integrated circuits configured to implement the methods above, such as: one or more specific integrated circuits (Application Specific Integrated Circuit, ASIC), or one or more digital signal processors (Digital Signal Processor, DSP), or one or more field programmable gate arrays (Field Programmable Gate Array, FPGA), etc. For another example, when a module above is implemented in the form of a processing element scheduler code, the processing element may be a general purpose processor, such as a central processing unit (Central Processing Unit, CPU) or other processor that may invoke the program code. For another example, the modules may be integrated together and implemented in the form of a System-on-a-chip (SOC).
In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, produces, in whole or in part, the processes or functions described in connection with the foregoing method embodiments. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center by wired (e.g., coaxial cable, fiber optic, digital subscriber line ((Digital Subscriber Line, DSL)), or wireless (e.g., infrared, wireless, bluetooth, microwave, etc.) means, the computer-readable storage medium may be any available medium that can be accessed by the computer or a data storage device such as a server, data center, etc., that contains an integration of one or more available media, the available media may be magnetic media (e.g., floppy disk, hard disk, tape), optical media (e.g., DVD), or semiconductor media (e.g., solid state disk, SSD), etc.
Fig. 4 is a schematic structural diagram of an electronic device according to a third embodiment of the present invention. The electronic device may be a terminal device or a server implementing the method of the foregoing embodiment, or may be a terminal device or a server implementing the method of the foregoing embodiment, which is connected to the foregoing terminal device or server. As shown in fig. 4, the electronic device may include: a processor 301 (e.g., a CPU), a memory 302, a transceiver 303; the transceiver 303 is coupled to the processor 301, and the processor 301 controls the transceiving actions of the transceiver 303. The memory 302 may store various instructions for performing the various processing functions and implementing the processing steps described in the methods of the previous embodiments. Preferably, the electronic device according to the embodiment of the present invention further includes: a power supply 304, a system bus 305, and a communication port 306. The system bus 305 is used to implement communication connections between the elements. The communication port 306 is used for connection communication between the electronic device and other peripheral devices.
The system bus 305 referred to in fig. 4 may be a peripheral component interconnect standard (Peripheral Component Interconnect, PCI) bus, or an extended industry standard architecture (Extended Industry Standard Architecture, EISA) bus, or the like. The system bus may be classified into an address bus, a data bus, a control bus, and the like. For ease of illustration, the figures are shown with only one bold line, but not with only one bus or one type of bus. The communication interface is used to enable communication between the database access apparatus and other devices (e.g., clients, read-write libraries, and read-only libraries). The Memory may comprise random access Memory (Random Access Memory, RAM) and may also include Non-Volatile Memory (Non-Volatile Memory), such as at least one disk Memory.
The processor may be a general-purpose processor, including a Central Processing Unit (CPU), a network processor (Network Processor, NP), a graphics processor (Graphics Processing Unit, GPU), etc.; but may also be a digital signal processor DSP, an application specific integrated circuit ASIC, a field programmable gate array FPGA or other programmable logic device, a discrete gate or transistor logic device, a discrete hardware component.
It should be noted that, the embodiments of the present invention also provide a computer readable storage medium, where instructions are stored, when the computer readable storage medium runs on a computer, to cause the computer to perform the method and the process provided in the above embodiments.
The embodiment of the invention also provides a chip for running the instructions, and the chip is used for executing the processing steps described in the embodiment of the method.
The embodiment of the invention provides a processing method, a processing device, electronic equipment and a computer readable storage medium for antibody structure prediction; an antibody structure prediction model is built in advance based on the model structure of the alpha fold-Multimer, and a corresponding antibody structure scoring model is built based on the SEGNNs network; then, based on the preset M iteration times n i Training the antibody structure prediction model to obtain corresponding M groups of structure prediction model parameters, and training an antibody structure scoring model; substituting M groups of structure prediction model parameters into the antibody structure prediction model respectively for setting to obtain corresponding M antibody structure prediction models with different parameters; then, each time a pair of light and heavy chain residue sequences is obtained, the pair of light and heavy chain residue sequences are input into the M antibody structure prediction modesAnd respectively carrying out structure prediction in the model, respectively carrying out confidence scoring on M predicted structures through an antibody structure scoring model, and outputting the predicted structure corresponding to the highest score as an optimal FV segment structure. The invention not only makes up the technical defect that no targeted model is provided for three-dimensional structure prediction of the antibody FV fragment in the traditional scheme; the prediction accuracy of the three-dimensional structure of the antibody FV fragment is improved through the combined operation of the antibody structure prediction model and the antibody structure scoring model.
Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative elements and steps are described above generally in terms of function in order to clearly illustrate the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied in hardware, in a software module executed by a processor, or in a combination of the two. The software modules may be disposed in Random Access Memory (RAM), memory, read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.
The foregoing description of the embodiments has been provided for the purpose of illustrating the general principles of the invention, and is not meant to limit the scope of the invention, but to limit the invention to the particular embodiments, and any modifications, equivalents, improvements, etc. that fall within the spirit and principles of the invention are intended to be included within the scope of the invention.

Claims (12)

1. A method of processing antibody structure prediction, the method comprising:
based on a plurality of preset iteration times n i Training the antibody structure prediction model to obtain a plurality of corresponding structure prediction model parameters; training the scoring model of the antibody structure; i is more than or equal to 1 and less than or equal to M, wherein M is a positive integer which is more than or equal to 1;
Obtaining the residue sequence of the antibody FV fragment as a corresponding first FV fragment sequence; the first FV fragment sequence comprising a first heavy chain residue sequence, a first light chain residue sequence;
traversing the structure prediction model parameters of the plurality of groups of structure prediction model parameters; the structure prediction model parameter of the current traversal is used as a corresponding current structure prediction model parameter, the model parameter of the antibody structure prediction model is set as the current structure prediction model parameter, and the antibody structure prediction model after the current setting is used as a corresponding current antibody structure prediction model; inputting the first heavy chain residue sequence and the first light chain residue sequence into the current antibody structure prediction model to perform FV segment three-dimensional structure prediction processing to obtain a corresponding first FV segment structure;
inputting the obtained M first FV segment structures into the trained antibody structure scoring model respectively for confidence scoring to obtain corresponding first structure scores;
selecting the first FV segment structure corresponding to the maximum score from the M obtained first structure scores as an optimal FV segment structure and outputting the optimal FV segment structure;
The antibody structure prediction model is obtained by transferring an alpha fold-Multimer model from a machine learning frame JAX to a machine learning frame pyrach and performing structure optimization; the alpha fold-Multimer model is a polymer structure prediction model realized based on the alpha fold model; the model structures of the antibody structure prediction model and the alpha fold-Multimer model are consistent with the model structures of the alpha fold model;
the antibody structure scoring model comprises a SEGNNs network and a weighted pooling module;
the number n of iterations based on the preset i Training the antibody structure prediction model to obtain a plurality of corresponding groups of structure prediction model parameters, which specifically comprise:
training the antibody structure prediction model based on a polymer structure prediction model training mode of the alpha fold-Multimer model, and taking the trained antibody structure prediction model as a corresponding first training model;
after the first training model is obtained, extracting all antibody structure data in a protein three-dimensional structure database to form a corresponding first data set; the first dataset comprises a plurality of first antibody structure data; the first antibody structural data includes first FV fragment data; the first FV segment data includes a first segment residue sequence and a corresponding first segment three-dimensional structural tag; the first fragment residue sequence comprises a first fragment light chain residue sequence and a first fragment heavy chain residue sequence; the first segment three-dimensional structure tag comprises a first segment atom set and a first segment atom connection bond set; the first segment atom set includes a plurality of first segment atoms; each first fragment atom corresponds to a group of atomic characteristic data, including an atomic identifier, an atomic name, an atomic coordinate, an atomic element type consisting of an element type and an atomic type, a residue type identifier and an atomic electric quantity; the first set of segment atom linkages comprises a plurality of first segment atom linkages; each first segment atom connection key corresponds to a group of connection key information, and comprises a connection key identifier, a key head atom identifier, a key tail atom identifier and a connection key type;
Constructing a secondary training data set according to the first data set to obtain a corresponding first training data set; the first training data set includes a plurality of first training FV fragment data; the first training FV segment data comprises a first training segment light chain residue sequence, a first training segment heavy chain residue sequence and a corresponding first training segment three-dimensional structure tag;
according to the number n of iterations i And the first training data set are respectively toPerforming secondary training on the first training model to obtain a corresponding group of structure prediction model parameters; and the M obtained structure prediction model parameters form corresponding multiple groups of structure prediction model parameters.
2. The method for processing antibody structure prediction according to claim 1, wherein the constructing the second training data set according to the first data set to obtain a corresponding first training data set specifically includes:
constructing the first training data set initialized to be empty;
adding the first segment light chain residue sequence, the first segment heavy chain residue sequence, and the first segment three-dimensional structure tag of the first FV segment data of each of the first antibody structural data in the first dataset as corresponding first training segment light chain residue sequence, the first training segment heavy chain residue sequence, and the first training segment three-dimensional structure tag to the first training dataset to form corresponding first training FV segment data;
Identifying a preset data enhancement mode; if the data enhancement mode is a first mode, performing motion simulation on the first segment three-dimensional structure labels of the first antibody structure data based on a preset dynamics simulator, and performing FV segment three-dimensional structure sampling processing in a motion simulation process based on a preset motion simulation sampling rule so as to obtain a plurality of sampled FV segment three-dimensional structures; if the data enhancement mode is a second mode, performing diffusion sample data generation processing according to the first segment three-dimensional structure labels of the first antibody structure data based on a preset diffusion model so as to obtain a plurality of sampling FV segment three-dimensional structures; taking each obtained three-dimensional structure of the sampled FV segment as a corresponding first training segment three-dimensional structure label, taking the first segment light chain residue sequence and the first segment heavy chain residue sequence of the first FV segment data corresponding to each three-dimensional structure of the sampled FV segment as the corresponding first training segment light chain residue sequence and the first training segment heavy chain residue sequence, and adding the corresponding first training segment data consisting of the first training segment light chain residue sequence, the first training segment heavy chain residue sequence and the first training segment three-dimensional structure label corresponding to each three-dimensional structure of the sampled FV segment into the first training data set; the data enhancement mode includes a first mode and a second mode; the dynamics simulator comprises a simulator based on the principle of enhanced dynamics and a simulator based on the principle of molecular dynamics.
3. The method according to claim 1, wherein the number n of iterations is based on i And the first training data set respectively carries out secondary training on the first training model to obtain a corresponding group of structure prediction model parameters, which specifically comprises the following steps:
step 51, initializing the count value of the first training counter to 0; and the current iteration times n i As a corresponding first maximum count value;
step 52 of randomly sampling one of said first training FV fragment data from said first training dataset as corresponding current training FV fragment data;
step 53, using the first training fragment light chain residue sequence, the first training fragment heavy chain residue sequence and the corresponding first training fragment three-dimensional structure tag of the current training FV fragment data as the corresponding current training light chain residue sequence, the current training heavy chain residue sequence and the current fragment three-dimensional structure tag;
step 54, inputting the current training light chain residue sequence and the current training heavy chain residue sequence into the first training model to perform FV fragment three-dimensional structure prediction processing to obtain a corresponding first training FV fragment structure; substituting the first training FV segment structure and the current segment three-dimensional structure label into a model loss function of the first training model to perform loss calculation to obtain a corresponding first loss value;
Step 55, evaluating the first loss value based on a preset first convergence loss range; if the first loss value does not meet the first convergence loss range, go to step 56; if the first loss value meets the first convergence loss range, go to step 57;
step 56, substituting the model parameters of the first training model into the model loss function of the first training model to construct a corresponding first objective function; solving model parameters of the first training model towards the direction of enabling the first objective function to reach the minimum value, and taking a solving result as a corresponding first updated model parameter; model parameter updating processing is carried out on the first training model based on the first updating model parameters; and returning to the step 54 to continue training when the model parameter updating process is successful;
step 57, adding 1 to the count value of the first training counter; identifying whether the count value of the first training counter after adding 1 is larger than the first maximum count value; if the count value of the first training counter is less than or equal to the first maximum count value, selecting the next first training FV fragment data from the first training dataset as new current training FV fragment data, and returning to step 53 to continue training; and if the count value of the first training counter is larger than the first maximum count value, stopping the secondary training of the round and storing the current model parameters of the first training model as a corresponding group of structure prediction model parameters.
4. The method for processing antibody structure prediction according to claim 1, wherein the training of the antibody structure scoring model specifically comprises:
selecting any specified number K of the first FV fragment data of the first antibody structural data from the first dataset to form a corresponding second dataset; the designated number K is a positive integer greater than or equal to 1;
traversing each of the first FV fragment data of the second dataset; traversing, namely taking the first FV segment data currently traversed as corresponding current FV segment data; training the antibody structure scoring model according to the current FV segment data; when the training is finished, the next first FV segment data is transferred to continue traversing until the last first FV segment data of the second data set is trained; when the traversing is finished, carrying out model parameter curing treatment on the antibody structure scoring model based on the latest model parameters of the antibody structure scoring model;
and if the model parameter curing treatment is successful, the current antibody structure scoring model is regarded as the training mature antibody structure scoring model.
5. The method according to claim 4, wherein training the antibody structure scoring model according to the current FV fragment data comprises:
taking the first fragment light chain residue sequence, the first fragment heavy chain residue sequence and the first fragment three-dimensional structure tag of the current FV fragment data as corresponding current light chain residue sequence, current heavy chain residue sequence and current FV fragment tag structures; the current FV segment tag structure includes the first segment atom set and the first segment atom set of linkages; the first set of fragment atoms includes a plurality of the first fragment atoms; each first fragment atom corresponds to a group of atomic characteristic data, including an atomic identifier, an atomic name, an atomic coordinate, an atomic element type consisting of an element type and an atomic type, a residue type identifier and an atomic electric quantity; the first set of segment atom linkages comprises a plurality of the first segment atom linkages; each first segment atom connection key corresponds to a group of connection key information, and comprises a connection key identifier, a key head atom identifier, a key tail atom identifier and a connection key type;
Performing CDRH3 residue substrate segment sequence identification processing on the current heavy chain residue sequence based on an IMGT database to obtain a corresponding first heavy chain CDRH3 segment sequence;
marking a three-dimensional structure region corresponding to the first heavy chain CDRH3 fragment sequence in the current FV fragment tag structure as a corresponding CDRH3 tag region, and marking three-dimensional structure regions other than the CDRH3 tag region as corresponding non-CDRH 3 tag regions; the first fragment atoms in the CDRH3 tag region at the region edge are marked as corresponding first edge atoms, and all the first fragment atoms in the non-CDRH 3 tag region, with which the atomic distance between the first edge atoms does not exceed a preset first distance threshold, are marked as corresponding first neighborhood atoms; and forming a corresponding first label atom set by all the first fragment atoms and all the first neighborhood atoms of the CDRH3 label region; extracting a first segment atom connection bond of which the key head atom identification or key tail atom identification is matched with each first segment atom in the first tag atom set to form a corresponding first tag connection bond set; the first tag atom set and the first tag connection key set form a corresponding first tag data set; the first distance threshold defaults to 10 a;
Performing constrained motion simulation on the current FV segment label structure based on a constrained dynamics principle, and performing FV segment three-dimensional structure sampling processing in a constrained motion simulation process based on a preset constrained motion simulation sampling rule so as to obtain a plurality of first FV segment sampling structures; the first FV segment sampling structure includes a second segment atom set and a second segment atom set of linkages; the second set of fragment atoms includes a plurality of the second fragment atoms; each second fragment atom corresponds to a group of atomic characteristic data, including an atomic identifier, an atomic name, an atomic coordinate, an atomic element type consisting of an element type and an atomic type, a residue type identifier and an atomic electric quantity; the second set of fragment atom linkages comprises a plurality of the second fragment atom linkages; each second segment atom connection key corresponds to a group of connection key information, and comprises a connection key identifier, a key head atom identifier, a key tail atom identifier and a connection key type;
marking a three-dimensional structure region corresponding to the first heavy chain CDRH3 fragment sequence in each of the first FV fragment sample structures as a corresponding CDRH3 region, and marking three-dimensional structure regions other than the CDRH3 region as corresponding non-CDRH 3 regions; and marking the second segment atoms in the CDRH3 region at the region edge as corresponding second edge atoms, and marking all the second segment atoms in the non-CDRH 3 region with the atomic distance not exceeding the first distance threshold value as corresponding second neighborhood atoms; and forming a corresponding first training atom set by all the second fragment atoms and all the second neighborhood atoms of the CDRH3 region; extracting the first segment atom connection bonds matched with the second segment atoms in the first label atom set from the first segment atom connection bond set by using the first segment atom identification or the second segment atom identification in the second segment atom connection bond set to form a corresponding first training connection bond set; the first training atom set and the first training connection key set form a corresponding first training data set;
Traversing each first training data set; traversing, namely taking the first training data set which is currently traversed as a corresponding current training data set; performing model training treatment on the antibody structure scoring model according to the current training data set and the first label data set; if the model training processing of the current time is successful, the next first training data set is transferred to continue to traverse until the model training processing corresponding to the last first training data set is successful.
6. The method according to claim 5, wherein the model training the antibody structure scoring model according to the current training data set and the first tag data set, specifically comprises:
step 81, performing E3 isovariogram construction processing according to the current training data set to obtain a corresponding first training E3 isovariogram; e3 isovariogram construction processing is carried out according to the first tag data set to obtain a corresponding first tag E3 isovariogram;
step 82, performing root mean square error calculation on the first training E3 isovariogram and the first tag E3 isovariogram to generate corresponding first tag RMSD data;
Step 83, inputting the first training E3 isovariogram into the segns network of the antibody structure scoring model to perform isovariogram prediction to obtain a corresponding second training E3 isovariogram; inputting the second training E3 isovariogram into the weighting pooling module of the antibody structure scoring model to perform root mean square error calculation to obtain corresponding first training RMSD data;
step 84, inputting the first tag RMSD data and the first training RMSD data into the loss function of the antibody structure scoring model to calculate a corresponding second loss value; the loss function of the antibody structure scoring model defaults to an MSE loss function for performing mean square error calculation on the input label RMSD data and training RMSD data;
step 85, evaluating the second loss value based on a preset second convergence loss range; if the second loss value does not meet the second convergence loss range, go to step 86; if the second loss value meets the second convergence loss range, go to step 87;
step 86, substituting the model parameters of the antibody structure scoring model into the loss function of the antibody structure scoring model to construct a corresponding second objective function; solving model parameters of the antibody structure scoring model towards the direction of enabling the second objective function to reach the minimum value, and taking the solved result as corresponding second updated model parameters; model parameter updating processing is carried out on the antibody structure scoring model based on the second updating model parameters; returning to the step 83 to continue training when the model parameter updating process is successful;
Step 87, confirming that the current model training process is successful.
7. The method for predicting antibody structure according to claim 6,
the first FV fragment structure comprises a third set of fragment atoms and a third set of fragment atom linkages; the third set of fragment atoms includes a plurality of third fragment atoms; each third fragment atom corresponds to a group of atomic characteristic data, including an atomic identifier, an atomic name, an atomic coordinate, an atomic element type consisting of an element type and an atomic type, a residue type identifier and an atomic electric quantity; the third set of fragment atom linkages comprises a plurality of third fragment atom linkages; and each third segment atom connecting key corresponds to a group of connecting key information, and comprises a connecting key identifier, a key head atom identifier, a key tail atom identifier and a connecting key type.
8. The method according to claim 7, wherein the inputting the obtained M first FV fragment structures into the trained antibody structure scoring model for confidence scoring to obtain corresponding first structure scores comprises:
performing CDRH3 residue substrate segment sequence identification processing on the first heavy chain residue sequence corresponding to each first FV segment structure based on an IMGT database to obtain a corresponding second heavy chain CDRH3 segment sequence;
Marking a three-dimensional structure region corresponding to the second heavy chain CDRH3 fragment sequence in each of the first FV fragment structures as a corresponding predicted CDRH3 tag region, and marking three-dimensional structure regions other than the predicted CDRH3 tag region as corresponding non-predicted CDRH3 tag regions; marking the third fragment atoms positioned at the region edge in the predicted CDRH3 tag region as corresponding third edge atoms, and marking all the third fragment atoms with the atomic distance not exceeding a first distance threshold value from each third edge atom in the non-predicted CDRH3 tag region as corresponding third neighborhood atoms; and forming a corresponding first prediction atom set by all third fragment atoms and all third neighborhood atoms of the prediction CDRH3 tag region; extracting a first segment atom connection bond of which the head bond atom identification or the tail bond atom identification is matched with each third segment atom in the first predicted atom set from the third segment atom connection bond set to form a corresponding first predicted connection bond set; the first prediction atom set and the first prediction connecting key set form a corresponding first prediction data set;
E3 isovariogram construction processing is carried out according to each first prediction data set to obtain a corresponding first prediction E3 isovariogram; inputting each first predicted E3 isovariogram into the SEGNNs network of the antibody structure scoring model for isovariogram prediction to obtain a corresponding second predicted E3 isovariogram, and inputting the second predicted E3 isovariogram into the weighting pooling module of the antibody structure scoring model for root mean square error calculation to obtain corresponding first predicted RMSD data;
inquiring a preset first corresponding relation table reflecting the corresponding relation between root mean square error and confidence coefficient according to the first prediction RMSD data corresponding to each first FV segment structure, and extracting a first confidence coefficient field of a first corresponding relation record in which a first root mean square error range field in the first corresponding relation table is matched with the current first prediction RMSD data as a corresponding first structure score; the first corresponding relation table comprises a plurality of first corresponding relation records; the first correspondence record includes the first root mean square error range field and the first confidence level field; the smaller the root mean square error of the first root mean square error range field is, the higher the confidence of the corresponding first confidence field is.
9. The method for processing antibody structure prediction according to claim 8, wherein the E3 isogram construction process specifically comprises:
taking the current training data set, the first label data set or the first prediction data set which are input at present as a corresponding first data set; the first training atom set and the first training connection key set of the current training data set which are input at the time, or the first label atom set and the first label connection key set of the first label data set, or the first prediction atom set and the first prediction connection key set of the first prediction data set are used as corresponding first atom set and first connection key set; the first set of atoms includes a plurality of first atoms; each first atom corresponds to a group of atomic characteristic data, including an atomic identifier, an atomic name, an atomic coordinate, an atomic element type consisting of an element type and an atomic type, a residue type identifier and an atomic electric quantity; the first set of connection keys includes a plurality of first connection keys; each first connecting key corresponds to a group of connecting key information and comprises a connecting key identifier, a key head atom identifier, a key tail atom identifier and a connecting key type;
Constructing an E3 isograph node as a corresponding first node according to each first atom of the first atom set; setting a first node identifier, a first node coordinate, a first node element type and a first node atom type of the corresponding first node according to the atom identifier, the atom coordinate, the element type and the atom type of each first atom;
constructing an E3 isograph node edge as a corresponding first edge according to each first connection key of the first connection key set; setting a first edge identifier, a first edge head node identifier, a first edge tail node identifier and a first edge connecting key type of the corresponding first edge according to the connecting key identifier, the key head atom identifier, the key tail atom identifier and the connecting key type of each first connecting key;
forming a corresponding current E3 isovariogram by all the obtained first nodes and all the first edges;
if the first data set obtained at this time is the current training data set, outputting the current E3 isovariogram as a corresponding first training E3 isovariogram; if the first data set obtained at this time is the first tag data set, outputting the current E3 isovariogram as a corresponding first tag E3 isovariogram; and if the first data set obtained at this time is the first prediction data set, outputting the current E3 isovariogram as the corresponding first prediction E3 isovariogram.
10. An apparatus for performing the method of processing of antibody structure prediction according to any one of claims 1-9, characterized in that the apparatus comprises: the system comprises a model training module, a residue sequence acquisition module, an FV segment structure prediction module, a predicted structure scoring module and an optimal FV segment structure processing module;
the model training module is used for based on a plurality of preset iteration times n i Training the antibody structure prediction model to obtain a plurality of corresponding structure prediction model parameters; training the scoring model of the antibody structure; i is more than or equal to 1 and less than or equal to M, wherein M is a positive integer which is more than or equal to 1;
the residue sequence acquisition module is used for acquiring the residue sequence of the antibody FV fragment as a corresponding first FV fragment sequence; the first FV fragment sequence comprising a first heavy chain residue sequence, a first light chain residue sequence;
the FV segment structure prediction module is used for traversing the structure prediction model parameters of the plurality of groups of structure prediction model parameters; the structure prediction model parameter of the current traversal is used as a corresponding current structure prediction model parameter, the model parameter of the antibody structure prediction model is set as the current structure prediction model parameter, and the antibody structure prediction model after the current setting is used as a corresponding current antibody structure prediction model; inputting the first heavy chain residue sequence and the first light chain residue sequence into the current antibody structure prediction model to perform FV segment three-dimensional structure prediction processing to obtain a corresponding first FV segment structure;
The prediction structure scoring module is used for inputting the obtained M first FV segment structures into the trained antibody structure scoring model respectively for confidence scoring to obtain corresponding first structure scores;
and the optimal FV segment structure processing module is used for selecting the first FV segment structure corresponding to the maximum score from the obtained M first structure scores as an optimal FV segment structure and outputting the optimal FV segment structure.
11. An electronic device, comprising: memory, processor, and transceiver;
the processor being adapted to couple with the memory, read and execute instructions in the memory to implement the method of any one of claims 1-9;
the transceiver is coupled to the processor and is controlled by the processor to transmit and receive messages.
12. A computer readable storage medium storing computer instructions which, when executed by a computer, cause the computer to perform the method of any one of claims 1-9.
CN202310114453.6A 2023-02-15 2023-02-15 Antibody structure prediction processing method and device Active CN115881220B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310114453.6A CN115881220B (en) 2023-02-15 2023-02-15 Antibody structure prediction processing method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310114453.6A CN115881220B (en) 2023-02-15 2023-02-15 Antibody structure prediction processing method and device

Publications (2)

Publication Number Publication Date
CN115881220A CN115881220A (en) 2023-03-31
CN115881220B true CN115881220B (en) 2023-06-06

Family

ID=85761150

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310114453.6A Active CN115881220B (en) 2023-02-15 2023-02-15 Antibody structure prediction processing method and device

Country Status (1)

Country Link
CN (1) CN115881220B (en)

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP3821434A1 (en) * 2018-09-21 2021-05-19 DeepMind Technologies Limited Machine learning for determining protein structures
KR20220011148A (en) * 2019-05-19 2022-01-27 저스트-에보텍 바이오로직스, 아이엔씨. Generation of protein sequences using machine learning techniques
EP3976083A4 (en) * 2019-05-31 2023-07-12 iBio, Inc. Machine learning-based apparatus for engineering meso-scale peptides and methods and system for the same
US20220372068A1 (en) * 2019-12-06 2022-11-24 The Governing Council Of The University Of Toronto System and method for generating a protein sequence
CN112233723B (en) * 2020-10-26 2022-10-25 上海天壤智能科技有限公司 Protein structure prediction method and system based on deep learning
CN114664374A (en) * 2022-01-27 2022-06-24 阿里云计算有限公司 Method and device for determining protein structure

Also Published As

Publication number Publication date
CN115881220A (en) 2023-03-31

Similar Documents

Publication Publication Date Title
CN112784092B (en) Cross-modal image text retrieval method of hybrid fusion model
CN110366734B (en) Optimizing neural network architecture
CN113764037B (en) Method and apparatus for model training, antibody engineering and binding site prediction
Sarkar et al. An algorithm for DNA read alignment on quantum accelerators
Vasicek et al. Trading between quality and non-functional properties of median filter in embedded systems
WO2022100607A1 (en) Method for determining neural network structure and apparatus thereof
CN114564410A (en) Software defect prediction method based on class level source code similarity
CN115881220B (en) Antibody structure prediction processing method and device
CN117316305A (en) Processing method and device of self-assembled short peptide prediction model
CN112270950A (en) Fusion network drug target relation prediction method based on network enhancement and graph regularization
CN115204372B (en) Pre-selection method and system based on term walk graph neural network
Christensen et al. Non-parametric correction of estimated gene trees using TRACTION
US20240006017A1 (en) Protein Structure Prediction
CN112466410B (en) Method and device for predicting binding free energy of protein and ligand molecule
CN115512693A (en) Audio recognition method, acoustic model training method, device and storage medium
Ngo et al. Target-aware variational auto-encoders for ligand generation with multimodal protein representation learning
Xiao et al. Neural PathSim for Inductive Similarity Search in Heterogeneous Information Networks
CN117637029B (en) Antibody developability prediction method and device based on deep learning model
CN117649563B (en) Quantum recognition method, system, electronic device and storage medium for image category
CN116383088B (en) Source code form verification method, device, equipment and storage medium
CN114724646A (en) Molecular attribute prediction method based on mass spectrogram and graph structure
Mumford Crafting neural argumentation networks
Vu et al. Exploring the features of quanvolutional neural networks for improved image classification
Consoli et al. An exact algorithm for the minimum quartet tree cost problem
Bhadwal et al. Nc-vae: normalised conditional diverse variational autoencoder guided de novo molecule generation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant