CN114822696B

CN114822696B - Attention mechanism-based antibody non-sequencing prediction method and device

Info

Publication number: CN114822696B
Application number: CN202210466987.0A
Authority: CN
Inventors: 张林峰; 孙伟杰; 温翰; 许瑞晗
Original assignee: Beijing Shenshi Technology Co ltd
Current assignee: Beijing Shenshi Technology Co ltd
Priority date: 2022-04-29
Filing date: 2022-04-29
Publication date: 2023-04-18
Anticipated expiration: 2042-04-29
Also published as: CN114822696A; WO2023208204A1

Abstract

The invention discloses a method and a device for predicting non-sequencing of an antibody based on an attention mechanism. The method comprises the following steps: acquiring an antibody database, wherein the antibody database is an antibody sequence data set aiming at a specific problem; inputting the antibody database into a non-sequencing neural network model for training until a trained antibody non-sequencing prediction model is obtained, wherein the non-sequencing neural network model is a generalized autoregressive pre-training attention model or a bidirectional generation type pre-training attention model; inputting the information of the antibody to be predicted into the antibody non-sequencing prediction model to obtain the predicted value or probability distribution of all amino acid sequences of the antibody to be predicted. The invention realizes the training of the non-sequencing model for predicting the amino acid sequence information of the antibody, can further apply the trained model to the antibody modification, can use the intermediate result of the model as a characteristic vector to be applied to various antibody property prediction and modification tasks, and has the advantages of wide application range, high prediction precision, flexibility, simplicity and the like.

Description

Attention mechanism-based antibody non-sequencing prediction method and device

Technical Field

The invention relates to the field of biological information and deep learning, in particular to an attention mechanism-based antibody non-sequencing prediction method and device.

Background

The antibody is an immunoglobulin capable of being specifically combined with antigen, and consists of two identical heavy chains (H chains) and two identical light chains (L chains), wherein the heavy chains are connected with the heavy chains through disulfide bonds, and the heavy chains and the light chains are connected with each other through disulfide bonds to form a light-heavy chain paired symmetrical molecule. The heavy and light chains are the products of transcription from two separate mrnas that, together, assemble into a full-length immunoglobulin molecule in the B cell endoplasmic reticulum. The binding of antibodies to antigens relies primarily on the Complementarity Determining Regions (CDRs) of antibodies, including heavy chain CDR-H1, CDR-H2, CDR-H3, light chain CDR-L1, CDR-L2, CDR-L3, where the CDR-H3 region types are most abundant and the remaining framework region sequences are usually more constant. Therefore, after determination of the framework regions, optimization of antibody affinity usually focuses on the complementarity determining regions, particularly CDR-H3, whose sequence will also affect the solubility, expression, and immunogenicity of the antibody. The speed of natural antibody production in humans is very slow, and testing the effectiveness of antibodies by means of biological experiments is a very time-consuming process. Antibody prediction techniques can rapidly screen potential antibodies.

The results of phage display (phase display) were trained with a conventional Convolutional Neural Network (CNN) in US2019/0065677A1 to guide new sequence generation. Another technical team is to virtually screen antibodies by screening (screening) large-scale generated data by training a neural network with data of a small number of lead antibody variants (lead antibody variants). However, these prior art techniques have limitations and unsatisfactory prediction results. Because the prior art does not use a deep learning framework which can well perform in natural language processing and does not consider the intrinsic correlation nature of protein sequences, a model with local sequencing is usually used, so that the comprehensive performance of the model is influenced, and particularly, the performance of interaction which is very close in a drawing space and is far away in sequence is not ideal enough.

Disclosure of Invention

The invention mainly provides an attention mechanism-based antibody non-sequencing prediction method and device.

The invention provides a method for predicting non-sequencing of an antibody based on an attention mechanism, which comprises the following steps: obtaining an antibody database, wherein the antibody database is a set of antibody sequence data for a particular question; inputting the antibody database into a non-sequencing neural network model for training, and stopping training until the error is lower than a threshold value or tends to be stable to obtain the trained antibody non-sequencing prediction model, wherein the non-sequencing neural network model is a generalized autoregressive pre-training attention model or a bidirectional generation type pre-training attention model; inputting the information of the antibody to be predicted into the antibody non-sequencing prediction model to obtain the predicted value or probability distribution of all amino acid sequences of the antibody to be predicted.

Optionally, the antibody information to be predicted comprises one or more of the following: the amino acid sequence of a partial site of the antibody to be predicted, the length of the amino acid sequence of the antibody to be predicted, and probability information of amino acid distribution in the amino acid sequences of a plurality of homologous proteins of the antibody to be predicted.

Optionally, the non-sequencing neural network model comprises at least one coding model and at least one decoding model, and the step of inputting the antibody information to be predicted into the antibody non-sequencing prediction model to obtain the antibody prediction result comprises: inputting the information of the antibody to be predicted into the coding model, and obtaining an intermediate result corresponding to the antibody to be predicted through the coding model; and inputting the information of the antibody to be predicted and the intermediate result into a decoding model together, and obtaining the predicted value of all amino acid sequences corresponding to the antibody to be predicted through the coding model.

Optionally, the input of the attention module is determined by label data randomly setting a prediction order, and the output of the attention module is determined by all label data, wherein the label data means that the whole antibody sequence comprises a framework region and a complementarity determining region or a pure complementarity determining region sequence.

Optionally, the tag data with the randomly set prediction order masks information of the partial sequence using a masking method to predict sequence information of the masked partial sequence.

Alternatively, the prediction and engineering applied to the solubility and aggregation of antibodies comprises: according to the method for antibody non-sequencing prediction based on the attention mechanism, a model obtained by training a corresponding antibody or nano antibody database is used, an antibody sequence to be predicted is input into the model, an intermediate result feature vector of any layer of a neural network is used as input and then input into any machine learning model to perform a prediction task, and further, the attention weight of an intermediate result is used as one of improved site selection bases, and the trained model is input after the site to be improved of the sequence to be improved is shielded by combining other solubility and aggregation improvement methods, so that the optimal distribution of the amino acid types of the site to be improved is obtained.

Alternatively, it is applied to antibody humanization prediction and engineering, including: according to the model obtained by training the attention-based antibody non-sequencing prediction method, a human source sequence such as an OAS human source data set and a cab-rep human source data set is only utilized for training, the antibody sequence to be predicted is input into the model, an intermediate result feature vector of any layer of a neural network is utilized as input, and then the input is input into an arbitrary machine learning model for carrying out a humanized prediction task, meanwhile, the attention weight of the intermediate result is further used as one of improved site selection bases, and the positions to be improved of the sequence to be improved are shielded and then input into the trained model, so that the optimal distribution of the amino acid types of the positions to be improved is obtained.

Alternatively, the method is applied to prediction and modification of antibody expression quantity, and comprises the following steps: according to the above-mentioned attention mechanism-based antibody non-sequencing prediction method, a model obtained by training a corresponding antibody or nano antibody database is used, an antibody sequence to be predicted is input into the model, an intermediate result feature vector of any layer of a neural network is used as input, and then the input is input into any machine learning model to perform a prediction task, and further, the attention weight of an intermediate result is used as one of the basis for selecting improved sites, and the input is performed on the trained model after the sites to be improved of the sequence to be improved are shielded by combining other expression improvement methods, so that the optimal distribution of the amino acid types of the sites to be improved is obtained.

Alternatively, the method is applied to antibody-antigen docking and antigen surface site prediction, and comprises the following steps: according to the antibody non-sequencing prediction method based on the attention mechanism, a model obtained by training a corresponding antibody or nano antibody database is used, an antibody sequence to be predicted is input into the model, an intermediate result feature vector of any layer of a neural network is used as input, and then the input is input into any machine learning model to perform a prediction task, so that the prediction of the antibody-antigen binding strength and the prediction of the antigen surface position are obtained.

Optionally, the method is applied to the construction of a display library for artificial antibody modification, and comprises the following steps: according to the method for predicting the non-sequencing of the antibody based on the attention mechanism, a model obtained by training a database of the required corresponding antibody or nano antibody is used, the site to be improved of the sequence to be improved is shielded and then input into the trained model, so that the optimal distribution of the amino acid types of the site to be improved is obtained, and the library is established for carrying out the display and screening of the phage, the yeast and the mammalian cells.

Alternatively, it is applied to de novo antibody design comprising: according to the method for predicting the non-sequencing of the antibody based on the attention mechanism, a model obtained by training a required corresponding antibody or a nano antibody database is used, a site to be designed of a sequence to be designed is shielded and then input into the trained model, so that the optimal distribution of the amino acid types of the site to be improved is obtained, and meanwhile, the physical model is combined, and the binding strength of the corresponding design is calculated by using Rosetta, openMM, gromacs, MM-PB/GBSA or a tool for predicting the improvement effect through deep learning, so that the optimal amino acid is comprehensively selected for the de novo design of the antibody.

The invention provides an attention-based antibody non-sequencing prediction device in a second aspect, which comprises: the system comprises an acquisition module, a database processing module and a database processing module, wherein the acquisition module is used for acquiring an antibody database which is an antibody sequence data set aiming at a specific problem; the model training module is used for inputting the antibody database into a non-sequencing neural network model for training, stopping training until the error is lower than a threshold value or tends to be stable, and obtaining the trained antibody non-sequencing prediction model, wherein the non-sequencing neural network model is a generalized autoregressive pre-training attention model or a two-way generation type pre-training attention model; the application module is used for inputting the local information of the antibody to be predicted into the antibody non-sequencing prediction model to obtain the predicted value or probability distribution of all amino acid sequences of the antibody to be predicted, or realizing the prediction and modification of the solubility and the aggregation of the antibody, the prediction and modification of antibody humanization, the prediction and modification of antibody expression quantity, the prediction of antibody-antigen docking and antigen surface position, the construction of an antibody modification display library or the de novo design of the antibody.

In conclusion, the technical scheme of the invention realizes the prediction of the amino acid sequence information of the antibody based on the non-sequencing neural network model, and has the following advantages:

(1) The prediction of the situations of different lengths, different known amino acid numbers, different known amino acid positions, different prediction sequences and the like can be realized, and the coverage range is wide.

(2) The influence of the CDR-H1 and the CDR-H2 of the heavy chain complementarity determining region on the CDR-H3 is considered, so that the generated sequence can better reflect the integral characteristics of the three CDRs, and the accuracy of the model is improved.

(3) The method breaks through the defect that the autoregressive model can only be in a fixed sequence, realizes the autoregressive model in any sequence, and improves the precision of the model.

(4) The method can not only realize the prediction of the arrangement information of the amino acids in the sequence, but also realize the prediction of the properties of the antibody, and is beneficial to improving the druggability.

(5) The optimal amino acid composition distribution of any position for predicting the rest positions can be fixed, and the antibody can be flexibly designed.

(6) The distinction takes into account the framework and complementarity determining regions and will perform better for different kinds of tasks.

(7) The model trained by the technical scheme can obtain better high-dimensional spatial embedding for the antibody, so that a better high-dimensional expression vector can be obtained from an intermediate result, and the accuracy of prediction and modification of various antibody properties is improved.

Drawings

For purposes of illustration and not limitation, the present invention will now be described in accordance with its preferred embodiments, particularly with reference to the accompanying drawings, in which:

FIG. 1 is a schematic diagram of the principle of an embodiment of antibody design according to an embodiment of the present invention;

FIG. 2 is a flow chart of a method for non-sequencing prediction of an antibody based on an attention mechanism according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of an embodiment of a modified bacteriophage of an embodiment of the present invention;

FIG. 4 is a schematic illustration of the principles of prediction and engineering of antibody solubility and aggregation according to embodiments of the present invention;

FIG. 5 is a schematic diagram of the prediction and engineering of antibody humanization according to an embodiment of the present invention;

FIG. 6 is a schematic diagram illustrating the prediction and modification of the expression level of an antibody according to an embodiment of the present invention;

FIG. 7 is a schematic diagram of the principles of antibody-antigen docking and antigen surface site prediction according to an embodiment of the present invention;

FIG. 8 is a block diagram of an antibody non-sequencing prediction apparatus based on the attention mechanism according to the embodiment of the present invention.

Detailed Description

As described in the background, the conventional antibody design model often uses a sequential local autoregressive generation method (i.e., sequentially generated from front to back) that is common in natural language processing. The difference between the autoregressive method and the actual scene of antibody design is too large, so that the prediction effect has a great space, and a plurality of special means are often required to be added in the actual scene of antibody sequence design to fix the amino acid types of certain positions.

In order to solve this problem, the present invention aims to propose a method and an apparatus for predicting non-sequencing generated (Random consensus Generation) antibody, the principle of which is: after the significant sites are identified, the fills are generated in a random order for the other sites. The key point of the invention lies in the application of non-sequencing technology (Random consensus) in antibody design. For example, as shown in FIG. 1, the present invention can predict the amino acids in the sequence in the order of "3-2-4-1", i.e., predict the amino acid at the 3 rd position first, and then predict the 2 nd position, the 4 th position, and the 1 st position in sequence. Meanwhile, the technology also introduces a strong attention mechanism and a coding and decoding model to further enhance the expression capability of the model, for example, the CDR-H1/H2 can be input into the coding model to enhance the decoding model to predict CDR-H3, and each CDR can be predicted independently. Therefore, the invention can realize universality to all cases of antibody design and has universal application potential to downstream scenes, such as antibody de novo design, construction of broad-spectrum phage display libraries, construction of libraries for specific tasks according to the existing information and the like.

As shown in fig. 2, the method for predicting non-sequencing of an antibody based on an attention mechanism according to an embodiment of the present invention mainly includes the following three steps.

Step A: obtaining an antibody database, wherein the antibody database is a set of antibody sequence data for a particular problem. The antibody sequence data set for a specific problem may be specifically one or a combination of cab-rep human source database, OAS human source database, or INDI nanobody sequence database, and may also be other data sets oriented to the problem to be solved.

And B: and inputting the antibody database into a non-sequencing neural network model for training, and stopping training until the error is lower than a threshold value or tends to be stable to obtain a trained antibody non-sequencing prediction model. Wherein the non-sequencing neural network model is a generalized autoregressive pre-trained attention model (also called XLNET model) or a two-way generative pre-trained attention model (also called two-way GPT model).

Step C: inputting the information of the antibody to be predicted into the antibody non-sequencing prediction model to obtain the predicted value or probability distribution of all amino acid sequences of the antibody to be predicted.

According to the antibody non-sequencing prediction method based on the attention mechanism, through acquiring the information of the antibody to be predicted (such as protein data of the antibody to be predicted and hot spot site information of the antibody combined with the antigen), the predicted value or probability distribution of the amino acid sequence of the antibody to be predicted is obtained through a neural network, so that a plurality of antibody sequences with excellent physicochemical properties and the capability of combining with the antigen can be rapidly designed, and the cost and time of biological experiments are saved.

The method is particularly suitable for a scene that some key sites are reserved and other sites are optimized and modified for the existing antibody. For example, as shown in fig. 3, the plan is to modify only a limited number of sites to fine-tune the activity or improve the drug-forming property, and the trained neural network model can be directly used for prediction, and the amino acid species with the highest rank is selected to be directly used for mutation test. The skilled in the art can use the method to predict the sequence of the target site, so as to select the amino acid species with higher rank to build a library for further phage display (yeast, mammalian cells, etc.), thereby making the CDR region sequence more like the natural human antibody sequence, thereby greatly improving the druggability, and further helping to generate a synthetic phage display library with overall good physicochemical property. In addition, the method can be applied to comprehensively design antibody sequences by combining other information, such as a structure-based physical energy model, so as to rationally design a sequence library or sequence with high affinity and developability. Further, the model can also be combined with a modeling method based on a physical model to design antibody sequences from the beginning by calculation, and the sequence of each site is determined by considering the prediction result of the model and the energy value calculated by the physical model comprehensively.

The method can obtain a deep learning model by utilizing the training mode, and can give a recommendation of an optimal amino acid combination at any position needing to be modified in the task of modifying various antibodies, particularly the framework region (framework) and the Complementarity Determining Region (CDR) of a given site to be predicted of an antibody sequence to be predicted. The advantage of this training approach is that the connection between arbitrary positions is completely described from scratch, independent of a fixed order or only locally adjacent sequences. The relevance of the complementarity determining region and the framework region is further comprehensively considered, so that the interaction between the complementarity determining region and the antigen is less influenced by the modification of the framework region, and the biological activity is reserved. Specifically, in the humanized modification of the framework region, a model obtained by training only a human sequence can be used, so that the predicted amino acid composition more conforms to the characteristics of the human sequence; in engineering solubility/aggregation, stability, etc., the model will give the amino acid combination that best matches the characteristics of the antibody itself. Thereby greatly improving the transformation efficiency.

Training can be performed using only CDR sequences as model inputs, using all light/heavy chain corresponding CDR sequences (CDR-L1, CDR-L2, CDR-L3, CDR-H1, CDR-H2, CDR-H3) of the cab-rep human source database, OAS human source database, or the INDI nanobody sequence database, using the generalized auto-regressive pre-trained attention model (XLNet model) or the two-way generative pre-trained attention model (the two-way GPT model). Training for CDR-H3 is particularly important due to the diversity of CDR-H3.

In the attention mechanism-based antibody non-sequencing prediction method, the deep learning model well captures sequence correlation inside a complementarity determining region, so that the optimal amino acid composition of the rest position of the given partial complementarity determining region sequence can be well predicted. Because the training mode does not depend on a specific sequence and global information is considered, the model has better performance compared with other methods such as a generative pre-training attention model (GPT) and a convolutional neural network. In the antibody affinity and activity engineering task for the complementarity determining regions, specific ones of the sites can be retained using this model, giving the best combination of remaining sites. If certain sites are known to be critical for affinity, the best engineered combination of these sites to predict others can be retained. In addition, in antibody de novo design, several sites can be identified by physical energy models, the model can be used to generate the best combination of the remaining sites, or the model and physical model can be combined to design sequences.

Step B is a step of training the model, and the specific content may be: obtaining at least one of protein data of the relevant antibody or data of a homologous protein of the relevant antibody; inputting data into a neural network, and obtaining a predicted value of all amino acid sequences of the related antibody and all amino acid sequences of homologous proteins of the related antibody through the neural network; training the neural network based on the predicted values of the total amino acid sequence of the antibody of interest and the total amino acid sequence of the protein homologous to the antibody of interest, and the true values of the total amino acid sequence of the antibody of interest and the total amino acid sequence of the protein homologous to the antibody of interest. In this implementation, the neural network is trained using the true values of the entire amino acid sequence of the relevant antibody and the entire amino acid sequence of the homologous protein of the relevant antibody as the supervisory information, thereby enabling the neural network to learn the autocorrelation of the internal amino acid sequence of the antibody, and thus enabling the neural network to learn the ability to predict the entire amino acid sequence of the antibody to be predicted.

The antibody information to be predicted may include one or more of the following: the amino acid sequence of a partial site of the antibody to be predicted, the amino acid sequence length of the antibody to be predicted, and probability information of amino acid distribution in amino acid sequences of a plurality of homologous proteins of the antibody to be predicted. In the implementation mode, the protein data of the antibody which is easy to obtain can be used for obtaining a more accurate prediction result without knowing a complex three-dimensional crystal structure, and the method has the advantages of simplicity, convenience and practicability. If there is a three-dimensional structure obtained by experimental and computational modeling, the antibody sequence can be designed by combining the energies of the physical models. The optimal amino acid types of the sites to be predicted are further sequenced through a physical model, so that the accuracy of the prediction result is further improved.

The non-sequencing neural network model may specifically include an encoding model and a decoding model, and accordingly, the specific process of step C is: inputting the information of the antibody to be predicted into a coding model, and obtaining an intermediate result corresponding to the antibody to be predicted through the coding model; and inputting the information of the antibody to be predicted and the intermediate result into a decoding model together, and obtaining the predicted value of all the amino acid sequences corresponding to the antibody to be predicted through the coding model. In this implementation, information of CDR-H1 and CDR-H2 is obtained by using an encoding model, and information of CDR-H3 is predicted by using a decoding model. By adopting the implementation mode to predict all amino acid sequences of the antibody to be predicted, the accuracy of the prediction result can be improved. The implementation can further consider the framework sequences of the non-complementarity determining regions, obtain the sequence information of the whole framework region through a coding model and predict the CDR-H3 information by adopting a decoding model. By adopting the implementation mode to predict all amino acid sequences of the antibody to be predicted, the accuracy of a prediction result can be improved.

The encoding model and the decoding model may respectively include at least one attention module, an input of the attention module being determined by tag data randomly setting a prediction order, and an output of the attention module being determined by all the tag data. The label data with the randomly set prediction order uses a mask method to mask the information of partial sequences so as to predict the sequence information of the masked parts. Correspondingly, the specific process of the step C is as follows: inputting at least one of the data of the antibody to be predicted and the homologous protein of the antibody to be predicted into a non-sequencing self-attention module in the coding model, and obtaining an intermediate result corresponding to the antibody to be predicted through the non-sequencing self-attention module. And then inputting at least one of the data of the antibody to be predicted and the homologous protein of the antibody to be predicted and the intermediate result into a non-sequencing self-attention module in the coding model, and obtaining the predicted value of all the amino acid sequences corresponding to the antibody to be predicted through the non-sequencing self-attention module.

Inputting the sequence information of the partial region of the antibody to be predicted into a non-sequencing self-attention module, obtaining a predicted value of a random position in the region of the antibody not to be predicted, combining the predicted value and the sequence information of the partial region input in the previous round as the input of a new round, inputting the non-sequencing self-attention module, and repeating the processes until the predicted value of the whole region of the antibody to be predicted is obtained. In this implementation, the non-sequenced self-attention module can predict the entire amino acid sequence of the antibody to be predicted in any order, in which way a more accurate prediction can be obtained.

In order to make the skilled person better understand the method for predicting non-sequencing of an antibody based on an attention mechanism of the present invention, several specific application scenarios are listed below for explanation.

Application scenario 1: prediction and modification of antibody solubility and aggregation

According to the attention mechanism-based antibody non-sequencing prediction method, a model obtained by training a corresponding antibody or nano antibody database is used, an antibody sequence to be predicted is input into the model, an intermediate result feature vector of any layer of a neural network is used as input, and the intermediate result feature vector is input into any machine learning model to perform a prediction task. And further, taking the attention weight of the intermediate result as one of the selection bases of the improved sites, combining other solubility and aggregation improvement methods, shielding the sites to be improved of the sequence to be improved, and inputting the shielded sites into a trained model so as to obtain the optimal distribution of the amino acid types of the sites to be improved. The operation can be understood with reference to fig. 4 of the specification.

Application scenario 2: antibody humanization prediction and engineering

According to the attention mechanism-based antibody non-sequencing prediction method, a human source sequence such as an OAS human source data set and a cab-rep human source data set is used for training to obtain a trained model, an antibody sequence to be predicted is input into the model, an intermediate result feature vector of any layer of a neural network is used as input, and then the input is input into any machine learning model to perform a humanized prediction task. And further, taking the attention weight of the intermediate result as one of the selection bases of the improved sites, shielding the sites to be improved of the sequence to be improved, and inputting the shielded sites into a trained model to obtain the optimal distribution of the amino acid types of the sites to be improved. The operation can be understood with reference to fig. 5 of the specification.

Application scenario 3: prediction and engineering of antibody expression levels

According to the attention mechanism-based antibody non-sequencing prediction method, a corresponding antibody or nano antibody database is used for training to obtain a trained model, an antibody sequence to be predicted is input into the model, an intermediate result feature vector of any layer of a neural network is used as input, and the input is input into any machine learning model to perform a prediction task. And furthermore, taking the attention weight of the intermediate result as one of the selection bases of the improved sites, combining with other expression quantity improvement methods, shielding the sites to be improved of the sequence to be improved, and inputting the shielded sites into a trained model so as to obtain the optimal distribution of the amino acid types of the sites to be improved. The operation can be understood with reference to fig. 6 of the specification.

Application scenario 4: antibody-antigen docking and antigen surface position prediction

According to the attention mechanism-based antibody non-sequencing prediction method, a corresponding antibody or nano antibody database is used for training to obtain a trained model, an antibody sequence to be predicted is input into the model, an intermediate result feature vector of any layer of a neural network is used as input, and the intermediate result feature vector is input into any machine learning model to perform a prediction task. Finally, the prediction of the antibody antigen binding strength and the prediction result of the antigen surface position are obtained. This operation can be understood with reference to fig. 7 of the specification.

Application scenario 5: construction of artificial antibody modified display library

According to the antibody non-sequencing prediction method based on the attention mechanism, a required corresponding antibody or nano antibody database is used for training to obtain a trained model, a site to be improved of a sequence to be improved is shielded and then input into the trained model to obtain the optimal distribution of the amino acid types of the site to be improved, and a library is built for phage, yeast and mammalian cell display screening. This operation can be understood with reference to figure 3 of the specification.

Application scenario 6: antibody de novo design

According to the attention mechanism-based antibody non-sequencing prediction method, a required corresponding antibody or nano antibody database is used for training to obtain a trained model, a site to be designed of a sequence to be designed is shielded and then input into the trained model to obtain the optimal distribution of the amino acid types of the site to be improved, and meanwhile, the physical model is combined to calculate the binding strength of corresponding design by using tools such as Rosetta, openMM, gromacs, MM-PB/GBSA or deep learning prediction improvement effect so as to comprehensively select the optimal amino acid for antibody de novo design. This operation can be understood with reference to figure 3 of the specification.

The device for predicting non-sequencing of antibodies based on attention mechanism according to the embodiment of the present invention, as shown in fig. 8, mainly includes an acquisition module 100, a training module 200, and an application module 300. The acquisition module 100 is used to acquire an antibody database, which is a set of antibody sequence data for a specific question. In particular, the antibody database may be a combination of one or more of a cab-rep human source database, an OAS human source database, or an INDI nanobody sequence database. The training module 200 is configured to input the antibody database into a non-sequencing neural network model for training, and stop training until an error is lower than a threshold or tends to be stable, so as to obtain a trained antibody non-sequencing prediction model, where the non-sequencing neural network model is a generalized auto-regression pre-training attention model (also called XLNet model) or a two-way generation pre-training attention model (also called two-way GPT model). The application module 300 is used for inputting the information of the antibody to be predicted into the antibody non-sequencing prediction model to obtain the prediction value or probability distribution of all amino acid sequences of the antibody to be predicted, or realizing prediction and modification of antibody solubility and aggregation, antibody humanization prediction and modification, antibody expression prediction and modification, antibody antigen docking and antigen surface position prediction, construction of a constructed antibody modification display library or antibody de novo design.

In the application module 300 of the antibody non-sequencing prediction device, the antibody information to be predicted may include one or more of the following combinations: the amino acid sequence of a partial site of the antibody to be predicted, the amino acid sequence length of the antibody to be predicted, and probability information of amino acid distribution in amino acid sequences of a plurality of homologous proteins of the antibody to be predicted.

In the training module 200 of the antibody non-sequencing prediction apparatus, the non-sequencing neural network model may include: the encoding model is used for receiving input information of the antibody to be predicted and processing the information to obtain an intermediate result corresponding to the antibody to be predicted; and the decoding model is used for receiving the input information of the antibody to be predicted and the intermediate result to obtain the predicted value of all the amino acid sequences corresponding to the antibody to be predicted. Wherein the coding model and the decoding model respectively comprise at least one attention module, the input of the attention module is determined by the label data of randomly set prediction sequence, and the output of the attention module is determined by all the label data. The tag data in which the prediction order is randomly set masks information of partial sequences using a mask method to predict sequence information of the masked portions.

In summary, the method and the device for predicting the non-sequencing of the antibody based on the attention mechanism, provided by the embodiment of the invention, realize the prediction of the amino acid sequence information of the antibody based on the non-sequencing neural network model, and have the following advantages:

(1) The method can realize the prediction of the conditions of different lengths, different known amino acid numbers, different known amino acid positions, different prediction sequences and the like, and has wide coverage.

(5) The optimal amino acid composition distribution of the predicted remaining sites of individual sites can be fixed, which facilitates flexible antibody design.

(6) The distinction takes into account the framework regions and the complementarity determining regions, which may perform better for different kinds of tasks.

The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and amendments can be made without departing from the principle of the present invention, and these modifications and amendments should also be considered as the protection scope of the present invention.

Claims

1. An attention mechanism-based antibody non-sequencing prediction method is characterized by being applied to prediction and modification of antibody solubility and aggregation, antibody humanization prediction and modification, antibody expression amount prediction and modification, antibody antigen docking and antigen surface site prediction, construction of an antibody modification display library or antibody de novo design, and comprising the following steps of:

obtaining an antibody database, wherein the antibody database is an antibody sequence data set for a specific problem, and the antibody sequence data set for the specific problem is one or a combination of more of a cab-rep human source database, an OAS human source database, or an INDI nano antibody sequence database;

inputting the antibody database into a non-sequencing neural network model for training, and stopping training until the error is lower than a threshold value or tends to be stable to obtain a trained antibody non-sequencing prediction model, wherein the method specifically comprises the following steps: obtaining at least one of protein data of a related antibody or data of a homologous protein of the related antibody, inputting the data into a neural network, obtaining a predicted value of all amino acid sequences of the related antibody and all amino acid sequences of the homologous protein of the related antibody via the neural network, training the neural network according to the predicted values of all amino acid sequences of the related antibody and all amino acid sequences of the homologous protein of the related antibody and truth values of all amino acid sequences of the related antibody and all amino acid sequences of the homologous protein of the related antibody, wherein the non-sequencing neural network model is a generalized autoregressive pre-training attention model or a two-way generative pre-training attention model, and specifically comprises an encoding model and a decoding model: the encoding model and the decoding model each comprise at least one attention module;

inputting the information of the antibody to be predicted into the antibody non-sequencing prediction model to obtain the predicted value or probability distribution of all amino acid sequences of the antibody to be predicted, wherein the method specifically comprises the following steps: obtaining information of CDR-H1 and CDR-H2 by adopting an encoding model, and predicting information of CDR-H3 by adopting a decoding model; further considering the framework sequences of the non-complementarity determining regions, acquiring the sequence information of the whole framework region through a coding model and predicting CDR-H3 information by adopting a decoding model: inputting at least one of data of an antibody to be predicted and a homologous protein of the antibody to be predicted into a non-sequencing self-attention module in a coding model, obtaining an intermediate result corresponding to the antibody to be predicted through the non-sequencing self-attention module, inputting at least one of the data of the antibody to be predicted and the homologous protein of the antibody to be predicted and the intermediate result into the non-sequencing self-attention module in the coding model, obtaining a predicted value of a random position in an antibody unpredicted region through the non-sequencing self-attention module, inputting sequence information of a partial region of the antibody to be predicted into the non-sequencing self-attention module, obtaining a predicted value of a random position in the antibody unpredicted region, combining the predicted value and the sequence information of the partial region input in the previous round as input of a new round, inputting the non-sequencing self-attention module, repeating the above processes until obtaining the predicted value of the whole region of the antibody to be predicted, wherein the antibody information to be predicted comprises one or a combination of the following matters: the amino acid sequence of a partial site of the antibody to be predicted, the length of the amino acid sequence of the antibody to be predicted, and probability information of amino acid distribution in the amino acid sequences of a plurality of homologous proteins of the antibody to be predicted.

2. The method of claim 1, wherein the input of the attention module is determined by randomly assigning a predicted sequence of tag data and the output of the attention module is determined by the total tag data, wherein the tag data is the entire antibody sequence including the framework region and the CDRs or the sequence of the pure CDRs.

3. The method of claim 2, wherein the tag data randomly set in the prediction order masks information of the partial sequence using a masking method to predict sequence information of the masked partial sequence.

4. An attention mechanism-based antibody non-sequencing prediction device for realizing prediction and modification of antibody solubility and aggregation, antibody humanization prediction and modification, antibody expression amount prediction and modification, antibody-antigen docking and antigen surface site prediction, antibody modification display library construction or antibody de novo design, the device comprising:

an obtaining module, configured to obtain an antibody database, where the antibody database is an antibody sequence data set for a specific problem, and the antibody sequence data set for the specific problem is one or a combination of a cab-rep human source database, an OAS human source database, or an INDI nano antibody sequence database;

the model training module is used for inputting the antibody database into a non-sequencing neural network model for training, stopping training until the error is lower than a threshold value or tends to be stable, obtaining a trained antibody non-sequencing prediction model, and further used for: obtaining at least one of protein data of a related antibody or data of a homologous protein of the related antibody, inputting the data into a neural network, obtaining a predicted value of all amino acid sequences of the related antibody and all amino acid sequences of the homologous protein of the related antibody via the neural network, training the neural network according to the predicted values of all amino acid sequences of the related antibody and all amino acid sequences of the homologous protein of the related antibody and truth values of all amino acid sequences of the related antibody and all amino acid sequences of the homologous protein of the related antibody, wherein the non-sequencing neural network model is a generalized autoregressive pre-training attention model or a two-way generative pre-training attention model, and specifically comprises an encoding model and a decoding model: the coding model and the decoding model each comprise at least one attention module;

an application module, configured to input local information of the antibody to be predicted into the antibody non-sequencing prediction model to obtain a prediction value or probability distribution of all amino acid sequences of the antibody to be predicted, and further configured to: obtaining information of CDR-H1 and CDR-H2 by adopting an encoding model, and predicting information of CDR-H3 by adopting a decoding model; further considering the framework sequences of the non-complementarity determining regions, acquiring the sequence information of the whole framework region through a coding model and predicting CDR-H3 information by adopting a decoding model: inputting at least one of data of an antibody to be predicted and a homologous protein of the antibody to be predicted into a non-sequencing self-attention module in a coding model, obtaining an intermediate result corresponding to the antibody to be predicted through the non-sequencing self-attention module, then inputting at least one of the data of the antibody to be predicted and the homologous protein of the antibody to be predicted and the intermediate result into the non-sequencing self-attention module in the coding model, obtaining a predicted value of all amino acid sequences corresponding to the antibody to be predicted through the non-sequencing self-attention module, inputting sequence information of a partial region of the antibody to be predicted into the non-sequencing self-attention module, obtaining a predicted value of a random position in the antibody non-predicted region, combining the predicted value and the sequence information of the partial region input in the previous round as input of a new round, inputting the non-sequencing self-attention module, repeating the above processes until the predicted value of all regions of the antibody to be predicted is obtained, wherein the antibody information to be predicted comprises one or more than the following: the amino acid sequence of a partial site of the antibody to be predicted, the length of the amino acid sequence of the antibody to be predicted, and probability information of amino acid distribution in the amino acid sequences of a plurality of homologous proteins of the antibody to be predicted.