WO2023163518A1

WO2023163518A1 - Immunogenic determinant predicting method and immunogenic binding site predicting method

Info

Publication number: WO2023163518A1
Application number: PCT/KR2023/002582
Authority: WO
Inventors: 박민준; 서승우; 박은영; 김진한
Original assignee: 주식회사 스탠다임
Priority date: 2022-02-25
Filing date: 2023-02-23
Publication date: 2023-08-31

Abstract

The present invention relates to an immunogenic determinant predicting method and an immunogenic binding site predicting method, whereby an immunogenic determinant and/or an immunogenic binding site specifically binding thereto can be predicted more reliably in proteins or cells inducing an immune response.

Description

Methods for predicting immunogenic determinants and immunogen binding sites

This application claims priority based on Patent Application No. 2022-0025194 filed with the Korean Intellectual Property Office on February 25, 2022 and Patent Application No. 2023-0023506 filed on February 22, 2023, and the specification of the application All contents disclosed in are incorporated into this application.

The present invention relates to a method for predicting an immunogen determinant and a method for predicting an immunogen binding portion capable of more reliably predicting an immunogen determinant and/or an immunogen binding portion that specifically binds thereto in a protein or cell that causes an immune response.

The immune system refers to a defense system that defends and neutralizes infectious pathogens or harmful proteins in vivo. It may refer to any reaction related to the immune system that specifically recognizes or binds.

In such an immune response, within an immunogen (eg, antigen) such as the infectious pathogen or harmful protein, an amino acid residue (or sequence) specifically recognized by the immune protein or the like is represented by an epitope It can be defined as an immunogenic determinant. In addition, an amino acid residue in an immune protein such as an antibody that specifically recognizes and binds to an immunogenic determinant in an immunogen may be defined as an immunogen binding site represented by an antigen binding site (paratope).

In addition, the antigenic determinant can be divided into a linear epitope and a structural epitope according to its shape and mode of action with an antigen-binding site. Of these, the linear antigenic determinant consists of a continuous linear amino acid sequence and can bind to the one-dimensional structure of the antigen-binding portion, and the three-dimensional structure antigenic determinant reflects the three-dimensional protein structure and consists of a discontinuous amino acid sequence that binds to the antigen. It can be combined with the three-dimensional structure of wealth.

Recently, interest in immune-based therapeutics or treatment methods based on immune responses and the like has been greatly increased, and research and development related thereto are also being actively conducted. Since such an immune-based therapeutic agent is developed based on the specific binding and immune response between an immunogen determinant such as the antigen determinant and an immunogen binding unit such as the antigen binding unit, the immunogen determinant within the immunogen is predicted or confirmed, and the immune protein Predicting or confirming an immunogen binding site within the body has emerged as a very important problem in the development of effective immune-based therapeutics.

However, since it is very difficult to experimentally confirm an immunogen determinant or immunogen-binding portion based on an in vivo immune response, several methods for predicting such immunogenic determinant or immunogen-binding portion or their candidate groups have been proposed in the past. And research on this is ongoing.

However, it is true that the methods for predicting immunogen determinants and immunogen binding sites known to date have not been sufficiently reliable. This is mainly due to the lack of available data on the sequence or structure of immunogens and immune proteins. In particular, in order to reliably predict the structural epitope and the three-dimensional antigen-binding portion that binds thereto, a deep understanding of the three-dimensional folding structure of target proteins such as immunogens and immune proteins and a lot of data are required. However, securing such data is not sufficient, which makes it more difficult to predict the three-dimensional antigenic determinant and the three-dimensional antigen-binding portion.

Due to these various problems, development of methods or related technologies that can more reliably predict the immunogen determinant and/or immunogen binding portion of the protein, particularly, even based on insufficient data on the three-dimensional folding structure of the protein, is ongoing. is being demanded

Accordingly, one embodiment of the present invention provides a method for predicting an immunogen prediction unit capable of more reliably predicting the existence, position, or sequence of an immunogen determinant that causes an immune response by specifically recognizing or binding to an immune protein or immune cell within an immunogen. is to do

In addition, another embodiment of the invention provides a method for predicting an immunogen-binding portion that can more reliably predict the presence, position, or sequence of an immunogen-binding portion that specifically binds to an immunogen in an immune protein or immune cell and causes an immune response. will be.

Accordingly, according to one embodiment of the present invention, based on the first database of proteins, bidirectional artificial neural network model learning is performed while masking each tokenized unit sequence included in each protein, and the unit of the protein pre-learning functions and arrangements for each sequence;

Based on the second database of immunogenic and non-immunogenic polypeptides classified according to whether immune proteins or immune cells specifically recognize or bind to immune responses, while classifying by tokenized unit sequence included in each polypeptide, amount Learning information of an immunogenic determinant that causes an immune response for each unit sequence of the polypeptide by conducting directional artificial neural network learning; and

A method for predicting an immunogen determinant of a target immunogen or a target protein using a learned artificial neural network is provided.

In the prediction method of this embodiment, unit sequences of immunogenic and non-immunogenic polypeptides are tokenized, and bidirectional artificial neural network learning is performed on the presence, position, or arrangement of immunogenic determining units for each tokenized unit sequence. Using the artificial neural network learned in this way, even when information on the three-dimensional folding structure of a protein is insufficient, the antigen determinant (epitope) that specifically binds to an antibody or the like within the target immunogen or target protein (eg, antigen protein) ) can reliably predict the immunogenic determinant.

In the prediction method of this embodiment, the pre-learning step may include dividing the amino acid sequence included in the protein into a plurality of unit sequences having a predetermined length and tokenizing them.

In a more specific example, the preliminary learning step may include masking some of the plurality of tokenized unit sequences; and proceeding with artificial neural network model learning along both directions around the masked unit sequence to pre-learn the function, arrangement or sequence of each unit sequence. In this case, the artificial neural network model learning step may include predicting the masked unit sequence; and verifying the accuracy of the predicted result and providing feedback thereof. Through this process, the artificial neural network can predict the function and arrangement of each unit sequence of the protein with high reliability.

In addition, in the prediction method of one embodiment, the step of learning the immunogen determining unit may include dividing the amino acid sequence included in the polypeptide into a plurality of unit sequences having a predetermined length and tokenizing them.

In a more specific example, the step of learning the immunogen determining unit may include predicting and classifying in both directions whether or not an immunogen determining unit exists for each of the plurality of tokenized unit sequences; And verifying the accuracy of the prediction result and providing feedback. Through this process, whether or not the immunogen determining unit exists for each unit sequence is predicted and classified, output and verified as data, and based on the output data As a result, a predictive model of the immunogen determining unit can be derived.

In the method of one embodiment, the sequence or structural information of the target immunogen or target protein is input based on the artificial neural network learned by the above-described method or the immunogen-determining unit prediction model, so as to determine whether the immunogen-determining unit exists, the location, or sequence can be reliably predicted.

On the other hand, according to another embodiment of the present invention, based on the first database of proteins, bidirectional artificial neural network learning is performed while masking for each tokenized unit sequence included in each protein, and the protein unit pre-learning functions and arrangements for each sequence;

Based on the third database of complexes of mutually bound immunogen polypeptides and immune proteins, bidirectional artificial neural network learning is performed while classifying each tokenized unit sequence included in each complex according to whether or not there is an immunogen binding site, Learning information of an immunogen binding site that causes an immune reaction with the immunogenic polypeptide for each protein unit sequence; and

A method for predicting an immunogen binding site in a target immune protein or target immune cell using a trained artificial neural network is provided.

In the prediction method of this other embodiment, the same pre-learning step as in one embodiment is performed, and the step of learning the immunogen binding part based on the third database on the complex of immunogen polypeptides and immune proteins linked to each other is additionally performed. At this time, the unit sequences included in the complex are tokenized, and bidirectional artificial neural network learning is performed on the existence, position, or arrangement of immunogen binding parts for each tokenized unit sequence. By using the artificial neural network learned in this way, even when information on the three-dimensional folding structure of the protein is insufficient, specifically binding to the immunogenic determinant such as the antigenic determinant in the target immune protein (eg, antibody protein) Immunogen binding sites such as antigen binding sites (paratopes) can be reliably predicted.

In addition, in the prediction method of the other embodiment, as the third database for complexes in which immunogenic polypeptides and immune proteins are mutually bound is utilized, not only the immunogen-binding portion but also the immunogenic determinant in the target immunogen, such as an antigen, can be reliably predicted. there is.

In the prediction method of another embodiment, the step of learning the immunogen binding part comprises dividing the amino acid sequences each included in the mutually bound immunogen polypeptide and immune protein into a plurality of unit sequences having a certain length and tokenizing them. can do.

In a more specific example, the step of learning the immunogen binding unit may include predicting and classifying whether or not an immunogen binding unit exists for each of the plurality of tokenized unit sequences in both directions; and verifying the accuracy of the prediction result and providing feedback.

At this time, a unique classification identifier is assigned to the tokenized unit sequence corresponding to both edges of the immunogenic polypeptide and the immune protein, and each embedding vector can be calculated, and these embedding vectors are The artificial neural network learning may proceed by being input in a concatenated state at the edge. Through the assignment of such a unique classification identifier, etc., immunogens and immune proteins can be distinguished, artificial neural network learning can proceed, and the immunogen-binding portion and immunogen-determining portion can be reliably predicted.

According to the method of the above-described embodiment, even when information on a protein, in particular, data on a protein's three-dimensional folding structure, etc. is insufficient, an immune protein or immune cell specifically recognizes or binds to an immune response within an immunogen. The presence, position or sequence of the causative immunogenic determinant can be more reliably predicted.

In addition, according to another embodiment of the present invention, even when the protein-related data is not sufficient, the presence, location, or sequence of an immunogen-binding portion that specifically binds to the immunogen and causes an immune response in an immune protein or immune cell is examined. can be predicted reliably. Furthermore, according to the method of another embodiment, not only the immunogen-binding portion but also the immunogen-determining portion within the immunogen can be reliably predicted.

Therefore, the prediction method according to one or another embodiment of the present invention more reliably predicts an immunogen determinant and/or an immunogen binding portion, and can greatly contribute to the economical development of a more effective immune-based therapeutic agent or treatment method within a short period of time.

1 is a schematic diagram schematically showing a method for predicting an immunogen determinant according to an embodiment of the present invention.

2 is a schematic diagram schematically showing an immunogen binding portion learning step distinguished from one embodiment in a method for predicting an immunogen binding portion according to another embodiment of the present invention.

Hereinafter, a method for predicting an immunogen determining unit and an immunogen binding unit according to embodiments of the present invention will be described with reference to the accompanying drawings. For reference, FIG. 1 schematically illustrates an example of a method for predicting an immunogen determinant according to an embodiment of the present invention, specifically, a pre-learning step and an immunogen determinant learning step included therein.

As shown in FIG. 1, in the prediction method according to one embodiment of the present invention, a pre-learning step based on a first database on proteins and a second database on immunogens and non-immunogen polypeptides Through the immunogen determinant learning step, a predictive model for predicting the existence, position, or sequence of an immunogen determinant such as an epitope within the immunogen is derived.

In this pre-learning and immunogen determining unit learning step, based on the sequence information of each protein and immunogen or non-immunogen polypeptide included in the first and second databases, their unit sequences are tokenized and tokenized. The artificial neural network model is trained in both directions while masking or classifying part of the unit sequence.

Through this bidirectional artificial neural network learning process, the order, arrangement, and function of each unit sequence are predicted, and prediction is made to predict whether each unit sequence corresponds to an immune protein such as an antibody or an immunogen determinant specifically recognized by immune cells. A model can be derived.

It was confirmed that the immunogenic determinant within the target immunogen, such as the target antigen, can be reliably predicted by using such a predictive model. In particular, in the prediction method of this embodiment, the immunogenic determinant can be reliably predicted based on the sequence information of the protein and the immunogenic/non-immunogenic polypeptide. Therefore, according to the method of one embodiment, even when data on the three-dimensional folding structure of a protein such as an immunogen is not sufficient, it is possible to reliably predict an immunogen determinant that causes an immune response within an immunogen, and as a result, prediction is difficult with conventional methods. The structural epitope, which was not easy, can also be predicted more reliably.

Meanwhile, in the prediction method of one embodiment, the first database used in the pre-learning step may include information on one or more of sequences, functions, structures, and known mutations for a plurality of proteins. More specifically, the first database may be a protein database containing information on functions, sequences, domain structures, and identified mutations for a plurality of proteins, and examples of such a first database include “Bairoch, A., and R. Apweiler. 1996. “The SWISS-PROT Protein Sequence Data Bank and Its New Supplement TREMBL.” Nucleic Acids Research 24 (1): 21-25” or “Boutet, Emmanuel, Damien Lieberherr, Michael Tognolli, Michel Schneider, and Amos Bairoch. 2007. “UniProtKB/Swiss-Prot.” Methods in Molecular Biology 406: 89-112”.

In the pre-learning step, information on a plurality of proteins included in the first database may be used as training data without separate processing, but in order to further increase the efficiency of artificial neural network learning, a certain length or less, for example, 5000 Information on proteins having an amino acid sequence of 3000 or less, or 2500 or less, more specifically, amino acid sequences of a certain length equivalent to each other may be pre-processed and utilized. In addition, even in the step of learning the immunogen determining unit described later, information on a plurality of immunogenic and non-immunogenic polypeptides included in the second database is converted into information on polypeptides having amino acid sequences of a certain length or less or equal to each other. According to pre-processing, the efficiency of artificial neural network learning for deriving a predictive model can be further increased.

In addition, in the pre-learning step, as shown in FIG. 1, the amino acid sequence included in each protein of the first database may be divided into a plurality of unit sequences having a predetermined length and tokenized. Such tokenization may be performed, for example, for each amino acid unit sequence of a certain length, and through this, a plurality of tokens may be generated from the amino acid sequence of the protein.

In the pre-learning step, some of the tokenized plurality of unit sequences, for example, 5 to 25%, or 10 to 20% are masked, while the artificial neural network along both directions of the protein around the masked unit sequence Bidirectional learning can be achieved through the model's attention mechanism. At this time, masking means an operation to cover a part of the tokenized unit sequence, and the masked token can be distinguished as a 'MASK' token.

In addition, in the artificial neural network learning process, a unit sequence corresponding to the masked token is predicted, and based on a verification dataset and a test dataset separated from the first database, the accuracy of the unit sequence prediction result is confirmed and verified so you can give feedback. In this way, while masking some of the tokenized unit sequences in various combinations along both directions of the protein, predicting the masked unit sequences, and confirming, verifying, and feeding back the accuracy of the predicted results, artificial neural networks The function, arrangement, or order of each unit sequence included in a protein can be pretrained on the model.

As a result of such pre-learning, the artificial neural network model can predict the order and arrangement of unit sequences included in a protein under the minimized consideration of biological and/or chemical properties. In the prediction method of one embodiment, as the pre-learning step is performed first, the reliability of the predictive model of the immunogen determining unit derived through the steps described below may be further improved.

On the other hand, for the bidirectional artificial neural network learning of the above-described pre-learning step, a tokenization module for tokenizing the unit sequences, a masking module for masking some of the tokenized unit sequences, and learning while converting and predicting the masked unit sequences A prediction model generation device including a transformation module and a learning module that performs a sequence prediction, a prediction accuracy operation module that checks and calculates the accuracy of the unit sequence prediction result, and a prediction model generation module that generates a prediction model for an arrangement of the unit sequences, etc. can

The above-described bidirectional artificial neural network learning method and predictive model generating device for its progress may have a similar form to the bidirectional artificial neural network learning method and predictive model generating device previously applied for language learning, one example of which is registered in Korea. Patent Publication No. 2426508 or “Devlin, Jacob and Chang, Ming-Wei and Lee, Kenton and Toutanova, Kristina. 2019. “BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding.” In Devlin-Etal-2019-Bert, 4171-86. Association for Computational Linguistics”.

On the other hand, in the prediction method of one embodiment, after the above-described pre-learning step, immunogens and non-immunogen polypeptides classified according to whether immune proteins such as antibodies or immune cells specifically recognize or bind immune responses Based on the second database, the step of learning the predictive model of the immunogen determining unit by proceeding with bidirectional artificial neural network learning.

In this step of learning the immunogenic determinant, the second database is divided into immunogenic polypeptides containing immunogenic determinants that cause an immune response and non-immunogen polypeptides not containing immunogenic determinants, and contains information on their sequences or structures. can do. More specifically, the second database may include information on immunogens and non-immunogen polypeptides that have been confirmed to cause or not cause an immune response with two or more immune proteins or immune cells through a plurality of experiments. Reliability of the prediction method of one embodiment may be further increased by utilizing the second database for immunogenic/non-immunogenic polypeptides.

In a more specific embodiment, the second database may include one or more types of information selected from the group consisting of continuous amino acid sequence information, non-contiguous amino acid sequence information, and 3-dimensional structural information of the immunogen and non-immunogen polypeptide. For example, according to the prediction method of one embodiment, an immunogenic determinant in the form of a structural epitope as well as a linear epitope can be predicted. Among them, in the method of predicting the linear antigenic determinant, a second database containing continuous amino acid sequence information of the immunogenic and non-immunogenic polypeptides may be used as learning data. A second database containing non-contiguous amino acid sequence information or three-dimensional structure information may be used. An example of such a second database is “Vita, Randi, Swapnil Mahajan, James A. Overton, Sandeep Kumar Dhanda, Sheridan Martini, Jason R. Cantrell, Daniel K. Wheeler, Alessandro Sette, and Bjoern Peters. 2019. “The Immune Epitope Database (IEDB): 2018 Update.” Nucleic Acids Research 47 (D1): D339-43” or a database known through the Protein Data Bank (PDB).

In the step of learning the immunogen determining unit, information on a plurality of immunogenic and non-immunogenic polypeptides included in the second database is pre-processed into information on a polypeptide having an amino acid sequence of a certain length, thereby increasing the reliability of the prediction method. It has already been described above that it can be increased.

Meanwhile, in the learning step of the immunogen determining unit, as shown in FIG. 1, the amino acid sequence included in each polypeptide of the second database may be divided into a plurality of unit sequences having a certain length and tokenized. Such tokenization may be performed for each amino acid unit sequence of a certain length corresponding to an immunogenic determinant, such as an antigenic determinant, and through this, a plurality of tokens may be generated from the amino acid sequence of an immunogenic or non-immunogenic polypeptide.

In the immunogen determining unit learning step, artificial neural network model learning may proceed as in the pre-learning step along both directions of the polypeptides around the plurality of tokenized unit sequences. In this bidirectional artificial neural network learning process, it is possible to output data by predicting and classifying whether each unit sequence corresponds to an immunogen determining unit or whether such an immunogen determining unit exists (e.g., the immunogen in “B” in FIG. 1). After predicting the decision part, the predicted data is classified as “0” or “1” and output).

The accuracy of the prediction result of the immunogen determining unit may be confirmed and verified based on the test data set separated from the second database, and then fed back. For example, parameters can be adjusted through a loss function such as weighted cross entropy. In this way, through the process of predicting whether or not the tokenized unit sequence is an immunogen determinant, and confirming, verifying, and feedbacking the accuracy of the predicted result, information on the existence, location, and arrangement of the immunogen determinant within the immunogen can be reliably provided. A predictive model that predicts can be derived.

Based on the learned artificial neural network model or the immunogen determinant prediction model, sequence or structural information of an unknown target immunogen or target protein is input, and the existence, position, or sequence of the immunogen determinant in the immunogen is reliably determined. can predictably

In the above-described immunogen determining unit learning step, for bidirectional artificial neural network learning, a tokenization module for tokenizing the unit sequences, a classification module for classifying the tokenized unit sequences, and learning while predicting the unit sequences A predictive model generation device including an ongoing learning module, a prediction accuracy calculation module that checks and calculates the accuracy of the unit sequence prediction result, and a predictive model generation module that generates a predictive model for an arrangement of the unit sequences, etc. can be used.

On the other hand, according to the prediction method of one embodiment described above, it is possible to reliably predict a target immunogen represented by an antigen protein or an immunogenic determinant of a target protein, and such an immunogen determinant specifically binds to an antibody in the antigen protein to induce an immune response. It can be represented by the epitope that causes it. In addition, it was confirmed that the prediction method of one embodiment can reliably predict structural epitopes as well as linear epitopes even when 3-dimensional structural information of proteins is lacking.

Therefore, the prediction method of one embodiment can more reliably predict an immunogenic determinant such as an antigenic determinant with only basic sequence information in an unknown antigen protein, etc., thereby contributing to more effective development of an immune-based therapeutic agent or treatment method.

On the other hand, according to another embodiment of the invention, in addition to the above-described immunogen determining unit, an immunogen binding unit (e.g., antigen binding unit; paratope) that causes an immune reaction with an immunogen such as an antigen in an immune protein or immune cell such as an antibody. ) is provided. The prediction method of this other embodiment proceeds with bidirectional artificial neural network model learning while masking for each tokenized unit sequence included in each protein based on the first database on the protein, and the unit sequence of the protein pre-learning each function and arrangement;

Based on the third database of complexes of mutually bound immunogen polypeptides and immune proteins, bidirectional artificial neural network learning is performed while classifying each tokenized unit sequence included in each complex according to whether or not there is an immunogen binding site, Learning information of an immunogen binding site that causes an immune response with the immunogenic polypeptide for each protein unit sequence; and

A step of predicting an immunogen binding site in a target immune protein or target immune cell using a trained artificial neural network may be included.

For reference, in the method for predicting an immunogen binding portion according to another embodiment, a schematic diagram of an immunogen binding portion learning step distinguished from one embodiment is shown in FIG. 2 . In the prediction method of the other embodiment, the pre-learning step based on the first database may be performed in the same manner as the prediction method of the above-described embodiment, so further description thereof will be omitted.

On the other hand, in the prediction method of another embodiment, for the step of learning the immunogen binding part, the third database for the complex of immunogen polypeptides and immune proteins bound to each other is used instead of the above-described second database. More specifically, the third database may be a database of sequences or structures of immunogenic polypeptides such as the antigens, immune proteins such as the antibodies, and complexes in which they are mutually bound, and the immunogen determining unit included in the immunogenic polypeptides information and information on the immunogen binding portion included in the immune protein may be included together.

In a more specific example, the third database includes the immunogenic polypeptide, the immune protein, or a complex thereof, each sequence or structure, the sequence or binding position of the immunogenic determinant included in the immunogenic polypeptide, the immune protein It may include information on the sequence or binding site of the immunogen binding unit included in, or information on both of them. An example of such a third database is EpiPred (Krawczyk K, Liu X, Baker T, Shi J, Deane CM. Improving B-Cell Epitope Prediction and its Application to Global Antibody-Antigen Docking. Bioinformatics (2014) 30:2288- 94), Docking Benchmarking Dataset (DBD) v5 (Vreven T, Moal IH, Vangone A, Pierce BG, Kastritis PL, Torchala M, et al. Updates to the Integrated Protein-Protein Interaction Benchmarks: Docking Benchmark Version 5 and Affinity Benchmark Version 2. J Mol Biol (2015) 427:3031-41) or “Daberdaku S, Ferrari C (2019) Antibody interface prediction with 3D Zernike descriptors and SVM. Bioinformatics 35: 1870-1876” and the like.

On the other hand, in the learning of the immunogen binding unit, as shown in FIG. 2, the amino acid sequences included in the immunogenic polypeptides and immune proteins of the third database may be divided into a plurality of unit sequences having a certain length and tokenized. there is. Such tokenization may be performed, for example, for each amino acid unit sequence of a certain length corresponding to an immunogen determinant such as an antigenic determinant of an immunogenic polypeptide and an immunogen binding portion such as an antigen binding portion of an immune protein. A plurality of tokens can be generated from each amino acid sequence of peptides and immune proteins.

At this time, in the immunogenic polypeptide and / or immune protein, the tokenized unit sequence corresponding to each edge (edge) has a unique classification identifier (eg, “CLS” in FIG. 2, “SEP”) ”, etc.) can be assigned to each embedding vector, and the immunogenic polypeptide and immune protein embedding vectors are input in a concatenated state at the edge, so that subsequent artificial neural network learning can proceed,

At this time, the order in which the immunogenic polypeptide and immune protein are connected in the embedding vector is not particularly limited, but the embedding vector in which the immunogenic polypeptide and immune protein are connected in a certain order is input for effective artificial neural network learning, and learning is performed. It is appropriate.

In this way, as data on immunogenic polypeptides and/or immune proteins are input in a state classified by unique classification identifiers and artificial neural network learning proceeds, as a prediction method of another embodiment, not only the immunogen binding portion of the immune protein but also the immunogen Immunogenic determinants can be predicted together.

In the step of learning the immunogen binding unit, artificial neural network learning may proceed along both directions around the plurality of tokenized unit sequences. In this bidirectional artificial neural network learning process, each unit sequence corresponds to an immunogen binding unit (or an immunogen determining unit combined therewith) or predicts and classifies whether such an immunogen binding unit exists and outputs it as data (e.g. For example, after predicting the immunogen binding part in FIG. 2, the predicted data is classified as “0” or “1” and output). At this time, the immunogen determining unit combined with the immunogen binding unit may also be distinguished by the unique classification identifier and the like, and the predicted result may be output.

The accuracy of the prediction result of the immunogen binding unit (and/or immunogen determination unit) may be confirmed and verified based on a test data set separated from a third database and fed back. In this way, through the process of predicting whether the tokenized unit sequence includes an immunogen-binding portion, and confirming, verifying, and feedbacking the accuracy of the predicted result, information such as the presence, location, and arrangement of the immunogen-binding portion in an immune protein such as an antibody A predictive model that reliably predicts can be derived.

Based on the artificial neural network or the immunogen-binding portion prediction model learned in this way, sequence information of an unknown target immune protein or target immune cell is input, and the existence, position, or sequence of the immunogen-binding portion in an immune protein such as an antibody or the like is input. can be predicted reliably. Furthermore, in the prediction method of the other embodiment, since the artificial neural network learning proceeds with data on immunogenic polypeptides as well as immune proteins classified by unique classification identifiers in the third database and input, the immunogen in the immune protein Not only the binding site, but also the immunogenic determinant within the immunogen that binds thereto can be further predicted.

Even in the above-described immunogen binding unit learning step, similar to the immunogen determining unit learning step in the method of one embodiment already described above, a prediction model including a tokenization module, a classification module, a learning module, a prediction accuracy calculation module, and a predictive model generation module. A generating device may be used to train the artificial neural network.

On the other hand, according to the prediction method of another embodiment described above, an immunogen-binding portion of a target immune protein represented by an antibody protein and an immunogen-determining portion such as a target immunogen represented by an antigen protein can be reliably predicted. In this case, the immunogen-binding portion may be represented by an antigen-binding portion (paratope) that specifically binds to an epitope of an antigen in the antibody protein and causes an immune response.

Therefore, the prediction method of another embodiment more reliably predicts an immunogen-binding portion, such as an antigen-binding portion, and an immunogenic determinant, such as an antigenic determinant, using only basic sequence information in an unknown antibody and/or antigen protein, thereby improving immunity-based therapeutics or treatment methods. can contribute to effective development.

Hereinafter, preferred embodiments of the invention and comparative examples in contrast thereto will be described. However, the following examples are only preferred examples of the invention and the invention is not limited thereto.

실시예 1: 면역원 결정부의 예측 신뢰성 평가 Example 1: Evaluation of predictive reliability of the immunogen determinant

First, as a first database for the pre-learning step, “Bairoch, A., and R. Apweiler. 1996. “The SWISS-PROT Protein Sequence Data Bank and Its New Supplement TREMBL.” The Swiss-Prot database known through Nucleic Acids Research 24 (1): 21-25” was used. These databases contain information on functions, domain structures, mutations, and sequences of more than 560,000 proteins. Data on protein sequences, etc. are extracted from this first database, and these data are divided into a training dataset, a verification dataset, and a test dataset at a ratio of 6: 2: 2, and based on this, “A” in FIG. The described pre-learning steps were followed.

In addition, as the second database for the learning step of the immunogen determinant, those related to linear antigenic determinants or steric antigenic determinants were used separately. First, as a second database for linear antigenic determinants, antibodies and antigenic determinants are classified “Vita, Randi, Swapnil Mahajan, James A. Overton, Sandeep Kumar Dhanda, Sheridan Martini, Jason R. Cantrell, Daniel K. Wheeler, Alessandro Sette, and Bjoern Peters. 2019. “The Immune Epitope Database (IEDB): 2018 Update.” Nucleic Acids Research 47 (D1): D339-43” was used. In addition, as a second database related to three-dimensional epitope determinants, data on some antigens were extracted from the Protein Data Bank (PDB) and used.

The data extracted from these databases was divided into a training dataset and a test dataset at a ratio of 8:2, and cross-validation was performed on the learning dataset using the cross validation technique. Based on the above classified data, the immunogen determining unit learning step described in “B” of FIG. 1 was performed. Through the above process, after the pre-learning step and the immunogen determinant learning step, the antigenic determinant of the target immunogen (linear antigenic determinant: Table 1 or steric antigenic determinant: Table 2) is predicted, and the reliability of the prediction result is verified. Statistically evaluated and shown in Tables 1 and 2 below, respectively. At this time, the statistical parameters evaluated for reliability are “https://jennainsight.tistory.com/entry/F1-Score-Roc%EA%B3%A1%EC%84%A0-Auc-%EA%B3%84%EC %82%B0%EB%B0%A9%EB%B2%95-scikit-learn-%EC%BD%94%EB%93%9C%EB%A1%9C-%EA%B5%AC%ED%98 %84%ED%95%98%EA%B8%B0” was calculated by a known method, and each calculated statistical parameter, in particular, the higher the AUC, the higher the reliability of the prediction result.

The reliability evaluation results of these prediction results are shown by comparing Example 1, Comparative Example 1 in which the prior learning step was omitted in Example 1, and Additional Comparative Examples 2 to 15 to which the existing prediction method was applied. For reference, the existing prediction methods of Comparative Examples 2 to 15 predicted antigenic determinants according to the methods described in the literature summarized at the bottom of each of Tables 1 and 2, and then evaluated the reliability of the prediction results.

For reference, Comparative Examples 2 to 15 are learning and prediction models known to be applicable to predict antigenic determinants, etc. Representatively, the predictive models of Comparative Examples 3 and 5 are as follows:

- Random Forest Classifier (Comparative Example 3): A tree-based model that makes optimal choices at each stage. A prediction model based on a typical classification that creates many models through data sampling and finally selects the most voted selection through voting;

- Gradient Boosting classifier (Comparative Example 5): A methodology in which a boosting function is added to the tree-based model, and data predicted incorrectly by the model are reflected in the next model to correctly predict again.

Reliability evaluation of linear epitope prediction results

통계 파라미터statistical parameter	AUCAUC	Macro F-1Macro F-1	Micro F-1Micro F-1	PrecisionPrecision	RecallRecall
실시예1Example 1	0.9220.922	0.8530.853	0.8600.860	0.8550.855	0.8520.852
비교예1(실시예1 중 사전 학습 단계 생략)Comparative Example 1 (preliminary learning step omitted in Example 1)	0.5570.557	0.5090.509	0.5090.509	0.5380.538	0.5370.537
비교예2(Bagging Classifier)Comparative Example 2 (Bagging Classifier)	0.8610.861	0.7520.752	0.7710.771	0.7470.747	0.7670.767
비교예3(RandomForest Classifier)Comparative Example 3 (Random Forest Classifier)	0.8570.857	0.7620.762	0.7800.780	0.7560.756	0.7800.780
비교예4(ExtraTrees Classifier)Comparative Example 4 (ExtraTrees Classifier)	0.8570.857	0.7630.763	0.7810.781	0.7560.756	0.7800.780
비교예5 (GradientBoosting Classifier)Comparative Example 5 (GradientBoosting Classifier)	0.7360.736	0.6610.661	0.6990.699	0.6590.659	0.6950.695
비교예6(AdaBoost Classifier)Comparative Example 6 (AdaBoost Classifier)	0.5900.590	0.4890.489	0.6110.611	0.5350.535	0.5860.586
비교예7 (LogisticRegression Classifier)Comparative Example 7 (LogisticRegression Classifier)	0.5900.590	0.5880.588	0.6110.611	0.5350.535	0.5680.568
비교예8(Linear Discriminant Analysis)Comparative Example 8 (Linear Discriminant Analysis)	0.5900.590	0.4900.490	0.6120.612	0.5360.536	0.5860.586
비교예9(Bernoulli NB)Comparative Example 9 (Bernoulli NB)	0.5880.588	0.5180.518	0.6090.609	0.5440.544	0.5760.576
비교예10(Gaussian NB)Comparative Example 10 (Gaussian NB)	0.5830.583	0.5360.536	0.5360.536	0.5580.558	0.5580.558
비교예11(Quadratic Discriminant Analysis)Comparative Example 11 (Quadratic Discriminant Analysis)	0.5150.515	0.5060.506	0.5490.549	0.5110.511	0.5120.512
비교예12(SupportVector Classifier)Comparative Example 12 (SupportVector Classifier)	0.5100.510	0.3840.384	0.6010.601	0.5020.502	0.5700.570

* Prediction method of Comparative Examples 2 to 12 (source):

- Comparative Example 2:

https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.BaggingClassifier.html

- Comparative Example 3:

https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html

- Comparative Example 4:

https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.ExtraTreesClassifier.html

- Comparative Example 5:

https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html

- Comparative Example 6:

https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.AdaBoostClassifier.html

- Comparative Example 7:

https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html

- Comparative Example 8:

https://scikit-learn.org/stable/modules/generated/sklearn.discriminant_analysis.LinearDiscriminantAnalysis.html

- Comparative Example 9:

https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.BernoulliNB.html

- Comparative Example 10:

http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html

- Comparative Example 11:

https://scikit-learn.org/stable/modules/generated/sklearn.discriminant_analysis.QuadraticDiscriminantAnalysis.html

- Comparative Example 12:

https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html

Reliability evaluation of conformational epitope prediction results

통계 파라미터statistical parameter	AUCAUC	Macro F-1Macro F-1	Micro F-1Micro F-1	PrecisionPrecision	RecallRecall
실시예1 Example 1	0.678±0.0210.678±0.021	0.478±0.0140.478±0.014	0.625±0.0370.625±0.037	0.535±0.0070.535±0.007	0.615±0.0270.615±0.027
비교예1(실시예1 중 사전 학습 단계 생략)Comparative Example 1 (preliminary learning step omitted in Example 1)	0.571±0.0190.571±0.019	0.253±0.1640.253±0.164	0.331±0.2880.331±0.288	0.426±0.1940.426±0.194	0.523±0.0230.523±0.023
비교예13(Bepipred-2.0)Comparative Example 13 (Bepipered-2.0)	0.620.62	--	--	--	--
비교예14(LBtope)Comparative Example 14 (LBtope)	0.540.54	--	--	--	--
비교예15(NetSurfP)Comparative Example 15 (NetSurfP)	0.600.60	--	--	--	--

* Prediction method of Comparative Examples 13 to 15 (source):

- Comparative Example 13: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5570230/

- Comparative Example 14: https://pubmed.ncbi.nlm.nih.gov/23667458/

- Comparative Example 15: https://pubmed.ncbi.nlm.nih.gov/19646261/

Referring to Tables 1 and 2, the antigenic determinant prediction method of Example 1 is not only Comparative Example 1 omitting the prior learning step, but also any prediction method of Comparative Examples 2 to 15 previously known, for example, the antigen Compared to the prediction methods of Comparative Examples 3 and 5, which correspond to tree-based models used for predicting determinants, etc., it was confirmed that linear antigenic determinants and conformational antigenic determinants in immunogens such as antigens can be predicted with high reliability.

Unlike existing tree-based models such as Comparative Examples 3 and 5, which require significant data for reliable prediction through step-by-step classification and selection, the prediction method of the embodiment is pre-learning and immunogen based on bi-directional artificial neural network learning. This seems to be because the prediction reliability can be further increased even if insufficient data is used by the decision part learning step.

실시예 2: 면역원 결합부의 예측 신뢰성 평가 Example 2: Evaluation of predictive reliability of the immunogen binding part

First, the pre-learning step was performed in the same way as in Example 1.

In addition, as the main third database for the learning step of the immunogen binding part, “Daberdaku S, Ferrari C (2019) Antibody interface prediction with 3D Zernike descriptors and SVM. Bioinformatics 35: 1870-1876 ”ope dataset source: Processed by Daberdaku and Ferrari (2019)”, using as input data the complex sequence of the database on antigen and antibody complexes, and the antigen-binding portion of the antigen-binding portion by the method described below. Prediction proceeded.

On the other hand, as an additional third database in this embodiment, EpiPred (Krawczyk K, Liu X, Baker T, Shi J, Deane CM. Improving B-Cell Epitope Prediction and its Application to Global Antibody-Antigen Docking. Bioinformatics (2014) 30 :2288-94) and Docking Benchmarking Dataset (DBD) v5 (Vreven T, Moal IH, Vangone A, Pierce BG, Kastritis PL, Torchala M, et al. Updates to the Integrated Protein-Protein Interaction Benchmarks: Docking Benchmark Version 5 and Affinity Benchmark Version 2. J Mol Biol (2015) 427:3031-41) was used as input data together with the complex sequence of the database on antigen-antibody complexes. This was used to predict immunogenic determinants, such as antigenic determinants, as well as immunogen binding sites according to the method of this example.

On the other hand, the data extracted from the third database was divided into a learning dataset and a test dataset at a ratio of 8: 2, and as in Example 1, cross validation by the cross validation technique was performed on the learning dataset together. . Based on the above classified data, the immunogen binding unit learning step described in FIG. 2 was performed. Through the above process, after proceeding with the preliminary learning step and the immunogen binding part learning step, the antigen binding part of the target antibody was predicted, and the reliability of the predicted result was statistically evaluated and shown in Table 3 below. At this time, the statistical parameters evaluated for reliability are “https://jennainsight.tistory.com/entry/F1-Score-Roc%EA%B3%A1%EC%84%A0-Auc-%EA%B3%84%EC %82%B0%EB%B0%A9%EB%B2%95-scikit-learn-%EC%BD%94%EB%93%9C%EB%A1%9C-%EA%B5%AC%ED%98 %84%ED%95%98%EA%B8%B0” was calculated by a known method, and each calculated statistical parameter, in particular, the higher the AUC-ROC, the higher the reliability of the prediction result.

The reliability evaluation results of these prediction results are shown by mutual comparison of Example 2 and Additional Comparative Examples 16 to 17 to which the existing prediction method was applied. For reference, in the existing prediction methods of Comparative Examples 16 and 17, after predicting the antigen binding site according to the method described in the literature summarized at the bottom of Table 3, the reliability of the prediction result was evaluated.

Reliability evaluation of antigen-binding site prediction results

통계 파라미터statistical parameter	AUC ROCAUC ROC	AUC PRAUC PR
실시예2 Example 2	0.9690.969	0.6330.633
비교예16(ANTIBODY I-PATCH)Comparative Example 16 (ANTIBODY I-PATCH)	0.8400.840	0.3760.376
비교예17(DABERDAKU ET AL.)Comparative Example 17 (DABERDAKU ET AL.)	0.9500.950	0.6580.658

* Prediction method of Comparative Examples 16-17 (source):

- Comparative Example 16: https://pubmed.ncbi.nlm.nih.gov/24006373/

- Comparative Example 17: https://pubmed.ncbi.nlm.nih.gov/30395191/

Referring to Table 3, it was confirmed that the method for predicting the antigen-binding portion of Example 3 could predict the antigen-binding portion in an immune protein such as an antibody with high reliability compared to the previously known prediction methods of Comparative Examples 16 to 17.

Claims

Based on the first protein database, bidirectional artificial neural network model learning is performed while masking for each tokenized unit sequence included in each protein, thereby pre-learning the function and arrangement of each unit sequence of the protein step;

Based on the second database of immunogenic and non-immunogenic polypeptides classified according to whether immune proteins or immune cells specifically recognize or bind to immune responses, while classifying by tokenized unit sequence included in each polypeptide, amount Learning information of an immunogenic determinant that causes an immune response for each unit sequence of the polypeptide by conducting directional artificial neural network learning; and

A method of predicting an immunogen determinant comprising predicting a target immunogen or an immunogen determinant of a target protein using a learned artificial neural network.
The method of claim 1, wherein the first database includes information on one or more of sequences, functions, structures, and known mutations of a plurality of proteins.
The method of claim 1, wherein the pre-learning step comprises dividing the amino acid sequence included in the protein into a plurality of unit sequences having a predetermined length and tokenizing them.
The method of claim 3, wherein the pre-learning step comprises masking some of the plurality of tokenized unit sequences; and

The method of predicting an immunogen determining unit further comprising the step of pre-learning the function, arrangement or sequence of each unit sequence by proceeding with artificial neural network model learning along both directions around the masked unit sequence.
The method of claim 4, wherein the training of the artificial neural network model comprises: predicting the masked unit sequence; and

Immunogen determining unit prediction method comprising the step of verifying the accuracy of the prediction result and providing feedback.
The method of claim 1, wherein the second database predicts an immunogen determinant comprising information on immunogens and non-immunogen polypeptides confirmed to cause or not to cause an immune response with two or more immune proteins or immune cells through a plurality of experiments. method.
The method of claim 6, wherein the second database predicts an immunogen determinant comprising at least one type of information selected from the group consisting of continuous amino acid sequence information, non-contiguous amino acid sequence information, and 3-dimensional structural information of the immunogen and non-immunogen polypeptide. method.
The method of claim 1, wherein the learning of the immunogen determinant comprises dividing the amino acid sequence included in the polypeptide into a plurality of unit sequences having a predetermined length and tokenizing the immunogen determinant.
The method of claim 8, wherein the learning of the immunogen determining unit comprises predicting and classifying in both directions whether or not the immunogen determining unit exists for each of the plurality of tokenized unit sequences; and

Immunogen determining unit prediction method comprising the step of verifying the accuracy of the prediction result and providing feedback.
The method of claim 9, wherein in the learning of the immunogen determining unit, whether or not the immunogen determining unit exists for each unit sequence is predicted and classified, output and verified as data, and a predictive model of the immunogen determining unit is derived based on the output data. Methods for predicting immunogenic determinants.
The method of claim 1, wherein the first database or the second database relates to a protein or polypeptide having an amino acid sequence of a predetermined length or less and is pre-processed.
The method of claim 1 or 10, wherein the predicting the immunogen determining unit comprises inputting sequence or structural information of the target immunogen or target protein based on the learned artificial neural network or the immunogen determining unit prediction model, A method for predicting an immunogenic determinant for predicting the presence, location or sequence of an immunogenic determinant.
The method of claim 1, wherein the target immunogen or target protein is an antigenic protein, and in the step of predicting the immunogen determining unit, the immunogen determining unit predicts an epitope that specifically binds to an antibody in the antigen protein and causes an immune response. prediction method.
The method of claim 13, wherein in the step of predicting the immunogen determinant, a linear epitope or a structural epitope is predicted.
Based on the first protein database, bidirectional artificial neural network model learning is performed while masking for each tokenized unit sequence included in each protein, thereby pre-learning the function and arrangement of each unit sequence of the protein step;

Based on the third database of complexes of mutually bound immunogen polypeptides and immune proteins, bidirectional artificial neural network learning is performed while classifying each tokenized unit sequence included in each complex according to whether or not there is an immunogen binding site, Learning information of an immunogen binding site that causes an immune response with the immunogenic polypeptide for each protein unit sequence; and

A method of predicting an immunogen binding site comprising predicting an immunogen binding site in a target immune protein or a target immune cell using a learned artificial neural network.
16. The method of claim 15, wherein the third database comprises a sequence or structure of each of the immunogenic polypeptide, the immune protein, or a complex thereof, a sequence or binding position of an immunogenic determinant included in the immunogenic polypeptide, and the immunogenic protein. A method for predicting an immunogen binding site comprising information on the sequence or binding site of an immunogen binding site included in a protein.
The method of claim 15, wherein the step of learning the immunogen binding portion comprises dividing the amino acid sequences each included in the immunogen polypeptide and the immune protein into a plurality of unit sequences having a predetermined length and tokenizing the immunogen binding portion prediction. method.
The method of claim 17, wherein the step of learning the immunogen binding unit comprises predicting and classifying in both directions whether or not an immunogen binding unit exists for each of the plurality of tokenized unit sequences; and

Immunogen binding portion prediction method comprising the step of verifying the accuracy of the prediction result and providing feedback.
18. The method of claim 17, wherein a unique classification identifier is assigned to the tokenized unit sequence corresponding to both edges of the immunogenic polypeptide and the immune protein, and is calculated as each embedding vector,

Immunogen binding portion prediction method in which the embedding vectors of the immunogen polypeptide and the immune protein are input in a concatenated state at the edge and artificial neural network learning proceeds.
[Claim 16] The method according to claim 15, wherein the immunogen determinant portion of the target immunogen is additionally predicted.
16. The method of claim 15, wherein the target immune protein is an antibody protein, and in the step of predicting the immunogen-binding portion, predicting an antigen-binding portion (paratope) that specifically binds to an epitope in the antibody protein and causes an immune response. Methods for predicting immunogen binding sites.