WO2023163518A1 - Immunogenic determinant predicting method and immunogenic binding site predicting method - Google Patents

Immunogenic determinant predicting method and immunogenic binding site predicting method Download PDF

Info

Publication number
WO2023163518A1
WO2023163518A1 PCT/KR2023/002582 KR2023002582W WO2023163518A1 WO 2023163518 A1 WO2023163518 A1 WO 2023163518A1 KR 2023002582 W KR2023002582 W KR 2023002582W WO 2023163518 A1 WO2023163518 A1 WO 2023163518A1
Authority
WO
WIPO (PCT)
Prior art keywords
immunogen
protein
learning
unit
predicting
Prior art date
Application number
PCT/KR2023/002582
Other languages
French (fr)
Korean (ko)
Inventor
박민준
서승우
박은영
김진한
Original Assignee
주식회사 스탠다임
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from KR1020230023506A external-priority patent/KR102645477B1/en
Application filed by 주식회사 스탠다임 filed Critical 주식회사 스탠다임
Publication of WO2023163518A1 publication Critical patent/WO2023163518A1/en

Links

Images

Classifications

    • GPHYSICS
    • G01MEASURING; TESTING
    • G01NINVESTIGATING OR ANALYSING MATERIALS BY DETERMINING THEIR CHEMICAL OR PHYSICAL PROPERTIES
    • G01N33/00Investigating or analysing materials by specific methods not covered by groups G01N1/00 - G01N31/00
    • G01N33/48Biological material, e.g. blood, urine; Haemocytometers
    • G01N33/50Chemical analysis of biological material, e.g. blood, urine; Testing involving biospecific ligand binding methods; Immunological testing
    • G01N33/68Chemical analysis of biological material, e.g. blood, urine; Testing involving biospecific ligand binding methods; Immunological testing involving proteins, peptides or amino acids
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B15/00ICT specially adapted for analysing two-dimensional or three-dimensional molecular structures, e.g. structural or functional relations or structure alignment
    • G16B15/30Drug targeting using structural data; Docking or binding prediction
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/30Detection of binding sites or motifs
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B50/00ICT programming tools or database systems specially adapted for bioinformatics

Definitions

  • the present invention relates to a method for predicting an immunogen determinant and a method for predicting an immunogen binding portion capable of more reliably predicting an immunogen determinant and/or an immunogen binding portion that specifically binds thereto in a protein or cell that causes an immune response.
  • the immune system refers to a defense system that defends and neutralizes infectious pathogens or harmful proteins in vivo. It may refer to any reaction related to the immune system that specifically recognizes or binds.
  • an amino acid residue (or sequence) specifically recognized by the immune protein or the like is represented by an epitope It can be defined as an immunogenic determinant.
  • an amino acid residue in an immune protein such as an antibody that specifically recognizes and binds to an immunogenic determinant in an immunogen may be defined as an immunogen binding site represented by an antigen binding site (paratope).
  • the antigenic determinant can be divided into a linear epitope and a structural epitope according to its shape and mode of action with an antigen-binding site.
  • the linear antigenic determinant consists of a continuous linear amino acid sequence and can bind to the one-dimensional structure of the antigen-binding portion
  • the three-dimensional structure antigenic determinant reflects the three-dimensional protein structure and consists of a discontinuous amino acid sequence that binds to the antigen. It can be combined with the three-dimensional structure of wealth.
  • an immune-based therapeutic agent is developed based on the specific binding and immune response between an immunogen determinant such as the antigen determinant and an immunogen binding unit such as the antigen binding unit, the immunogen determinant within the immunogen is predicted or confirmed, and the immune protein Predicting or confirming an immunogen binding site within the body has emerged as a very important problem in the development of effective immune-based therapeutics.
  • one embodiment of the present invention provides a method for predicting an immunogen prediction unit capable of more reliably predicting the existence, position, or sequence of an immunogen determinant that causes an immune response by specifically recognizing or binding to an immune protein or immune cell within an immunogen. is to do
  • another embodiment of the invention provides a method for predicting an immunogen-binding portion that can more reliably predict the presence, position, or sequence of an immunogen-binding portion that specifically binds to an immunogen in an immune protein or immune cell and causes an immune response. will be.
  • bidirectional artificial neural network model learning is performed while masking each tokenized unit sequence included in each protein, and the unit of the protein pre-learning functions and arrangements for each sequence;
  • a method for predicting an immunogen determinant of a target immunogen or a target protein using a learned artificial neural network is provided.
  • unit sequences of immunogenic and non-immunogenic polypeptides are tokenized, and bidirectional artificial neural network learning is performed on the presence, position, or arrangement of immunogenic determining units for each tokenized unit sequence.
  • the antigen determinant epitope
  • target immunogen or target protein eg, antigen protein
  • the pre-learning step may include dividing the amino acid sequence included in the protein into a plurality of unit sequences having a predetermined length and tokenizing them.
  • the preliminary learning step may include masking some of the plurality of tokenized unit sequences; and proceeding with artificial neural network model learning along both directions around the masked unit sequence to pre-learn the function, arrangement or sequence of each unit sequence.
  • the artificial neural network model learning step may include predicting the masked unit sequence; and verifying the accuracy of the predicted result and providing feedback thereof. Through this process, the artificial neural network can predict the function and arrangement of each unit sequence of the protein with high reliability.
  • the step of learning the immunogen determining unit may include dividing the amino acid sequence included in the polypeptide into a plurality of unit sequences having a predetermined length and tokenizing them.
  • the step of learning the immunogen determining unit may include predicting and classifying in both directions whether or not an immunogen determining unit exists for each of the plurality of tokenized unit sequences; And verifying the accuracy of the prediction result and providing feedback.
  • whether or not the immunogen determining unit exists for each unit sequence is predicted and classified, output and verified as data, and based on the output data
  • a predictive model of the immunogen determining unit can be derived.
  • the sequence or structural information of the target immunogen or target protein is input based on the artificial neural network learned by the above-described method or the immunogen-determining unit prediction model, so as to determine whether the immunogen-determining unit exists, the location, or sequence can be reliably predicted.
  • bidirectional artificial neural network learning is performed while masking for each tokenized unit sequence included in each protein, and the protein unit pre-learning functions and arrangements for each sequence;
  • bidirectional artificial neural network learning is performed while classifying each tokenized unit sequence included in each complex according to whether or not there is an immunogen binding site, Learning information of an immunogen binding site that causes an immune reaction with the immunogenic polypeptide for each protein unit sequence;
  • a method for predicting an immunogen binding site in a target immune protein or target immune cell using a trained artificial neural network is provided.
  • the same pre-learning step as in one embodiment is performed, and the step of learning the immunogen binding part based on the third database on the complex of immunogen polypeptides and immune proteins linked to each other is additionally performed.
  • the unit sequences included in the complex are tokenized, and bidirectional artificial neural network learning is performed on the existence, position, or arrangement of immunogen binding parts for each tokenized unit sequence.
  • the third database for complexes in which immunogenic polypeptides and immune proteins are mutually bound is utilized, not only the immunogen-binding portion but also the immunogenic determinant in the target immunogen, such as an antigen, can be reliably predicted. there is.
  • the step of learning the immunogen binding part comprises dividing the amino acid sequences each included in the mutually bound immunogen polypeptide and immune protein into a plurality of unit sequences having a certain length and tokenizing them. can do.
  • the step of learning the immunogen binding unit may include predicting and classifying whether or not an immunogen binding unit exists for each of the plurality of tokenized unit sequences in both directions; and verifying the accuracy of the prediction result and providing feedback.
  • a unique classification identifier is assigned to the tokenized unit sequence corresponding to both edges of the immunogenic polypeptide and the immune protein, and each embedding vector can be calculated, and these embedding vectors are
  • the artificial neural network learning may proceed by being input in a concatenated state at the edge. Through the assignment of such a unique classification identifier, etc., immunogens and immune proteins can be distinguished, artificial neural network learning can proceed, and the immunogen-binding portion and immunogen-determining portion can be reliably predicted.
  • an immune protein or immune cell specifically recognizes or binds to an immune response within an immunogen.
  • the presence, position or sequence of the causative immunogenic determinant can be more reliably predicted.
  • the presence, location, or sequence of an immunogen-binding portion that specifically binds to the immunogen and causes an immune response in an immune protein or immune cell is examined. can be predicted reliably.
  • the immunogen-binding portion not only the immunogen-binding portion but also the immunogen-determining portion within the immunogen can be reliably predicted.
  • the prediction method according to one or another embodiment of the present invention more reliably predicts an immunogen determinant and/or an immunogen binding portion, and can greatly contribute to the economical development of a more effective immune-based therapeutic agent or treatment method within a short period of time.
  • FIG. 1 is a schematic diagram schematically showing a method for predicting an immunogen determinant according to an embodiment of the present invention.
  • FIG. 2 is a schematic diagram schematically showing an immunogen binding portion learning step distinguished from one embodiment in a method for predicting an immunogen binding portion according to another embodiment of the present invention.
  • FIG. 1 schematically illustrates an example of a method for predicting an immunogen determinant according to an embodiment of the present invention, specifically, a pre-learning step and an immunogen determinant learning step included therein.
  • a pre-learning step based on a first database on proteins and a second database on immunogens and non-immunogen polypeptides
  • a predictive model for predicting the existence, position, or sequence of an immunogen determinant such as an epitope within the immunogen is derived.
  • this pre-learning and immunogen determining unit learning step based on the sequence information of each protein and immunogen or non-immunogen polypeptide included in the first and second databases, their unit sequences are tokenized and tokenized.
  • the artificial neural network model is trained in both directions while masking or classifying part of the unit sequence.
  • each unit sequence is predicted, and prediction is made to predict whether each unit sequence corresponds to an immune protein such as an antibody or an immunogen determinant specifically recognized by immune cells.
  • a model can be derived.
  • the immunogenic determinant within the target immunogen can be reliably predicted by using such a predictive model.
  • the immunogenic determinant can be reliably predicted based on the sequence information of the protein and the immunogenic/non-immunogenic polypeptide. Therefore, according to the method of one embodiment, even when data on the three-dimensional folding structure of a protein such as an immunogen is not sufficient, it is possible to reliably predict an immunogen determinant that causes an immune response within an immunogen, and as a result, prediction is difficult with conventional methods.
  • the structural epitope which was not easy, can also be predicted more reliably.
  • the first database used in the pre-learning step may include information on one or more of sequences, functions, structures, and known mutations for a plurality of proteins. More specifically, the first database may be a protein database containing information on functions, sequences, domain structures, and identified mutations for a plurality of proteins, and examples of such a first database include “Bairoch, A., and R. Apweiler. 1996. “The SWISS-PROT Protein Sequence Data Bank and Its New Supplement TREMBL.” Nucleic Acids Research 24 (1): 21-25” or “Boutet, Emmanuel, Damien Lieberherr, Michael Tognolli, Michel Schneider, and Amos Bairoch. 2007. “UniProtKB/Swiss-Prot.” Methods in Molecular Biology 406: 89-112”.
  • information on a plurality of proteins included in the first database may be used as training data without separate processing, but in order to further increase the efficiency of artificial neural network learning, a certain length or less, for example, 5000 Information on proteins having an amino acid sequence of 3000 or less, or 2500 or less, more specifically, amino acid sequences of a certain length equivalent to each other may be pre-processed and utilized.
  • information on a plurality of immunogenic and non-immunogenic polypeptides included in the second database is converted into information on polypeptides having amino acid sequences of a certain length or less or equal to each other. According to pre-processing, the efficiency of artificial neural network learning for deriving a predictive model can be further increased.
  • the amino acid sequence included in each protein of the first database may be divided into a plurality of unit sequences having a predetermined length and tokenized.
  • Such tokenization may be performed, for example, for each amino acid unit sequence of a certain length, and through this, a plurality of tokens may be generated from the amino acid sequence of the protein.
  • some of the tokenized plurality of unit sequences for example, 5 to 25%, or 10 to 20% are masked, while the artificial neural network along both directions of the protein around the masked unit sequence Bidirectional learning can be achieved through the model's attention mechanism.
  • masking means an operation to cover a part of the tokenized unit sequence, and the masked token can be distinguished as a 'MASK' token.
  • a unit sequence corresponding to the masked token is predicted, and based on a verification dataset and a test dataset separated from the first database, the accuracy of the unit sequence prediction result is confirmed and verified so you can give feedback.
  • artificial neural networks The function, arrangement, or order of each unit sequence included in a protein can be pretrained on the model.
  • the artificial neural network model can predict the order and arrangement of unit sequences included in a protein under the minimized consideration of biological and/or chemical properties.
  • the reliability of the predictive model of the immunogen determining unit derived through the steps described below may be further improved.
  • a tokenization module for tokenizing the unit sequences, a masking module for masking some of the tokenized unit sequences, and learning while converting and predicting the masked unit sequences
  • a prediction model generation device including a transformation module and a learning module that performs a sequence prediction, a prediction accuracy operation module that checks and calculates the accuracy of the unit sequence prediction result, and a prediction model generation module that generates a prediction model for an arrangement of the unit sequences, etc.
  • the above-described bidirectional artificial neural network learning method and predictive model generating device for its progress may have a similar form to the bidirectional artificial neural network learning method and predictive model generating device previously applied for language learning, one example of which is registered in Korea.
  • the prediction method of one embodiment after the above-described pre-learning step, immunogens and non-immunogen polypeptides classified according to whether immune proteins such as antibodies or immune cells specifically recognize or bind immune responses Based on the second database, the step of learning the predictive model of the immunogen determining unit by proceeding with bidirectional artificial neural network learning.
  • the second database is divided into immunogenic polypeptides containing immunogenic determinants that cause an immune response and non-immunogen polypeptides not containing immunogenic determinants, and contains information on their sequences or structures. can do. More specifically, the second database may include information on immunogens and non-immunogen polypeptides that have been confirmed to cause or not cause an immune response with two or more immune proteins or immune cells through a plurality of experiments. Reliability of the prediction method of one embodiment may be further increased by utilizing the second database for immunogenic/non-immunogenic polypeptides.
  • the second database may include one or more types of information selected from the group consisting of continuous amino acid sequence information, non-contiguous amino acid sequence information, and 3-dimensional structural information of the immunogen and non-immunogen polypeptide.
  • an immunogenic determinant in the form of a structural epitope as well as a linear epitope can be predicted.
  • a second database containing continuous amino acid sequence information of the immunogenic and non-immunogenic polypeptides may be used as learning data.
  • a second database containing non-contiguous amino acid sequence information or three-dimensional structure information may be used.
  • step of learning the immunogen determining unit information on a plurality of immunogenic and non-immunogenic polypeptides included in the second database is pre-processed into information on a polypeptide having an amino acid sequence of a certain length, thereby increasing the reliability of the prediction method. It has already been described above that it can be increased.
  • the amino acid sequence included in each polypeptide of the second database may be divided into a plurality of unit sequences having a certain length and tokenized.
  • Such tokenization may be performed for each amino acid unit sequence of a certain length corresponding to an immunogenic determinant, such as an antigenic determinant, and through this, a plurality of tokens may be generated from the amino acid sequence of an immunogenic or non-immunogenic polypeptide.
  • artificial neural network model learning may proceed as in the pre-learning step along both directions of the polypeptides around the plurality of tokenized unit sequences.
  • this bidirectional artificial neural network learning process it is possible to output data by predicting and classifying whether each unit sequence corresponds to an immunogen determining unit or whether such an immunogen determining unit exists (e.g., the immunogen in “B” in FIG. 1). After predicting the decision part, the predicted data is classified as “0” or “1” and output).
  • the accuracy of the prediction result of the immunogen determining unit may be confirmed and verified based on the test data set separated from the second database, and then fed back. For example, parameters can be adjusted through a loss function such as weighted cross entropy. In this way, through the process of predicting whether or not the tokenized unit sequence is an immunogen determinant, and confirming, verifying, and feedbacking the accuracy of the predicted result, information on the existence, location, and arrangement of the immunogen determinant within the immunogen can be reliably provided.
  • a predictive model that predicts can be derived.
  • sequence or structural information of an unknown target immunogen or target protein is input, and the existence, position, or sequence of the immunogen determinant in the immunogen is reliably determined.
  • a tokenization module for tokenizing the unit sequences
  • a classification module for classifying the tokenized unit sequences
  • a predictive model generation device including an ongoing learning module, a prediction accuracy calculation module that checks and calculates the accuracy of the unit sequence prediction result, and a predictive model generation module that generates a predictive model for an arrangement of the unit sequences, etc. can be used.
  • the prediction method of one embodiment described above it is possible to reliably predict a target immunogen represented by an antigen protein or an immunogenic determinant of a target protein, and such an immunogen determinant specifically binds to an antibody in the antigen protein to induce an immune response. It can be represented by the epitope that causes it.
  • the prediction method of one embodiment can reliably predict structural epitopes as well as linear epitopes even when 3-dimensional structural information of proteins is lacking.
  • the prediction method of one embodiment can more reliably predict an immunogenic determinant such as an antigenic determinant with only basic sequence information in an unknown antigen protein, etc., thereby contributing to more effective development of an immune-based therapeutic agent or treatment method.
  • an immunogen binding unit e.g., antigen binding unit; paratope
  • an immunogen binding unit that causes an immune reaction with an immunogen such as an antigen in an immune protein or immune cell such as an antibody.
  • the prediction method of this other embodiment proceeds with bidirectional artificial neural network model learning while masking for each tokenized unit sequence included in each protein based on the first database on the protein, and the unit sequence of the protein pre-learning each function and arrangement;
  • bidirectional artificial neural network learning is performed while classifying each tokenized unit sequence included in each complex according to whether or not there is an immunogen binding site, Learning information of an immunogen binding site that causes an immune response with the immunogenic polypeptide for each protein unit sequence;
  • a step of predicting an immunogen binding site in a target immune protein or target immune cell using a trained artificial neural network may be included.
  • FIG. 2 a schematic diagram of an immunogen binding portion learning step distinguished from one embodiment is shown in FIG. 2 .
  • the pre-learning step based on the first database may be performed in the same manner as the prediction method of the above-described embodiment, so further description thereof will be omitted.
  • the third database for the complex of immunogen polypeptides and immune proteins bound to each other is used instead of the above-described second database.
  • the third database may be a database of sequences or structures of immunogenic polypeptides such as the antigens, immune proteins such as the antibodies, and complexes in which they are mutually bound, and the immunogen determining unit included in the immunogenic polypeptides information and information on the immunogen binding portion included in the immune protein may be included together.
  • the third database includes the immunogenic polypeptide, the immune protein, or a complex thereof, each sequence or structure, the sequence or binding position of the immunogenic determinant included in the immunogenic polypeptide, the immune protein It may include information on the sequence or binding site of the immunogen binding unit included in, or information on both of them.
  • An example of such a third database is EpiPred (Krawczyk K, Liu X, Baker T, Shi J, Deane CM. Improving B-Cell Epitope Prediction and its Application to Global Antibody-Antigen Docking.
  • the amino acid sequences included in the immunogenic polypeptides and immune proteins of the third database may be divided into a plurality of unit sequences having a certain length and tokenized. there is.
  • Such tokenization may be performed, for example, for each amino acid unit sequence of a certain length corresponding to an immunogen determinant such as an antigenic determinant of an immunogenic polypeptide and an immunogen binding portion such as an antigen binding portion of an immune protein.
  • a plurality of tokens can be generated from each amino acid sequence of peptides and immune proteins.
  • the tokenized unit sequence corresponding to each edge has a unique classification identifier (eg, “CLS” in FIG. 2, “SEP”) ”, etc.) can be assigned to each embedding vector, and the immunogenic polypeptide and immune protein embedding vectors are input in a concatenated state at the edge, so that subsequent artificial neural network learning can proceed,
  • the order in which the immunogenic polypeptide and immune protein are connected in the embedding vector is not particularly limited, but the embedding vector in which the immunogenic polypeptide and immune protein are connected in a certain order is input for effective artificial neural network learning, and learning is performed. It is appropriate.
  • each unit sequence corresponds to an immunogen binding unit (or an immunogen determining unit combined therewith) or predicts and classifies whether such an immunogen binding unit exists and outputs it as data (e.g. For example, after predicting the immunogen binding part in FIG. 2, the predicted data is classified as “0” or “1” and output).
  • the immunogen determining unit combined with the immunogen binding unit may also be distinguished by the unique classification identifier and the like, and the predicted result may be output.
  • the accuracy of the prediction result of the immunogen binding unit may be confirmed and verified based on a test data set separated from a third database and fed back.
  • a test data set separated from a third database and fed back.
  • sequence information of an unknown target immune protein or target immune cell is input, and the existence, position, or sequence of the immunogen-binding portion in an immune protein such as an antibody or the like is input. can be predicted reliably. Furthermore, in the prediction method of the other embodiment, since the artificial neural network learning proceeds with data on immunogenic polypeptides as well as immune proteins classified by unique classification identifiers in the third database and input, the immunogen in the immune protein Not only the binding site, but also the immunogenic determinant within the immunogen that binds thereto can be further predicted.
  • a prediction model including a tokenization module, a classification module, a learning module, a prediction accuracy calculation module, and a predictive model generation module.
  • a generating device may be used to train the artificial neural network.
  • an immunogen-binding portion of a target immune protein represented by an antibody protein and an immunogen-determining portion such as a target immunogen represented by an antigen protein can be reliably predicted.
  • the immunogen-binding portion may be represented by an antigen-binding portion (paratope) that specifically binds to an epitope of an antigen in the antibody protein and causes an immune response.
  • the prediction method of another embodiment more reliably predicts an immunogen-binding portion, such as an antigen-binding portion, and an immunogenic determinant, such as an antigenic determinant, using only basic sequence information in an unknown antibody and/or antigen protein, thereby improving immunity-based therapeutics or treatment methods. can contribute to effective development.
  • Example 1 Evaluation of predictive reliability of the immunogen determinant
  • the second database for the learning step of the immunogen determinant those related to linear antigenic determinants or steric antigenic determinants were used separately.
  • antibodies and antigenic determinants are classified “Vita, Randi, Swapnil Mahajan, James A. Overton, Sandeep Kumar Dhanda, Sheridan Martini, Jason R. Cantrell, Daniel K. Wheeler, Alessandro Sette, and Bjoern Peters. 2019. “The Immune Epitope Database (IEDB): 2018 Update.” Nucleic Acids Research 47 (D1): D339-43” was used.
  • IEDB Immune Epitope Database
  • D1 Nucleic Acids Research 47
  • D1 D339-43
  • the data extracted from these databases was divided into a training dataset and a test dataset at a ratio of 8:2, and cross-validation was performed on the learning dataset using the cross validation technique.
  • the immunogen determining unit learning step described in “B” of FIG. 1 was performed.
  • the antigenic determinant of the target immunogen linear antigenic determinant: Table 1 or steric antigenic determinant: Table 2 is predicted, and the reliability of the prediction result is verified.
  • Tables 1 and 2 below, respectively.
  • the statistical parameters evaluated for reliability are “https://jennainsight.tistory.com/entry/F1-Score-Roc%EA%B3%A1%EC%84%A0-Auc-%EA%B3%84%EC %82%B0%EB%B0%A9%EB%B2%95-scikit-learn-%EC%BD%94%EB%93%9C%EB%A1%9C-%EA%B5%AC%ED%98 %84%ED%95%98%EA%B8%B0” was calculated by a known method, and each calculated statistical parameter, in particular, the higher the AUC, the higher the reliability of the prediction result.
  • Example 1 Comparative Example 1 in which the prior learning step was omitted in Example 1
  • Additional Comparative Examples 2 to 15 Comparative Examples 2 to 15 to which the existing prediction method was applied.
  • the existing prediction methods of Comparative Examples 2 to 15 predicted antigenic determinants according to the methods described in the literature summarized at the bottom of each of Tables 1 and 2, and then evaluated the reliability of the prediction results.
  • Comparative Examples 2 to 15 are learning and prediction models known to be applicable to predict antigenic determinants, etc.
  • the predictive models of Comparative Examples 3 and 5 are as follows:
  • Random Forest Classifier (Comparative Example 3): A tree-based model that makes optimal choices at each stage. A prediction model based on a typical classification that creates many models through data sampling and finally selects the most voted selection through voting;
  • the antigenic determinant prediction method of Example 1 is not only Comparative Example 1 omitting the prior learning step, but also any prediction method of Comparative Examples 2 to 15 previously known, for example, the antigen Compared to the prediction methods of Comparative Examples 3 and 5, which correspond to tree-based models used for predicting determinants, etc., it was confirmed that linear antigenic determinants and conformational antigenic determinants in immunogens such as antigens can be predicted with high reliability.
  • the prediction method of the embodiment is pre-learning and immunogen based on bi-directional artificial neural network learning. This seems to be because the prediction reliability can be further increased even if insufficient data is used by the decision part learning step.
  • the pre-learning step was performed in the same way as in Example 1.
  • the data extracted from the third database was divided into a learning dataset and a test dataset at a ratio of 8: 2, and as in Example 1, cross validation by the cross validation technique was performed on the learning dataset together. .
  • the immunogen binding unit learning step described in FIG. 2 was performed. Through the above process, after proceeding with the preliminary learning step and the immunogen binding part learning step, the antigen binding part of the target antibody was predicted, and the reliability of the predicted result was statistically evaluated and shown in Table 3 below.
  • the statistical parameters evaluated for reliability are “https://jennainsight.tistory.com/entry/F1-Score-Roc%EA%B3%A1%EC%84%A0-Auc-%EA%B3%84%EC %82%B0%EB%B0%A9%EB%B2%95-scikit-learn-%EC%BD%94%EB%93%9C%EB%A1%9C-%EA%B5%AC%ED%98 %84%ED%95%98%EA%B8%B0” was calculated by a known method, and each calculated statistical parameter, in particular, the higher the AUC-ROC, the higher the reliability of the prediction result.
  • Example 3 it was confirmed that the method for predicting the antigen-binding portion of Example 3 could predict the antigen-binding portion in an immune protein such as an antibody with high reliability compared to the previously known prediction methods of Comparative Examples 16 to 17.

Landscapes

  • Life Sciences & Earth Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Medical Informatics (AREA)
  • Theoretical Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • General Health & Medical Sciences (AREA)
  • Biotechnology (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Chemical & Material Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Biophysics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Molecular Biology (AREA)
  • Software Systems (AREA)
  • Analytical Chemistry (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Urology & Nephrology (AREA)
  • Immunology (AREA)
  • Databases & Information Systems (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • General Physics & Mathematics (AREA)
  • Medicinal Chemistry (AREA)
  • Bioethics (AREA)
  • Biomedical Technology (AREA)
  • Hematology (AREA)
  • Biochemistry (AREA)
  • General Engineering & Computer Science (AREA)
  • Pharmacology & Pharmacy (AREA)
  • Food Science & Technology (AREA)
  • Crystallography & Structural Chemistry (AREA)
  • Mathematical Physics (AREA)
  • Cell Biology (AREA)
  • Microbiology (AREA)
  • Computing Systems (AREA)
  • Pathology (AREA)

Abstract

The present invention relates to an immunogenic determinant predicting method and an immunogenic binding site predicting method, whereby an immunogenic determinant and/or an immunogenic binding site specifically binding thereto can be predicted more reliably in proteins or cells inducing an immune response.

Description

면역원 결정부 및 면역원 결합부 예측 방법Methods for predicting immunogenic determinants and immunogen binding sites
본 출원은 한국 특허청에 2022년 2월 25일에 출원된 특허출원 제 2022-0025194 호 및 2023년 2월 22일에 출원된 특허출원 제 2023-0023506 호에 기초한 우선권을 주장하며, 해당 출원의 명세서에 개시된 모든 내용은 본 출원에 원용된다. This application claims priority based on Patent Application No. 2022-0025194 filed with the Korean Intellectual Property Office on February 25, 2022 and Patent Application No. 2023-0023506 filed on February 22, 2023, and the specification of the application All contents disclosed in are incorporated into this application.
본 발명은 면역 반응을 일으키는 단백질 또는 세포에서 면역원 결정부 및/또는 이와 특이적으로 결합하는 면역원 결합부를 보다 신뢰성 있게 예측할 수 있는 면역원 결정부의 예측 방법 및 면역원 결합부의 예측 방법에 관한 것이다.The present invention relates to a method for predicting an immunogen determinant and a method for predicting an immunogen binding portion capable of more reliably predicting an immunogen determinant and/or an immunogen binding portion that specifically binds thereto in a protein or cell that causes an immune response.
면역계(immune system)는 생체 내에서 감염성 병원체 또는 유해 단백질 등을 방어하고 중화하는 방어 시스템을 지칭하며, 면역 반응은 항체 등 면역 단백질 또는 T세포, B세포 등의 면역 세포가 상기 유해 단백질 등 면역원을 특이적으로 인식하거나 결합하는 면역계에 관련된 일체의 반응을 지칭할 수 있다.The immune system refers to a defense system that defends and neutralizes infectious pathogens or harmful proteins in vivo. It may refer to any reaction related to the immune system that specifically recognizes or binds.
이러한 면역 반응에 있어, 상기 감염성 병원체 또는 유해 단백질 등의 면역원(예를 들어, 항원) 내에서, 상기 면역 단백질 등이 특이적으로 인식하는 아미노산 잔기(또는 서열)가 항원 결정기(epitope)로 대표되는 면역원 결정부(immunogenic determinant)로 정의될 수 있다. 또, 이러한 면역원 내의 면역원 결정부를 특이적으로 인식하여 이에 결합하는 항체 등 면역 단백질 내의 아미노산 잔기는 항원 결합부(paratope)로 대표되는 면역원 결합부(immunogen binding site)로 정의될 수 있다. In such an immune response, within an immunogen (eg, antigen) such as the infectious pathogen or harmful protein, an amino acid residue (or sequence) specifically recognized by the immune protein or the like is represented by an epitope It can be defined as an immunogenic determinant. In addition, an amino acid residue in an immune protein such as an antibody that specifically recognizes and binds to an immunogenic determinant in an immunogen may be defined as an immunogen binding site represented by an antigen binding site (paratope).
또, 상기 항원 결정기는 그 형태 및 항원 결합부와의 작용 방식에 따라, 선형 항원 결정기(linear epitope)와, 입체구조 항원 결정기(structural epitope)로 나뉠 수 있다. 이중, 선형 항원 결정기는 연속적인 선형 아미노산 서열로 구성되어 항원 결합부의 1차원적 구조와 결합할 수 있고, 입체구조 항원 결정기는 3차원적인 단백질 구조가 반영되어 불연속적인 아미노산 서열로 구성되고, 항원 결합부의 3차원적 구조와 결합할 수 있다. In addition, the antigenic determinant can be divided into a linear epitope and a structural epitope according to its shape and mode of action with an antigen-binding site. Of these, the linear antigenic determinant consists of a continuous linear amino acid sequence and can bind to the one-dimensional structure of the antigen-binding portion, and the three-dimensional structure antigenic determinant reflects the three-dimensional protein structure and consists of a discontinuous amino acid sequence that binds to the antigen. It can be combined with the three-dimensional structure of wealth.
최근 들어, 면역 반응 등에 기초한 면역 기반 치료제 또는 치료 방법에 관한 관심이 크게 증가하고 있으며, 이에 관한 연구 및 개발도 매우 활발히 이루어지고 있다. 이러한 면역 기반 치료제는 상기 항원 결정기 등 면역원 결정부와, 상기 항원 결합부 등 면역원 결합부 간의 특이적 결합 및 면역 반응 등에 기반하여 개발되므로, 상기 면역원 내에서 면역원 결정부를 예측 또는 확인하고, 상기 면역 단백질 내에서 면역원 결합부를 예측 또는 확인하는 것은 효과적인 면역 기반 치료제 등의 개발에 있어 매우 중요한 문제로 대두되고 있다. Recently, interest in immune-based therapeutics or treatment methods based on immune responses and the like has been greatly increased, and research and development related thereto are also being actively conducted. Since such an immune-based therapeutic agent is developed based on the specific binding and immune response between an immunogen determinant such as the antigen determinant and an immunogen binding unit such as the antigen binding unit, the immunogen determinant within the immunogen is predicted or confirmed, and the immune protein Predicting or confirming an immunogen binding site within the body has emerged as a very important problem in the development of effective immune-based therapeutics.
다만, 생체 내 면역 반응에 기반하여 면역원 결정부 또는 면역원 결합부를 실험적으로 확인하기는 매우 어렵기 때문에, 이전부터 이러한 면역원 결정부 또는 면역원 결합부나 이들의 후보군을 예측하기 위한 몇 가지 방법이 제안된 바 있으며, 이에 관한 연구가 계속되고 있다. However, since it is very difficult to experimentally confirm an immunogen determinant or immunogen-binding portion based on an in vivo immune response, several methods for predicting such immunogenic determinant or immunogen-binding portion or their candidate groups have been proposed in the past. And research on this is ongoing.
그러나, 현재까지 알려진 면역원 결정부 및 면역원 결합부 등의 예측 방법은 그 신뢰성이 충분치 못하였던 것이 사실이다. 이는 주로 면역원 및 면역 단백질의 서열 또는 구조 등에 관한 활용 가능한 데이터가 충분치 못함에 기인한다. 특히, 입체구조 항원 결정기(structural epitope) 및 이에 결합하는 3차원적 항원 결합부를 신뢰성 있게 예측하기 위해서는, 면역원 및 면역 단백질 등 타겟 단백질의 3차원 폴딩 구조에 대한 깊은 이해 및 많은 데이터가 필요하였다. 그러나, 이러한 데이터의 확보는 충분치 못한 실정이며, 이는 상기 입체구조 항원 결정기 및 3차원 항원 결합부의 예측을 더욱 어렵게 하는 요인이 되고 있다. However, it is true that the methods for predicting immunogen determinants and immunogen binding sites known to date have not been sufficiently reliable. This is mainly due to the lack of available data on the sequence or structure of immunogens and immune proteins. In particular, in order to reliably predict the structural epitope and the three-dimensional antigen-binding portion that binds thereto, a deep understanding of the three-dimensional folding structure of target proteins such as immunogens and immune proteins and a lot of data are required. However, securing such data is not sufficient, which makes it more difficult to predict the three-dimensional antigenic determinant and the three-dimensional antigen-binding portion.
이러한 제반 문제로 인해, 단백질, 특히, 단백질의 3차원 폴딩 구조 등에 대한 충분치 않은 데이터를 기반으로 하더라도, 상기 면역원 결정부 및/또는 면역원 결합부를 보다 신뢰성 있게 예측할 수 있는 방법 또는 관련 기술의 개발이 계속적으로 요구되고 있다. Due to these various problems, development of methods or related technologies that can more reliably predict the immunogen determinant and/or immunogen binding portion of the protein, particularly, even based on insufficient data on the three-dimensional folding structure of the protein, is ongoing. is being demanded
이에 발명의 일 구현예는 면역원 내에서 면역 단백질 또는 면역 세포가 특이적으로 인식 또는 결합하여 면역 반응을 일으키는 면역원 결정부의 존재, 위치 또는 서열 등을 보다 신뢰성 있게 예측할 수 있는 면역원 예측부의 예측 방법을 제공하는 것이다. Accordingly, one embodiment of the present invention provides a method for predicting an immunogen prediction unit capable of more reliably predicting the existence, position, or sequence of an immunogen determinant that causes an immune response by specifically recognizing or binding to an immune protein or immune cell within an immunogen. is to do
또한, 발명의 다른 구현예는 면역 단백질 또는 면역 세포 내에서 면역원과 특이적으로 결합하여 면역 반응을 일으키는 면역원 결합부의 존재, 위치 또는 서열 등을 보다 신뢰성 있게 예측할 수 있는 면역원 결합부의 예측 방법을 제공하는 것이다. In addition, another embodiment of the invention provides a method for predicting an immunogen-binding portion that can more reliably predict the presence, position, or sequence of an immunogen-binding portion that specifically binds to an immunogen in an immune protein or immune cell and causes an immune response. will be.
이에 발명의 일 구현예에 따르면, 단백질에 관한 제 1 데이터베이스를 기초로, 각 단백질에 포함된 토큰(token)화된 단위 서열별로 마스킹(masking)하면서 양 방향 인공 신경망 모델 학습을 진행하여, 단백질의 단위 서열별 기능 및 배열을 사전 학습하는 단계;Accordingly, according to one embodiment of the present invention, based on the first database of proteins, bidirectional artificial neural network model learning is performed while masking each tokenized unit sequence included in each protein, and the unit of the protein pre-learning functions and arrangements for each sequence;
면역 단백질 또는 면역 세포가 특이적으로 인식 또는 결합하는 면역 반응 여부에 따라 구분된 면역원 및 비면역원 폴리펩타이드에 관한 제 2 데이터베이스를 기초로, 각 폴리펩타이드에 포함된 토큰화된 단위 서열별로 분류하면서 양 방향 인공 신경망 학습을 진행하여, 상기 폴리펩타이드의 단위 서열별로 면역 반응을 일으키는 면역원 결정부(immunogenic determinant)의 정보를 학습하는 단계; 및 Based on the second database of immunogenic and non-immunogenic polypeptides classified according to whether immune proteins or immune cells specifically recognize or bind to immune responses, while classifying by tokenized unit sequence included in each polypeptide, amount Learning information of an immunogenic determinant that causes an immune response for each unit sequence of the polypeptide by conducting directional artificial neural network learning; and
학습된 인공 신경망을 이용하여, 타겟 면역원 또는 타겟 단백질의 면역원 결정부를 예측하는 단계를 포함하는 면역원 결정부 예측 방법이 제공된다. A method for predicting an immunogen determinant of a target immunogen or a target protein using a learned artificial neural network is provided.
이러한 일 구현예의 예측 방법에서는, 면역원 및 비면역원 폴리펩타이드의 단위 서열들을 토큰화하고, 토큰화된 단위 서열별로 면역원 결정부의 존재, 위치 또는 배열 등에 대해 양 방향 인공 신경망 학습을 진행한다. 이렇게 학습된 인공 신경망을 활용하여, 단백질의 3차원 폴딩 구조 등에 대한 정보가 부족한 경우에도, 상기 타겟 면역원 또는 타겟 단백질(예를 들어, 항원 단백질) 내에서 항체 등과 특이적으로 결합하는 항원 결정기(epitope) 등 면역원 결정부를 신뢰성 있게 예측할 수 있다. In the prediction method of this embodiment, unit sequences of immunogenic and non-immunogenic polypeptides are tokenized, and bidirectional artificial neural network learning is performed on the presence, position, or arrangement of immunogenic determining units for each tokenized unit sequence. Using the artificial neural network learned in this way, even when information on the three-dimensional folding structure of a protein is insufficient, the antigen determinant (epitope) that specifically binds to an antibody or the like within the target immunogen or target protein (eg, antigen protein) ) can reliably predict the immunogenic determinant.
이러한 일 구현예의 예측 방법에서, 상기 사전 학습 단계는, 상기 단백질에 포함된 아미노산 서열을 일정 길이를 갖는 복수의 단위 서열로 구분하여 토큰화하는 단계를 포함할 수 있다. In the prediction method of this embodiment, the pre-learning step may include dividing the amino acid sequence included in the protein into a plurality of unit sequences having a predetermined length and tokenizing them.
보다 구체적인 일 예에서, 상기 사전 학습 단계는, 상기 토큰화된 복수의 단위 서열 중 일부를 마스킹하는 단계; 및 상기 마스킹된 단위 서열 주위의 양 방향을 따라 인공 신경망 모델 학습을 진행하여 각 단위 서열의 기능, 배열 또는 순서를 사전 학습하는 단계를 더 포함할 수 있다. 이때, 상기 인공 신경망 모델 학습 단계는, 상기 마스킹된 단위 서열을 예측하는 단계; 및 상기 예측 결과의 정확성을 검증하여 피드백하는 단계를 포함하여 진행될 수 있고, 이러한 과정을 통해 상기 인공 신경망이 단백질의 단위 서열별 기능 및 배열 등을 높은 신뢰성으로 예측할 수 있게 된다. In a more specific example, the preliminary learning step may include masking some of the plurality of tokenized unit sequences; and proceeding with artificial neural network model learning along both directions around the masked unit sequence to pre-learn the function, arrangement or sequence of each unit sequence. In this case, the artificial neural network model learning step may include predicting the masked unit sequence; and verifying the accuracy of the predicted result and providing feedback thereof. Through this process, the artificial neural network can predict the function and arrangement of each unit sequence of the protein with high reliability.
또한, 상기 일 구현예의 예측 방법에서, 상기 면역원 결정부 학습 단계는, 상기 폴리펩타이드에 포함된 아미노산 서열을 일정 길이를 갖는 복수의 단위 서열로 구분하여 토큰화하는 단계를 포함할 수 있다. In addition, in the prediction method of one embodiment, the step of learning the immunogen determining unit may include dividing the amino acid sequence included in the polypeptide into a plurality of unit sequences having a predetermined length and tokenizing them.
보다 구체적인 예에서, 상기 면역원 결정부 학습 단계는, 상기 토큰화된 복수의 단위 서열별로 면역원 결정부가 존재하는지 여부를 양 방향으로 예측 및 분류하는 단계; 및 상기 예측 결과의 정확성을 검증하여 피드백하는 단계를 포함할 수 있고, 이러한 과정을 통해, 상기 단위 서열별로 면역원 결정부가 존재하는지 여부가 예측 및 분류되어 데이터로 출력 및 검증되고, 상기 출력 데이터를 기초로 면역원 결정부의 예측 모델이 도출될 수 있다. In a more specific example, the step of learning the immunogen determining unit may include predicting and classifying in both directions whether or not an immunogen determining unit exists for each of the plurality of tokenized unit sequences; And verifying the accuracy of the prediction result and providing feedback. Through this process, whether or not the immunogen determining unit exists for each unit sequence is predicted and classified, output and verified as data, and based on the output data As a result, a predictive model of the immunogen determining unit can be derived.
상기 일 구현예의 방법에서는, 상술한 방법으로 학습된 인공 신경망 또는 상기 면역원 결정부 예측 모델을 기초로, 상기 타겟 면역원 또는 타겟 단백질의 서열 또는 구조 정보를 입력하여, 상기 면역원 결정부의 존재 여부, 위치 또는 서열을 신뢰성 있게 예측할 수 있다. In the method of one embodiment, the sequence or structural information of the target immunogen or target protein is input based on the artificial neural network learned by the above-described method or the immunogen-determining unit prediction model, so as to determine whether the immunogen-determining unit exists, the location, or sequence can be reliably predicted.
한편, 발명의 다른 구현예에 따르면, 단백질에 관한 제 1 데이터베이스를 기초로, 각 단백질에 포함된 토큰(token)화된 단위 서열별로 마스킹(masking)하면서 양 방향 인공 신경망 학습을 진행하여, 단백질의 단위 서열별 기능 및 배열을 사전 학습하는 단계;On the other hand, according to another embodiment of the present invention, based on the first database of proteins, bidirectional artificial neural network learning is performed while masking for each tokenized unit sequence included in each protein, and the protein unit pre-learning functions and arrangements for each sequence;
상호 결합된 면역원 폴리펩타이드 및 면역 단백질의 복합체에 관한 제 3 데이터베이스를 기초로, 각 복합체에 포함된 토큰화된 단위 서열별로 면역원 결합부 여부에 따라 분류하면서 양 방향 인공 신경망 학습을 진행하여, 상기 면역 단백질의 단위 서열별로 상기 면역원 폴리펩타이드와 면역 반응을 일으키는 면역원 결합부 (immunogen binding site)의 정보를 학습하는 단계; 및 Based on the third database of complexes of mutually bound immunogen polypeptides and immune proteins, bidirectional artificial neural network learning is performed while classifying each tokenized unit sequence included in each complex according to whether or not there is an immunogen binding site, Learning information of an immunogen binding site that causes an immune reaction with the immunogenic polypeptide for each protein unit sequence; and
학습된 인공 신경망을 이용하여, 타겟 면역 단백질 또는 타겟 면역 세포에서 면역원 결합부를 예측하는 단계를 포함하는 면역원 결합부 예측 방법이 제공된다. A method for predicting an immunogen binding site in a target immune protein or target immune cell using a trained artificial neural network is provided.
이러한 다른 구현예의 예측 방법에서는, 일 구현예와 동일한 사전 학습 단계를 진행하고, 상호 결합된 면역원 폴리펩타이드 및 면역 단백질의 복합체에 관한 제 3 데이터베이스를 기초로 하는 면역원 결합부의 학습 단계를 추가 진행한다. 이때, 상기 복합체에 포함된 단위 서열들을 토큰화하고, 토큰화된 단위 서열별로 면역원 결합부의 존재, 위치 또는 배열 등에 대해 양 방향 인공 신경망 학습을 진행한다. 이렇게 학습된 인공 신경망을 활용하여, 단백질의 3차원 폴딩 구조 등에 대한 정보가 부족한 경우에도, 상기 타겟 면역 단백질(예를 들어, 항체 단백질) 내에서 상기 항원 결정기 등 면역원 결정부와 특이적으로 결합하는 항원 결합부(paratope) 등 면역원 결합부를 신뢰성 있게 예측할 수 있다. In the prediction method of this other embodiment, the same pre-learning step as in one embodiment is performed, and the step of learning the immunogen binding part based on the third database on the complex of immunogen polypeptides and immune proteins linked to each other is additionally performed. At this time, the unit sequences included in the complex are tokenized, and bidirectional artificial neural network learning is performed on the existence, position, or arrangement of immunogen binding parts for each tokenized unit sequence. By using the artificial neural network learned in this way, even when information on the three-dimensional folding structure of the protein is insufficient, specifically binding to the immunogenic determinant such as the antigenic determinant in the target immune protein (eg, antibody protein) Immunogen binding sites such as antigen binding sites (paratopes) can be reliably predicted.
또한, 상기 다른 구현예의 예측 방법에서는, 면역원 폴리펩타이드 및 면역 단백질이 상호 결합된 복합체에 관한 제 3 데이터베이스를 활용함에 따라, 상기 면역원 결합부 뿐만 아니라 항원 등 타겟 면역원 내의 면역원 결정부 역시 신뢰성 있게 예측할 수 있다. In addition, in the prediction method of the other embodiment, as the third database for complexes in which immunogenic polypeptides and immune proteins are mutually bound is utilized, not only the immunogen-binding portion but also the immunogenic determinant in the target immunogen, such as an antigen, can be reliably predicted. there is.
상기 다른 구현예의 예측 방법에서, 상기 면역원 결합부 학습 단계는, 상기 상호 결합된 면역원 폴리펩타이드 및 면역 단백질에 각각 포함된 아미노산 서열을 일정 길이를 갖는 복수의 단위 서열로 구분하여 토큰화하는 단계를 포함할 수 있다. In the prediction method of another embodiment, the step of learning the immunogen binding part comprises dividing the amino acid sequences each included in the mutually bound immunogen polypeptide and immune protein into a plurality of unit sequences having a certain length and tokenizing them. can do.
보다 구체적인 예에서, 상기 면역원 결합부 학습 단계는, 상기 토큰화된 복수의 단위 서열별로 면역원 결합부가 존재하는지 여부를 양 방향으로 예측 및 분류하는 단계; 및 상기 예측 결과의 정확성을 검증하여 피드백하는 단계를 포함할 수 있다. In a more specific example, the step of learning the immunogen binding unit may include predicting and classifying whether or not an immunogen binding unit exists for each of the plurality of tokenized unit sequences in both directions; and verifying the accuracy of the prediction result and providing feedback.
이때, 상기 면역원 폴리펩타이드 및 면역 단백질의 양 에지(edge)에 대응하는 토큰화된 단위 서열에는 고유의 분류 식별자(classification identifier)가 부여되어 각각의 임베딩 벡터로 산출될 수 있고, 이들 임베딩 벡터가 상기 에지에서 서로 연결된(concatenated) 상태로 입력되어 상기 인공 신경망 학습이 진행될 수 있다. 이러한 고유 분류 식별자의 부여 등을 통해, 면역원 및 면역 단백질이 구분되어 인공 신경망 학습이 진행될 수 있고, 상기 면역원 결합부 및 면역원 결정부가 신뢰성 있게 예측될 수 있다. At this time, a unique classification identifier is assigned to the tokenized unit sequence corresponding to both edges of the immunogenic polypeptide and the immune protein, and each embedding vector can be calculated, and these embedding vectors are The artificial neural network learning may proceed by being input in a concatenated state at the edge. Through the assignment of such a unique classification identifier, etc., immunogens and immune proteins can be distinguished, artificial neural network learning can proceed, and the immunogen-binding portion and immunogen-determining portion can be reliably predicted.
상술한 일 구현예의 방법에 따르면, 단백질에 관한 정보, 특히, 단백질의 3차원 폴딩 구조 등에 관한 데이터가 충분히 않은 경우에도, 면역원 내에서 면역 단백질 또는 면역 세포가 특이적으로 인식 또는 결합하여 면역 반응을 일으키는 면역원 결정부의 존재, 위치 또는 서열 등을 보다 신뢰성 있게 예측할 수 있다. According to the method of the above-described embodiment, even when information on a protein, in particular, data on a protein's three-dimensional folding structure, etc. is insufficient, an immune protein or immune cell specifically recognizes or binds to an immune response within an immunogen. The presence, position or sequence of the causative immunogenic determinant can be more reliably predicted.
또한, 발명의 다른 구현예에 따르면, 상기 단백질 관련 데이터가 충분치 않은 경우에도, 면역 단백질 또는 면역 세포 내에서 상기 면역원과 특이적으로 결합하여 면역 반응을 일으키는 면역원 결합부의 존재, 위치 또는 서열 등을 보다 신뢰성 있게 예측할 수 있다. 더 나아가, 상기 다른 구현예의 방법에 따르면, 상기 면역원 결합부 뿐만 아니라, 상기 면역원 내의 면역원 결정부에 관해서도 신뢰성 있게 예측할 수 있다. In addition, according to another embodiment of the present invention, even when the protein-related data is not sufficient, the presence, location, or sequence of an immunogen-binding portion that specifically binds to the immunogen and causes an immune response in an immune protein or immune cell is examined. can be predicted reliably. Furthermore, according to the method of another embodiment, not only the immunogen-binding portion but also the immunogen-determining portion within the immunogen can be reliably predicted.
따라서, 발명의 일 구현예 또는 다른 구현예의 예측 방법은 면역원 결정부 및/또는 면역원 결합부를 보다 신뢰성 있게 예측하여, 보다 효과적인 면역 기반 치료제 또는 치료 방법을 단 시간 내에 경제적으로 개발하는데 크게 기여할 수 있다. Therefore, the prediction method according to one or another embodiment of the present invention more reliably predicts an immunogen determinant and/or an immunogen binding portion, and can greatly contribute to the economical development of a more effective immune-based therapeutic agent or treatment method within a short period of time.
도 1은 발명의 일 구현예에 따른 면역원 결정부 예측방법을 개략적으로 나타낸 모식도이다. 1 is a schematic diagram schematically showing a method for predicting an immunogen determinant according to an embodiment of the present invention.
도 2는 발명의 다른 구현예에 따른 면역원 결합부 예측 방법에서 일 구현예와 구분되는 면역원 결합부 학습 단계를 개략적으로 나타낸 모식도이다. 2 is a schematic diagram schematically showing an immunogen binding portion learning step distinguished from one embodiment in a method for predicting an immunogen binding portion according to another embodiment of the present invention.
이하, 첨부한 도면을 참고로, 발명의 구현예들에 따른 면역원 결정부 및 면역원 결합부의 예측 방법에 대해 설명하기로 한다. 참고로, 도 1에는 발명의 일 구현예에 따른 면역원 결정부 예측방법, 구체적으로 이에 포함되는 사전 학습 단계 및 면역원 결정부 학습 단계의 일 예가 개략적으로 도시되어 있다. Hereinafter, a method for predicting an immunogen determining unit and an immunogen binding unit according to embodiments of the present invention will be described with reference to the accompanying drawings. For reference, FIG. 1 schematically illustrates an example of a method for predicting an immunogen determinant according to an embodiment of the present invention, specifically, a pre-learning step and an immunogen determinant learning step included therein.
도 1에 도시된 바와 같이, 발명의 일 구현예에 의한 예측 방법에서는, 단백질에 관한 제 1 데이터베이터를 기초로 하는 사전 학습 단계와, 면역원 및 비면역원 폴리펩타이드에 관한 제 2 데이터베이스를 기초로 하는 면역원 결정부 학습 단계를 거쳐, 면역원 내에서 항원 결정기(epitope) 등 면역원 결정부의 존재, 위치 또는 서열 등을 예측하는 예측 모델을 도출한다. As shown in FIG. 1, in the prediction method according to one embodiment of the present invention, a pre-learning step based on a first database on proteins and a second database on immunogens and non-immunogen polypeptides Through the immunogen determinant learning step, a predictive model for predicting the existence, position, or sequence of an immunogen determinant such as an epitope within the immunogen is derived.
이러한 사전 학습 및 면역원 결정부 학습 단계에서는, 상기 제 1 및 제 2 데이터베이스에 포함된 각 단백질 및 면역원 또는 비면역원 폴리펩타이드에 관한 서열 정보를 기초로, 이들의 단위 서열을 토큰화하고, 토큰화된 단위 서열의 일부를 마스킹(masking)하거나 분류하면서 양 방향으로 인공 신경망 모델을 학습한다. In this pre-learning and immunogen determining unit learning step, based on the sequence information of each protein and immunogen or non-immunogen polypeptide included in the first and second databases, their unit sequences are tokenized and tokenized. The artificial neural network model is trained in both directions while masking or classifying part of the unit sequence.
이러한 양 방향 인공 신경망 학습 과정을 통해, 각 단위 서열의 순서, 배열 및 기능 등을 예측하고, 각 단위 서열이 항체 등 면역 단백질 또는 면역 세포가 특이적으로 인식하는 면역원 결정부에 해당하는지 예측하는 예측 모델이 도출될 수 있다. Through this bidirectional artificial neural network learning process, the order, arrangement, and function of each unit sequence are predicted, and prediction is made to predict whether each unit sequence corresponds to an immune protein such as an antibody or an immunogen determinant specifically recognized by immune cells. A model can be derived.
이러한 예측 모델을 활용함에 따라, 타겟이 되는 항원 등 타겟 면역원 내에서 면역원 결정부를 신뢰성 있게 예측할 수 있음이 확인되었다. 특히, 이러한 일 구현예의 예측 방법에서는, 상기 단백질과, 면역원/비면역원 폴리펩타이드의 서열 정보를 기초로 면역원 결정부를 신뢰성 있게 예측할 수 있다. 따라서, 일 구현예의 방법에 따르면, 면역원 등 단백질의 3차원 폴딩 구조 등에 관한 데이터가 충분치 않은 경우에도, 면역원 내에서 면역 반응을 일으키는 면역원 결정부를 신뢰성 있게 예측할 수 있으며, 그 결과 기존 방법으로는 예측이 쉽지 않았던 입체구조 항원 결정기(structural epitope) 역시 보다 신뢰성 있게 예측할 수 있다. It was confirmed that the immunogenic determinant within the target immunogen, such as the target antigen, can be reliably predicted by using such a predictive model. In particular, in the prediction method of this embodiment, the immunogenic determinant can be reliably predicted based on the sequence information of the protein and the immunogenic/non-immunogenic polypeptide. Therefore, according to the method of one embodiment, even when data on the three-dimensional folding structure of a protein such as an immunogen is not sufficient, it is possible to reliably predict an immunogen determinant that causes an immune response within an immunogen, and as a result, prediction is difficult with conventional methods. The structural epitope, which was not easy, can also be predicted more reliably.
한편, 일 구현예의 예측 방법에서, 상기 사전 학습 단계에서 사용되는 상기 제 1 데이터베이스는 복수의 단백질에 대해, 서열, 기능, 구조 및 알려진 변이 중 하나 이상에 관한 정보를 포함할 수 있다. 보다 구체적으로, 상기 제 1 데이터베이스는 다수의 단백질에 대해, 기능, 서열, 도메인 구조 및 확인된 변이들에 대한 정보가 포함된 단백질 데이터베이스로 될 수 있고, 이러한 제 1 데이터베이스의 예로는, “Bairoch, A., and R. Apweiler. 1996. “The SWISS-PROT Protein Sequence Data Bank and Its New Supplement TREMBL.” Nucleic Acids Research 24 (1): 21-25” 또는 “Boutet, Emmanuel, Damien Lieberherr, Michael Tognolli, Michel Schneider, and Amos Bairoch. 2007. “UniProtKB/Swiss-Prot.” Methods in Molecular Biology 406: 89-112” 등을 통해 알려진 단백질 데이터베이스를 들 수 있다. Meanwhile, in the prediction method of one embodiment, the first database used in the pre-learning step may include information on one or more of sequences, functions, structures, and known mutations for a plurality of proteins. More specifically, the first database may be a protein database containing information on functions, sequences, domain structures, and identified mutations for a plurality of proteins, and examples of such a first database include “Bairoch, A., and R. Apweiler. 1996. “The SWISS-PROT Protein Sequence Data Bank and Its New Supplement TREMBL.” Nucleic Acids Research 24 (1): 21-25” or “Boutet, Emmanuel, Damien Lieberherr, Michael Tognolli, Michel Schneider, and Amos Bairoch. 2007. “UniProtKB/Swiss-Prot.” Methods in Molecular Biology 406: 89-112”.
상기 사전 학습 단계에서는, 상기 제 1 데이터베이스에 포함된 다수의 단백질에 관한 정보를 별도의 가공 없이 학습 데이터로 활용할 수도 있지만, 인공 신경망 학습의 효율성을 보다 높이기 위해, 일정 길이 이하, 예를 들어, 5000 이하, 혹은 3000 이하, 혹은 2500 이하의 아미노산 서열, 보다 구체적으로 서로 동등한 일정 길이의 아미노산 서열을 갖는 단백질에 관한 정보로 사전 가공하여 활용할 수 있다. 또한, 후술하는 면역원 결정부 학습 단계에 있어서도, 상기 제 2 데이터베이스에 포함된 다수의 면역원 및 비면역원 폴리펩타이드에 관한 정보를 일정 길이 이하 혹은 서로 동등한 일정 길이의 아미노산 서열을 갖는 폴리펩타이드에 관한 정보로 사전 가공함에 따라, 예측 모델의 도출을 위한 인공 신경망 학습의 효율성을 보다 높일 수 있다. In the pre-learning step, information on a plurality of proteins included in the first database may be used as training data without separate processing, but in order to further increase the efficiency of artificial neural network learning, a certain length or less, for example, 5000 Information on proteins having an amino acid sequence of 3000 or less, or 2500 or less, more specifically, amino acid sequences of a certain length equivalent to each other may be pre-processed and utilized. In addition, even in the step of learning the immunogen determining unit described later, information on a plurality of immunogenic and non-immunogenic polypeptides included in the second database is converted into information on polypeptides having amino acid sequences of a certain length or less or equal to each other. According to pre-processing, the efficiency of artificial neural network learning for deriving a predictive model can be further increased.
또한, 상기 사전 학습 단계에서는, 도 1에도 도시된 바와 같이, 상기 제 1 데이터베이스의 각 단백질에 포함된 아미노산 서열을 일정 길이를 갖는 복수의 단위 서열로 구분하여 토큰화할 수 있다. 이러한 토큰화는, 예를 들어, 일정 길이의 아미노산 단위 서열별로 이루어질 수 있고, 이를 통해 단백질의 아미노산 서열로부터 복수의 토큰이 생성될 수 있다. In addition, in the pre-learning step, as shown in FIG. 1, the amino acid sequence included in each protein of the first database may be divided into a plurality of unit sequences having a predetermined length and tokenized. Such tokenization may be performed, for example, for each amino acid unit sequence of a certain length, and through this, a plurality of tokens may be generated from the amino acid sequence of the protein.
상기 사전 학습 단계에서는, 상기 토큰화된 복수의 단위 서열 중 일부, 예를 들어, 5 내지 25%, 혹은 10 내지 20%를 마스킹하는 한편, 상기 마스킹된 단위 서열 주위의 단백질 양 방향을 따라 인공 신경망 모델의 어텐션 메커니즘(attention mechanism)을 통한 양 방향 학습이 이루어질 수 있다. 이때, 마스킹이란 토큰화된 단위 서열 일부를 가리는 작업을 의미하는 것으로 마스킹된 토큰은 'MASK' 토큰으로 구분될 수 있다. In the pre-learning step, some of the tokenized plurality of unit sequences, for example, 5 to 25%, or 10 to 20% are masked, while the artificial neural network along both directions of the protein around the masked unit sequence Bidirectional learning can be achieved through the model's attention mechanism. At this time, masking means an operation to cover a part of the tokenized unit sequence, and the masked token can be distinguished as a 'MASK' token.
또, 상기 인공 신경망 학습 과정에서는, 상기 마스킹된 토큰에 해당하는 단위 서열을 예측하고, 제 1 데이터베이스로부터 구분된 검증 데이터셋 및 테스트 데이터셋을 기초로, 상기 단위 서열 예측 결과의 정확성을 확인 및 검증하여 피드백할 수 있다. 이와 같이, 단백질의 양 방향을 따라, 토큰화된 단위 서열 중 일부를 다양한 조합으로 마스킹하면서, 마스킹된 단위 서열을 예측하고, 그 예측 결과의 정확성을 확인, 검증 및 피드백하는 과정을 통해, 인공 신경망 모델에 단백질에 포함된 각 단위 서열의 기능, 배열 또는 순서를 사전 학습시킬 수 있다.In addition, in the artificial neural network learning process, a unit sequence corresponding to the masked token is predicted, and based on a verification dataset and a test dataset separated from the first database, the accuracy of the unit sequence prediction result is confirmed and verified so you can give feedback. In this way, while masking some of the tokenized unit sequences in various combinations along both directions of the protein, predicting the masked unit sequences, and confirming, verifying, and feeding back the accuracy of the predicted results, artificial neural networks The function, arrangement, or order of each unit sequence included in a protein can be pretrained on the model.
이러한 사전 학습 결과, 생물학적 및/또는 화학적 특성의 최소화된 고려하에 인공 신경망 모델이 단백질에 포함되는 단위 서열들의 순서 및 배열 등을 예측할 수 있게 된다. 일 구현예의 예측 방법에서, 상기 사전 학습 단계를 먼저 진행함에 따라, 후술하는 단계를 거쳐 도출되는 면역원 결정부의 예측 모델의 신뢰성이 보다 향상될 수 있다. As a result of such pre-learning, the artificial neural network model can predict the order and arrangement of unit sequences included in a protein under the minimized consideration of biological and/or chemical properties. In the prediction method of one embodiment, as the pre-learning step is performed first, the reliability of the predictive model of the immunogen determining unit derived through the steps described below may be further improved.
한편, 상술한 사전 학습 단계의 양 방향 인공 신경망 학습을 위해, 상기 단위 서열들을 토큰화하는 토큰화 모듈, 토큰화된 단위 서열 중 일부를 마스킹하는 마스킹 모듈, 마스킹된 단위 서열을 변환 및 예측하면서 학습을 진행하는 변환 모듈 및 학습 모듈, 이러한 단위 서열 예측 결과의 정확성을 확인 및 연산하는 예측 정확도 연산 모듈 및 상기 단위 서열들의 배열 등에 대한 예측 모델을 생성하는 예측 모델 생성 모듈을 포함한 예측 모델 생성 장치가 사용될 수 있다. On the other hand, for the bidirectional artificial neural network learning of the above-described pre-learning step, a tokenization module for tokenizing the unit sequences, a masking module for masking some of the tokenized unit sequences, and learning while converting and predicting the masked unit sequences A prediction model generation device including a transformation module and a learning module that performs a sequence prediction, a prediction accuracy operation module that checks and calculates the accuracy of the unit sequence prediction result, and a prediction model generation module that generates a prediction model for an arrangement of the unit sequences, etc. can
상술한 양 방향 인공 신경망 학습 및 이의 진행을 위한 예측 모델 생성 장치는 이전에 언어 학습을 위해 적용되던 양 방향 인공 신경망 학습 방법 및 예측 모델 생성 장치와 유사한 형태를 가질 수 있으며, 이의 일 예는 한국 등록 특허 공보 제 2426508 호 또는 “Devlin, Jacob and Chang, Ming-Wei and Lee, Kenton and Toutanova, Kristina. 2019. “BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding.” In Devlin-Etal-2019-Bert, 4171-86. Association for Computational Linguistics” 등을 통해 개시되어 있다. The above-described bidirectional artificial neural network learning method and predictive model generating device for its progress may have a similar form to the bidirectional artificial neural network learning method and predictive model generating device previously applied for language learning, one example of which is registered in Korea. Patent Publication No. 2426508 or “Devlin, Jacob and Chang, Ming-Wei and Lee, Kenton and Toutanova, Kristina. 2019. “BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding.” In Devlin-Etal-2019-Bert, 4171-86. Association for Computational Linguistics”.
한편, 일 구현예의 예측 방법에서, 상술한 사전 학습 단계를 진행한 후에는, 항체 등 면역 단백질 또는 면역 세포가 특이적으로 인식 또는 결합하는 면역 반응 여부에 따라 구분된 면역원 및 비면역원 폴리펩타이드에 관한 제 2 데이터베이스를 기초로, 양 방향 인공 신경망 학습을 진행하여 면역원 결정부의 예측 모델을 학습하는 단계를 진행한다. On the other hand, in the prediction method of one embodiment, after the above-described pre-learning step, immunogens and non-immunogen polypeptides classified according to whether immune proteins such as antibodies or immune cells specifically recognize or bind immune responses Based on the second database, the step of learning the predictive model of the immunogen determining unit by proceeding with bidirectional artificial neural network learning.
이러한 면역원 결정부 학습 단계에서, 상기 제 2 데이터베이스는 면역 반응을 일으키는 면역원 결정부 포함 면역원 폴리펩타이드와, 면역원 결정부를 포함하지 않는 비면역원 폴리펩타이드가 구분되어, 이들의 서열 또는 구조에 관한 정보를 포함할 수 있다. 보다 구체적으로, 상기 제 2 데이터베이스는 복수의 실험을 통해 둘 이상의 면역 단백질 또는 면역 세포와 면역 반응을 일으키거나 일으키지 않는 것으로 확인된 면역원 및 비면역원 폴리펩타이드에 관한 정보를 포함할 수 있으며, 이러한 검증된 면역원/비면역원 폴리펩타이드에 관한 제 2 데이터베이스를 활용하여, 일 구현예의 예측 방법의 신뢰성을 보다 높일 수 있다. In this step of learning the immunogenic determinant, the second database is divided into immunogenic polypeptides containing immunogenic determinants that cause an immune response and non-immunogen polypeptides not containing immunogenic determinants, and contains information on their sequences or structures. can do. More specifically, the second database may include information on immunogens and non-immunogen polypeptides that have been confirmed to cause or not cause an immune response with two or more immune proteins or immune cells through a plurality of experiments. Reliability of the prediction method of one embodiment may be further increased by utilizing the second database for immunogenic/non-immunogenic polypeptides.
더욱 구체적인 일 실시예에서, 상기 제 2 데이터베이스는 상기 면역원 및 비면역원 폴리펩타이드의 연속적 아미노산 서열 정보, 비연속적 아미노산 서열 정보 및 3차원 구조 정보로 이루어진 군에서 선택된 1종 이상의 정보를 포함할 수 있다. 예를 들어, 일 구현예의 예측 방법에 따르면, 선형 항원 결정기(linear epitope) 뿐만 아니라 입체구조 항원 결정기(structural epitope) 형태의 면역원 결정부가 예측될 수 있다. 이 중, 상기 선형 항원 결정기의 예측 방법에서는, 상기 면역원 및 비면역원 폴리펩타이드의 연속적 아미노산 서열 정보가 포함된 제 2 데이터베이스가 학습 데이터로서 사용될 수 있으며, 상기 입체구조 항원 결정기를 예측하고자 할 경우에는, 비연속적 아미노산 서열 정보 또는 3차원 구조 정보가 포함된 제 2 데이터베이스가 사용될 수 있다. 이러한 제 2 데이터베이스의 일 예로는, “Vita, Randi, Swapnil Mahajan, James A. Overton, Sandeep Kumar Dhanda, Sheridan Martini, Jason R. Cantrell, Daniel K. Wheeler, Alessandro Sette, and Bjoern Peters. 2019. “The Immune Epitope Database (IEDB): 2018 Update.” Nucleic Acids Research 47 (D1): D339-43” 또는 단백질 데이터 뱅크(Protein Data Bank; PDB)를 통해 알려진 데이터베이스 등을 들 수 있다. In a more specific embodiment, the second database may include one or more types of information selected from the group consisting of continuous amino acid sequence information, non-contiguous amino acid sequence information, and 3-dimensional structural information of the immunogen and non-immunogen polypeptide. For example, according to the prediction method of one embodiment, an immunogenic determinant in the form of a structural epitope as well as a linear epitope can be predicted. Among them, in the method of predicting the linear antigenic determinant, a second database containing continuous amino acid sequence information of the immunogenic and non-immunogenic polypeptides may be used as learning data. A second database containing non-contiguous amino acid sequence information or three-dimensional structure information may be used. An example of such a second database is “Vita, Randi, Swapnil Mahajan, James A. Overton, Sandeep Kumar Dhanda, Sheridan Martini, Jason R. Cantrell, Daniel K. Wheeler, Alessandro Sette, and Bjoern Peters. 2019. “The Immune Epitope Database (IEDB): 2018 Update.” Nucleic Acids Research 47 (D1): D339-43” or a database known through the Protein Data Bank (PDB).
상기 면역원 결정부 학습 단계에서, 상기 제 2 데이터베이스에 포함된 다수의 면역원 및 비면역원 폴리펩타이드에 관한 정보를 일정 길이의 아미노산 서열을 갖는 폴리펩타이드에 관한 정보로 사전 가공함에 따라 예측 방법의 신뢰성을 보다 높일 수 있음은 이미 상술한 바와 같다. In the step of learning the immunogen determining unit, information on a plurality of immunogenic and non-immunogenic polypeptides included in the second database is pre-processed into information on a polypeptide having an amino acid sequence of a certain length, thereby increasing the reliability of the prediction method. It has already been described above that it can be increased.
한편, 상기 면역원 결정부 학습 단계에서는, 도 1에도 도시된 바와 같이, 상기 제 2 데이터베이스의 각 폴리펩타이드에 포함된 아미노산 서열을 일정 길이를 갖는 복수의 단위 서열로 구분하여 토큰화할 수 있다. 이러한 토큰화는, 예를 들어, 항원 결정기 등 면역원 결정부에 대응하는 일정 길이의 아미노산 단위 서열별로 이루어질 수 있고, 이를 통해 면역원 또는 비면역원 폴리펩타이드의 아미노산 서열로부터 복수의 토큰이 생성될 수 있다. Meanwhile, in the learning step of the immunogen determining unit, as shown in FIG. 1, the amino acid sequence included in each polypeptide of the second database may be divided into a plurality of unit sequences having a certain length and tokenized. Such tokenization may be performed for each amino acid unit sequence of a certain length corresponding to an immunogenic determinant, such as an antigenic determinant, and through this, a plurality of tokens may be generated from the amino acid sequence of an immunogenic or non-immunogenic polypeptide.
상기 면역원 결정부 학습 단계에서는, 상기 토큰화된 복수의 단위 서열 주위의 폴리펩타이드 양 방향을 따라 사전 학습 단계에서와 마찬가지로 인공 신경망 모델 학습을 진행할 수 있다. 이러한 양 방향 인공 신경망 학습 과정에서, 각 단위 서열별로 면역원 결정부에 해당하거나 이러한 면역원 결정부가 존재하는지 여부를 예측 및 분류하여 데이터로 출력할 수 있다(예를 들어, 도 1의 “B”에서 면역원 결정부를 예측한 후, 그 예측 데이터를 “0” 또는 “1”로 분류하여 출력). In the immunogen determining unit learning step, artificial neural network model learning may proceed as in the pre-learning step along both directions of the polypeptides around the plurality of tokenized unit sequences. In this bidirectional artificial neural network learning process, it is possible to output data by predicting and classifying whether each unit sequence corresponds to an immunogen determining unit or whether such an immunogen determining unit exists (e.g., the immunogen in “B” in FIG. 1). After predicting the decision part, the predicted data is classified as “0” or “1” and output).
이러한 면역원 결정부 예측 결과는 제 2 데이터베이스로부터 구분된 테스트 데이터셋을 기초로 그 정확성이 확인 및 검증되어 피드백될 수 있다. 예를 들어, weighted cross entropy 등의 손실 함수를 통해 파라미터들을 조정할 수 있다. 이와 같이, 토큰화된 단위 서열이 면역원 결정부인지 아닌지 예측하고, 그 예측 결과의 정확성을 확인, 검증 및 피드백하는 과정을 통해, 면역원 내에서 면역원 결정부의 존재, 위치 및 배열 등에 관한 정보를 신뢰성 있게 예측하는 예측 모델이 도출될 수 있다. The accuracy of the prediction result of the immunogen determining unit may be confirmed and verified based on the test data set separated from the second database, and then fed back. For example, parameters can be adjusted through a loss function such as weighted cross entropy. In this way, through the process of predicting whether or not the tokenized unit sequence is an immunogen determinant, and confirming, verifying, and feedbacking the accuracy of the predicted result, information on the existence, location, and arrangement of the immunogen determinant within the immunogen can be reliably provided. A predictive model that predicts can be derived.
이렇게 학습된 인공 신경망 모델 또는 상기 면역원 결정부 예측 모델을 기초로, 미지의 타겟 면역원 또는 타겟 단백질의 서열 또는 구조 정보를 입력하여, 이러한 면역원 내에서 상기 면역원 결정부의 존재 여부, 위치 또는 서열 등을 신뢰성 있게 예측할 수 있다. Based on the learned artificial neural network model or the immunogen determinant prediction model, sequence or structural information of an unknown target immunogen or target protein is input, and the existence, position, or sequence of the immunogen determinant in the immunogen is reliably determined. can predictably
상술한 면역원 결정부 학습 단계에 있어서는, 양 방향 인공 신경망 학습을 위해, 상기 단위 서열들을 토큰화하는 토큰화 모듈, 토큰화된 단위 서열을 분류하는 분류(classification) 모듈, 단위 서열을 예측하면서 학습을 진행하는 학습 모듈, 이러한 단위 서열 예측 결과의 정확성을 확인 및 연산하는 예측 정확도 연산 모듈 및 상기 단위 서열들의 배열 등에 대한 예측 모델을 생성하는 예측 모델 생성 모듈을 포함한 예측 모델 생성 장치가 사용될 수 있다. In the above-described immunogen determining unit learning step, for bidirectional artificial neural network learning, a tokenization module for tokenizing the unit sequences, a classification module for classifying the tokenized unit sequences, and learning while predicting the unit sequences A predictive model generation device including an ongoing learning module, a prediction accuracy calculation module that checks and calculates the accuracy of the unit sequence prediction result, and a predictive model generation module that generates a predictive model for an arrangement of the unit sequences, etc. can be used.
한편, 상술한 일 구현예의 예측 방법에 따르면, 항원 단백질로 대표되는 타겟 면역원 또는 타겟 단백질의 면역원 결정부를 신뢰성 있게 예측할 수 있고, 이러한 면역원 결정부는 상기 항원 단백질에서 항체와 특이적으로 결합하여 면역 반응을 일으키는 항원 결정기(epitope)로 대표될 수 있다. 또한, 일 구현예의 예측 방법에서는, 단백질의 3차원 구조 정보 등이 부족한 경우에도, 선형 항원 결정기(linear epitope) 뿐만 아니라 입체구조 항원 결정기(structural epitope) 또한 신뢰성 있게 예측할 수 있음이 확인되었다. On the other hand, according to the prediction method of one embodiment described above, it is possible to reliably predict a target immunogen represented by an antigen protein or an immunogenic determinant of a target protein, and such an immunogen determinant specifically binds to an antibody in the antigen protein to induce an immune response. It can be represented by the epitope that causes it. In addition, it was confirmed that the prediction method of one embodiment can reliably predict structural epitopes as well as linear epitopes even when 3-dimensional structural information of proteins is lacking.
따라서, 일 구현예의 예측 방법은 미지의 항원 단백질 등에서 기본적인 서열 정보만으로도 항원 결정기 등 면역원 결정부를 보다 신뢰성 있게 예측하여, 면역 기반 치료제 또는 치료 방법의 보다 효과적인 개발에 기여할 수 있다. Therefore, the prediction method of one embodiment can more reliably predict an immunogenic determinant such as an antigenic determinant with only basic sequence information in an unknown antigen protein, etc., thereby contributing to more effective development of an immune-based therapeutic agent or treatment method.
한편, 발명의 다른 구현예에 따르면, 상술한 면역원 결정부 뿐만 아니라, 항체 등 면역 단백질 또는 면역 세포 내에서, 상기 항원 등 면역원과 면역 반응을 일으키는 면역원 결합부(예를 들어, 항원 결합부; paratope)를 예측하는 방법이 제공된다. 이러한 다른 구현예의 예측 방법은, 단백질에 관한 제 1 데이터베이스를 기초로, 각 단백질에 포함된 토큰(token)화된 단위 서열별로 마스킹(masking)하면서 양 방향 인공 신경망 모델 학습을 진행하여, 단백질의 단위 서열별 기능 및 배열을 사전 학습하는 단계;On the other hand, according to another embodiment of the invention, in addition to the above-described immunogen determining unit, an immunogen binding unit (e.g., antigen binding unit; paratope) that causes an immune reaction with an immunogen such as an antigen in an immune protein or immune cell such as an antibody. ) is provided. The prediction method of this other embodiment proceeds with bidirectional artificial neural network model learning while masking for each tokenized unit sequence included in each protein based on the first database on the protein, and the unit sequence of the protein pre-learning each function and arrangement;
상호 결합된 면역원 폴리펩타이드 및 면역 단백질의 복합체에 관한 제 3 데이터베이스를 기초로, 각 복합체에 포함된 토큰화된 단위 서열별로 면역원 결합부 여부에 따라 분류하면서 양 방향 인공 신경망 학습을 진행하여, 상기 면역 단백질의 단위 서열별로 상기 면역원 폴리펩타이드와 면역 반응을 일으키는 면역원 결합부(immunogen binding site)의 정보를 학습하는 단계; 및 Based on the third database of complexes of mutually bound immunogen polypeptides and immune proteins, bidirectional artificial neural network learning is performed while classifying each tokenized unit sequence included in each complex according to whether or not there is an immunogen binding site, Learning information of an immunogen binding site that causes an immune response with the immunogenic polypeptide for each protein unit sequence; and
학습된 인공 신경망을 이용하여, 타겟 면역 단백질 또는 타겟 면역 세포에서 면역원 결합부를 예측하는 단계를 포함할 수 있다. A step of predicting an immunogen binding site in a target immune protein or target immune cell using a trained artificial neural network may be included.
참고로, 이러한 다른 구현예에 따른 면역원 결합부 예측 방법에서 일 구현예와 구분되는 면역원 결합부 학습 단계의 개략적인 모식도가 도 2에 도시되어 있다. 상기 다른 구현예의 예측 방법에서, 상기 제 1 데이터베이스를 기초로 하는 사전 학습 단계는 이미 상술한 일 구현예의 예측 방법과 동일하게 진행할 수 있으므로, 이에 관한 추가적인 설명은 생략하기로 한다. For reference, in the method for predicting an immunogen binding portion according to another embodiment, a schematic diagram of an immunogen binding portion learning step distinguished from one embodiment is shown in FIG. 2 . In the prediction method of the other embodiment, the pre-learning step based on the first database may be performed in the same manner as the prediction method of the above-described embodiment, so further description thereof will be omitted.
한편, 상기 다른 구현예의 예측 방법에서는, 면역원 결합부의 학습 단계를 위해, 상술한 제 2 데이터베이스 대신 상호 결합된 면역원 폴리펩타이드 및 면역 단백질의 복합체에 관한 제 3 데이터베이스를 사용한다. 보다 구체적으로, 상기 제 3 데이터베이스는 상기 항원 등 면역원 폴리펩타이드, 상기 항체 등 면역 단백질 및 이들이 상호 결합된 복합체의 서열 또는 구조 등에 관한 데이터베이스가 될 수 있으며, 상기 면역원 폴리펩타이드에 포함된 면역원 결정부에 관한 정보 및 상기 면역 단백질에 포함된 면역원 결합부에 관한 정보를 함께 포함될 수 있다. On the other hand, in the prediction method of another embodiment, for the step of learning the immunogen binding part, the third database for the complex of immunogen polypeptides and immune proteins bound to each other is used instead of the above-described second database. More specifically, the third database may be a database of sequences or structures of immunogenic polypeptides such as the antigens, immune proteins such as the antibodies, and complexes in which they are mutually bound, and the immunogen determining unit included in the immunogenic polypeptides information and information on the immunogen binding portion included in the immune protein may be included together.
보다 구체적인 예에서, 상기 제 3 데이터베이스는 상기 면역원 폴리펩타이드, 상기 면역 단백질 또는 이들의 복합체에 관하여, 이들 각각의 서열 또는 구조, 상기 면역원 폴리펩타이드에 포함된 면역원 결정부의 서열 또는 결합 위치, 상기 면역 단백질에 포함된 면역원 결합부의 서열 또는 결합 위치에 관한 정보를 포함하거나, 이들 모두에 관한 정보를 포함할 수 있다. 이러한 제 3 데이터베이스의 일 예로는, EpiPred (Krawczyk K, Liu X, Baker T, Shi J, Deane CM. Improving B-Cell Epitope Prediction and its Application to Global Antibody-Antigen Docking. Bioinformatics (2014) 30:2288-94), Docking Benchmarking Dataset (DBD) v5 (Vreven T, Moal IH, Vangone A, Pierce BG, Kastritis PL, Torchala M, et al. Updates to the Integrated Protein-Protein Interaction Benchmarks: Docking Benchmark Version 5 and Affinity Benchmark Version 2. J Mol Biol (2015) 427:3031-41) 또는 “Daberdaku S, Ferrari C (2019) Antibody interface prediction with 3D Zernike descriptors and SVM. Bioinformatics 35:1870-1876” 등을 통해 알려진 데이터베이스 등을 들 수 있다. In a more specific example, the third database includes the immunogenic polypeptide, the immune protein, or a complex thereof, each sequence or structure, the sequence or binding position of the immunogenic determinant included in the immunogenic polypeptide, the immune protein It may include information on the sequence or binding site of the immunogen binding unit included in, or information on both of them. An example of such a third database is EpiPred (Krawczyk K, Liu X, Baker T, Shi J, Deane CM. Improving B-Cell Epitope Prediction and its Application to Global Antibody-Antigen Docking. Bioinformatics (2014) 30:2288- 94), Docking Benchmarking Dataset (DBD) v5 (Vreven T, Moal IH, Vangone A, Pierce BG, Kastritis PL, Torchala M, et al. Updates to the Integrated Protein-Protein Interaction Benchmarks: Docking Benchmark Version 5 and Affinity Benchmark Version 2. J Mol Biol (2015) 427:3031-41) or “Daberdaku S, Ferrari C (2019) Antibody interface prediction with 3D Zernike descriptors and SVM. Bioinformatics 35: 1870-1876” and the like.
한편, 상기 면역원 결합부 학습 단계에서는, 도 2에 도시된 바와 같이, 상기 제 3 데이터베이스의 면역원 폴리펩타이드 및 면역 단백질에 각각 포함된 아미노산 서열을 일정 길이를 갖는 복수의 단위 서열로 구분하여 토큰화할 수 있다. 이러한 토큰화는, 예를 들어, 면역원 폴리펩타이드의 항원 결정기 등 면역원 결정부와, 면역 단백질의 항원 결합부 등 면역원 결합부에 각각 대응하는 일정 길이의 아미노산 단위 서열별로 이루어질 수 있고, 이를 통해 면역원 폴리펩타이드 및 면역 단백질의 각 아미노산 서열로부터 복수의 토큰이 생성될 수 있다. On the other hand, in the learning of the immunogen binding unit, as shown in FIG. 2, the amino acid sequences included in the immunogenic polypeptides and immune proteins of the third database may be divided into a plurality of unit sequences having a certain length and tokenized. there is. Such tokenization may be performed, for example, for each amino acid unit sequence of a certain length corresponding to an immunogen determinant such as an antigenic determinant of an immunogenic polypeptide and an immunogen binding portion such as an antigen binding portion of an immune protein. A plurality of tokens can be generated from each amino acid sequence of peptides and immune proteins.
이때, 상기 면역원 폴리펩타이드 및/또는 면역 단백질에서, 각각의 양 에지(edge)에 대응하는 토큰화된 단위 서열에는 고유의 분류 식별자(classification identifier; 예를 들어, 도 2의 “CLS”, “SEP” 등 참조)가 부여되어 각각의 임베딩 벡터로 산출될 수 있고, 상기 면역원 폴리펩타이드 및 면역 단백질의 임베딩 벡터가 상기 에지에서 서로 연결된(concatenated) 상태로 입력되어 이후의 인공 신경망 학습이 진행될 수 있다, At this time, in the immunogenic polypeptide and / or immune protein, the tokenized unit sequence corresponding to each edge (edge) has a unique classification identifier (eg, “CLS” in FIG. 2, “SEP”) ”, etc.) can be assigned to each embedding vector, and the immunogenic polypeptide and immune protein embedding vectors are input in a concatenated state at the edge, so that subsequent artificial neural network learning can proceed,
이때, 상기 임베딩 벡터에서 상기 면역원 폴리펩타이드 및 면역 단백질이 연결되는 순서는 특히 제한되지 않으나, 효과적인 인공 신경망 학습을 위해 상기 면역원 폴리펩타이드 및 면역 단백질이 일정한 순서로 연결된 임베딩 벡터가 입력되어 학습이 이루어짐이 적절하다. At this time, the order in which the immunogenic polypeptide and immune protein are connected in the embedding vector is not particularly limited, but the embedding vector in which the immunogenic polypeptide and immune protein are connected in a certain order is input for effective artificial neural network learning, and learning is performed. It is appropriate.
이와 같이, 면역원 폴리펩타이드 및/또는 면역 단백질에 관한 데이터가 고유 분류 식별자로 구분된 상태로 입력되어 인공 신경망 학습이 진행됨에 따라, 다른 구현예의 예측 방법으로, 면역 단백질의 면역원 결합부 뿐만 아니라 면역원의 면역원 결정부가 함께 예측될 수 있다. In this way, as data on immunogenic polypeptides and/or immune proteins are input in a state classified by unique classification identifiers and artificial neural network learning proceeds, as a prediction method of another embodiment, not only the immunogen binding portion of the immune protein but also the immunogen Immunogenic determinants can be predicted together.
상기 면역원 결합부 학습 단계에서는, 상기 토큰화된 복수의 단위 서열 주위의 양 방향을 따라 인공 신경망 학습을 진행할 수 있다. 이러한 양 방향 인공 신경망 학습 과정에서, 각 단위 서열별로 면역원 결합부(또는 이와 결합되는 면역원 결정부)에 해당하거나 이러한 면역원 결합부 등이 존재하는지 여부를 예측 및 분류하여 데이터로 출력할 수 있다(예를 들어, 도 2에서 면역원 결합부를 예측한 후, 그 예측 데이터를 “0” 또는 “1” 등으로 분류하여 출력). 이때, 상기 면역원 결합부와 결합되는 면역원 결정부 또한 상기 고유의 분류 식별자 등에 의해 구분되어 그 예측 결과가 출력될 수 있다. In the step of learning the immunogen binding unit, artificial neural network learning may proceed along both directions around the plurality of tokenized unit sequences. In this bidirectional artificial neural network learning process, each unit sequence corresponds to an immunogen binding unit (or an immunogen determining unit combined therewith) or predicts and classifies whether such an immunogen binding unit exists and outputs it as data (e.g. For example, after predicting the immunogen binding part in FIG. 2, the predicted data is classified as “0” or “1” and output). At this time, the immunogen determining unit combined with the immunogen binding unit may also be distinguished by the unique classification identifier and the like, and the predicted result may be output.
이러한 면역원 결합부(및/또는 면역원 결정부) 예측 결과는 제 3 데이터베이스로부터 구분된 테스트 데이터셋을 기초로 그 정확성이 확인 및 검증되어 피드백될 수 있다. 이와 같이, 토큰화된 단위 서열이 면역원 결합부를 포함하는지 예측하고, 그 예측 결과의 정확성을 확인, 검증 및 피드백하는 과정을 통해, 항체 등 면역 단백질 내에서 면역원 결합부의 존재, 위치 및 배열 등의 정보를 신뢰성 있게 예측하는 예측 모델이 도출될 수 있다. The accuracy of the prediction result of the immunogen binding unit (and/or immunogen determination unit) may be confirmed and verified based on a test data set separated from a third database and fed back. In this way, through the process of predicting whether the tokenized unit sequence includes an immunogen-binding portion, and confirming, verifying, and feedbacking the accuracy of the predicted result, information such as the presence, location, and arrangement of the immunogen-binding portion in an immune protein such as an antibody A predictive model that reliably predicts can be derived.
이렇게 학습된 인공 신경망 또는 상기 면역원 결합부 예측 모델을 기초로, 미지의 타겟 면역 단백질 또는 타겟 면역 세포의 서열 정보 등을 입력하여, 이러한 항체 등 면역 단백질 내에서 상기 면역원 결합부의 존재 여부, 위치 또는 서열 등을 신뢰성 있게 예측할 수 있다. 더 나아가, 상기 다른 구현예의 예측 방법에서는, 상기 제 3 데이터베이스에서 면역 단백질 뿐 아니라 면역원 폴리펩타이드에 관한 데이터가 고유 분류 식별자에 의해 구분되어 입력된 상태로 인공 신경망 학습이 진행되므로, 상기 면역 단백질 내의 면역원 결합부 뿐만 아니라 이와 결합되는 면역원 내의 면역원 결정부 역시 추가 예측될 수 있다. Based on the artificial neural network or the immunogen-binding portion prediction model learned in this way, sequence information of an unknown target immune protein or target immune cell is input, and the existence, position, or sequence of the immunogen-binding portion in an immune protein such as an antibody or the like is input. can be predicted reliably. Furthermore, in the prediction method of the other embodiment, since the artificial neural network learning proceeds with data on immunogenic polypeptides as well as immune proteins classified by unique classification identifiers in the third database and input, the immunogen in the immune protein Not only the binding site, but also the immunogenic determinant within the immunogen that binds thereto can be further predicted.
상술한 면역원 결합부 학습 단계에 있어서도, 이미 상술한 일 구현예의 방법에서 면역원 결정부 학습 단계와 유사하게, 토큰화 모듈, 분류 모듈, 학습 모듈, 예측 정확도 연산 모듈 및 예측 모델 생성 모듈을 포함한 예측 모델 생성 장치가 사용되어 인공 신경망 학습이 진행될 수 있다. Even in the above-described immunogen binding unit learning step, similar to the immunogen determining unit learning step in the method of one embodiment already described above, a prediction model including a tokenization module, a classification module, a learning module, a prediction accuracy calculation module, and a predictive model generation module. A generating device may be used to train the artificial neural network.
한편, 상술한 다른 구현예의 예측 방법에 따르면, 항체 단백질로 대표되는 타겟 면역 단백질의 면역원 결합부와, 추가적으로 항원 단백질로 대표되는 타겟 면역원 등의 면역원 결정부가 신뢰성 있게 예측될 수 있다. 이때, 상기 면역원 결합부는 상기 항체 단백질에서 항원의 항원 결정기(epitope)와 특이적으로 결합하여 면역 반응을 일으키는 항원 결합부(paratope)로 대표될 수 있다. On the other hand, according to the prediction method of another embodiment described above, an immunogen-binding portion of a target immune protein represented by an antibody protein and an immunogen-determining portion such as a target immunogen represented by an antigen protein can be reliably predicted. In this case, the immunogen-binding portion may be represented by an antigen-binding portion (paratope) that specifically binds to an epitope of an antigen in the antibody protein and causes an immune response.
따라서, 다른 구현예의 예측 방법은 미지의 항체 및/또는 항원 단백질 등에서 기본적인 서열 정보만으로도 항원 결합부 등 면역원 결합부와, 항원 결정기 등 면역원 결정부를 보다 신뢰성 있게 예측하여, 면역 기반 치료제 또는 치료 방법의 보다 효과적인 개발에 기여할 수 있다.Therefore, the prediction method of another embodiment more reliably predicts an immunogen-binding portion, such as an antigen-binding portion, and an immunogenic determinant, such as an antigenic determinant, using only basic sequence information in an unknown antibody and/or antigen protein, thereby improving immunity-based therapeutics or treatment methods. can contribute to effective development.
이하, 발명의 바람직한 실시예 및 이에 대비되는 비교예 등을 기재한다. 그러나 하기 실시예는 발명의 바람직한 일 예일뿐 발명이 이에 한정되는 것은 아니다.Hereinafter, preferred embodiments of the invention and comparative examples in contrast thereto will be described. However, the following examples are only preferred examples of the invention and the invention is not limited thereto.
실시예 1: 면역원 결정부의 예측 신뢰성 평가 Example 1: Evaluation of predictive reliability of the immunogen determinant
먼저, 사전 학습 단계를 위한 제 1 데이터베이스로는, “Bairoch, A., and R. Apweiler. 1996. “The SWISS-PROT Protein Sequence Data Bank and Its New Supplement TREMBL.” Nucleic Acids Research 24 (1): 21-25”을 통해 알려진 Swiss-Prot database를 사용하였다. 이러한 데이터베이스는 56만개 이상의 단백질의 기능, 도메인 구조, 변이 및 서열 등에 대한 정보를 포함하고 있다. 이러한 제 1 데이터베이스로부터 단백질 서열 등에 대한 데이터를 추출하고, 이러한 데이터를 6 : 2 : 2의 비율로 학습 데이터셋, 검증 데이터셋 및 테스트 데이터셋으로 구분하고, 이를 기초로 도 1의 “A”에 기재된 사전 학습 단계를 진행하였다. First, as a first database for the pre-learning step, “Bairoch, A., and R. Apweiler. 1996. “The SWISS-PROT Protein Sequence Data Bank and Its New Supplement TREMBL.” The Swiss-Prot database known through Nucleic Acids Research 24 (1): 21-25” was used. These databases contain information on functions, domain structures, mutations, and sequences of more than 560,000 proteins. Data on protein sequences, etc. are extracted from this first database, and these data are divided into a training dataset, a verification dataset, and a test dataset at a ratio of 6: 2: 2, and based on this, “A” in FIG. The described pre-learning steps were followed.
또한, 면역원 결정부 학습 단계를 위한 제 2 데이터베이스로는 선형 항원 결정기에 관한 것 또는 입체구조 항원 결정기에 관한 것을 구분하여 사용하였다. 먼저, 선형 항원 결정기에 관한 제 2 데이터베이스로는, 항체 및 항원 결정기가 구분된 “Vita, Randi, Swapnil Mahajan, James A. Overton, Sandeep Kumar Dhanda, Sheridan Martini, Jason R. Cantrell, Daniel K. Wheeler, Alessandro Sette, and Bjoern Peters. 2019. “The Immune Epitope Database (IEDB): 2018 Update.” Nucleic Acids Research 47 (D1): D339-43”에 공지된 데이터베이스를 사용하였다. 또, 입체구조 항원결정기에 관한 제 2 데이터베이스로는, 단백질 데이터 뱅크(Protein Data Bank; PDB)로부터 일부 항원에 관한 데이터를 추출하여 사용하였다. In addition, as the second database for the learning step of the immunogen determinant, those related to linear antigenic determinants or steric antigenic determinants were used separately. First, as a second database for linear antigenic determinants, antibodies and antigenic determinants are classified “Vita, Randi, Swapnil Mahajan, James A. Overton, Sandeep Kumar Dhanda, Sheridan Martini, Jason R. Cantrell, Daniel K. Wheeler, Alessandro Sette, and Bjoern Peters. 2019. “The Immune Epitope Database (IEDB): 2018 Update.” Nucleic Acids Research 47 (D1): D339-43” was used. In addition, as a second database related to three-dimensional epitope determinants, data on some antigens were extracted from the Protein Data Bank (PDB) and used.
이러한 데이터베이스로부터 추출된 데이터를 8 : 2의 비율로 학습 데이터셋과, 테스트 데이터셋으로 구분하고, 학습 데이터셋에 대해서는 cross validation 기법에 의한 교차 검증이 함께 진행되었다. 위 구분된 데이터를 기초로 도 1의 “B”에 기재된 면역원 결정부 학습 단계를 진행하였다. 위 과정을 통해, 사전 학습 단계 및 면역원 결정부 학습 단계를 진행한 후, 타겟 면역원의 항원 결정기(선형 항원 결정기: 표 1 또는 입체구조 항원 결정기: 표 2)를 예측하고, 그 예측 결과의 신뢰성을 통계적으로 평가하여 하기 표 1 및 2에 각각 나타내었다. 이때, 신뢰성을 평가한 통계적 파라미터는 “https://jennainsight.tistory.com/entry/F1-Score-Roc%EA%B3%A1%EC%84%A0-Auc-%EA%B3%84%EC%82%B0%EB%B0%A9%EB%B2%95-scikit-learn-%EC%BD%94%EB%93%9C%EB%A1%9C-%EA%B5%AC%ED%98%84%ED%95%98%EA%B8%B0” 등을 통해 공지된 방법으로 산출하였으며, 각 산출된 통계적 파라미터, 특히, AUC가 높을수록 예측 결과의 신뢰성이 높은 것으로 평가하였다. The data extracted from these databases was divided into a training dataset and a test dataset at a ratio of 8:2, and cross-validation was performed on the learning dataset using the cross validation technique. Based on the above classified data, the immunogen determining unit learning step described in “B” of FIG. 1 was performed. Through the above process, after the pre-learning step and the immunogen determinant learning step, the antigenic determinant of the target immunogen (linear antigenic determinant: Table 1 or steric antigenic determinant: Table 2) is predicted, and the reliability of the prediction result is verified. Statistically evaluated and shown in Tables 1 and 2 below, respectively. At this time, the statistical parameters evaluated for reliability are “https://jennainsight.tistory.com/entry/F1-Score-Roc%EA%B3%A1%EC%84%A0-Auc-%EA%B3%84%EC %82%B0%EB%B0%A9%EB%B2%95-scikit-learn-%EC%BD%94%EB%93%9C%EB%A1%9C-%EA%B5%AC%ED%98 %84%ED%95%98%EA%B8%B0” was calculated by a known method, and each calculated statistical parameter, in particular, the higher the AUC, the higher the reliability of the prediction result.
이러한 예측 결과의 신뢰성 평가 결과는, 상기 실시예 1과, 실시예 1에서 사전 학습 단계가 생략된 비교예 1 및 기존의 예측 방법이 적용된 추가 비교예 2~15를 상호 비교하여 나타내었다. 참고로, 비교예 2~15의 기존 예측 방법은, 표 1 및 2의 각 하단에 정리된 문헌에 기재된 방법에 따라 항원 결정기를 예측한 후, 그 예측 결과의 신뢰성을 평가하였다. The reliability evaluation results of these prediction results are shown by comparing Example 1, Comparative Example 1 in which the prior learning step was omitted in Example 1, and Additional Comparative Examples 2 to 15 to which the existing prediction method was applied. For reference, the existing prediction methods of Comparative Examples 2 to 15 predicted antigenic determinants according to the methods described in the literature summarized at the bottom of each of Tables 1 and 2, and then evaluated the reliability of the prediction results.
참고로, 비교예 2~15는 기존에 항원 결정기 등을 예측하기 위해 적용 가능한 것으로 알려진 학습 및 예측 모델로서, 대표적으로 비교예 3 및 5의 예측 모델은 다음과 같다: For reference, Comparative Examples 2 to 15 are learning and prediction models known to be applicable to predict antigenic determinants, etc. Representatively, the predictive models of Comparative Examples 3 and 5 are as follows:
- Random Forest Classifier (비교예 3): tree-based 모델로 각 단계에서 최적의 선택을 내리는 모델임. 데이터 샘플링을 통해 많은 모델들을 만들며, 최종적으로 투표를 통해 가장 많은 표를 받은 선택을 내리는 전형적인 분류에 의한 예측 모델;- Random Forest Classifier (Comparative Example 3): A tree-based model that makes optimal choices at each stage. A prediction model based on a typical classification that creates many models through data sampling and finally selects the most voted selection through voting;
- Gradient Boosting classifier (비교예 5): 상기 tree-based 모델에 부스팅 기능을 추가한 것으로, 모델이 잘못 예측한 데이터들을 그 다음 모델에 반영하여, 다시 옳게 예측하도록 만드는 방법론. - Gradient Boosting classifier (Comparative Example 5): A methodology in which a boosting function is added to the tree-based model, and data predicted incorrectly by the model are reflected in the next model to correctly predict again.
선형 항원 결정기 예측 결과의 신뢰성 평가 Reliability evaluation of linear epitope prediction results
통계 파라미터statistical parameter AUCAUC Macro F-1Macro F-1 Micro F-1Micro F-1 PrecisionPrecision RecallRecall
실시예1Example 1 0.9220.922 0.8530.853 0.8600.860 0.8550.855 0.8520.852
비교예1(실시예1 중 사전 학습 단계 생략)Comparative Example 1 (preliminary learning step omitted in Example 1) 0.5570.557 0.5090.509 0.5090.509 0.5380.538 0.5370.537
비교예2(Bagging Classifier)Comparative Example 2 (Bagging Classifier) 0.8610.861 0.7520.752 0.7710.771 0.7470.747 0.7670.767
비교예3(RandomForest Classifier)Comparative Example 3 (Random Forest Classifier) 0.8570.857 0.7620.762 0.7800.780 0.7560.756 0.7800.780
비교예4(ExtraTrees Classifier)Comparative Example 4 (ExtraTrees Classifier) 0.8570.857 0.7630.763 0.7810.781 0.7560.756 0.7800.780
비교예5 (GradientBoosting Classifier)Comparative Example 5 (GradientBoosting Classifier) 0.7360.736 0.6610.661 0.6990.699 0.6590.659 0.6950.695
비교예6(AdaBoost Classifier)Comparative Example 6 (AdaBoost Classifier) 0.5900.590 0.4890.489 0.6110.611 0.5350.535 0.5860.586
비교예7 (LogisticRegression Classifier)Comparative Example 7 (LogisticRegression Classifier) 0.5900.590 0.5880.588 0.6110.611 0.5350.535 0.5680.568
비교예8(Linear Discriminant Analysis)Comparative Example 8 (Linear Discriminant Analysis) 0.5900.590 0.4900.490 0.6120.612 0.5360.536 0.5860.586
비교예9(Bernoulli NB)Comparative Example 9 (Bernoulli NB) 0.5880.588 0.5180.518 0.6090.609 0.5440.544 0.5760.576
비교예10(Gaussian NB)Comparative Example 10 (Gaussian NB) 0.5830.583 0.5360.536 0.5360.536 0.5580.558 0.5580.558
비교예11(Quadratic Discriminant Analysis)Comparative Example 11 (Quadratic Discriminant Analysis) 0.5150.515 0.5060.506 0.5490.549 0.5110.511 0.5120.512
비교예12(SupportVector Classifier)Comparative Example 12 (SupportVector Classifier) 0.5100.510 0.3840.384 0.6010.601 0.5020.502 0.5700.570
* 비교예 2~12의 예측 방법(출처): * Prediction method of Comparative Examples 2 to 12 (source):
- 비교예 2:- Comparative Example 2:
https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.BaggingClassifier.htmlhttps://scikit-learn.org/stable/modules/generated/sklearn.ensemble.BaggingClassifier.html
- 비교예 3:- Comparative Example 3:
https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.htmlhttps://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html
- 비교예 4: - Comparative Example 4:
https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.ExtraTreesClassifier.htmlhttps://scikit-learn.org/stable/modules/generated/sklearn.ensemble.ExtraTreesClassifier.html
- 비교예 5: - Comparative Example 5:
https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.htmlhttps://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html
- 비교예 6: - Comparative Example 6:
https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.AdaBoostClassifier.htmlhttps://scikit-learn.org/stable/modules/generated/sklearn.ensemble.AdaBoostClassifier.html
- 비교예 7: - Comparative Example 7:
https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.htmlhttps://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html
- 비교예 8: - Comparative Example 8:
https://scikit-learn.org/stable/modules/generated/sklearn.discriminant_analysis.LinearDiscriminantAnalysis.htmlhttps://scikit-learn.org/stable/modules/generated/sklearn.discriminant_analysis.LinearDiscriminantAnalysis.html
- 비교예 9: - Comparative Example 9:
https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.BernoulliNB.htmlhttps://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.BernoulliNB.html
- 비교예 10: - Comparative Example 10:
http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.htmlhttp://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html
- 비교예 11: - Comparative Example 11:
https://scikit-learn.org/stable/modules/generated/sklearn.discriminant_analysis.QuadraticDiscriminantAnalysis.htmlhttps://scikit-learn.org/stable/modules/generated/sklearn.discriminant_analysis.QuadraticDiscriminantAnalysis.html
- 비교예 12: - Comparative Example 12:
https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.htmlhttps://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html
입체구조 항원 결정기 예측 결과의 신뢰성 평가 Reliability evaluation of conformational epitope prediction results
통계 파라미터statistical parameter AUCAUC Macro F-1Macro F-1 Micro F-1Micro F-1 PrecisionPrecision RecallRecall
실시예1 Example 1 0.678±0.0210.678±0.021 0.478±0.0140.478±0.014 0.625±0.0370.625±0.037 0.535±0.0070.535±0.007 0.615±0.0270.615±0.027
비교예1(실시예1 중 사전 학습 단계 생략)Comparative Example 1 (preliminary learning step omitted in Example 1) 0.571±0.0190.571±0.019 0.253±0.1640.253±0.164 0.331±0.2880.331±0.288 0.426±0.1940.426±0.194 0.523±0.0230.523±0.023
비교예13(Bepipred-2.0)Comparative Example 13 (Bepipered-2.0) 0.620.62 -- -- -- --
비교예14(LBtope)Comparative Example 14 (LBtope) 0.540.54 -- -- -- --
비교예15(NetSurfP)Comparative Example 15 (NetSurfP) 0.600.60 -- -- -- --
* 비교예 13~15의 예측 방법(출처): * Prediction method of Comparative Examples 13 to 15 (source):
- 비교예 13: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5570230/- Comparative Example 13: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5570230/
- 비교예 14: https://pubmed.ncbi.nlm.nih.gov/23667458/- Comparative Example 14: https://pubmed.ncbi.nlm.nih.gov/23667458/
- 비교예 15: https://pubmed.ncbi.nlm.nih.gov/19646261/- Comparative Example 15: https://pubmed.ncbi.nlm.nih.gov/19646261/
상기 표 1 및 2를 참고하면, 실시예 1의 항원 결정기 예측 방법은 사전 학습 단계를 생략한 비교예 1 뿐만 아니라, 기존에 알려진 비교예 2~15의 어떠한 예측 방법, 예를 들어, 기존에 항원 결정기 등의 예측에 사용되던 tree-based 모델에 해당하는 비교예 3 및 5 등의 예측 방법에 비해서도, 신뢰성 높게 항원 등 면역원 내의 선형 항원 결정기 및 입체구조 항원 결정기를 예측할 수 있음이 확인되었다. Referring to Tables 1 and 2, the antigenic determinant prediction method of Example 1 is not only Comparative Example 1 omitting the prior learning step, but also any prediction method of Comparative Examples 2 to 15 previously known, for example, the antigen Compared to the prediction methods of Comparative Examples 3 and 5, which correspond to tree-based models used for predicting determinants, etc., it was confirmed that linear antigenic determinants and conformational antigenic determinants in immunogens such as antigens can be predicted with high reliability.
이는 단계별 분류 및 선택을 통한 신뢰성 있는 예측을 위해 상당한 데이터를 필요로 하는 비교예 3 및 5 등의 기존 tree-based 모델 등과는 달리, 실시예의 예측 방법은 양 방향 인공 신경망 학습에 기반한 사전 학습 및 면역원 결정부 학습 단계에 의해 충분치 않은 데이터를 이용하여도 예측 신뢰성을 보다 높일 수 있기 때문으로 보인다. Unlike existing tree-based models such as Comparative Examples 3 and 5, which require significant data for reliable prediction through step-by-step classification and selection, the prediction method of the embodiment is pre-learning and immunogen based on bi-directional artificial neural network learning. This seems to be because the prediction reliability can be further increased even if insufficient data is used by the decision part learning step.
실시예 2: 면역원 결합부의 예측 신뢰성 평가 Example 2: Evaluation of predictive reliability of the immunogen binding part
먼저, 사전 학습 단계는 실시예 1과 동일하게 진행하였다. First, the pre-learning step was performed in the same way as in Example 1.
또한, 면역원 결합부 학습 단계를 위한 주된 제 3 데이터베이스로서, “Daberdaku S, Ferrari C (2019) Antibody interface prediction with 3D Zernike descriptors and SVM. Bioinformatics 35:1870-1876”ope dataset source: Processed by Daberdaku and Ferrari (2019)”에 공지된 항원 및 항체의 복합체에 관한 데이터베이스의 복합체 서열을 입력 데이터로 사용하여, 이하에 기술된 방법으로 항원 결합부의 예측을 진행하였다. In addition, as the main third database for the learning step of the immunogen binding part, “Daberdaku S, Ferrari C (2019) Antibody interface prediction with 3D Zernike descriptors and SVM. Bioinformatics 35: 1870-1876 ”ope dataset source: Processed by Daberdaku and Ferrari (2019)”, using as input data the complex sequence of the database on antigen and antibody complexes, and the antigen-binding portion of the antigen-binding portion by the method described below. Prediction proceeded.
한편, 본 실시예에서 추가 제 3 데이터베이스로서, EpiPred (Krawczyk K, Liu X, Baker T, Shi J, Deane CM. Improving B-Cell Epitope Prediction and its Application to Global Antibody-Antigen Docking. Bioinformatics (2014) 30:2288-94) 및 Docking Benchmarking Dataset (DBD) v5 (Vreven T, Moal IH, Vangone A, Pierce BG, Kastritis PL, Torchala M, et al. Updates to the Integrated Protein-Protein Interaction Benchmarks: Docking Benchmark Version 5 and Affinity Benchmark Version 2. J Mol Biol (2015) 427:3031-41)의 항원 및 항체의 복합체에 관한 데이터베이스의 복합체 서열을 입력 데이터로 함께 사용하였다. 이는 본 실시예의 방법에 따라 면역원 결합부 뿐만 아니라, 항원 결정기 등 면역원 결정부를 함께 예측하기 위해 사용되었다. On the other hand, as an additional third database in this embodiment, EpiPred (Krawczyk K, Liu X, Baker T, Shi J, Deane CM. Improving B-Cell Epitope Prediction and its Application to Global Antibody-Antigen Docking. Bioinformatics (2014) 30 :2288-94) and Docking Benchmarking Dataset (DBD) v5 (Vreven T, Moal IH, Vangone A, Pierce BG, Kastritis PL, Torchala M, et al. Updates to the Integrated Protein-Protein Interaction Benchmarks: Docking Benchmark Version 5 and Affinity Benchmark Version 2. J Mol Biol (2015) 427:3031-41) was used as input data together with the complex sequence of the database on antigen-antibody complexes. This was used to predict immunogenic determinants, such as antigenic determinants, as well as immunogen binding sites according to the method of this example.
한편, 상기 제 3 데이터베이스로부터 추출된 데이터를 8 : 2의 비율로 학습 데이터셋과, 테스트 데이터셋으로 구분하고, 실시예 1과 마찬가지로 학습 데이터셋에 대해서는 cross validation 기법에 의한 교차 검증이 함께 진행되었다. 위 구분된 데이터를 기초로 도 2에 기재된 면역원 결합부 학습 단계를 진행하였다. 위 과정을 통해, 사전 학습 단계 및 면역원 결합부 학습 단계를 진행한 후, 타겟 항체의 항원 결합부를 예측하고, 그 예측 결과의 신뢰성을 통계적으로 평가하여 하기 표 3에 나타내었다. 이때, 신뢰성을 평가한 통계적 파라미터는 “https://jennainsight.tistory.com/entry/F1-Score-Roc%EA%B3%A1%EC%84%A0-Auc-%EA%B3%84%EC%82%B0%EB%B0%A9%EB%B2%95-scikit-learn-%EC%BD%94%EB%93%9C%EB%A1%9C-%EA%B5%AC%ED%98%84%ED%95%98%EA%B8%B0” 등을 통해 공지된 방법으로 산출하였으며, 각 산출된 통계적 파라미터, 특히, AUC-ROC가 높을수록 예측 결과의 신뢰성이 높은 것으로 평가하였다. On the other hand, the data extracted from the third database was divided into a learning dataset and a test dataset at a ratio of 8: 2, and as in Example 1, cross validation by the cross validation technique was performed on the learning dataset together. . Based on the above classified data, the immunogen binding unit learning step described in FIG. 2 was performed. Through the above process, after proceeding with the preliminary learning step and the immunogen binding part learning step, the antigen binding part of the target antibody was predicted, and the reliability of the predicted result was statistically evaluated and shown in Table 3 below. At this time, the statistical parameters evaluated for reliability are “https://jennainsight.tistory.com/entry/F1-Score-Roc%EA%B3%A1%EC%84%A0-Auc-%EA%B3%84%EC %82%B0%EB%B0%A9%EB%B2%95-scikit-learn-%EC%BD%94%EB%93%9C%EB%A1%9C-%EA%B5%AC%ED%98 %84%ED%95%98%EA%B8%B0” was calculated by a known method, and each calculated statistical parameter, in particular, the higher the AUC-ROC, the higher the reliability of the prediction result.
이러한 예측 결과의 신뢰성 평가 결과는, 상기 실시예 2와, 기존의 예측 방법이 적용된 추가 비교예 16~17을 상호 비교하여 나타내었다. 참고로, 비교예 16~17의 기존 예측 방법은, 표 3의 하단에 정리된 문헌에 기재된 방법에 따라 항원 결합부를 예측한 후, 그 예측 결과의 신뢰성을 평가하였다. The reliability evaluation results of these prediction results are shown by mutual comparison of Example 2 and Additional Comparative Examples 16 to 17 to which the existing prediction method was applied. For reference, in the existing prediction methods of Comparative Examples 16 and 17, after predicting the antigen binding site according to the method described in the literature summarized at the bottom of Table 3, the reliability of the prediction result was evaluated.
항원 결합부 예측 결과의 신뢰성 평가 Reliability evaluation of antigen-binding site prediction results
통계 파라미터statistical parameter AUC ROCAUC ROC AUC PRAUC PR
실시예2 Example 2 0.9690.969 0.6330.633
비교예16(ANTIBODY I-PATCH)Comparative Example 16 (ANTIBODY I-PATCH) 0.8400.840 0.3760.376
비교예17(DABERDAKU ET AL.)Comparative Example 17 (DABERDAKU ET AL.) 0.9500.950 0.6580.658
* 비교예 16~17의 예측 방법(출처): * Prediction method of Comparative Examples 16-17 (source):
- 비교예 16: https://pubmed.ncbi.nlm.nih.gov/24006373/- Comparative Example 16: https://pubmed.ncbi.nlm.nih.gov/24006373/
- 비교예 17: https://pubmed.ncbi.nlm.nih.gov/30395191/- Comparative Example 17: https://pubmed.ncbi.nlm.nih.gov/30395191/
상기 표 3을 참고하면, 실시예 3의 항원 결합부 예측 방법은 기존에 알려진 비교예 16~17의 예측 방법에 비해, 신뢰성 높게 항체 등 면역 단백질 내의 항원 결합부를 예측할 수 있음이 확인되었다. Referring to Table 3, it was confirmed that the method for predicting the antigen-binding portion of Example 3 could predict the antigen-binding portion in an immune protein such as an antibody with high reliability compared to the previously known prediction methods of Comparative Examples 16 to 17.

Claims (21)

  1. 단백질에 관한 제 1 데이터베이스를 기초로, 각 단백질에 포함된 토큰(token)화된 단위 서열별로 마스킹(masking)하면서 양 방향 인공 신경망 모델 학습을 진행하여, 단백질의 단위 서열별 기능 및 배열을 사전 학습하는 단계;Based on the first protein database, bidirectional artificial neural network model learning is performed while masking for each tokenized unit sequence included in each protein, thereby pre-learning the function and arrangement of each unit sequence of the protein step;
    면역 단백질 또는 면역 세포가 특이적으로 인식 또는 결합하는 면역 반응 여부에 따라 구분된 면역원 및 비면역원 폴리펩타이드에 관한 제 2 데이터베이스를 기초로, 각 폴리펩타이드에 포함된 토큰화된 단위 서열별로 분류하면서 양 방향 인공 신경망 학습을 진행하여, 상기 폴리펩타이드의 단위 서열별로 면역 반응을 일으키는 면역원 결정부(immunogenic determinant)의 정보를 학습하는 단계; 및 Based on the second database of immunogenic and non-immunogenic polypeptides classified according to whether immune proteins or immune cells specifically recognize or bind to immune responses, while classifying by tokenized unit sequence included in each polypeptide, amount Learning information of an immunogenic determinant that causes an immune response for each unit sequence of the polypeptide by conducting directional artificial neural network learning; and
    학습된 인공 신경망을 이용하여, 타겟 면역원 또는 타겟 단백질의 면역원 결정부를 예측하는 단계를 포함하는 면역원 결정부 예측 방법. A method of predicting an immunogen determinant comprising predicting a target immunogen or an immunogen determinant of a target protein using a learned artificial neural network.
  2. 제 1 항에 있어서, 상기 제 1 데이터베이스는 복수의 단백질의 서열, 기능, 구조 및 알려진 변이 중 하나 이상에 관한 정보를 포함하는 면역원 결정부 예측 방법. The method of claim 1, wherein the first database includes information on one or more of sequences, functions, structures, and known mutations of a plurality of proteins.
  3. 제 1 항에 있어서, 상기 사전 학습 단계는, 상기 단백질에 포함된 아미노산 서열을 일정 길이를 갖는 복수의 단위 서열로 구분하여 토큰화하는 단계를 포함하는 면역원 결정부 예측 방법. The method of claim 1, wherein the pre-learning step comprises dividing the amino acid sequence included in the protein into a plurality of unit sequences having a predetermined length and tokenizing them.
  4. 제 3 항에 있어서, 상기 사전 학습 단계는, 상기 토큰화된 복수의 단위 서열 중 일부를 마스킹하는 단계; 및 The method of claim 3, wherein the pre-learning step comprises masking some of the plurality of tokenized unit sequences; and
    상기 마스킹된 단위 서열 주위의 양 방향을 따라 인공 신경망 모델 학습을 진행하여 각 단위 서열의 기능, 배열 또는 순서를 사전 학습하는 단계를 더 포함하는 면역원 결정부 예측 방법. The method of predicting an immunogen determining unit further comprising the step of pre-learning the function, arrangement or sequence of each unit sequence by proceeding with artificial neural network model learning along both directions around the masked unit sequence.
  5. 제 4 항에 있어서, 상기 인공 신경망 모델 학습 단계는, 상기 마스킹된 단위 서열을 예측하는 단계; 및 The method of claim 4, wherein the training of the artificial neural network model comprises: predicting the masked unit sequence; and
    상기 예측 결과의 정확성을 검증하여 피드백하는 단계를 포함하는 면역원 결정부 예측 방법. Immunogen determining unit prediction method comprising the step of verifying the accuracy of the prediction result and providing feedback.
  6. 제 1 항에 있어서, 상기 제 2 데이터베이스는 복수의 실험을 통해 둘 이상의 면역 단백질 또는 면역 세포와 면역 반응을 일으키거나 일으키지 않는 것으로 확인된 면역원 및 비면역원 폴리펩타이드에 관한 정보를 포함하는 면역원 결정부 예측 방법. The method of claim 1, wherein the second database predicts an immunogen determinant comprising information on immunogens and non-immunogen polypeptides confirmed to cause or not to cause an immune response with two or more immune proteins or immune cells through a plurality of experiments. method.
  7. 제 6 항에 있어서, 상기 제 2 데이터베이스는 상기 면역원 및 비면역원 폴리펩타이드의 연속적 아미노산 서열 정보, 비연속적 아미노산 서열 정보 및 3차원 구조 정보로 이루어진 군에서 선택된 1종 이상의 정보를 포함하는 면역원 결정부 예측 방법.The method of claim 6, wherein the second database predicts an immunogen determinant comprising at least one type of information selected from the group consisting of continuous amino acid sequence information, non-contiguous amino acid sequence information, and 3-dimensional structural information of the immunogen and non-immunogen polypeptide. method.
  8. 제 1 항에 있어서, 상기 면역원 결정부 학습 단계는, 상기 폴리펩타이드에 포함된 아미노산 서열을 일정 길이를 갖는 복수의 단위 서열로 구분하여 토큰화하는 단계를 포함하는 면역원 결정부 예측 방법. The method of claim 1, wherein the learning of the immunogen determinant comprises dividing the amino acid sequence included in the polypeptide into a plurality of unit sequences having a predetermined length and tokenizing the immunogen determinant.
  9. 제 8 항에 있어서, 상기 면역원 결정부 학습 단계는, 상기 토큰화된 복수의 단위 서열별로 면역원 결정부가 존재하는지 여부를 양 방향으로 예측 및 분류하는 단계; 및 The method of claim 8, wherein the learning of the immunogen determining unit comprises predicting and classifying in both directions whether or not the immunogen determining unit exists for each of the plurality of tokenized unit sequences; and
    상기 예측 결과의 정확성을 검증하여 피드백하는 단계를 포함하는 면역원 결정부 예측 방법. Immunogen determining unit prediction method comprising the step of verifying the accuracy of the prediction result and providing feedback.
  10. 제 9 항에 있어서, 상기 면역원 결정부 학습 단계로부터, 상기 단위 서열별로 면역원 결정부가 존재하는지 여부가 예측 및 분류되어 데이터로 출력 및 검증되고, 상기 출력 데이터를 기초로 면역원 결정부의 예측 모델이 도출되는 면역원 결정부 예측 방법. The method of claim 9, wherein in the learning of the immunogen determining unit, whether or not the immunogen determining unit exists for each unit sequence is predicted and classified, output and verified as data, and a predictive model of the immunogen determining unit is derived based on the output data. Methods for predicting immunogenic determinants.
  11. 제 1 항에 있어서, 상기 제 1 데이터베이스 또는 상기 제 2 데이터베이스는 일정 길이 이하의 아미노산 서열을 갖는 단백질 또는 폴리펩타이드에 관한 것으로 사전 가공되는 면역원 결정부 예측 방법. The method of claim 1, wherein the first database or the second database relates to a protein or polypeptide having an amino acid sequence of a predetermined length or less and is pre-processed.
  12. 제 1 항 또는 제 10 항에 있어서, 상기 면역원 결정부 예측 단계는, 상기 학습된 인공 신경망 또는 상기 면역원 결정부 예측 모델을 기초로, 상기 타겟 면역원 또는 타겟 단백질의 서열 또는 구조 정보를 입력하여, 상기 면역원 결정부의 존재 여부, 위치 또는 서열을 예측하는 면역원 결정부 예측 방법. The method of claim 1 or 10, wherein the predicting the immunogen determining unit comprises inputting sequence or structural information of the target immunogen or target protein based on the learned artificial neural network or the immunogen determining unit prediction model, A method for predicting an immunogenic determinant for predicting the presence, location or sequence of an immunogenic determinant.
  13. 제 1 항에 있어서, 상기 타겟 면역원 또는 타겟 단백질은 항원 단백질이고, 상기 면역원 결정부 예측 단계에서는 상기 항원 단백질에서 항체와 특이적으로 결합하여 면역 반응을 일으키는 항원 결정기(epitope)를 예측하는 면역원 결정부 예측 방법. The method of claim 1, wherein the target immunogen or target protein is an antigenic protein, and in the step of predicting the immunogen determining unit, the immunogen determining unit predicts an epitope that specifically binds to an antibody in the antigen protein and causes an immune response. prediction method.
  14. 제 13 항에 있어서, 상기 면역원 결정부의 예측 단계에서는 선형 항원 결정기(linear epitope) 또는 입체구조 항원 결정기(structural epitope)를 예측하는 면역원 결정부 예측 방법.The method of claim 13, wherein in the step of predicting the immunogen determinant, a linear epitope or a structural epitope is predicted.
  15. 단백질에 관한 제 1 데이터베이스를 기초로, 각 단백질에 포함된 토큰(token)화된 단위 서열별로 마스킹(masking)하면서 양 방향 인공 신경망 모델 학습을 진행하여, 단백질의 단위 서열별 기능 및 배열을 사전 학습하는 단계;Based on the first protein database, bidirectional artificial neural network model learning is performed while masking for each tokenized unit sequence included in each protein, thereby pre-learning the function and arrangement of each unit sequence of the protein step;
    상호 결합된 면역원 폴리펩타이드 및 면역 단백질의 복합체에 관한 제 3 데이터베이스를 기초로, 각 복합체에 포함된 토큰화된 단위 서열별로 면역원 결합부 여부에 따라 분류하면서 양 방향 인공 신경망 학습을 진행하여, 상기 면역 단백질의 단위 서열별로 상기 면역원 폴리펩타이드와 면역 반응을 일으키는 면역원 결합부(immunogen binding site)의 정보를 학습하는 단계; 및 Based on the third database of complexes of mutually bound immunogen polypeptides and immune proteins, bidirectional artificial neural network learning is performed while classifying each tokenized unit sequence included in each complex according to whether or not there is an immunogen binding site, Learning information of an immunogen binding site that causes an immune response with the immunogenic polypeptide for each protein unit sequence; and
    학습된 인공 신경망을 이용하여, 타겟 면역 단백질 또는 타겟 면역 세포에서 면역원 결합부를 예측하는 단계를 포함하는 면역원 결합부 예측 방법. A method of predicting an immunogen binding site comprising predicting an immunogen binding site in a target immune protein or a target immune cell using a learned artificial neural network.
  16. 제 15 항에 있어서, 상기 제 3 데이터베이스는 상기 면역원 폴리펩타이드, 상기 면역 단백질 또는 이들의 복합체에 관하여, 이들 각각의 서열 또는 구조, 상기 면역원 폴리펩타이드에 포함된 면역원 결정부의 서열 또는 결합 위치, 상기 면역 단백질에 포함된 면역원 결합부의 서열 또는 결합 위치에 관한 정보를 포함하는 면역원 결합부 예측 방법. 16. The method of claim 15, wherein the third database comprises a sequence or structure of each of the immunogenic polypeptide, the immune protein, or a complex thereof, a sequence or binding position of an immunogenic determinant included in the immunogenic polypeptide, and the immunogenic protein. A method for predicting an immunogen binding site comprising information on the sequence or binding site of an immunogen binding site included in a protein.
  17. 제 15 항에 있어서, 상기 면역원 결합부 학습 단계는, 상기 면역원 폴리펩타이드 및 면역 단백질에 각각 포함된 아미노산 서열을 일정 길이를 갖는 복수의 단위 서열로 구분하여 토큰화하는 단계를 포함하는 면역원 결합부 예측 방법.The method of claim 15, wherein the step of learning the immunogen binding portion comprises dividing the amino acid sequences each included in the immunogen polypeptide and the immune protein into a plurality of unit sequences having a predetermined length and tokenizing the immunogen binding portion prediction. method.
  18. 제 17 항에 있어서, 상기 면역원 결합부 학습 단계는, 상기 토큰화된 복수의 단위 서열별로 면역원 결합부가 존재하는지 여부를 양 방향으로 예측 및 분류하는 단계; 및 The method of claim 17, wherein the step of learning the immunogen binding unit comprises predicting and classifying in both directions whether or not an immunogen binding unit exists for each of the plurality of tokenized unit sequences; and
    상기 예측 결과의 정확성을 검증하여 피드백하는 단계를 포함하는 면역원 결합부 예측 방법. Immunogen binding portion prediction method comprising the step of verifying the accuracy of the prediction result and providing feedback.
  19. 제 17 항에 있어서, 상기 면역원 폴리펩타이드 및 면역 단백질의 양 에지(edge)에 대응하는 토큰화된 단위 서열에는 고유의 분류 식별자(classification identifier)가 부여되어 각각의 임베딩 벡터로 산출되며, 18. The method of claim 17, wherein a unique classification identifier is assigned to the tokenized unit sequence corresponding to both edges of the immunogenic polypeptide and the immune protein, and is calculated as each embedding vector,
    상기 면역원 폴리펩타이드 및 면역 단백질의 임베딩 벡터가 상기 에지에서 서로 연결된(concatenated) 상태로 입력되어 인공 신경망 학습이 진행되는 면역원 결합부 예측 방법. Immunogen binding portion prediction method in which the embedding vectors of the immunogen polypeptide and the immune protein are input in a concatenated state at the edge and artificial neural network learning proceeds.
  20. 제 15 항에 있어서, 타겟 면역원의 면역원 결정부를 추가적으로 예측하는 면역원 결합부 예측 방법. [Claim 16] The method according to claim 15, wherein the immunogen determinant portion of the target immunogen is additionally predicted.
  21. 제 15 항에 있어서, 상기 타겟 면역 단백질은 항체 단백질이고, 상기 면역원 결합부 예측 단계에서는 상기 항체 단백질에서 항원 결정기(epitope)와 특이적으로 결합하여 면역 반응을 일으키는 항원 결합부(paratope)를 예측하는 면역원 결합부 예측 방법. 16. The method of claim 15, wherein the target immune protein is an antibody protein, and in the step of predicting the immunogen-binding portion, predicting an antigen-binding portion (paratope) that specifically binds to an epitope in the antibody protein and causes an immune response. Methods for predicting immunogen binding sites.
PCT/KR2023/002582 2022-02-25 2023-02-23 Immunogenic determinant predicting method and immunogenic binding site predicting method WO2023163518A1 (en)

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
KR10-2022-0025194 2022-02-25
KR20220025194 2022-02-25
KR1020230023506A KR102645477B1 (en) 2022-02-25 2023-02-22 Prediction method of immunogenic determinant and immunogen binding site
KR10-2023-0023506 2023-02-22

Publications (1)

Publication Number Publication Date
WO2023163518A1 true WO2023163518A1 (en) 2023-08-31

Family

ID=87766443

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/KR2023/002582 WO2023163518A1 (en) 2022-02-25 2023-02-23 Immunogenic determinant predicting method and immunogenic binding site predicting method

Country Status (1)

Country Link
WO (1) WO2023163518A1 (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080172215A1 (en) * 2007-01-12 2008-07-17 Microsoft Corporation T-cell epiotope prediction
WO2013163348A1 (en) * 2012-04-24 2013-10-31 Laboratory Corporation Of America Holdings Methods and systems for identification of a protein binding site
CN109326324A (en) * 2018-09-30 2019-02-12 河北省科学院应用数学研究所 A kind of detection method of epitope, system and terminal device
JP2019179356A (en) * 2018-03-30 2019-10-17 株式会社エムティーアイ Epitope prediction method and epitope prediction system

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080172215A1 (en) * 2007-01-12 2008-07-17 Microsoft Corporation T-cell epiotope prediction
WO2013163348A1 (en) * 2012-04-24 2013-10-31 Laboratory Corporation Of America Holdings Methods and systems for identification of a protein binding site
JP2019179356A (en) * 2018-03-30 2019-10-17 株式会社エムティーアイ Epitope prediction method and epitope prediction system
CN109326324A (en) * 2018-09-30 2019-02-12 河北省科学院应用数学研究所 A kind of detection method of epitope, system and terminal device

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
EDGAR LIBERIS, VELIčKOVIć PETAR, SORMANNI PIETRO, VENDRUSCOLO MICHELE, LIò PIETRO: "Parapred: Antibody Paratope Prediction using Convolutional and Recurrent Neural Networks", BIOINFORMATICS, OXFORD UNIVERSITY PRESS , SURREY, GB, 8 September 2017 (2017-09-08), GB , XP055469248, ISSN: 1367-4803, DOI: 10.1093/bioinformatics/bty305 *

Similar Documents

Publication Publication Date Title
Martin Accessing the Kabat antibody sequence database by computer
Peters et al. Examining the independent binding assumption for binding of peptide epitopes to MHC-I molecules
Six et al. Systems biology in vaccine design
Chen et al. Advances of bioinformatics tools applied in virus epitopes prediction
WO2021071182A1 (en) Method for predicting degree of binding of mhc-peptide on surface of cancer cell, and analysis device
Karpenko et al. Prediction of MHC class II binders using the ant colony search strategy
WO2020185010A1 (en) System and method for providing neoantigen immunotherapy information by using artificial-intelligence-model-based molecular dynamics big data
Li et al. AbRSA: a robust tool for antibody numbering
JPH1153540A (en) Method and device for vital matching
WO2022107998A1 (en) Method and apparatus for segmenting three-dimensional image
WO2023163518A1 (en) Immunogenic determinant predicting method and immunogenic binding site predicting method
WO2022124725A1 (en) Method, device, and computer program for predicting interaction between compound and protein
CN113762417A (en) Method for enhancing HLA antigen presentation prediction system based on deep migration
Gfeller et al. Contemplating immunopeptidomes to better predict them
Vita et al. The curation guidelines of the immune epitope database and analysis resource
Shukla et al. Development of a syphilis serum bank to support research, development, and evaluation of syphilis diagnostic tests in the United States
KR20230127910A (en) Prediction method of immunogenic determinant and immunogen binding site
WO2022164236A1 (en) Method and system for searching target node related to queried entity in network
Azcárate et al. Plasmodium falciparum immunodominant IgG epitopes in subclinical malaria
Zhang et al. DeepANIS: Predicting antibody paratope from concatenated CDR sequences by integrating bidirectional long-short-term memory and transformer neural networks
Kinjo et al. Predicting secondary structures, contact numbers, and residue-wise contact orders of native protein structures from amino acid sequences using critical random networks
Liang et al. Monoclonal antibodies to immunodominant epitope of Tropheryma whipplei
KR20220151388A (en) A system for searching the new peptide
Aszódi et al. Protein modeling by multiple sequence threading and distance geometry
Heng et al. A simple pan-specific RNN model for predicting HLA-II binding peptides

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23760392

Country of ref document: EP

Kind code of ref document: A1