WO2024049245A1 - Method for generating antibody sequence using machine learning technology - Google Patents

Method for generating antibody sequence using machine learning technology Download PDF

Info

Publication number
WO2024049245A1
WO2024049245A1 PCT/KR2023/013015 KR2023013015W WO2024049245A1 WO 2024049245 A1 WO2024049245 A1 WO 2024049245A1 KR 2023013015 W KR2023013015 W KR 2023013015W WO 2024049245 A1 WO2024049245 A1 WO 2024049245A1
Authority
WO
WIPO (PCT)
Prior art keywords
amino acid
antibody
acid sequence
antibody amino
target antigen
Prior art date
Application number
PCT/KR2023/013015
Other languages
French (fr)
Korean (ko)
Inventor
서승우
박은영
강은지
김채은
강태현
곽민우
Original Assignee
주식회사 스탠다임
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 주식회사 스탠다임 filed Critical 주식회사 스탠다임
Publication of WO2024049245A1 publication Critical patent/WO2024049245A1/en

Links

Images

Classifications

    • CCHEMISTRY; METALLURGY
    • C07ORGANIC CHEMISTRY
    • C07KPEPTIDES
    • C07K16/00Immunoglobulins [IGs], e.g. monoclonal or polyclonal antibodies
    • GPHYSICS
    • G01MEASURING; TESTING
    • G01NINVESTIGATING OR ANALYSING MATERIALS BY DETERMINING THEIR CHEMICAL OR PHYSICAL PROPERTIES
    • G01N33/00Investigating or analysing materials by specific methods not covered by groups G01N1/00 - G01N31/00
    • G01N33/48Biological material, e.g. blood, urine; Haemocytometers
    • G01N33/50Chemical analysis of biological material, e.g. blood, urine; Testing involving biospecific ligand binding methods; Immunological testing
    • G01N33/53Immunoassay; Biospecific binding assay; Materials therefor
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/10Sequence alignment; Homology search
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B5/00ICT specially adapted for modelling or simulations in systems biology, e.g. gene-regulatory networks, protein interaction networks or metabolic networks

Definitions

  • Therapeutic antibodies have been a focus of the pharmaceutical industry since 1986, when the U.S. Food and Drug Administration approved the first monoclonal antibody (mAb) product.
  • the top 10 bestsellers in 2021 include four therapeutic antibodies, and the number of therapeutic antibodies currently approved by the FDA exceeds 100.
  • a mAb has two chains (heavy chain and light chain) and two regions (variable and constant) for each chain.
  • the variable heavy chain (VH) and variable light chain (VL) are primarily responsible for target specificity.
  • VH consists of 114 amino acids
  • VL consists of 110 amino acids
  • the variable region consists of approximately 220 amino acids.
  • the amino acid sequence of the variable region is calculated as A L , where A is the amino acid type, 20, and L is the protein length, i.e., 220 for an antibody, making it difficult to search all possible sequences of the variable region even with the most powerful high-throughput screening method. impossible. If site-specific amino acid frequencies in the constant region are taken into account, this number will be significantly reduced. However, considering the creation or optimization of heavy chain complementarity determining region 3 (HCDR3), which has an average length of 15 amino acids, approximately 10 8 sequences must be searched. Nonetheless, deep mutation scanning-based mutagenesis libraries can span approximately 10 4 sequences.
  • HCDR3 heavy chain complementarity determining region 3
  • deep learning is considered one of the powerful tools.
  • deep learning is showing significant progress in many fields, including images, natural language, and protein structure.
  • drugs using deep learning to find the binding between HCDR3 and target antigen For example, deep learning models are applied to screen in silico optimized HCDR3 sequences from a library.
  • a screening library was generated by mutating every single site in HCDR3 and introducing site-directed mutagenesis.
  • it is not only limited to a given template sequence, but also requires a huge amount of data to generate the entire sequence of the variable region, but there is a problem of insufficient data.
  • Patent Document 1 KR 10-2022-0091497 A
  • a method for manufacturing an antibody drug using the generated antibody amino acid sequence is provided.
  • a computer-readable medium recording a program applied to perform a method for generating an antibody amino acid sequence.
  • a method of generating an antibody amino acid sequence comprising:
  • one or more first antibody amino acid sequences comprising one or more antibody amino acid sequences with a region or amino acid of each antibody
  • pretraining a machine learning model by training general characteristics of the antibody on one or more first antibody amino acid sequences tagged using the machine learning model by the computing device
  • the general characteristic is selected from the group consisting of length of HCDR3, length of HCDR2, length of HCDR1, length of LCDR3, length of LCDR2, length of LCDR1, and isoelectric point;
  • the third antibody amino acid sequence that does not specifically bind to the target antigen provides a method comprising generating a fourth antibody amino acid sequence using a machine learning model.
  • the term “antibody” is used interchangeably with the term “immunoglobulin (Ig).”
  • the antibody may be, for example, IgA, IgD, IgE, IgG, or IgM.
  • the antibody may be a monoclonal antibody or a polyclonal antibody.
  • the antibody may be an animal-derived antibody, a mouse-human chimeric antibody, a humanized antibody, or a human antibody.
  • a complete antibody has a structure of two full-length light chains and two full-length heavy chains, and each light chain is bound to the heavy chain through a disulfide bond (SS-bond).
  • Each heavy chain consists of a heavy chain variable region (VH) and a heavy chain constant region (consisting of domains CH1, hinge, CH2, and CH3).
  • VH heavy chain variable region
  • CH3 heavy chain constant region
  • Each light chain consists of a light chain variable region (VL) and a light chain constant region (CL).
  • VL variable region
  • CL light chain constant region
  • the VH and VL regions can be further subdivided into hypervariable regions called complementarity determining regions (CDRs) interspersed with framework regions (FRs).
  • CDRs complementarity determining regions
  • Each VH and VL consists of three CDR and four FR fragments arranged from amino-to-carboxy-terminus in the following order: FR1, CDR1, FR2, CDR2, FR3, CDR3, and FR4.
  • Immunoglobulins can be divided into five types: IgA, IgD, IgE, IgG, and IgM, depending on the heavy chain constant domain amino acid sequence.
  • IgA and IgG can be further divided into isotypes IgA1, IgA2, IgG1, IgG2, IgG3, and IgG4.
  • Antibody light chains from any vertebrate species can have one of two distinctly distinct types, kappa ( ⁇ ) and lambda ( ⁇ ), based on the amino acid sequence of their constant domains.
  • CDR complementarity determining region
  • HCDR1, HCDR2, HCDR3 three CDRs in the VH and three CDRs (LCDR1, LCDR2, LCDR3) in the VL.
  • CDRs are Kabat (Wu et al., (1970) J Exp Med 132(2): 211-250), Chothia (Chothia et al., (1987) J. Mol. Biol.
  • Framework regions are antibody regions that act as supports for CDRs.
  • the framework region is responsible for supporting the binding of the antigen to the antibody.
  • Framework residues contact the antigen, are part of the binding site of the antibody, and include residues that are close to the CDR in sequence or are located in close proximity to the CDR when folded into a three-dimensional structure. Framework residues may also include residues that do not contact the antigen but indirectly affect binding by contributing to structural support for the CDR.
  • FR can be numbered using various descriptions such as Kabat, Chotia, IMGT and AbM.
  • FR1, FR2, FR3, and FR4 include FRs defined by any of the methods described above.
  • HCFR refers to heavy chain framework regions FR1, FR2, FR3, or FR4.
  • LCFR refers to light chain framework regions FR1, FR2, FR3, or FR4.
  • Typical 1-letter and 3-letter amino acid codes can be represented in the table below.
  • the antibody includes an antigen-binding fragment.
  • An antigen-binding fragment is a fragment of the entire structure of an immunoglobulin and refers to a portion of a polypeptide containing a portion to which an antigen can bind.
  • the antigen binding fragment is scFv, (scFv) 2 , Fv, Fab, Fab', F(ab') 2 , diabody, triabody, tetrabody, Bis-scFv , a nanobody, or a combination thereof.
  • Antigen-binding fragments can be conjugated to other antibodies, proteins, antigen-binding fragments, or alternative supports to generate bispecific and multispecific proteins.
  • the antibody amino acid sequence may be an amino acid sequence selected from the group consisting of a full length antibody, a heavy chain variable region, a light chain variable region, a complementarity determining region (CDR), and a framework region. .
  • the antibody amino acid sequence may be data stored in a storage unit of a computing device.
  • the method includes tagging, by a computing device, one or more first antibody amino acid sequences comprising one or more antibody amino acid sequences with a region or amino acid of each antibody.
  • the first antibody amino acid sequence may be obtained from the OAS (Observed Antibody Space) database or the PDB (Protein Data Bank) database.
  • the OAS database may be a database containing a set of human repertoire antibody amino acid sequences.
  • Antibody amino acid sequences stored in the OAS database may have data such as species, vaccination history, and disease.
  • the antibody amino acid sequence stored in the OAS database may not have data on the target antigen.
  • the PDB (Protein Data Bank) database is a database of three-dimensional structural data of biomolecules such as proteins and nucleic acids.
  • the tag refers to a keyword or term assigned to a piece of information (e.g., a database record or Internet bookmark or computer program).
  • the tag may be an unknown tag ([Unk]), a tag indicating a target antigen, a tag indicating a species, or a tag indicating the position of an amino acid.
  • the tag indicating the species may be a tag indicating the species of the heavy chain variable region (VH species) or a tag indicating the species of the light chain variable region (VL species).
  • the tag indicating the species may be a tag indicating a species selected from the group consisting of humans, mice, rats, rabbits, and apes.
  • the tag may be attached to the N-terminus, middle, or C-terminus of the antibody amino acid sequence.
  • one antibody amino acid sequence may be [Unk][VH species][VL species] ⁇ vh3>... ⁇ vl3>... ⁇ end> or [target][VH species][VL They can be tagged in the following order: species] ⁇ vh3>... ⁇ vl3>... ⁇ end>.
  • the first antibody amino acid sequence may further include tagging a species selected from the group consisting of humans, mice, rats, rabbits, and apes.
  • the method includes pretraining a machine learning model by training general characteristics of the antibody on one or more first antibody amino acid sequences tagged using the machine learning model by the computing device.
  • the general characteristic may be selected from the group consisting of length of HCDR3, length of HCDR2, length of HCDR1, length of LCDR3, length of LCDR2, length of LCDR1, and isoelectric point.
  • the isoelectric point (pI) refers to a specific pH at which the net charge of a positive electrolyte containing both an anionic group and a cationic group, such as a protein, is 0.
  • the isoelectric point may be that of the full length antibody, heavy chain variable region, light chain variable region, complementarity determining region (CDR), or framework region.
  • the machine learning model is the same as that received by inputting the antibody amino acid sequence information, or about 99% or more, about 98% or more, about 97% or more, about 96% or more, about 95% or more, or about 94% or more. , about 93% or more, about 92% or more, about 91% or more, about 91% or more, about 90% or more, about 85% or more, about 80% or more, about 75% or more, about 70% or more, about 65% or more , amino acid sequence information having sequence identity of about 60% or more, about 55% or more, or about 50% or more can be output.
  • the machine learning model may drive a machine learning algorithm.
  • the machine learning algorithms include LGBM (Light Gradient Boosting Machine), AdaBoost, Voting Classifier, Random Forest, Logistic algorithm, Neural Network, and QDA. (Quadratic Discriminant Analysis).
  • the machine learning algorithm may be a deep learning algorithm.
  • the machine class algorithm may be linear logistic regression, a recurrent neural network (RNN), or a convolutional neural network (CNN).
  • the method includes tagging one or more second antibody amino acid sequences with information about the target antigen by the computing device.
  • the second antibody amino acid sequence may be obtained from a patent database or a phage display library.
  • the patent database includes a dataset in which antibody amino acid sequences obtained from patents are extracted and converted into data.
  • the patent database can be divided into regions using the IMGT method using ANARCI (antigen receptor numbering and receptor classification).
  • the patent database may have information on target antigens.
  • the method analyzes the first antibody amino acid sequence and the second antibody amino acid sequence in a pre-trained machine learning model by the computing device to generate a third antibody amino acid sequence having information about the target antigen from the first antibody amino acid sequence. Includes creation steps.
  • language modeling can be used to predict and select the probability of the next amino acid of the corresponding amino acid in a given amino acid sequence.
  • the language modeling can calculate the conditional probability P(x) for x i , the amino acid at the i-th position, to the amino acid immediately preceding the i-th position, x ⁇ i :
  • x 0 refers to the first character given
  • l refers to the total length of the sentence.
  • the neural network function f(x ⁇ i ) can be converted to a conditional probability with a softmax function:
  • K is the number of words, and j and k represent the indices of tokens in the word.
  • the language modeling may be an attention-based neural network model.
  • the attention mechanism is self-attention and can actively calculate relationships between words and other words. For example, an attention mechanism can predict the probability of the next amino acid in a given amino acid sequence.
  • the attention mechanism can generate antibody sequences by repeatedly executing to predict and select amino acids with predicted probabilities.
  • hyperparameters and early stopping criteria can be determined by the validation set loss as a result of grid search.
  • the method includes the step of verifying whether there is specific binding to the target antigen in vitro using the generated third antibody amino acid sequence.
  • the target antigen refers to any molecule (e.g., protein, peptide, polysaccharide, glycoprotein, glycolipid, nucleic acid, portion thereof, or combination thereof) capable of mediating an immune response.
  • the immune response may include antibody production and activation of immune cells such as T cells, B cells, or NK cells.
  • the target antigens are PD-1, PD-L1, CTLA-4, LAG-3, BTLA, CD200, CD276, KIR, TIM-1, TIM-3, TIGIT, VISTA, CD27, CD28, CD40, CD40L, CD70, CD75, CD80, CD86, CD73, CD137, GITR, GITRL, IL15, OX40, OX40L, IDO-1, IDO-2, A2AR, ICOS, ICOSL, 4-1BB, and 4-1BBL. .
  • the step of verifying whether there is specific binding to the target antigen is ELISA (Enzyme Linked Immunosorbent Assay), Radial Immunodiffusion, Immunoprecipitation Analysis, RIA (Radioimmunoassay), Immunofluorescence Analysis, and Immunoblotting.
  • ELISA Enzyme Linked Immunosorbent Assay
  • Radial Immunodiffusion Radial Immunodiffusion
  • Immunoprecipitation Analysis e.g., RIA (Radioimmunoassay)
  • Immunofluorescence Analysis e.g., Immunofluorescence Analysis, and Immunoblotting.
  • a method selected from the group consisting of may be performed.
  • the verification step may measure defect affinity to the target antigen, equilibrium dissociation constant (KD), neutralization to the target antigen, or inhibition to the target antigen.
  • KD equilibrium dissociation constant
  • the step of verifying whether there is specific binding to the target antigen in vitro can be determined by absorbance or EC 50 (half maximal effective concentration) as a parameter in ELISA at a light wavelength of 450 nm.
  • absorbance or EC 50 half maximal effective concentration
  • the method may be terminated without passing the third antibody amino acid sequence to the machine learning model again.
  • the method includes generating a fourth antibody amino acid sequence using the third antibody amino acid sequence that does not specifically bind to the target antigen using a machine learning model.
  • the third antibody amino acid sequence that does not specifically bind to the target antigen can be transferred back to the machine learning model without tagging the target antigen.
  • the tag can be changed to an unknown tag ([Unk]) and passed back to the machine learning model.
  • Another aspect provides a method of making an antibody drug product comprising constructing an antibody from an antibody amino acid sequence produced according to one aspect in vitro.
  • a third antibody amino acid sequence or a fourth antibody amino acid sequence that specifically binds to the target antigen can be used to generate antibodies in vitro.
  • Antibody pharmaceuticals can be manufactured by loading a polynucleotide encoding the antibody into a vector and transforming the vector into cells.
  • the polynucleotide may additionally include a nucleic acid encoding a signal sequence or leader sequence.
  • signal sequence used herein refers to a signal peptide that directs secretion of a target protein.
  • the signal peptide is cleaved after translation in the host cell.
  • the signal sequence is an amino acid sequence that initiates the movement of proteins through the ER (Endoplasmic reticulum) membrane. After initiation, the signal sequence is cleaved within the lumen of the ER by a cellular enzyme commonly known as signal peptidase.
  • the signal sequence may be a secretion signal sequence of tPa (Tissue Plasminogen Activation), HSV gDs (Signal sequence of Herpes simplex virus glycoprotein D), or growth hormone.
  • tPa tissue Plasminogen Activation
  • HSV gDs Synignal sequence of Herpes simplex virus glycoprotein D
  • growth hormone a secretion signal sequence used in higher eukaryotic cells, including mammals, can be used.
  • the signal sequence can be used as a wild-type signal sequence, or by replacing it with a codon that is frequently expressed in host cells.
  • the vector can be introduced into a host cell and recombined and inserted into the host cell genome.
  • the vector is understood as a nucleic acid vehicle containing a polynucleotide sequence capable of spontaneous replication as an episome.
  • the vectors include linear nucleic acids, plasmids, phagemids, cosmids, RNA vectors, viral vectors and analogs thereof.
  • viral vectors include, but are not limited to, retroviruses, adenoviruses, and adeno-associated viruses.
  • the vector may be plasmid DNA, phage DNA, etc., commercially developed plasmids (pUC18, pBAD, pIDTSAMRT-AMP, etc.), E.
  • coli-derived plasmids pYG601BR322, pBR325, pUC118, pUC119, etc.
  • Bacillus subtilis. plasmids pUB110, pTP5, etc.
  • yeast-derived plasmids YEp13, YEp24, YCp50, etc.
  • phage DNA Charon4A, Charon21A, EMBL3, EMBL4, ⁇ , etc.
  • animal virus vectors RVetrovirus, adenovirus
  • Adenovirus Vaccinia virus, etc.
  • insect virus vectors Bacovirus, etc.
  • Host cells of the transformed cells may include, but are not limited to, cells of prokaryotic, eukaryotic, mammalian, plant, insect, fungal or cellular origin.
  • An example of the prokaryotic cell may be Escherichia coli.
  • yeast can be used as an example of a eukaryotic cell.
  • CHO cells, F2N cells, CSO cells, BHK cells, Bowes melanoma cells, HeLa cells, 911 cells, AT1080 cells, A549 cells, HEK 293 cells, or HEK293T cells can be used as the mammalian cells. , but is not limited thereto, and any cell that can be used as a mammalian host cell known to those skilled in the art can be used.
  • the CaCl 2 precipitation method when introducing an expression vector into a host cell, the CaCl 2 precipitation method, the Hanahan method, which increases efficiency by using a reducing substance called DMSO (dimethyl sulfoxide) in the CaCl 2 precipitation method, electroporation, and calcium phosphate precipitation method.
  • DMSO dimethyl sulfoxide
  • protoplast fusion method stirring method using silicon carbide fiber, Agrobacteria-mediated transformation method, transformation method using PEG, dextran sulfate, lipofectamine, and drying/inhibition-mediated transformation method, etc. can be used.
  • glycosylation-related genes of the host cell are manipulated using methods known to those skilled in the art to change the antibody's sugar chain pattern (e.g., sialic acid, fucosylation, glycosylation). can be adjusted.
  • the method of culturing the transformed cells can be performed using methods widely known in the art. Specifically, the culture may be continuously cultured in a batch process or fed batch or repeated fed batch process (Fed batch or Repeated fed batch process).
  • the antibody pharmaceutical may be a pharmaceutical composition for preventing or treating cancer.
  • the above cancers include stomach cancer, liver cancer, lung cancer, colon cancer, breast cancer, prostate cancer, gallbladder cancer, bladder cancer, kidney cancer, esophageal cancer, skin cancer, rectal cancer, osteosarcoma, multiple myeloma, glioma, ovarian cancer, pancreatic cancer, cervical cancer, endometrial cancer, Any selected from the group consisting of thyroid cancer, laryngeal cancer, testicular cancer, mesothelioma, acute myeloid leukemia, chronic myeloid leukemia, acute lymphoblastic leukemia, chronic lymphoblastic leukemia, brain tumor, neuroblastoma, retinoblastoma, head and neck cancer, salivary gland cancer, and lymphoma. It could be one.
  • Another aspect provides a computer-readable medium recording a program applied to perform a method of generating an antibody amino acid sequence according to one aspect.
  • the computer-readable medium includes tagging one or more first antibody amino acid sequences comprising one or more antibody amino acid sequences with a region or amino acid of each antibody by a computing device;
  • pretraining a machine learning model by training general characteristics of the antibody on one or more first antibody amino acid sequences tagged using the machine learning model by the computing device
  • the general characteristic is selected from the group consisting of length of HCDR3, length of HCDR2, length of HCDR1, length of LCDR3, length of LCDR2, length of LCDR1, and isoelectric point;
  • a program containing command codes to execute is stored in the storage unit (SDD/HDD).
  • Antibody amino acid sequence information stored in another server DB may be transmitted to the communication unit of the computing device through local network communication (eg, USB, LAN) or a metropolitan network remote server.
  • the antibody amino acid sequence information transmitted to the communication unit may be stored in a storage unit or memory.
  • a new antibody amino acid sequence can be generated by processing the data of the antibody amino acid sequence stored in the storage unit in the processing unit (FIG. 5).
  • a method for generating an antibody amino acid sequence a method for producing an antibody pharmaceutical using the generated antibody amino acid sequence, and a computer readable medium recording a program for applying the same, while generating the entire sequence of the antibody amino acid sequence, Target antigen specificity or species specificity can be controlled and the proportion of antibody amino acid sequences that bind to the target antigen can be increased.
  • Figure 1 is a schematic diagram showing a deep learning method, generation of an antibody sequence, ELISA verification, and repetition thereof according to one aspect.
  • Figure 2a is a schematic diagram of the process of learning an attention-based neural network model
  • Figure 2b is a schematic diagram of the process of converting input values within an attention-based neural network.
  • Figure 3a is a graph showing the length distribution of HCDR3 in the amino acid sequence generated from the trained attention-based neural network and the sequence obtained from the OAS dataset
  • Figure 3b is a graph showing the length distribution of HCDR3 in the amino acid sequence generated from the trained attention-based neural network and the sequence obtained from the OAS dataset. This is a graph showing the distribution of theoretical isoelectric points calculated from the obtained sequences.
  • Figure 4 is a graph comparing the learning curves of a pre-trained model (Finetuned) and a model trained from scratch (From scratch).
  • FIG. 5 is a schematic diagram of a computing device according to an aspect.
  • FIG. 6 is a flowchart showing a method for producing an antibody amino acid sequence according to one aspect.
  • Deep learning models are being used to screen in silico optimized HCDR3 sequences from libraries.
  • a screening library is created by mutating every single site in HCDR3 to introduce site-directed mutagenesis.
  • the limitation of existing technology is that it only focused on the selection of optimized HCDR3 according to a given template sequence. This is because the number of data required for a deep learning model increases exponentially as the area of interest expands, as does the entire variable area. In machine learning theory, this is called the curse of dimensionality.
  • a method to design antibody sequences was designed by combining deep learning and biological analysis. To design the entire sequence of the variable region, a generative model was used rather than a supervised model that only predicts target-specificity. Generative models can predict characteristics of other antibodies to design new therapeutic antibody sequences.
  • Model training, sequence generation, and biological validation provide analysis results to the training model, creating a feedback loop ( Figure 1).
  • This loop of training, query (sequence generation), and verification is called active learning, and is effective when there is very little labeled data.
  • active learning effectively searches for antibody sequences that have regions that differ from the native HCDR3 sequence. As additional iterations are performed, the number of sequences found increases.
  • the Observed Antibody Space (OAS) database is an extensive antibody database created in 2018 in which antibody sequences are cleaned, annotated, and translated. Specifically, the OAS database is a cleaned, ImMunoGeneTics (IMGT) numbered dataset containing only VH and VL sequences. OAS is rich in metadata such as species, vaccination history, and diseases. Since cancer is not an annotated disease in OAS, an unknown tag ([Unk]) was used in the entire OAS dataset. Each VH and VL sequence can be divided into 7 regions including 3 CDRs and 4 frameworks. The OAS database was already divided into areas using the IMGT numbering method, but the self-produced patent database was not.
  • IMGT ImMunoGeneTics
  • ANARCI antigen receptor numbering and receptor classification
  • Language modeling is a time series model in which the probability of the entire sequence x of length I is assigned, that is, P(x).
  • P(x) can decompose the character at the ith position, i.e. x i , into the conditional probability for the corresponding previous character, x ⁇ i .
  • x 0 refers to the first character given
  • l refers to the total length of the sentence.
  • the neural network function f(x ⁇ i ) can be converted to a conditional probability with a softmax function:
  • K is the number of words, and j and k represent the indices of tokens in the word.
  • Many neural network structures can be configured to estimate conditional probabilities, such as linear logistic regression, recurrent neural networks (RNN), and convolutional neural networks (CNN).
  • An attention-based neural network model was learned (data converted) using an attention-based neural network model.
  • a schematic diagram of the process of learning an attention-based neural network model is shown in Figure 2a. Additionally, a schematic diagram of the process of converting input values within an attention-based neural network is shown in Figure 2b.
  • an attention-based neural network model It is an attention-based neural network model.
  • the attention mechanism actively calculates relationships between words and other words through self-attention.
  • an attention-based neural network model can predict the probability of the next amino acid in a given amino acid sequence.
  • An antibody sequence can be generated by repeatedly running an attention-based neural network model to predict and select amino acids with predicted probabilities.
  • the scaled dot-product self-attention function mapped the input vector to query, key, and value, denoted by Q, K, and V, respectively.
  • the weight between query and key was calculated as Q, K, and V, and the values were multiplied.
  • d k is the dimension of the key vector.
  • the number of layers, embedding dimension, hidden layer dimension, and dropout rate were 12, 252, 1024, and 0.3, respectively.
  • Hyperparameters and early stopping criteria were verified as the results of a grid search using Weight and Biases (Lukas Biewald. Experiment Tracking with Weights & Biases. Software available from wandb. com, (January), 311 2020.) It was determined by the set (validation set) loss.
  • the pre-trained attention-based neural network model learned general features of natural antibodies, such as the length distribution and theoretical isoelectric point (pI) value of HCDR3.
  • the length distribution and theoretical isoelectric point of HCDR3 were calculated using the amino acid sequence generated from the trained attention-based neural network model and the sequence obtained from the OAS dataset, and the results are shown in Figures 3a and 3b.
  • the HCDR3 length and calculated pI values of the amino acid sequences generated from the trained attention-based neural network model were similar to those of the sequences trained on the OAS dataset.
  • Each target had a wide range in the number of antibody sequences.
  • the number of antibodies per target ranged from a few to over 1,000.
  • PD-1 programmed cell death protein 1
  • MET Mesenchymal Epithelial Transition
  • less than 10 antibody sequences were identified.
  • Figure 4 shows the verification loss for the number of learning steps (steps). As shown in Figure 4, the validation loss of the patent dataset was faster and had lower coverage for the pre-trained model compared to the model trained from scratch. Therefore, transfer learning increased the training speed by more than 4 times until coverage was reached and gave the model even better performance.
  • Species were annotated using species tags using the Conditional Transformer (CTRL) algorithm. Additionally, to verify the effectiveness of the CTRL method, the OAS pre-trained model was used to generate sequences under three conditions without tags and with human or mouse tags.
  • CTRL Conditional Transformer
  • the 100 sequences with the least loss were checked to ensure that they were not duplicated in the training, validation, and test datasets, and experimental verification was performed.
  • the scFv designed by AI in Example 1.5 was cleaved with SfiI restriction enzyme (New England Biolabs, USA) and cloned into pCombi3x vector or pCDisplay-4 vector.
  • 100 ng of recombinant scFv plasmid was incubated with ER2738 bacterial soluble cells for 20 minutes on ice. Afterwards, the DNA and soluble cells were incubated in a heating block at 42°C for 90 seconds and then at 4°C for 10 minutes.
  • LB medium was added to the sample in an amount four times the volume of competent cells, and incubated at 37°C at 180 rpm for 1 hour. It was inoculated into an LB plate containing carbenicillin at a concentration of 50 ⁇ g/mL and incubated at 37°C overnight.
  • the OmpA (outer membrane protein A) signal peptide was used to secrete the scFv protein into the periplasmic cytoplasm, and the bacterial outer membrane and peptidoglycan layer were removed to obtain a periplasmic fraction.
  • the ScFv protein was fused to an HA tag at the C-terminus. Therefore, the soluble scFv protein in the periplasmic fraction bound to the target antigen was detected using an HRP-conjugated anti-HA antibody and by ELISA analysis.
  • the deep well plate was centrifuged at 4000 rpm for 20 minutes to remove the supernatant.
  • the pellet was resuspended in 400 ⁇ L of STE (20% sucrose, 50 mM Tris-Cl pH 8.0, 1 mM EDTA) buffer. 100 ⁇ L of 10 mg/mL lysozyme was added to each well and incubated on ice at 180 rpm for 10 minutes.
  • 50 ⁇ L of 1 M MgCl 2 was added and incubated at 4°C at 180 rpm for 10 minutes.
  • the supernatant containing the periplasmic fraction was obtained by centrifugation at 4000 rpm for 20 minutes, and soluble scFv protein was obtained.
  • the scFv protein bound to the target antigen was selected by enzyme-linked immunosorbent assay (ELISA).
  • ELISA enzyme-linked immunosorbent assay
  • human PD-1 protein and human PD-L1 protein were purchased and prepared from BIOSYSTEMS Acro (USA). 80 ⁇ L of each periplasmic fraction was processed into a 96-well MaxiSorp plate coated with target antigen and incubated at 25°C for 2 hours.
  • Ni-NTA resin and 1-StepTM Ultra TMB-ELISA Substrate Solution were purchased from Thermo scientific (USA), and disposable columns were purchased from BIO-RAD (USA).
  • the purified positive candidates were further evaluated in a dose-dependent manner in an ELISA assay.
  • Nivolumab (Opdivo®) and pembrolizumab (Keytruda®) were used as anti-PD-1 positive controls, and atezolizumab, avelumab, and durvalumab were used as anti-PD-L1 positive controls.
  • a self-produced non-specific (anti-LPA2) antibody was used as a negative control.
  • Purified antibodies, including positive and negative controls were serially diluted in duplicate at 1000 nM. On a larger scale, two anti-PD-L1 antibodies (clones 162 and 163) failed due to poor purification.
  • the purified 17 antibodies were further examined for their binding activity to the corresponding antigen in a dose-dependent manner.
  • the concentration, yield, measured Kd value, and absorbance after purification of each scFV clone are shown in Table 2.
  • N/A means not analyzed.
  • Anti-PD-1 antibody clone 77 (0.094 nM) showed lower EC 50 (nM) values than the anti-PD-1 positive controls, nivolumab (0.93 nM) and pembrolizumab (4.37 nM).

Landscapes

  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Chemical & Material Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Biophysics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Immunology (AREA)
  • Biotechnology (AREA)
  • Medical Informatics (AREA)
  • Theoretical Computer Science (AREA)
  • Evolutionary Biology (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Hematology (AREA)
  • Biochemistry (AREA)
  • Urology & Nephrology (AREA)
  • Medicinal Chemistry (AREA)
  • Analytical Chemistry (AREA)
  • Organic Chemistry (AREA)
  • Biomedical Technology (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Cell Biology (AREA)
  • Microbiology (AREA)
  • Genetics & Genomics (AREA)
  • Food Science & Technology (AREA)
  • General Physics & Mathematics (AREA)
  • Pathology (AREA)
  • Artificial Intelligence (AREA)
  • Bioethics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Epidemiology (AREA)
  • Evolutionary Computation (AREA)
  • Public Health (AREA)
  • Software Systems (AREA)
  • Physiology (AREA)

Abstract

Provided, according to one aspect, are a method for generating an antibody amino acid sequence, a method for preparing an antibody drug using a generated antibody amino acid sequence, and a computer-readable medium for recording a program for applying same. According to the present invention, the complete sequence of an antibody amino acid sequence may be generated while enabling the controlling of target antigen specificity or species specificity, and the proportion of antibody amino acid sequences binding to a target antigen may be increased.

Description

기계 학습 기술을 사용하여 항체 서열을 생성하는 방법How to generate antibody sequences using machine learning techniques
기계 학습 기술을 사용한 항체 서열을 생성하는 방법, 이를 이용한 항체 의약품을 제조하는 방법, 및 이를 수행하는 컴퓨터 판독 매체를 제공한다.Provided are a method of generating an antibody sequence using machine learning technology, a method of manufacturing an antibody pharmaceutical using the same, and a computer readable medium for performing the same.
미국 FDA가 최초의 최초의 단일클론항체(monoclonal antibody: mAb) 제품을 승인한 1986년 이래로 치료용 항체는 제약 산업의 초점이 되었다. 2021년 베스트셀러 10위 안에는 4개의 치료용 항체가 포함되어 있고, 현재 FDA 승인을 받은 치료용 항체의 수는 100개가 넘는다. mAb에는 각 사슬에 대해 2개의 사슬(중쇄 및 경쇄)과 2개의 영역(가변 및 불변)이 있다. 가변 중쇄(VH) 및 가변 경쇄(VL)는 주로 표적 특이성을 담당한다. VH의 경우 114개의 아미노산으로 이루어지고 VL의 경우 110개의 아미노산으로 이루어져, 가변 영역은 약 220개의 아미노산으로 이루어진다. Therapeutic antibodies have been a focus of the pharmaceutical industry since 1986, when the U.S. Food and Drug Administration approved the first monoclonal antibody (mAb) product. The top 10 bestsellers in 2021 include four therapeutic antibodies, and the number of therapeutic antibodies currently approved by the FDA exceeds 100. A mAb has two chains (heavy chain and light chain) and two regions (variable and constant) for each chain. The variable heavy chain (VH) and variable light chain (VL) are primarily responsible for target specificity. VH consists of 114 amino acids, VL consists of 110 amino acids, and the variable region consists of approximately 220 amino acids.
가변 영역의 아미노산 서열은 AL로 계산되는데, 여기서 A는 아미노산 종류인 20, L은 단백질 길이로서 즉 항체의 경우 220을 나타내어, 가장 강력한 고성능 스크리닝 방법으로도 가변 영역의 모든 가능한 서열을 탐색하는 것은 불가능하다. 불변 영역에서 부위 특이적 아미노산 빈도를 고려하면 이 수치는 상당히 감소할 것이다. 그러나 평균 15개의 아미노산 길이를 갖는 중쇄 상보성 결정 영역 3(HCDR3)의 생성 또는 최적화를 고려하면 약 108개의 서열을 탐색해야 한다. 그럼에도 불구하고, 딥 돌연변이 스캐닝-기반 돌연변이유발 라이브러리는 약 104개 서열에 이를 수 있다.The amino acid sequence of the variable region is calculated as A L , where A is the amino acid type, 20, and L is the protein length, i.e., 220 for an antibody, making it difficult to search all possible sequences of the variable region even with the most powerful high-throughput screening method. impossible. If site-specific amino acid frequencies in the constant region are taken into account, this number will be significantly reduced. However, considering the creation or optimization of heavy chain complementarity determining region 3 (HCDR3), which has an average length of 15 amino acids, approximately 10 8 sequences must be searched. Nonetheless, deep mutation scanning-based mutagenesis libraries can span approximately 10 4 sequences.
방대한 염기서열 공간에서 새롭고 효과적인 치료용 항체를 찾는 것은 비용과 노력이 많이 들기 때문에 딥러닝(deep learning)은 강력한 도구 중 하나로 생각된다. 현재 딥러닝은 이미지, 자연어, 단백질 구조 등 많은 분야에서 상당한 발전을 보이고 있다. 또한, 최근 연구에서는 딥러닝이 HCDR3와 표적 항원 간의 결합을 찾는 것과 같은 약물을 발견한 바 있다. 예를 들어, 딥러닝 모델은 라이브러리에서 인 실리코(in silico) 최적화된 HCDR3 시퀀스를 스크리닝하는 데 적용된다. HCDR3의 모든 단일 부위를 돌연변이시켜 부위 특이적 돌연변이(site-directed mutagenesis)를 도입하여 스크리닝 라이브러리를 생성하였다. 그러나 주어진 템플릿 서열에 한정될 뿐만 아니라, 가변 영역의 전체 서열을 생성하려면 엄청난 양의 데이터가 필요하나 데이터가 부족한 문제가 있다.Because finding new and effective therapeutic antibodies in a vast sequence space is costly and laborious, deep learning is considered one of the powerful tools. Currently, deep learning is showing significant progress in many fields, including images, natural language, and protein structure. Additionally, recent research has discovered drugs using deep learning to find the binding between HCDR3 and target antigen. For example, deep learning models are applied to screen in silico optimized HCDR3 sequences from a library. A screening library was generated by mutating every single site in HCDR3 and introducing site-directed mutagenesis. However, it is not only limited to a given template sequence, but also requires a huge amount of data to generate the entire sequence of the variable region, but there is a problem of insufficient data.
따라서, 데이터가 부족한 상황에서도 항체의 전체 서열을 생성하고, 예측된 서열 중 항원에 결합하는 서열의 비율을 증가시킬 필요가 있다.Therefore, even in situations where there is insufficient data, there is a need to generate the entire sequence of the antibody and increase the ratio of sequences that bind to the antigen among the predicted sequences.
[선행기술문헌][Prior art literature]
[특허문헌][Patent Document]
(특허문헌 1) KR 10-2022-0091497 A(Patent Document 1) KR 10-2022-0091497 A
항체 아미노산 서열을 생성하는 방법을 제공한다.Methods for generating antibody amino acid sequences are provided.
생성된 항체 아미노산 서열을 이용하여 항체 의약품을 제조하는 방법을 제공한다.A method for manufacturing an antibody drug using the generated antibody amino acid sequence is provided.
항체 아미노산 서열을 생성하는 방법을 수행하기 위해 적용되는 프로그램을 기록한 컴퓨터 판독 매체를 제공한다.Provided is a computer-readable medium recording a program applied to perform a method for generating an antibody amino acid sequence.
항체 아미노산 서열을 생성하는 방법으로서,A method of generating an antibody amino acid sequence, comprising:
컴퓨팅 장치에 의해 하나 이상의 항체 아미노산 서열을 포함하는 하나 이상의 제1 항체 아미노산 서열을 각각의 항체의 영역 또는 아미노산으로 태그하는 단계;tagging, by a computing device, one or more first antibody amino acid sequences comprising one or more antibody amino acid sequences with a region or amino acid of each antibody;
상기 컴퓨팅 장치에 의해 기계 학습 모델을 사용하여 태그가 붙은 하나 이상의 제1 항체 아미노산 서열에 항체의 일반적인 특징을 훈련하여 기계 학습 모델을 사전 훈련하는 단계로서, pretraining a machine learning model by training general characteristics of the antibody on one or more first antibody amino acid sequences tagged using the machine learning model by the computing device,
상기 일반적인 특징은 HCDR3의 길이, HCDR2의 길이, HCDR1의 길이, LCDR3의 길이, LCDR2의 길이, LCDR1의 길이, 및 등전점으로 이루어진 군으로부터 선택된 것인 단계;wherein the general characteristic is selected from the group consisting of length of HCDR3, length of HCDR2, length of HCDR1, length of LCDR3, length of LCDR2, length of LCDR1, and isoelectric point;
상기 컴퓨팅 장치에 의해 표적 항원에 대한 정보를 갖는 하나 이상의 제2 항체 아미노산 서열을 표적 항원에 대한 정보로 태그하는 단계;tagging at least one second antibody amino acid sequence having information about the target antigen with information about the target antigen by the computing device;
상기 컴퓨팅 장치에 의해 상기 제1 항체 아미노산 서열과 상기 제2 항체 아미노산 서열을 사전 훈련된 기계 학습 모델에서 분석하여 제1 항체 아미노산 서열로부터 표적 항원에 대한 정보를 갖는 제3 항체 아미노산 서열을 생성하는 단계;Analyzing the first antibody amino acid sequence and the second antibody amino acid sequence in a pre-trained machine learning model by the computing device to generate a third antibody amino acid sequence having information about the target antigen from the first antibody amino acid sequence. ;
생성된 제3 항체 아미노산 서열을 이용하여 시험관 내에서(in vitro) 표적 항원에 대한 특이적 결합이 있는지 여부를 검증하는 단계; 및Verifying whether there is specific binding to the target antigen in vitro using the generated third antibody amino acid sequence; and
표적 항원과 특이적 결합이 없는 제3 항체 아미노산 서열은 다시 기계 학습 모델을 이용하여 제4의 항체 아미노산 서열을 생성하는 단계를 포함하는 방법을 제공한다.The third antibody amino acid sequence that does not specifically bind to the target antigen provides a method comprising generating a fourth antibody amino acid sequence using a machine learning model.
용어 "항체(antibody)"는 용어 "면역글로불린(immunoglobulin: Ig)"과 상호 교환적으로 사용된다. 항체는 예를 들면, IgA, IgD, IgE, IgG, 또는 IgM일 수 있다. 상기 항체는 단일클론 항체(monoclonal antibody) 또는 다클론 항체(polyclonal antibody)일 수 있다. 상기 항체는 동물 유래 항체, 마우스-인간 키메릭 항체(chimeric antibody), 인간화 항체(humanized antibody), 또는 인간 항체일 수 있다. 완전한 항체는 2개의 전장(full length) 경쇄 및 2개의 전장 중쇄를 가지는 구조이며 각각의 경쇄는 중쇄와 이황화 결합(disulfide bond: SS-bond)로 결합한다. 각각의 중쇄는 중쇄 가변 영역(VH) 및 중쇄 불변 영역(도메인 CH1, 힌지(hinge), CH2, 및 CH3으로 이루어짐)으로 이루어진다. 각각의 경쇄는 경쇄 가변 영역(VL) 및 경쇄 불변 영역(CL)으로 이루어진다. VH 및 VL 영역은 프레임워크 영역(framework region; FR)이 산재된(interspersed) 상보성 결정 영역(complementarity determining region: CDR)이라고 하는 초가변성 영역(hypervariable region)으로 추가로 세분화될 수 있다.The term “antibody” is used interchangeably with the term “immunoglobulin (Ig).” The antibody may be, for example, IgA, IgD, IgE, IgG, or IgM. The antibody may be a monoclonal antibody or a polyclonal antibody. The antibody may be an animal-derived antibody, a mouse-human chimeric antibody, a humanized antibody, or a human antibody. A complete antibody has a structure of two full-length light chains and two full-length heavy chains, and each light chain is bound to the heavy chain through a disulfide bond (SS-bond). Each heavy chain consists of a heavy chain variable region (VH) and a heavy chain constant region (consisting of domains CH1, hinge, CH2, and CH3). Each light chain consists of a light chain variable region (VL) and a light chain constant region (CL). The VH and VL regions can be further subdivided into hypervariable regions called complementarity determining regions (CDRs) interspersed with framework regions (FRs).
각각의 VH 및 VL은 하기 순서에서 아미노-로부터-카르복시-말단으로 배열된 3개의 CDR 및 4개의 FR 단편으로 이루어진다: FR1, CDR1, FR2, CDR2, FR3, CDR3, 및 FR4. 면역글로불린은 중쇄 불변 도메인 아미노산 서열에 따라, IgA, IgD, IgE, IgG, 및 IgM의 5개 종류로 나뉠 수 있다. IgA 및 IgG는 이소타입 IgA1, IgA2, IgG1, IgG2, IgG3, 및 IgG4로 추가로 나뉠 수 있다. 임의의 척추동물 종의 항체 경쇄는 이의 불변 도메인의 아미노산 서열에 기초하여 2개의 분명하게 별도인 유형, 즉 카파(κ) 및 람다(λ) 중 하나를 가질 수 있다.Each VH and VL consists of three CDR and four FR fragments arranged from amino-to-carboxy-terminus in the following order: FR1, CDR1, FR2, CDR2, FR3, CDR3, and FR4. Immunoglobulins can be divided into five types: IgA, IgD, IgE, IgG, and IgM, depending on the heavy chain constant domain amino acid sequence. IgA and IgG can be further divided into isotypes IgA1, IgA2, IgG1, IgG2, IgG3, and IgG4. Antibody light chains from any vertebrate species can have one of two distinctly distinct types, kappa (κ) and lambda (λ), based on the amino acid sequence of their constant domains.
"상보성 결정 영역"(CDR)은 항원에 결합하는 항체 영역이다. VH에 3개의 CDR(HCDR1, HCDR2, HCDR3) 및 VL에 3개의 CDR(LCDR1, LCDR2, LCDR3)이 존재한다. CDR은 카바트(Kabat)(Wu et al., (1970) J Exp Med 132(2): 211-250), 초티아(Chothia)(Chothia 등, (1987) J. Mol. Biol. 196(4):901-17), IMGT(Lefranc et al., (2003) Dev Comp Immunol 27(1): 55-77) 및 AbM(Martin and Thornton (1996) J Mol Biol 263(5): 800-815)과 같은 다양한 방법으로 넘버링할 수 있다.A “complementarity determining region” (CDR) is the region of an antibody that binds antigen. There are three CDRs (HCDR1, HCDR2, HCDR3) in the VH and three CDRs (LCDR1, LCDR2, LCDR3) in the VL. CDRs are Kabat (Wu et al., (1970) J Exp Med 132(2): 211-250), Chothia (Chothia et al., (1987) J. Mol. Biol. 196(4) ):901-17), IMGT (Lefranc et al., (2003) Dev Comp Immunol 27(1): 55-77) and AbM (Martin and Thornton (1996) J Mol Biol 263(5): 800-815) Numbering can be done in various ways, such as:
"프레임워크 영역" 또는 "FR"은 CDR에 대한 지지체로서 작용하는 항체 영역이다. 프레임워크 영역은 항체에의 항원의 결합을 뒷받침하는 것을 담당한다. 프레임워크 잔기는 항원과 접촉되며, 항체의 결합 부위의 일부이고, 서열상에서 CDR과 가깝거나 3차원 구조로 접혀 있을 때 CDR에 대한 가까운 근접부에 위치하는 잔기를 포함한다. 프레임워크 잔기는 또한, 항원과 접촉되지 않지만 CDR에 대한 구조적 뒷받침에 일조함으로써 결합에 간접적으로 영향을 미치는 잔기를 포함할 수 있다. FR은 카바트, 초티아, IMGT 및 AbM과 같은 다양한 서술을 사용하여 넘버링할 수 있다. FR1, FR2, FR3, 및 FR4는 상기 기재된 임의의 방법에 의해 정의되는 FR을 포함한다. HCFR은 중쇄 프레임워크 영역 FR1, FR2, FR3, 또는 FR4를 나타낸다. LCFR은 경쇄 프레임워크 영역 FR1, FR2, FR3, 또는 FR4를 나타낸다.“Framework regions” or “FRs” are antibody regions that act as supports for CDRs. The framework region is responsible for supporting the binding of the antigen to the antibody. Framework residues contact the antigen, are part of the binding site of the antibody, and include residues that are close to the CDR in sequence or are located in close proximity to the CDR when folded into a three-dimensional structure. Framework residues may also include residues that do not contact the antigen but indirectly affect binding by contributing to structural support for the CDR. FR can be numbered using various descriptions such as Kabat, Chotia, IMGT and AbM. FR1, FR2, FR3, and FR4 include FRs defined by any of the methods described above. HCFR refers to heavy chain framework regions FR1, FR2, FR3, or FR4. LCFR refers to light chain framework regions FR1, FR2, FR3, or FR4.
통상적인 1-글자 및 3-글자 아미노산 코드는 하기 표로 나타낼 수 있다.Typical 1-letter and 3-letter amino acid codes can be represented in the table below.
Figure PCTKR2023013015-appb-img-000001
Figure PCTKR2023013015-appb-img-000001
상기 항체는 항원 결합 단편을 포함한다. 항원 결합 단편(antigen-binding fragment)은 면역글로불린 전체 구조에 대한 그의 단편으로, 항원이 결합할 수 있는 부분을 포함하는 폴리펩티드의 일부를 말한다. 예를 들어, 항원 결합 단편은 scFv, (scFv)2, Fv, Fab, Fab', F(ab')2, 디아바디(diabody), 트리아바디(triabody), 테트라바디(tetrabody), Bis-scFv, 나노바디(nanobody), 또는 이들의 조합일 수 있다. 항원 결합 단편은 다른 항체, 단백질, 항원 결합 단편, 또는 대안적인 지지체에 접합되어, 이중특이적 단백질 및 다중특이적 단백질을 생성할 수 있다.The antibody includes an antigen-binding fragment. An antigen-binding fragment is a fragment of the entire structure of an immunoglobulin and refers to a portion of a polypeptide containing a portion to which an antigen can bind. For example, the antigen binding fragment is scFv, (scFv) 2 , Fv, Fab, Fab', F(ab') 2 , diabody, triabody, tetrabody, Bis-scFv , a nanobody, or a combination thereof. Antigen-binding fragments can be conjugated to other antibodies, proteins, antigen-binding fragments, or alternative supports to generate bispecific and multispecific proteins.
상기 항체 아미노산 서열은 전장(full length) 항체, 중쇄 가변 영역, 경쇄 가변 영역, 상보성 결정 영역(complementarity determining region: CDR), 및 프레임워크(framework) 영역으로 이루어진 군으로부터 선택되는 것의 아미노산 서열일 수 있다.The antibody amino acid sequence may be an amino acid sequence selected from the group consisting of a full length antibody, a heavy chain variable region, a light chain variable region, a complementarity determining region (CDR), and a framework region. .
상기 항체 아미노산 서열은 컴퓨팅 장치의 저장부에 저장되는 데이터일 수 있다.The antibody amino acid sequence may be data stored in a storage unit of a computing device.
상기 방법은 컴퓨팅 장치에 의해 하나 이상의 항체 아미노산 서열을 포함하는 하나 이상의 제1 항체 아미노산 서열을 각각의 항체의 영역 또는 아미노산으로 태그하는 단계를 포함한다.The method includes tagging, by a computing device, one or more first antibody amino acid sequences comprising one or more antibody amino acid sequences with a region or amino acid of each antibody.
상기 제1 항체 아미노산 서열은 OAS(Observed Antibody Space) 데이터베이스 또는 PDB(Protein Data Bank) 데이터베이스로부터 얻은 것일 수 있다. 상기 OAS 데이터베이스는 인간 레퍼토리 항체 아미노산 서열 세트를 포함하는 데이터베이스일 수 있다. 상기 OAS 데이터베이스에 저장된 항체 아미노산 서열은 종(species), 백신을 맞은 이력, 및 질병과 같은 데이터를 가질 수 있다. 상기 OAS 데이터베이스에 저장된 항체 아미노산 서열은 표적 항원에 대한 데이터는 없을 수 있다. PDB(Protein Data Bank) 데이터베이스는 단백질 및 핵산과 같은 생물분자의 3차원 구조 데이터에 대한 데이터베이스이다.The first antibody amino acid sequence may be obtained from the OAS (Observed Antibody Space) database or the PDB (Protein Data Bank) database. The OAS database may be a database containing a set of human repertoire antibody amino acid sequences. Antibody amino acid sequences stored in the OAS database may have data such as species, vaccination history, and disease. The antibody amino acid sequence stored in the OAS database may not have data on the target antigen. The PDB (Protein Data Bank) database is a database of three-dimensional structural data of biomolecules such as proteins and nucleic acids.
상기 태그는 정보 조각(예, 데이터베이스 기록 또는 인터넷 책갈피, 컴퓨터 프로그램)에 할당된 키워드 또는 용어를 말한다. 상기 태그는 unknown tag([Unk]), 표적(target) 항원을 나타내는 태그, 종(species)을 나타내는 태그, 또는 아미노산의 위치를 나타내는 태그일 수 있다. 상기 종(species)을 나타내는 태그는 중쇄 가변영역의 종을 나타내는 태그(VH species) 또는 경쇄 가변영역의 종을 나타내는 태그(VL species)일 수 있다. 상기 종을 나타내는 태그는 인간, 마우스, 래트, 토끼, 및 유인원으로 이루어진 군으로부터 선택된 종(species)을 나타내는 태그일 수 있다. 상기 태그는 항체 아미노산 서열의 N-말단, 중간, 또는 C-말단에 부착될 수 있다. 예를 들어, 하나의 항체 아미노산 서열은 N-말단으로부터 [Unk][VH species][VL species]<vh3>...<vl3>...<end> 또는 [target][VH species][VL species]<vh3>...<vl3>...<end>의 순서로 태그될 수 있다.The tag refers to a keyword or term assigned to a piece of information (e.g., a database record or Internet bookmark or computer program). The tag may be an unknown tag ([Unk]), a tag indicating a target antigen, a tag indicating a species, or a tag indicating the position of an amino acid. The tag indicating the species may be a tag indicating the species of the heavy chain variable region (VH species) or a tag indicating the species of the light chain variable region (VL species). The tag indicating the species may be a tag indicating a species selected from the group consisting of humans, mice, rats, rabbits, and apes. The tag may be attached to the N-terminus, middle, or C-terminus of the antibody amino acid sequence. For example, one antibody amino acid sequence may be [Unk][VH species][VL species]<vh3>...<vl3>...<end> or [target][VH species][VL They can be tagged in the following order: species]<vh3>...<vl3>...<end>.
상기 제1 항체 아미노산 서열은 인간, 마우스, 래트, 토끼, 및 유인원으로 이루어진 군으로부터 선택된 종(species)으로 태그하는 단계를 더 포함할 수 있다.The first antibody amino acid sequence may further include tagging a species selected from the group consisting of humans, mice, rats, rabbits, and apes.
상기 방법은 상기 컴퓨팅 장치에 의해 기계 학습 모델을 사용하여 태그가 붙은 하나 이상의 제1 항체 아미노산 서열에 항체의 일반적인 특징을 훈련하여 기계 학습 모델을 사전 훈련하는 단계를 포함한다.The method includes pretraining a machine learning model by training general characteristics of the antibody on one or more first antibody amino acid sequences tagged using the machine learning model by the computing device.
상기 일반적인 특징은 HCDR3의 길이, HCDR2의 길이, HCDR1의 길이, LCDR3의 길이, LCDR2의 길이, LCDR1의 길이, 및 등전점으로 이루어진 군으로부터 선택될 수 있다. 상기 등전점(isoelectric point: pI)은 단백질과 같이 음이온단과 양이온단을 동시에 포함하는 양성 전해질의 알짜 전하량(net charge)가 0이 되는 특정 pH를 말한다. 상기 등전점은 전장(full length) 항체, 중쇄 가변 영역, 경쇄 가변 영역, 상보성 결정 영역(CDR), 또는 프레임워크(framework) 영역의 등전점일 수 있다.The general characteristic may be selected from the group consisting of length of HCDR3, length of HCDR2, length of HCDR1, length of LCDR3, length of LCDR2, length of LCDR1, and isoelectric point. The isoelectric point (pI) refers to a specific pH at which the net charge of a positive electrolyte containing both an anionic group and a cationic group, such as a protein, is 0. The isoelectric point may be that of the full length antibody, heavy chain variable region, light chain variable region, complementarity determining region (CDR), or framework region.
상기 기계 학습 모델은 항체 아미노산 서열 정보를 입력하여 다시 입력받은 것과 동일하거나, 또는 이와 약 99% 이상, 약 98% 이상, 약 97% 이상, 약 96% 이상, 약 95% 이상, 약 94% 이상, 약 93% 이상, 약 92% 이상, 약 91% 이상, 약 91% 이상, 약 90% 이상, 약 85% 이상, 약 80% 이상, 약 75% 이상, 약 70% 이상, 약 65% 이상, 약 60% 이상, 약 55% 이상, 약 50% 이상 서열 동일성을 갖는 아미노산 서열 정보를 출력할 수 있다.The machine learning model is the same as that received by inputting the antibody amino acid sequence information, or about 99% or more, about 98% or more, about 97% or more, about 96% or more, about 95% or more, or about 94% or more. , about 93% or more, about 92% or more, about 91% or more, about 91% or more, about 90% or more, about 85% or more, about 80% or more, about 75% or more, about 70% or more, about 65% or more , amino acid sequence information having sequence identity of about 60% or more, about 55% or more, or about 50% or more can be output.
상기 기계 학습 모델(machine learning model)은 기계 학습 알고리즘을 구동하는 것일 수 있다. 상기 기계학습 알고리즘은 LGBM(Light Gradient Boosting Machine), 에이다부스트(AdaBoost), 다수결 분류(Voting Classifier), 랜덤 포레스트(Random Forest), 로지스틱 회귀분석(Logistic algorithm), 인공 신경망(Neural Network), 및 QDA(Quadratic Discriminant Analysis)로 이루어진 군으로부터 선택될 수 있다. 상기 기계학습 알고리즘은 딥 러닝(deep learning) 알고리즘일 수 있다. 상기 기계 학급 알고리즘은 선형 로지스틱 회귀분석(linear logistic regression), 순환 신경망(recurrent neural network: RNN), 또는 컨볼루션 신경망(convolutional neural network: CNN)일 수 있다.The machine learning model may drive a machine learning algorithm. The machine learning algorithms include LGBM (Light Gradient Boosting Machine), AdaBoost, Voting Classifier, Random Forest, Logistic algorithm, Neural Network, and QDA. (Quadratic Discriminant Analysis). The machine learning algorithm may be a deep learning algorithm. The machine class algorithm may be linear logistic regression, a recurrent neural network (RNN), or a convolutional neural network (CNN).
상기 방법은 상기 컴퓨팅 장치에 의해 표적 항원에 대한 정보를 갖는 하나 이상의 제2 항체 아미노산 서열을 표적 항원에 대한 정보로 태그하는 단계를 포함한다.The method includes tagging one or more second antibody amino acid sequences with information about the target antigen by the computing device.
상기 제2 항체 아미노산 서열은 특허 데이터베이스 또는 파지 디스플레이 라이브러리로부터 얻은 것일 수 있다. 특허 데이터베이스는 특허로부터 얻은 항체 아미노산 서열을 추출하고 데이터화시킨 데이터세트를 포함한다. 상기 특허 데이터베이스는 ANARCI(antigen receptor numbering and receptor classification)를 사용하여 IMGT 방식으로 영역을 구분할 수 있다. 상기 특허 데이터베이스는 표적 항원에 대한 정보를 가질 수 있다.The second antibody amino acid sequence may be obtained from a patent database or a phage display library. The patent database includes a dataset in which antibody amino acid sequences obtained from patents are extracted and converted into data. The patent database can be divided into regions using the IMGT method using ANARCI (antigen receptor numbering and receptor classification). The patent database may have information on target antigens.
상기 방법은 상기 컴퓨팅 장치에 의해 상기 제1 항체 아미노산 서열과 상기 제2 항체 아미노산 서열을 사전 훈련된 기계 학습 모델에서 분석하여 제1 항체 아미노산 서열로부터 표적 항원에 대한 정보를 갖는 제3 항체 아미노산 서열을 생성하는 단계를 포함한다.The method analyzes the first antibody amino acid sequence and the second antibody amino acid sequence in a pre-trained machine learning model by the computing device to generate a third antibody amino acid sequence having information about the target antigen from the first antibody amino acid sequence. Includes creation steps.
상기 제3 항체 아미노산 서열을 생성하는 단계는 언어 모델링(Language modeling)을 이용하여 주어진 아미노산 서열에서 해당 아미노산의 다음 아미노산의 확률을 예측하고 선택할 수 있다.In the step of generating the third antibody amino acid sequence, language modeling can be used to predict and select the probability of the next amino acid of the corresponding amino acid in a given amino acid sequence.
상기 언어 모델링은 i 번째 위치의 아미노산인 xi를 i 번째 바로 이전 아미노산 x<i에 대한 조건부 확률 P(x)를 산출할 수 있다: The language modeling can calculate the conditional probability P(x) for x i , the amino acid at the i-th position, to the amino acid immediately preceding the i-th position, x <i :
Figure PCTKR2023013015-appb-img-000002
(1)
Figure PCTKR2023013015-appb-img-000002
(One)
식 1에서, x0은 최초로 주어지는 문자를 의미하고, l은 문장의 총 길이를 의미한다.In equation 1, x 0 refers to the first character given, and l refers to the total length of the sentence.
예를 들어, 신경망 함수(neural network function) f(x<i)는 소프트맥스(softmax) 함수를 갖는 조건부 확률로 변환될 수 있다:For example, the neural network function f(x <i ) can be converted to a conditional probability with a softmax function:
Figure PCTKR2023013015-appb-img-000003
(2)
Figure PCTKR2023013015-appb-img-000003
(2)
여기서, 소프트맥스 함수는 Here, the softmax function is
Figure PCTKR2023013015-appb-img-000004
(3)이다.
Figure PCTKR2023013015-appb-img-000004
(3).
식 (3)에서 K는 단어의 갯수이고, j 및 k는 단어에서 토큰의 인덱스를 나타낸다.In equation (3), K is the number of words, and j and k represent the indices of tokens in the word.
상기 언어 모델링(Language modeling)은 어텐션-기반 신경망(attention-based neural network) 모델일 수 있다. 어텐션 메커니즘은 자가-어텐션(self-attention)으로 단어와 다른 단어 사이의 관계를 능동적으로 계산할 수 있다. 예를 들어, 어텐션 메커니즘은 주어진 아미노산 서열에서 다음 아미노산의 확률을 예측할 수 있다. 어텐션 메커니즘은 반복적으로 실행하여 예측된 확률로 아미노산을 예측하고 선택함으로써, 항체 서열을 생성할 수 있다. 상기 어텐션 메커니즘에서 하이퍼파라미터와 조기 중단 기준은 격자 조사(grid search)의 결과로서 검증 세트(validation set) 손실로 결정할 수 있다.The language modeling may be an attention-based neural network model. The attention mechanism is self-attention and can actively calculate relationships between words and other words. For example, an attention mechanism can predict the probability of the next amino acid in a given amino acid sequence. The attention mechanism can generate antibody sequences by repeatedly executing to predict and select amino acids with predicted probabilities. In the attention mechanism, hyperparameters and early stopping criteria can be determined by the validation set loss as a result of grid search.
상기 방법은 생성된 제3 항체 아미노산 서열을 이용하여 시험관 내에서(in vitro) 표적 항원에 대한 특이적 결합이 있는지 여부를 검증하는 단계를 포함한다.The method includes the step of verifying whether there is specific binding to the target antigen in vitro using the generated third antibody amino acid sequence.
상기 표적 항원은 면역 반응을 매개할 수 있는 임의의 분자(예를 들어, 단백질, 펩타이드, 다당류, 당단백질, 당지질, 핵산, 이들의 부분, 또는 이들의 조합)를 지칭한다. 상기 면역 반응은 항체 생성 및 T 세포, B 세포, 또는 NK 세포와 같은 면역 세포의 활성화를 포함할 수 있다.The target antigen refers to any molecule (e.g., protein, peptide, polysaccharide, glycoprotein, glycolipid, nucleic acid, portion thereof, or combination thereof) capable of mediating an immune response. The immune response may include antibody production and activation of immune cells such as T cells, B cells, or NK cells.
상기 표적 항원은 PD-1, PD-L1, CTLA-4, LAG-3, BTLA, CD200, CD276, KIR, TIM-1, TIM-3, TIGIT, VISTA, CD27, CD28, CD40, CD40L, CD70, CD75, CD80, CD86, CD73, CD137, GITR, GITRL, IL15, OX40, OX40L, IDO-1, IDO-2, A2AR, ICOS, ICOSL, 4-1BB, 및 4-1BBL로 이루어진 군으로부터 선택될 수 있다.The target antigens are PD-1, PD-L1, CTLA-4, LAG-3, BTLA, CD200, CD276, KIR, TIM-1, TIM-3, TIGIT, VISTA, CD27, CD28, CD40, CD40L, CD70, CD75, CD80, CD86, CD73, CD137, GITR, GITRL, IL15, OX40, OX40L, IDO-1, IDO-2, A2AR, ICOS, ICOSL, 4-1BB, and 4-1BBL. .
상기 표적 항원에 대한 특이적 결합이 있는지 여부를 검증하는 단계는 ELISA (Enzyme Linked Immunosorbent Assay), 방사상 면역확산(Radial Immunodiffusion), 면역침전분석, RIA(Radioimmunoassay), 면역형광분석, 및 면역블로팅으로 이루어진 군으로부터 선택된 방법을 수행할 수 있다.The step of verifying whether there is specific binding to the target antigen is ELISA (Enzyme Linked Immunosorbent Assay), Radial Immunodiffusion, Immunoprecipitation Analysis, RIA (Radioimmunoassay), Immunofluorescence Analysis, and Immunoblotting. A method selected from the group consisting of may be performed.
상기 검증하는 단계는 표적 항원에 대한 결함 친화도, 평형 해리 상수(equilibrium dissociation constant: KD), 표적 항원에 대한 중화, 또는 표적 항원에 대한 저해를 측정할 수 있다.The verification step may measure defect affinity to the target antigen, equilibrium dissociation constant (KD), neutralization to the target antigen, or inhibition to the target antigen.
시험관 내에서 표적 항원에 대한 특이적 결합이 있는지 여부를 검증하는 단계는 ELISA에서 450 nm 파장의 광선에서 흡광도 또는 EC50(half maximal effective concentration)을 파라미터로 판별할 수 있다. ELISA에서 450 nm 파장의 광선에서 흡광도가 0.1 이상인 경우 상기 제3 항체 아미노산 서열은 표적 항원에 대한 특이적 결합이 있는 것으로 판별할 수 있다.The step of verifying whether there is specific binding to the target antigen in vitro can be determined by absorbance or EC 50 (half maximal effective concentration) as a parameter in ELISA at a light wavelength of 450 nm. In ELISA, if the absorbance is 0.1 or more in light with a wavelength of 450 nm, the third antibody amino acid sequence can be determined to have specific binding to the target antigen.
상기 방법은 제3 항체 아미노산 서열이 표적 항원과 특이적 결합이 있는 것으로 판별될 경우, 상기 제3 항체 아미노산 서열에 대하여 다시 기계 학습 모델에 전달하지 않고 종료할 수 있다.If the third antibody amino acid sequence is determined to have specific binding to the target antigen, the method may be terminated without passing the third antibody amino acid sequence to the machine learning model again.
상기 방법은 표적 항원과 특이적 결합이 없는 제3 항체 아미노산 서열은 다시 기계 학습 모델을 이용하여 제4의 항체 아미노산 서열을 생성하는 단계를 포함한다.The method includes generating a fourth antibody amino acid sequence using the third antibody amino acid sequence that does not specifically bind to the target antigen using a machine learning model.
상기 표적 항원과 특이적 결합이 없는 제3 항체 아미노산 서열은 표적 항원에 대한 태그를 붙이지 않고 다시 기계 학습 모델로 전달할 수 있다. 예를 들어, 표적 항원에 대한 태그가 붙어 있는 항체 아미노산 서열이 생성될 경우, 태그를 unknown tag([Unk])로 바꿔 달고 다시 기계 학습 모델로 전달할 수 있다.The third antibody amino acid sequence that does not specifically bind to the target antigen can be transferred back to the machine learning model without tagging the target antigen. For example, when an antibody amino acid sequence with a tag for a target antigen is generated, the tag can be changed to an unknown tag ([Unk]) and passed back to the machine learning model.
다른 양상은 시험관 내에서 일 양상에 따라 생성된 항체 아미노산 서열로부터 항체를 제작하는 단계를 포함하는 항체 의약품을 제조하는 방법을 제공한다.Another aspect provides a method of making an antibody drug product comprising constructing an antibody from an antibody amino acid sequence produced according to one aspect in vitro.
표적 항원과 특이적 결합이 있는 제3 항체 아미노산 서열 또는 제4 항체 아미노산 서열은 시험관 내에서 항체를 생성하는데 이용될 수 있다. 상기 항체를 코딩하는 폴리뉴클레오티드를 벡터에 적재하고, 상기 벡터를 세포에 형질전환하여 항체 의약품을 제조할 수 있다.A third antibody amino acid sequence or a fourth antibody amino acid sequence that specifically binds to the target antigen can be used to generate antibodies in vitro. Antibody pharmaceuticals can be manufactured by loading a polynucleotide encoding the antibody into a vector and transforming the vector into cells.
상기 폴리뉴클레오티드는 신호서열(Signal sequence) 또는 리더 서열(Leader sequence)을 코딩하는 핵산을 추가적으로 포함할 수 있다. 본 명세서에서 사용한 용어 "신호서열"은 목적 단백질의 분비를 지시하는 신호 펩타이드를 의미한다. 상기 신호 펩타이드는 숙주 세포에서 번역된 후에 절단된다. 구체적으로, 상기 신호서열은 ER(Endoplasmic reticulum) 막을 관통하는 단백질의 이동을 개시하는 아미노산 서열이다. 개시 이후에, 신호서열은 흔히 신호 펩티다아제(Signal peptidase)로 알려진 세포 효소에 의하여 ER의 루멘(Lumen) 내에서 절단된다. 이때, 상기 신호서열은 tPa(Tissue Plasminogen Activation), HSV gDs(Signal sequence of Herpes simplex virus glycoprotein D), 또는 성장 호르몬(Growth hormone)의 분비신호서열일 수 있다. 바람직하게, 포유동물 등을 포함하는 고등 진핵 세포에서 사용되는 분비 신호서열을 사용할 수 있다. 또한, 상기 신호서열은 야생형 신호서열을 사용하거나, 숙주세포에서 발현 빈도가 높은 코돈으로 치환하여 사용할 수 있다.The polynucleotide may additionally include a nucleic acid encoding a signal sequence or leader sequence. The term “signal sequence” used herein refers to a signal peptide that directs secretion of a target protein. The signal peptide is cleaved after translation in the host cell. Specifically, the signal sequence is an amino acid sequence that initiates the movement of proteins through the ER (Endoplasmic reticulum) membrane. After initiation, the signal sequence is cleaved within the lumen of the ER by a cellular enzyme commonly known as signal peptidase. At this time, the signal sequence may be a secretion signal sequence of tPa (Tissue Plasminogen Activation), HSV gDs (Signal sequence of Herpes simplex virus glycoprotein D), or growth hormone. Preferably, a secretion signal sequence used in higher eukaryotic cells, including mammals, can be used. In addition, the signal sequence can be used as a wild-type signal sequence, or by replacing it with a codon that is frequently expressed in host cells.
상기 벡터는 숙주 세포에 도입되어 숙주 세포 유전체 내로 재조합 및 삽입될 수 있다. 또는 상기 벡터는 에피좀으로서 자발적으로 복제될 수 있는 폴리뉴클레오티드 서열을 포함하는 핵산 수단으로 이해된다. 상기 벡터는 선형 핵산, 플라스미드, 파지미드, 코스미드, RNA 벡터, 바이러스 벡터 및 이의 유사체들을 포함한다. 바이러스 벡터의 예로는 레트로바이러스, 아데노바이러스, 및 아데노-관련 바이러스를 포함하나 이에 제한되지 않는다. 구체적으로, 상기 벡터는 플라스미드 DNA, 파아지 DNA 등이 될 수 있고, 상업적으로 개발된 플라스미드(pUC18, pBAD, pIDTSAMRT-AMP 등), 대장균 유래 플라스미드(pYG601BR322, pBR325, pUC118, pUC119 등), 바실러스 서브틸리스 유래 플라스미드(pUB110, pTP5 등), 효모-유래 플라스미드(YEp13, YEp24, YCp50 등), 파아지 DNA(Charon4A, Charon21A, EMBL3, EMBL4, λλλ등), 동물 바이러스 벡터(레트로바이러스(Retrovirus), 아데노바이러스(Adenovirus), 백시니아 바이러스(Vaccinia virus) 등), 곤충 바이러스 벡터(바큘로바이러스(Baculovirus) 등)이 될 수 있다. 상기 벡터는 숙주 세포에 따라서 단백질의 발현량과 수식 등이 다르게 나타나므로, 목적에 가장 적합한 숙주세포를 선택하여 사용함이 바람직하다.The vector can be introduced into a host cell and recombined and inserted into the host cell genome. Alternatively, the vector is understood as a nucleic acid vehicle containing a polynucleotide sequence capable of spontaneous replication as an episome. The vectors include linear nucleic acids, plasmids, phagemids, cosmids, RNA vectors, viral vectors and analogs thereof. Examples of viral vectors include, but are not limited to, retroviruses, adenoviruses, and adeno-associated viruses. Specifically, the vector may be plasmid DNA, phage DNA, etc., commercially developed plasmids (pUC18, pBAD, pIDTSAMRT-AMP, etc.), E. coli-derived plasmids (pYG601BR322, pBR325, pUC118, pUC119, etc.), Bacillus subtilis. plasmids (pUB110, pTP5, etc.), yeast-derived plasmids (YEp13, YEp24, YCp50, etc.), phage DNA (Charon4A, Charon21A, EMBL3, EMBL4, λλλ, etc.), animal virus vectors (Retrovirus, adenovirus) (Adenovirus, Vaccinia virus, etc.), or insect virus vectors (Baculovirus, etc.). Since the expression level and modification of the protein of the vector varies depending on the host cell, it is preferable to select and use the host cell most suitable for the purpose.
상기 형질전환 세포의 숙주세포로서, 원핵세포, 진핵세포, 포유동물, 식물, 곤충, 균류 또는 세포성 기원의 세포를 포함할 수 있지만 이에 한정되지 않는다. 상기 원핵세포의 일 예로는 대장균을 사용할 수 있다. 또한, 진핵세포의 일 예로는 효모를 사용할 수 있다. 또한, 상기 포유동물 세포로 CHO 세포, F2N 세포, CSO 세포, BHK 세포, 바우스(Bowes) 흑색종 세포, HeLa 세포, 911 세포, AT1080 세포, A549 세포, HEK 293 세포 또는 HEK293T 세포 등을 사용할 수 있으나, 이에 한정되지 않으며, 당업자에게 알려진 포유동물 숙주세포로 사용 가능한 세포는 모두 이용 가능하다.Host cells of the transformed cells may include, but are not limited to, cells of prokaryotic, eukaryotic, mammalian, plant, insect, fungal or cellular origin. An example of the prokaryotic cell may be Escherichia coli. Additionally, yeast can be used as an example of a eukaryotic cell. In addition, CHO cells, F2N cells, CSO cells, BHK cells, Bowes melanoma cells, HeLa cells, 911 cells, AT1080 cells, A549 cells, HEK 293 cells, or HEK293T cells can be used as the mammalian cells. , but is not limited thereto, and any cell that can be used as a mammalian host cell known to those skilled in the art can be used.
또한, 숙주세포로 발현벡터를 도입하는 경우, CaCl2 침전법, CaCl2 침전법에 DMSO(dimethyl sulfoxide)라는 환원물질을 사용함으로써 효율을 높인 Hanahan 방법, 전기천공법(electroporation), 인산칼슘 침전법, 원형질 융합법, 실리콘 카바이드 섬유를 이용한 교반법, 아그로박테리아 매개된 형질전환법, PEG를 이용한 형질전환법, 덱스트란 설페이트, 리포펙타민 및 건조/억제 매개된 형질전환 방법 등이 사용될 수 있다. 항체 의약품으로서의 특성을 최적하거나 기타 다른 목적을 위해 호스트 세포가 갖고 있는 당화(glycosylation) 관련 유전자를 당업자에게 알려져 있는 방법을 통해 조작하여 항체의 당쇄 패턴(예를 들어, 시알산, 퓨코실화, 당화)을 조정할 수 있다.In addition, when introducing an expression vector into a host cell, the CaCl 2 precipitation method, the Hanahan method, which increases efficiency by using a reducing substance called DMSO (dimethyl sulfoxide) in the CaCl 2 precipitation method, electroporation, and calcium phosphate precipitation method. , protoplast fusion method, stirring method using silicon carbide fiber, Agrobacteria-mediated transformation method, transformation method using PEG, dextran sulfate, lipofectamine, and drying/inhibition-mediated transformation method, etc. can be used. To optimize the properties of an antibody drug or for other purposes, the glycosylation-related genes of the host cell are manipulated using methods known to those skilled in the art to change the antibody's sugar chain pattern (e.g., sialic acid, fucosylation, glycosylation). can be adjusted.
상기 형질전환 세포를 배양하는 방법은 당업계에 널리 알려져 있는 방법을 이용하여 수행할 수 있다. 구체적으로, 상기 배양은 배치 공정 또는 주입 배치 또는 반복 주입 배치 공정(Fed batch 또는 Repeated fed batch process)에서 연속식으로 배양할 수 있다. The method of culturing the transformed cells can be performed using methods widely known in the art. Specifically, the culture may be continuously cultured in a batch process or fed batch or repeated fed batch process (Fed batch or Repeated fed batch process).
상기 항체 의약품은 암 예방 또는 치료용 약학 조성물일 수 있다. 상기 암은 위암, 간암, 폐암, 대장암, 유방암, 전립선암, 담낭암, 방광암, 신장암, 식도암, 피부암, 직장암, 골육종, 다발성골수종, 신경교종, 난소암, 췌장암, 자궁경부암, 자궁내막암, 갑상선암, 후두암, 고환암, 중피종, 급성 골수성 백혈병, 만성 골수성 백혈병, 급성 림프모구성 백혈병, 만성 림프모구성 백혈병, 뇌종양, 신경모세포종, 망막 모세포종, 두경부암, 침샘암 및 림프종으로 구성된 군에서 선택되는 어느 하나일 수 있다.The antibody pharmaceutical may be a pharmaceutical composition for preventing or treating cancer. The above cancers include stomach cancer, liver cancer, lung cancer, colon cancer, breast cancer, prostate cancer, gallbladder cancer, bladder cancer, kidney cancer, esophageal cancer, skin cancer, rectal cancer, osteosarcoma, multiple myeloma, glioma, ovarian cancer, pancreatic cancer, cervical cancer, endometrial cancer, Any selected from the group consisting of thyroid cancer, laryngeal cancer, testicular cancer, mesothelioma, acute myeloid leukemia, chronic myeloid leukemia, acute lymphoblastic leukemia, chronic lymphoblastic leukemia, brain tumor, neuroblastoma, retinoblastoma, head and neck cancer, salivary gland cancer, and lymphoma. It could be one.
다른 양상은 일 양상에 따른 항체 아미노산 서열을 생성하는 방법을 수행하기 위해 적용되는 프로그램을 기록한 컴퓨터 판독 매체를 제공한다.Another aspect provides a computer-readable medium recording a program applied to perform a method of generating an antibody amino acid sequence according to one aspect.
상기 컴퓨터 판독 매체에는 컴퓨팅 장치에 의해 하나 이상의 항체 아미노산 서열을 포함하는 하나 이상의 제1 항체 아미노산 서열을 각각의 항체의 영역 또는 아미노산으로 태그하는 단계;The computer-readable medium includes tagging one or more first antibody amino acid sequences comprising one or more antibody amino acid sequences with a region or amino acid of each antibody by a computing device;
상기 컴퓨팅 장치에 의해 기계 학습 모델을 사용하여 태그가 붙은 하나 이상의 제1 항체 아미노산 서열에 항체의 일반적인 특징을 훈련하여 기계 학습 모델을 사전 훈련하는 단계로서, pretraining a machine learning model by training general characteristics of the antibody on one or more first antibody amino acid sequences tagged using the machine learning model by the computing device,
상기 일반적인 특징은 HCDR3의 길이, HCDR2의 길이, HCDR1의 길이, LCDR3의 길이, LCDR2의 길이, LCDR1의 길이, 및 등전점으로 이루어진 군으로부터 선택된 것인 단계;wherein the general characteristic is selected from the group consisting of length of HCDR3, length of HCDR2, length of HCDR1, length of LCDR3, length of LCDR2, length of LCDR1, and isoelectric point;
상기 컴퓨팅 장치에 의해 표적 항원에 대한 정보를 갖는 하나 이상의 제2 항체 아미노산 서열을 표적 항원에 대한 정보로 태그하는 단계; 및tagging at least one second antibody amino acid sequence having information about the target antigen with information about the target antigen by the computing device; and
상기 컴퓨팅 장치에 의해 상기 제1 항체 아미노산 서열과 상기 제2 항체 아미노산 서열을 사전 훈련된 기계 학습 모델에서 분석하여 제1 항체 아미노산 서열로부터 표적 항원에 대한 정보를 갖는 제3 항체 아미노산 서열을 생성하는 단계를 실행하도록 하는 명령 코드들을 포함하는 프로그램이 저장부(SDD/HDD)에 저장되어 있다.Analyzing the first antibody amino acid sequence and the second antibody amino acid sequence in a pre-trained machine learning model by the computing device to generate a third antibody amino acid sequence having information about the target antigen from the first antibody amino acid sequence. A program containing command codes to execute is stored in the storage unit (SDD/HDD).
타 서버 DB에 저장된 항체 아미노산 서열 정보는 로컬 네트워크 통신(예, USB, LAN) 또는 메트로폴리탄 네트워크 원격 서버를 통해 컴퓨팅 장치의 통신부에 전달될 수 있다. 통신부에 전달된 항체 아미노산 서열 정보는 통해 저장부 또는 메모리로 저장될 수 있다. 저장부에 저장된 항체 아미노산 서열의 데이터를 처리부에서 처리하여 신규한 항체 아미노산 서열이 생성될 수 있다(도 5).Antibody amino acid sequence information stored in another server DB may be transmitted to the communication unit of the computing device through local network communication (eg, USB, LAN) or a metropolitan network remote server. The antibody amino acid sequence information transmitted to the communication unit may be stored in a storage unit or memory. A new antibody amino acid sequence can be generated by processing the data of the antibody amino acid sequence stored in the storage unit in the processing unit (FIG. 5).
일 양상에 따른 항체 아미노산 서열을 생성하는 방법, 생성된 항체 아미노산 서열을 이용하여 항체 의약품을 제조하는 방법, 및 이를 적용되는 프로그램을 기록한 컴퓨터 판독 매체에 따르면, 항체 아미노산 서열의 전체 서열을 생성하면서도, 표적 항원 특이성 또는 종 특이성을 제어하고, 표적 항원에 결합하는 항체 아미노산 서열의 비율이 증가할 수 있다.According to one aspect, a method for generating an antibody amino acid sequence, a method for producing an antibody pharmaceutical using the generated antibody amino acid sequence, and a computer readable medium recording a program for applying the same, while generating the entire sequence of the antibody amino acid sequence, Target antigen specificity or species specificity can be controlled and the proportion of antibody amino acid sequences that bind to the target antigen can be increased.
도 1은 일 양상에 따른 딥러닝 방법과 항체 서열의 생성, ELISA 검증 및 이의 반복을 나타내는 모식도이다. Figure 1 is a schematic diagram showing a deep learning method, generation of an antibody sequence, ELISA verification, and repetition thereof according to one aspect.
도 2a는 어텐션-기반 신경망 모델을 학습시키는 과정의 모식도이고, 도 2b는 어텐션-기반 신경망 내부에서 입력값이 변환되는 과정의 모식도이다.Figure 2a is a schematic diagram of the process of learning an attention-based neural network model, and Figure 2b is a schematic diagram of the process of converting input values within an attention-based neural network.
도 3a는 훈련된 어텐션-기반 신경망에서 생성된 아미노산 서열과 OAS 데이터세트에서 얻은 서열에서 HCDR3의 길이 분포를 나타낸 그래프이고, 도 3b는 훈련된 어텐션-기반 신경망에서 생성된 아미노산 서열과 OAS 데이터세트에서 얻은 서열에서 산출된 이론적인 등전점의 분포를 나타낸 그래프이다.Figure 3a is a graph showing the length distribution of HCDR3 in the amino acid sequence generated from the trained attention-based neural network and the sequence obtained from the OAS dataset, and Figure 3b is a graph showing the length distribution of HCDR3 in the amino acid sequence generated from the trained attention-based neural network and the sequence obtained from the OAS dataset. This is a graph showing the distribution of theoretical isoelectric points calculated from the obtained sequences.
도 4는 사전 훈련된 모델(Finetuned)과 처음부터 훈련된 모델(From scratch)의 학습 곡선을 비교한 그래프이다.Figure 4 is a graph comparing the learning curves of a pre-trained model (Finetuned) and a model trained from scratch (From scratch).
도 5는 일 양상에 따른 컴퓨팅 장치의 모식도이다.5 is a schematic diagram of a computing device according to an aspect.
도 6은 일 양상에 따른 항체 아미노산 서열의 생산 방법을 나타내는 흐름도이다.6 is a flowchart showing a method for producing an antibody amino acid sequence according to one aspect.
이하 본 발명을 실시예를 통하여 보다 상세하게 설명한다. 그러나, 이들 실시예는 본 발명을 예시적으로 설명하기 위한 것으로 본 발명의 범위가 이들 실시예에 한정되는 것은 아니다.Hereinafter, the present invention will be described in more detail through examples. However, these examples are for illustrative purposes only and the scope of the present invention is not limited to these examples.
실시예 1. 딥러닝 모델을 이용하여 항체 서열의 예측Example 1. Prediction of antibody sequence using deep learning model
1.1. 항체 서열의 예측 방법1.1. Methods for predicting antibody sequences
딥러닝(deep learning) 모델은 라이브러리에서 in silico 최적화된 HCDR3 서열을 스크리닝하는데 이용되고 있다. HCDR3의 모든 단일 부위를 돌연변이시켜 부위 특이적 돌연변이(site-directed mutagenesis)를 도입하여 스크리닝 라이브러리를 생성한다. 그러나 기존 기술의 한계는 주어진 템플릿 서열에 따라 최적화된 HCDR3의 선별에만 집중되었다는 것이다. 이는 딥러닝 모델에 필요한 데이터의 수가 전체 변수 영역과 같이 관심 영역이 확장됨에 따라 기하급수적으로 증가하기 때문이다. 기계 학습 이론에서는 이를 차원의 저주(Curse of dimensionality)라고 부른다.Deep learning models are being used to screen in silico optimized HCDR3 sequences from libraries. A screening library is created by mutating every single site in HCDR3 to introduce site-directed mutagenesis. However, the limitation of existing technology is that it only focused on the selection of optimized HCDR3 according to a given template sequence. This is because the number of data required for a deep learning model increases exponentially as the area of interest expands, as does the entire variable area. In machine learning theory, this is called the curse of dimensionality.
본 실시예에서는 딥러닝과 생물학적 분석을 조합하여 항체 서열을 디자인하는 방법을 고안하였다. 가변 영역의 전체 서열을 디자인하기 위해, 표적-특이적 여부만 예측하는 지도 모델(supervised model)보다는 생성 모델(generative model)을 사용하였다. 생성 모델은 다른 항체의 특징을 예측하여 신규한 치료 항체 서열을 디자인할 수 있다.In this example, a method to design antibody sequences was designed by combining deep learning and biological analysis. To design the entire sequence of the variable region, a generative model was used rather than a supervised model that only predicts target-specificity. Generative models can predict characteristics of other antibodies to design new therapeutic antibody sequences.
또한, 가변 영역의 전체 서열을 생성하려면 차원의 저주로 인해 엄청난 양의 데이터가 필요하다. 데이터 부족 문제에 대처하기 위해, 큰 데이터세트로 모델을 사전 훈련시키는 전이 학습(transfer learning) 방법을 적용하여, 항체의 일반적인 특징을 학습하였다. 그후, 표적-특이적 데이터로 미세조정을 하였다. 딥러닝 모델을 훈련하고, 상기 모델에서 새로운 서열을 샘플링한 후, ELISA로 항원-항체 결합을 통해 생성된 항체를 실험적으로 검증하였다. Additionally, generating full sequences of variable regions requires enormous amounts of data due to the curse of dimensionality. To cope with the data shortage problem, we applied a transfer learning method that pre-trains the model with a large dataset to learn the general characteristics of antibodies. This was then fine-tuned with target-specific data. After training the deep learning model and sampling new sequences from the model, the antibodies generated through antigen-antibody binding were experimentally verified by ELISA.
모델 훈련, 서열 생성, 및 생물학적 검증은 분석 결과를 훈련 모델에 제공하여 피드백 루프를 만든다(도 1). 이러한 훈련, 쿼리(서열 생성), 및 검증의 루프를 능동 학습(active learning)이라고 하고, 표지된 데이터가 매우 적을 때 효과적이다. 결과적으로 능동 학습은 고유한 HCDR3 서열과 다른 영역을 가진 항체 서열을 효과적으로 검색한다. 추가 반복이 수행되는 동안 발견된 서열의 수가 증가한다.Model training, sequence generation, and biological validation provide analysis results to the training model, creating a feedback loop (Figure 1). This loop of training, query (sequence generation), and verification is called active learning, and is effective when there is very little labeled data. As a result, active learning effectively searches for antibody sequences that have regions that differ from the native HCDR3 sequence. As additional iterations are performed, the number of sequences found increases.
1.2. 데이터세트의 준비1.2. Preparation of the dataset
Observed Antibody Space(OAS) 데이터베이스는 2018년도에 만들어진 광범위한 항체 데이터베이스로서, 항체 서열을 정리하고(cleaned), 주석달기(annotated)가 가능하고, 항체 서열이 번역된(translated) 데이터베이스이다. 구체적으로, OAS 데이터베이스는 VH 및 VL 서열만을 함유하는 정리되고 IMGT(ImMunoGeneTics) 넘버링이 적용된 데이터세트이다. OAS는 종(species), 백신을 맞은 이력, 및 질병과 같은 풍부한 메타데이터이다. 암은 OAS에서 주석달아 놓은 질병이 아니기 때문에, 전체 OAS 데이터세트에서 unknown tag([Unk])를 사용하였다. 각 VH 및 VL 서열은 3개의 CDR 및 4개의 프레임워크를 포함한 7개의 영역으로 나뉠 수 있다. OAS 데이터베이스는 이미 IMGT 넘버링 방식으로 영역을 나누어 놓았지만, 자체 제작한 특허 데이터베이스는 그렇지 않았다. ANARCI(antigen receptor numbering and receptor classification)를 사용하여 자체 특허 데이터세트에 대해 IMGT 방식으로 영역을 구분하였다. 또한, ANARCI는 해당 VH 및 VL 서열의 종에 관한 예측을 제공하므로, ANARCI로 예측된 종을 특허 데이터세트의 종 태그로 사용하였다. 타당한 훈련을 처리하기 위해, 각 데이터세트는 무작위적으로 훈련(트레이닝), 검증, 및 테스트 세트를 각각 6:2:2의 비율로 나누었다. 훈련 동안, 훈련 세트만을 사용하여 모델을 업데이트하였고, 검증 세트에서의 손실이 관찰되어 하이퍼파라미터(hyperparameter)로 선택하고, 모델이 오버핏팅(overfitting)되기 전에 훈련을 종료하였다. 마지막으로, 모델은 테스트 세트로 성능을 확인하였다.The Observed Antibody Space (OAS) database is an extensive antibody database created in 2018 in which antibody sequences are cleaned, annotated, and translated. Specifically, the OAS database is a cleaned, ImMunoGeneTics (IMGT) numbered dataset containing only VH and VL sequences. OAS is rich in metadata such as species, vaccination history, and diseases. Since cancer is not an annotated disease in OAS, an unknown tag ([Unk]) was used in the entire OAS dataset. Each VH and VL sequence can be divided into 7 regions including 3 CDRs and 4 frameworks. The OAS database was already divided into areas using the IMGT numbering method, but the self-produced patent database was not. We used ANARCI (antigen receptor numbering and receptor classification) to classify regions using the IMGT method for our own patent dataset. Additionally, since ANARCI provides predictions regarding the species of the corresponding VH and VL sequences, the species predicted by ANARCI was used as the species tag for the patent dataset. To handle reasonable training, each dataset was randomly divided into training, validation, and test sets in a ratio of 6:2:2, respectively. During training, the model was updated using only the training set, and loss in the validation set was observed, so it was selected as a hyperparameter and training was terminated before the model overfitted. Finally, the model was checked for performance on the test set.
OAS 데이터세트에는 항원 결합 데이터가 없으므로, OAS 데이터세트로 훈련된 모델에서 생성된 서열은 표적 특이성을 갖지 않는다. 표적 특이성을 갖도록 하기 위해서는, 항원과 항체의 페어링 데이터와 같은 다른 데이터세트가 필요하다. 공개된 데이터가 없기 때문에, 표적 항원이 있는 항체를 포함하여 15 개의 종양 표적에 대한 특허에서 약 6000개의 항체를 선별하였다.Since there is no antigen binding data in the OAS dataset, sequences generated from models trained on the OAS dataset do not have target specificity. To achieve target specificity, other datasets, such as antigen-antibody pairing data, are needed. Due to lack of published data, approximately 6000 antibodies were screened from patents for 15 tumor targets, including antibodies with target antigens.
1.3. 언어 모델링 및 어텐션 메커니즘1.3. Language modeling and attention mechanisms
언어 모델링(Language modeling)은 길이가 I인 전체 서열 x의 확률, 즉 P(x)로 할당되는 시계열적 모델이다. P(x)는 i 번째 위치의 문자, 즉 xi를 해당 이전 문자, x<i에 대한 조건부 확률로 분해할 수 있다.Language modeling is a time series model in which the probability of the entire sequence x of length I is assigned, that is, P(x). P(x) can decompose the character at the ith position, i.e. x i , into the conditional probability for the corresponding previous character, x < i .
Figure PCTKR2023013015-appb-img-000005
(1)
Figure PCTKR2023013015-appb-img-000005
(One)
식 1에서, x0은 최초로 주어지는 문자를 의미하고, l은 문장의 총 길이를 의미한다.In equation 1, x 0 refers to the first character given, and l refers to the total length of the sentence.
전체 서열의 확률을 모델링하기 보다, 이전 토근(token)(단어, 글자, 또는 아미노산을 의미함)에 대한 다음 토큰의 대략적인 조건부 확률이 모델 단순화 및 샘플링 효율에 유리하다. 예를 들어, 신경망 함수(neural network function) f(x<i)는 소프트맥스(softmax) 함수를 갖는 조건부 확률로 변환될 수 있다:Rather than modeling the probability of the entire sequence, approximate conditional probability of the next token (meaning a word, letter, or amino acid) relative to the previous token is advantageous for model simplification and sampling efficiency. For example, the neural network function f(x <i ) can be converted to a conditional probability with a softmax function:
Figure PCTKR2023013015-appb-img-000006
(2)
Figure PCTKR2023013015-appb-img-000006
(2)
여기서, 소프트맥스 함수는 Here, the softmax function is
Figure PCTKR2023013015-appb-img-000007
이다.
Figure PCTKR2023013015-appb-img-000007
am.
식 (3)에서 K는 단어의 갯수이고, j 및 k는 단어에서 토큰의 인덱스를 나타낸다. 많은 신경망 구조는 선형 로지스틱 회귀분석(linear logistic regression), 순환 신경망(recurrent neural network: RNN), 및 컨볼루션 신경망(convolutional neural network: CNN)과 같은 조건부 확률을 추정하도록 구성될 수 있다. In equation (3), K is the number of words, and j and k represent the indices of tokens in the word. Many neural network structures can be configured to estimate conditional probabilities, such as linear logistic regression, recurrent neural networks (RNN), and convolutional neural networks (CNN).
어텐션-기반 신경망 모델을 이용하여 어텐션-기반 신경망 모델을 학습(데이터를 변환)시켰다. 어텐션-기반 신경망 모델을 학습시키는 과정의 모식도를 도 2a에 나타내었다. 또한, 어텐션-기반 신경망 내부에서 입력값이 변환되는 과정의 모식도를 도 2b에 나타내었다.An attention-based neural network model was learned (data converted) using an attention-based neural network model. A schematic diagram of the process of learning an attention-based neural network model is shown in Figure 2a. Additionally, a schematic diagram of the process of converting input values within an attention-based neural network is shown in Figure 2b.
어텐션-기반 신경망(attention-based neural network) 모델이다. 어텐션 메커니즘은 자가-어텐션(self-attention)으로 단어와 다른 단어 사이의 관계를 능동적으로 계산한다. 구체적으로, 어텐션-기반 신경망 모델은 주어진 아미노산 서열에서 다음 아미노산의 확률을 예측할 수 있다. 어텐션-기반 신경망 모델을 반복적으로 실행하여 예측된 확률로 아미노산을 예측하고 선택함으로써, 항체 서열을 생성할 수 있다.It is an attention-based neural network model. The attention mechanism actively calculates relationships between words and other words through self-attention. Specifically, an attention-based neural network model can predict the probability of the next amino acid in a given amino acid sequence. An antibody sequence can be generated by repeatedly running an attention-based neural network model to predict and select amino acids with predicted probabilities.
구체적으로, scaled dot-product self-attention 함수는 입력 벡터를 각각 Q, K 및 V로 표시되는 query, key 및 value로 맵핑하였다. query와 key 간의 가중치를 Q, K, 및 V로 계산하고, 값을 곱하였다.Specifically, the scaled dot-product self-attention function mapped the input vector to query, key, and value, denoted by Q, K, and V, respectively. The weight between query and key was calculated as Q, K, and V, and the values were multiplied.
Figure PCTKR2023013015-appb-img-000008
(4)
Figure PCTKR2023013015-appb-img-000008
(4)
식 (4)에서, dk는 key 벡터의 차원(dimension)이다.In equation (4), d k is the dimension of the key vector.
1.4. 어텐션-기반 신경망 모델의 수행1.4. Performance of attention-based neural network model
어텐션-기반 신경망 모델을 구현하기 위한, 레이어(layer)의 개수, 임베딩 차원(embedding dimension), 숨겨진 레이어 차원, 및 중도실패율(dropout rate)은 각각 12, 252, 1024 및 0.3이었다. 하이퍼파라미터와 조기 중단 기준은 Weight and Biases (Lukas Biewald. Experiment Tracking with Weights & Biases. Software available from wandb. com, (January), 311 2020.)를 이용하여, 격자 조사(grid search)의 결과로서 검증 세트(validation set) 손실로 결정하였다.To implement the attention-based neural network model, the number of layers, embedding dimension, hidden layer dimension, and dropout rate were 12, 252, 1024, and 0.3, respectively. Hyperparameters and early stopping criteria were verified as the results of a grid search using Weight and Biases (Lukas Biewald. Experiment Tracking with Weights & Biases. Software available from wandb. com, (January), 311 2020.) It was determined by the set (validation set) loss.
사전 훈련된 어텐션-기반 신경망 모델은 HCDR3의 길이 분포 및 이론적인 등전점(isoelectric point: pI) 값과 같은 천연 항체의 일반적인 특징을 학습하였다. 훈련된 어텐션-기반 신경망 모델에서 생성된 아미노산 서열과 OAS 데이터세트에서 얻은 서열을 HCDR3의 길이 분포 및 이론적인 등전점을 산출하고, 그 결과를 도 3a 및 도 3b에 나타내었다. 도 3a 및 도 3b에 나타난 바와 같이, 훈련된 어텐션-기반 신경망 모델에서 생성된 아미노산 서열의 HCDR3 길이와 계산된 pI 값은 OAS 데이터세트에서 훈련된 서열의 것과 유사하였다.The pre-trained attention-based neural network model learned general features of natural antibodies, such as the length distribution and theoretical isoelectric point (pI) value of HCDR3. The length distribution and theoretical isoelectric point of HCDR3 were calculated using the amino acid sequence generated from the trained attention-based neural network model and the sequence obtained from the OAS dataset, and the results are shown in Figures 3a and 3b. As shown in Figures 3A and 3B, the HCDR3 length and calculated pI values of the amino acid sequences generated from the trained attention-based neural network model were similar to those of the sequences trained on the OAS dataset.
1.5. 표적- 및 종-특이적 항체의 생성 1.5. Generation of target- and species-specific antibodies
각 표적은 항체 서열의 수에서 다양한 범위를 가졌다. 각 표적 당 몇 개에서 1000여개 항체까지 다양하였다. 예를 들어, PD-1(programmed cell death protein 1)은 1313개의 항-PD-1 항체가 있었다. 반면에, MET(Mesenchymal Epithelial Transition)의 경우에는 10개 미만의 항체 서열이 확인되었다. 표적 특이성을 학습하기 위해, 사전 훈련된 모델 가중치를 이용하여 표적-특이적 데이터세트로 훈련(finetuning)하였다. 사전 훈련하고 특허 데이터에 적용한 모델(Finetuned)과 사전 훈련 없이 바로 특허 데이터에 적용한 모델(From scratch)의 학습 곡선을 비교하고, 그 결과를 도 4에 나타내었다(lr: 학습 속도로서 모델의 최적화 속도를 결정하는 값). 도 4는 학습시킨 횟수(step)에 대한 검증 손실을 나타낸다. 도 4에 나타난 바와 같이, 특허 데이터세트의 검증 손실은 처음부터 훈련된 모델에 비해 사전 훈련된 모델에서 더 빠르고 더 낮은 커버리지를 가졌다. 따라서, 전이 학습은 커버리지에 도달할 때까지 훈련 속도를 4배 이상 빠르게 하고, 더욱 모델이 우수한 성능을 갖게 하였다.Each target had a wide range in the number of antibody sequences. The number of antibodies per target ranged from a few to over 1,000. For example, for programmed cell death protein 1 (PD-1), there were 1313 anti-PD-1 antibodies. On the other hand, in the case of MET (Mesenchymal Epithelial Transition), less than 10 antibody sequences were identified. To learn target specificity, we trained (finetuned) a target-specific dataset using pre-trained model weights. The learning curves of a model pre-trained and applied to patent data (Finetuned) and a model applied directly to patent data without pre-training (From scratch) were compared, and the results are shown in Figure 4 (lr: optimization speed of the model as learning speed value to determine). Figure 4 shows the verification loss for the number of learning steps (steps). As shown in Figure 4, the validation loss of the patent dataset was faster and had lower coverage for the pre-trained model compared to the model trained from scratch. Therefore, transfer learning increased the training speed by more than 4 times until coverage was reached and gave the model even better performance.
종(species)은 Conditional Transformer(CTRL) 알고리즘을 이용하여 종 태그를 사용하여 주석을 달았다. 또한, CTRL 방법의 효과를 검증하기 위해 OAS 사전 훈련된 모델을 태그 없이 인간 또는 마우스 태그를 사용하여 세 가지 조건에서 서열을 생성하였다.Species were annotated using species tags using the Conditional Transformer (CTRL) algorithm. Additionally, to verify the effectiveness of the CTRL method, the OAS pre-trained model was used to generate sequences under three conditions without tags and with human or mouse tags.
제어 생성된 프레임워크에서 CTRL 방법의 효과를 인간과 마우스에서 비교하고, 그 결과를 표 1에 나타내었다. 표 1에서 숫자는 모델로부터 총 생성된 서열에 대한 표적 종(target species)의 비율을 나타낸다.The effectiveness of the CTRL method in the control-generated framework was compared in humans and mice, and the results are shown in Table 1. Numbers in Table 1 represent the ratio of target species to the total sequence generated from the model.
Figure PCTKR2023013015-appb-img-000009
Figure PCTKR2023013015-appb-img-000009
표 1에 나타난 바와 같이, 태그가 없으면, VH 및 VL 프레임워크에서 인간으로 생성한 서열의 비율은 거의 0이다. 그러나, CTRL이 적용된 어텐션-기반 신경망 모델 생성 서열의 절반 이상이 VH-VL 영역에서 인간 프레임워크로 분류되었다. 인간에서 마우스 태그로 간단한 변경에 의해, 모델에서 다른 프레임워크를 생성하였다. OAS가 인간보다 2배 많은 마우스 서열을 가지고 있기 때문에, 마우스 태그를 사용하면 CTRL은 사람의 경우보다 더 높은 비율을 제공하였다.As shown in Table 1, without tags, the proportion of human-generated sequences in the VH and VL frameworks is nearly zero. However, more than half of the attention-based neural network model generated sequences with CTRL were classified as human frameworks in the VH-VL region. By a simple change from human to mouse tags, a different framework was created from the model. Because the OAS has twice as many mouse sequences as human sequences, using mouse tags gave CTRL a higher rate than the human sequence.
생성된 서열 중 손실이 가장 적은 100개의 서열에 대해 훈련, 검증, 및 테스트 데이터세트에서 중복되지 않도록 확인하고, 실험 검증을 수행하였다.Among the generated sequences, the 100 sequences with the least loss were checked to ensure that they were not duplicated in the training, validation, and test datasets, and experimental verification was performed.
실시예 2. 예측된 항체 서열의 검증Example 2. Validation of predicted antibody sequences
2.1. 예측된 항체의 항원 결합2.1. Antigen binding of predicted antibodies
실시예 1.5에서 AI가 디자인한 scFv는 SfiI 제한효소(New England Biolabs, USA)로 절단하고, pCombi3x 벡터나 pCDisplay-4 벡터로 클로닝하였다. 100 ng의 재조합 scFv 플라스미드를 ER2738 세균 수용성 세포와 얼음에서 20분 동안 인큐베이션하였다. 그 후, DNA와 수용성 세포를 42℃에서 90초 동안 히팅 블록(heat block)에서 인큐베이션한 다음, 4℃에서 10분 간 인큐베이션하였다. LB 배지를 수용성(competent) 세포 부피의 4배로 하여 시료에 가하고, 37℃에서 180 rpm으로 1 시간 동안 인큐베이션하였다. 50 ㎍/mL 농도의 카르베니실린(carbenicillin)을 함유하는 LB 플레이트에 접종하고, 37℃에서 밤새 인큐베이션하였다.The scFv designed by AI in Example 1.5 was cleaved with SfiI restriction enzyme (New England Biolabs, USA) and cloned into pCombi3x vector or pCDisplay-4 vector. 100 ng of recombinant scFv plasmid was incubated with ER2738 bacterial soluble cells for 20 minutes on ice. Afterwards, the DNA and soluble cells were incubated in a heating block at 42°C for 90 seconds and then at 4°C for 10 minutes. LB medium was added to the sample in an amount four times the volume of competent cells, and incubated at 37°C at 180 rpm for 1 hour. It was inoculated into an LB plate containing carbenicillin at a concentration of 50 μg/mL and incubated at 37°C overnight.
AI가 디자인한 scFvs가 항원(PD-1 및 PD-L1)에 결합하는지 여부를 검출하기 위해, AI가 디자인한 항-PD-1 클론 218개, 및 AI가 디자인한 항-PD-L1 클론 183개를 선별하여 면역분석법으로 스크리닝하였다. 우선, 박테리아 발현 시스템이 포유류 발현 시스템에 비해 고성능으로 적은 노력, 시간, 및 비용이 소요되므로, 박테리아 발현 시스템을 사용하였다.To detect whether AI-designed scFvs bind to antigens (PD-1 and PD-L1), 218 AI-designed anti-PD-1 clones, and 183 AI-designed anti-PD-L1 clones. Dogs were selected and screened by immunoassay. First, a bacterial expression system was used because it requires less effort, time, and cost with high performance compared to a mammalian expression system.
OmpA(외막 단백질 A) 신호 펩타이드를 이용하여 scFv 단백질을 원형질막주위(periplamsmic) 세포질로 분비하게 하고 박테리아 외막과 펩티도글리칸 층을 제거하여 원형질막주위 분획을 얻었다. ScFv 단백질을 C-말단에서 HA 태그에 융합시켰다. 따라서 표적 항원에 결합된 원형질막주위 분획 내의 가용성 scFv 단백질은 HRP-접합된 항-HA 항체를 사용하고 ELISA 분석으로 검출하였다.The OmpA (outer membrane protein A) signal peptide was used to secrete the scFv protein into the periplasmic cytoplasm, and the bacterial outer membrane and peptidoglycan layer were removed to obtain a periplasmic fraction. The ScFv protein was fused to an HA tag at the C-terminus. Therefore, the soluble scFv protein in the periplasmic fraction bound to the target antigen was detected using an HRP-conjugated anti-HA antibody and by ELISA analysis.
구체적으로, 카르베니실린을 함유하는 750 μL의 SB 배지(20 g의 효모 추출물, 30 g의 트립톤, 10 g의 MOPS, 1 L의 물, pH 7.0)를 2.2 mL 폴리프로필렌 딥 웰 플레이트(Axygen)의 각 웰에 가하였다. scFv 형질전환체의 콜로니들을 각 웰에 넣고, 37℃, 180 rpm에서 약 3-4 시간 동안 인큐베이션하였다. 그 후, 각 웰에 1 mM IPTG(Isopropyl-βSolution, KOREA)를 가하고 30℃, 180 rpm에서 20 시간 동안 인큐베이션하여 scFv 단백질의 발현을 유도하였다. 20 시간 동안 유도한 후, 딥 웰 플레이트를 4000 rpm으로 20분 동안 원심분리하여 상층액을 제거하였다. 펠렛은 400 μL의 STE (20% 수크로스, 50 mM Tris-Cl pH 8.0, 1 mM EDTA) 완충액으로 재현탁하였다. 100 μL의 10 mg/mL 리소자임(lysozyme)을 각 웰에 가하고, 얼음에서 180 rpm로 10분 동안 인큐베이션하였다. 외막과 펩티도글리칸층을 제거하기 위해, 50 μL의 1 M MgCl2를 가하고 4℃에서 180 rpm으로 10분 동안 인큐베이션하였다. 원형질막주위 분획(periplasmic fraction)을 함유하는 상층액은 4000 rpm, 20분의 원심분리하여 얻었고, 가용성 scFv 단백질을 얻었다. Specifically, 750 μL of SB medium containing carbenicillin (20 g of yeast extract, 30 g of tryptone, 10 g of MOPS, 1 L of water, pH 7.0) was plated in a 2.2 mL polypropylene deep well plate (Axygen). ) was added to each well. Colonies of the scFv transformant were placed in each well and incubated at 37°C and 180 rpm for about 3-4 hours. Afterwards, 1 mM IPTG (Isopropyl-βSolution, KOREA) was added to each well and incubated at 30°C and 180 rpm for 20 hours to induce the expression of scFv protein. After induction for 20 hours, the deep well plate was centrifuged at 4000 rpm for 20 minutes to remove the supernatant. The pellet was resuspended in 400 μL of STE (20% sucrose, 50 mM Tris-Cl pH 8.0, 1 mM EDTA) buffer. 100 μL of 10 mg/mL lysozyme was added to each well and incubated on ice at 180 rpm for 10 minutes. To remove the outer membrane and peptidoglycan layer, 50 μL of 1 M MgCl 2 was added and incubated at 4°C at 180 rpm for 10 minutes. The supernatant containing the periplasmic fraction was obtained by centrifugation at 4000 rpm for 20 minutes, and soluble scFv protein was obtained.
표적 항원에 결합한 scFv 단백질은 효소 결합 면역 흡착법(enzyme-linked immunosorbent assay: ELISA)으로 선별하였다. 항원으로서, 인간 PD-1 단백질 및 인간 PD-L1 단백질은 BIOSYSTEMS Acro(USA)에서 구입하여 준비하였다. 각 80 μL의 원형질막주위 분획을 표적 항원이 코팅된 96 웰 MaxiSorp 플레이트에 처리하고, 25℃에서 2 시간 동안 인큐베이션하였다. Ni-NTA 레진과 1-Step™ Ultra TMB-ELISA Substrate Solution은 Thermo scientific(USA)에서 구입하였고, 일회용 컬럼은 BIO-RAD(USA)에서 구입하였다.The scFv protein bound to the target antigen was selected by enzyme-linked immunosorbent assay (ELISA). As antigens, human PD-1 protein and human PD-L1 protein were purchased and prepared from BIOSYSTEMS Acro (USA). 80 μL of each periplasmic fraction was processed into a 96-well MaxiSorp plate coated with target antigen and incubated at 25°C for 2 hours. Ni-NTA resin and 1-Step™ Ultra TMB-ELISA Substrate Solution were purchased from Thermo scientific (USA), and disposable columns were purchased from BIO-RAD (USA).
토끼 HRP-접합된 항-HA 항체(Bethyl (USA), 1:3000배 희석), 3,3',5'5'-테트라메틸벤지딘 용액, 및 2.5 NH2SO4을 순차 처리하였다. 450 nm에서 0.1 이상의 흡광도를 나타내는 항체를 표적 항원 결합 항체로서 선별하였다. 각 클론의 발현 수준을 면역블로팅으로 검증하였다. 각 4 μL의 원형질막주위 추출물을 NC(nitrocellulose) 멤브레인(GE Healthcare Life Science, Germany)으로 옮기고, 25℃에서 1 시간 동안 블로킹 완충액(0.05% PBST 중 5% 탈지 분유(BD (USA)))으로 인큐베이션하였다. 블로킹된 NC 멤브레인을 HRP-접합된 항-HA 항체(1:3000)로 25℃에서 1시간 동안 인큐베이션하고, 화학발광을 검출하였다. 결과적으로, 9개의 항-PD-1 항체 및 10개의 항-PD-L1 항체가 초기 스크리닝에서 상응하는 항원에 대한 결합을 나타내었다. 동시에, 각 후보의 원형질막주위 발현을 확인하기 위해 동일한 원형질막주위 분획을 사용하여 면역블로팅을 수행하였다.Rabbit HRP-conjugated anti-HA antibody (Bethyl (USA), 1:3000-fold dilution), 3,3',5'5'-tetramethylbenzidine solution, and 2.5 NH 2 SO 4 were treated sequentially. Antibodies showing an absorbance of 0.1 or more at 450 nm were selected as target antigen-binding antibodies. The expression level of each clone was verified by immunoblotting. Each 4 μL of periplasmic extract was transferred to a nitrocellulose (NC) membrane (GE Healthcare Life Science, Germany) and incubated with blocking buffer (5% nonfat dry milk (BD (USA)) in 0.05% PBST) for 1 h at 25°C. did. Blocked NC membranes were incubated with HRP-conjugated anti-HA antibody (1:3000) for 1 hour at 25°C, and chemiluminescence was detected. As a result, 9 anti-PD-1 antibodies and 10 anti-PD-L1 antibodies showed binding to the corresponding antigens in the initial screening. At the same time, immunoblotting was performed using the same periplasmic fraction to confirm the periplasmic expression of each candidate.
또한, 정제된 양성 후보를 ELISA 분석에서 용량 의존적 방식으로 추가로 평가하였다. 니볼루맙(옵디보®)과 펨브롤리주맙(키트루다®)은 항-PD-1 양성 대조군으로 사용하였고 아테졸리주맙, 아벨루맙 및 더발루맙은 항-PD-L1 양성 대조군으로 사용하였다. 자체 제작된 비특이적(항-LPA2) 항체를 음성 대조군으로 사용하였다. 양성 및 음성 대조군을 포함하는 정제된 항체를 1000 nM에서 이중으로 연속 희석하였다. 큰 규모에서, 2개의 항-PD-L1 항체(클론 162 및 163)는 정제되지 못하여 실패하였다. 정제된 17개 항체는 용량 의존적으로 해당 항원에 대한 결합 활성을 추가로 조사하였다. 각 scFV 클론의 정제 후 농도, 수율, 측정된 Kd 값, 및 흡광도를 표 2에 나타내었다. 표 2에서 N/A는 분석되지 않음(not analyzed)을 의미한다.Additionally, the purified positive candidates were further evaluated in a dose-dependent manner in an ELISA assay. Nivolumab (Opdivo®) and pembrolizumab (Keytruda®) were used as anti-PD-1 positive controls, and atezolizumab, avelumab, and durvalumab were used as anti-PD-L1 positive controls. A self-produced non-specific (anti-LPA2) antibody was used as a negative control. Purified antibodies, including positive and negative controls, were serially diluted in duplicate at 1000 nM. On a larger scale, two anti-PD-L1 antibodies (clones 162 and 163) failed due to poor purification. The purified 17 antibodies were further examined for their binding activity to the corresponding antigen in a dose-dependent manner. The concentration, yield, measured Kd value, and absorbance after purification of each scFV clone are shown in Table 2. In Table 2, N/A means not analyzed.
Figure PCTKR2023013015-appb-img-000010
Figure PCTKR2023013015-appb-img-000010
표 2에 기재된 항체 중 5개의 항체(각각 항-PD-1에서 4개 및 항-PD-L1에서 1개)는 표적 항원에 대한 결합을 나타내지 않았다. 그 결과, 14개의 AI가 디자인한 항체가 표적 항원에 결합하는 양성 클론으로 밝혀졌다. 항-PD-1 항체 클론 77(0.094 nM)은 니볼루맙(0.93 nM) 및 펨브롤리주맙(4.37 nM)인 항-PD-1 양성 대조군보다 더 낮은 EC50(nM) 값을 나타냈다. PD-L1 대조군 항체인 아테졸리주맙(0.17 nM), 아벨루맙(0.35 nM), 더발루맙(0.71 nM)보다 더 높은 결합 활성을 갖는 것은 없었지만, 항-PD-L1 항체 클론 187(0.82 nM)은 더발루맙과 유사한 PD-L1 결합 활성을 보였다.Of the antibodies listed in Table 2, five antibodies (four from anti-PD-1 and one from anti-PD-L1, respectively) showed no binding to the target antigen. As a result, 14 AI-designed antibodies were found to be positive clones that bound to the target antigen. Anti-PD-1 antibody clone 77 (0.094 nM) showed lower EC 50 (nM) values than the anti-PD-1 positive controls, nivolumab (0.93 nM) and pembrolizumab (4.37 nM). None had higher binding activity than the PD-L1 control antibodies atezolizumab (0.17 nM), avelumab (0.35 nM), and durvalumab (0.71 nM), whereas anti-PD-L1 antibody clone 187 (0.82 nM) showed PD-L1 binding activity similar to durvalumab.
따라서, AI 기반 드 노보(de novo) 생성 및 시험관 내 실험 연구를 조합하여, PD-1 및 PD-L1에 대한 새로운 항체를 성공적으로 식별할 수 있음을 확인하였다.Therefore, it was confirmed that new antibodies against PD-1 and PD-L1 could be successfully identified by combining AI-based de novo generation and in vitro experimental studies.

Claims (15)

  1. 항체 아미노산 서열을 생성하는 방법으로서,A method of generating an antibody amino acid sequence, comprising:
    컴퓨팅 장치에 의해 하나 이상의 항체 아미노산 서열을 포함하는 하나 이상의 제1 항체 아미노산 서열을 각각의 항체의 영역 또는 아미노산으로 태그하는 단계;tagging, by a computing device, one or more first antibody amino acid sequences comprising one or more antibody amino acid sequences with a region or amino acid of each antibody;
    상기 컴퓨팅 장치에 의해 기계 학습 모델을 사용하여 태그가 붙은 하나 이상의 제1 항체 아미노산 서열에 항체의 일반적인 특징을 훈련하여 기계 학습 모델을 사전 훈련하는 단계로서, pretraining a machine learning model by training general characteristics of the antibody on one or more first antibody amino acid sequences tagged using the machine learning model by the computing device,
    상기 일반적인 특징은 HCDR3의 길이, HCDR2의 길이, HCDR1의 길이, LCDR3의 길이, LCDR2의 길이, LCDR1의 길이, 및 등전점으로 이루어진 군으로부터 선택된 것인 단계;wherein the general characteristic is selected from the group consisting of length of HCDR3, length of HCDR2, length of HCDR1, length of LCDR3, length of LCDR2, length of LCDR1, and isoelectric point;
    상기 컴퓨팅 장치에 의해 표적 항원에 대한 정보를 갖는 하나 이상의 제2 항체 아미노산 서열을 표적 항원에 대한 정보로 태그하는 단계;tagging at least one second antibody amino acid sequence having information about the target antigen with information about the target antigen by the computing device;
    상기 컴퓨팅 장치에 의해 상기 제1 항체 아미노산 서열과 상기 제2 항체 아미노산 서열을 사전 훈련된 기계 학습 모델에서 분석하여 제1 항체 아미노산 서열로부터 표적 항원에 대한 정보를 갖는 제3 항체 아미노산 서열을 생성하는 단계;Analyzing the first antibody amino acid sequence and the second antibody amino acid sequence in a pre-trained machine learning model by the computing device to generate a third antibody amino acid sequence having information about the target antigen from the first antibody amino acid sequence. ;
    생성된 제3 항체 아미노산 서열을 이용하여 시험관 내에서(in vitro) 표적 항원에 대한 특이적 결합이 있는지 여부를 검증하는 단계; 및Verifying whether there is specific binding to the target antigen in vitro using the generated third antibody amino acid sequence; and
    표적 항원과 특이적 결합이 없는 제3 항체 아미노산 서열은 다시 기계 학습 모델을 이용하여 제4의 항체 아미노산 서열을 생성하는 단계를 포함하는 방법.A method comprising generating a fourth antibody amino acid sequence using a machine learning model for the third antibody amino acid sequence that does not specifically bind to the target antigen.
  2. 청구항 1에 있어서, 상기 항체 아미노산 서열은 전장(full length) 항체, 중쇄 가변 영역, 경쇄 가변 영역, 상보성 결정 영역(complementarity determining region: CDR), 및 프레임워크(framework) 영역으로 이루어진 군으로부터 선택되는 것의 아미노산 서열인 것인 방법.The method of claim 1, wherein the antibody amino acid sequence is selected from the group consisting of a full length antibody, a heavy chain variable region, a light chain variable region, a complementarity determining region (CDR), and a framework region. A method that is an amino acid sequence.
  3. 청구항 1에 있어서, 상기 제1 항체 아미노산 서열은 OAS(Observed Antibody Space) 데이터베이스 또는 PDB(Protein Data Bank) 데이터베이스로부터 얻은 것인 방법.The method according to claim 1, wherein the first antibody amino acid sequence is obtained from OAS (Observed Antibody Space) database or PDB (Protein Data Bank) database.
  4. 청구항 1에 있어서, 상기 제1 항체 아미노산 서열은 인간, 마우스, 래트, 토끼, 및 유인원으로 이루어진 군으로부터 선택된 종(species)으로 태그하는 단계를 더 포함하는 방법.The method of claim 1, further comprising tagging the first antibody amino acid sequence with a species selected from the group consisting of human, mouse, rat, rabbit, and ape.
  5. 청구항 1에 있어서, 상기 제2 항체 아미노산 서열은 특허 데이터베이스 또는 파지 디스플레이 라이브러리로부터 얻은 것인 방법.The method of claim 1, wherein the second antibody amino acid sequence is obtained from a patent database or a phage display library.
  6. 청구항 1에 있어서, 상기 제3 항체 아미노산 서열을 생성하는 단계는 언어 모델링(Language modeling)을 이용하여 주어진 아미노산 서열에서 해당 아미노산의 다음 아미노산의 확률을 예측하고 선택하는 것인 방법.The method of claim 1, wherein the step of generating the third antibody amino acid sequence involves predicting and selecting the probability of the next amino acid of the corresponding amino acid in a given amino acid sequence using language modeling.
  7. 청구항 6에 있어서, 상기 언어 모델링은 i 번째 위치의 아미노산인 xi를 i 번째 바로 이전 아미노산 x<i에 대한 조건부 확률을 산출하는 것인 방법.The method of claim 6, wherein the language modeling calculates a conditional probability for x i , the amino acid at the i-th position, to the amino acid immediately preceding the i-th position, x <i .
  8. 청구항 6에 있어서, 상기 언어 모델링(Language modeling)은 어텐션-기반 신경망(attention-based neural network) 모델을 이용하는 것인 방법.The method of claim 6, wherein the language modeling uses an attention-based neural network model.
  9. 청구항 1에 있어서, 표적 항원은 PD-1, PD-L1, CTLA-4, LAG-3, BTLA, CD200, CD276, KIR, TIM-1, TIM-3, TIGIT, VISTA, CD27, CD28, CD40, CD40L, CD70, CD75, CD80, CD86, CD73, CD137, GITR, GITRL, IL15, OX40, OX40L, IDO-1, IDO-2, A2AR, ICOS, ICOSL, 4-1BB, 및 4-1BBL로 이루어진 군으로부터 선택된 것인 방법.The method of claim 1, wherein the target antigen is PD-1, PD-L1, CTLA-4, LAG-3, BTLA, CD200, CD276, KIR, TIM-1, TIM-3, TIGIT, VISTA, CD27, CD28, CD40, From the group consisting of CD40L, CD70, CD75, CD80, CD86, CD73, CD137, GITR, GITRL, IL15, OX40, OX40L, IDO-1, IDO-2, A2AR, ICOS, ICOSL, 4-1BB, and 4-1BBL How to be chosen.
  10. 청구항 1에 있어서, 표적 항원에 대한 특이적 결합이 있는지 여부를 검증하는 단계는 ELISA(Enzyme Linked Immunosorbent Assay), 방사상 면역확산(Radial Immunodiffusion), 면역침전분석, RIA(Radioimmunoassay), 면역형광분석, 및 면역블로팅으로 이루어진 군으로부터 선택된 방법을 수행하는 것인 방법.The method of claim 1, wherein the step of verifying whether there is specific binding to the target antigen includes ELISA (Enzyme Linked Immunosorbent Assay), Radial Immunodiffusion, Immunoprecipitation Analysis, RIA (Radioimmunoassay), Immunofluorescence Analysis, and A method comprising performing a method selected from the group consisting of immunoblotting.
  11. 청구항 10에 있어서, 시험관 내에서 표적 항원에 대한 특이적 결합이 있는지 여부를 검증하는 단계는 ELISA에서 450 nm 파장의 광선에서 흡광도 또는 EC50(half maximal effective concentration)을 파라미터로 판별하는 것인 방법.The method of claim 10, wherein the step of verifying whether there is specific binding to the target antigen in vitro is to determine absorbance or EC 50 (half maximal effective concentration) as a parameter in ELISA at a light wavelength of 450 nm.
  12. 청구항 11에 있어서, ELISA에서 450 nm 파장의 광선에서 흡광도가 0.1 이상인 경우 상기 제3 항체 아미노산 서열은 표적 항원에 대한 특이적 결합이 있는 것으로 판별하는 것인 방법.The method of claim 11, wherein the third antibody amino acid sequence is determined to have specific binding to the target antigen when the absorbance is 0.1 or more in 450 nm wavelength light in ELISA.
  13. 청구항 1에 있어서, 표적 항원과 특이적 결합이 없는 제3 항체 아미노산 서열은 표적 항원에 대한 태그를 붙이지 않고 다시 기계 학습 모델로 전달하는 것인 방법.The method of claim 1, wherein the third antibody amino acid sequence that does not specifically bind to the target antigen is passed back to the machine learning model without tagging the target antigen.
  14. 시험관 내에서 청구항 1의 생성된 항체 아미노산 서열로부터 항체를 제작하는 단계를 포함하는 항체 의약품을 제조하는 방법.A method of producing an antibody pharmaceutical comprising the step of producing an antibody from the amino acid sequence of the antibody produced according to claim 1 in vitro.
  15. 청구항 1 내지 14 중 어느 한 항에 따른 방법을 수행하기 위해 적용되는 프로그램을 기록한 컴퓨터 판독 매체.A computer readable medium recording a program applied to perform the method according to any one of claims 1 to 14.
PCT/KR2023/013015 2022-09-01 2023-08-31 Method for generating antibody sequence using machine learning technology WO2024049245A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
KR10-2022-0110819 2022-09-01
KR1020220110819A KR20240031723A (en) 2022-09-01 2022-09-01 Method for generating antibody sequence using machine learning

Publications (1)

Publication Number Publication Date
WO2024049245A1 true WO2024049245A1 (en) 2024-03-07

Family

ID=90098434

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/KR2023/013015 WO2024049245A1 (en) 2022-09-01 2023-08-31 Method for generating antibody sequence using machine learning technology

Country Status (2)

Country Link
KR (1) KR20240031723A (en)
WO (1) WO2024049245A1 (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20180022537A (en) * 2016-08-23 2018-03-06 주식회사 스탠다임 Method for predicting therapeutic efficacy of combined drug by machine learning ensemble model
US20190065677A1 (en) * 2017-01-13 2019-02-28 Massachusetts Institute Of Technology Machine learning based antibody design
KR20220026869A (en) * 2020-08-26 2022-03-07 이화여자대학교 산학협력단 A novel method for generating an antibody library and the generated library therefrom

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2022551119A (en) 2019-10-03 2022-12-07 ヤンセン バイオテツク,インコーポレーテツド Methods for producing biotherapeutic agents with increased stability by sequence optimization

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20180022537A (en) * 2016-08-23 2018-03-06 주식회사 스탠다임 Method for predicting therapeutic efficacy of combined drug by machine learning ensemble model
US20190065677A1 (en) * 2017-01-13 2019-02-28 Massachusetts Institute Of Technology Machine learning based antibody design
KR20220026869A (en) * 2020-08-26 2022-03-07 이화여자대학교 산학협력단 A novel method for generating an antibody library and the generated library therefrom

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
MARK L. CHIU, DENNIS R. GOULET, ALEXEY TEPLYAKOV, GARY L. GILLILAND: "Antibody Structure and Function: The Basis for Engineering Therapeutics", ANTIBODIES, vol. 8, no. 4, 3 December 2019 (2019-12-03), pages 55, XP055702945, DOI: 10.3390/antib8040055 *
SAKA KOICHIRO, KAKUZAKI TARO, METSUGI SHOICHI, KASHIWAGI DAIKI, YOSHIDA KENJI, WADA MANABU, TSUNODA HIROYUKI, TERAMOTO REIJI: "Antibody design using LSTM based deep generative model from phage display library for affinity maturation", SCIENTIFIC REPORTS, vol. 11, no. 1, 1 December 2021 (2021-12-01), XP055876990, DOI: 10.1038/s41598-021-85274-7 *

Also Published As

Publication number Publication date
KR20240031723A (en) 2024-03-08

Similar Documents

Publication Publication Date Title
Perchiacca et al. Engineering aggregation-resistant antibodies
Bradbury et al. Beyond natural antibodies: the power of in vitro display technologies
Fridy et al. A robust pipeline for rapid production of versatile nanobody repertoires
Venkataraman et al. A toolbox of immunoprecipitation-grade monoclonal antibodies to human transcription factors
Ebo et al. An in vivo platform to select and evolve aggregation-resistant proteins
EP2482212A1 (en) Method of acquiring proteins with high affinity by computer aided design
EP2871189A1 (en) High-affinity monoclonal anti-strep-tag antibody
Townsend et al. Augmented binary substitution: single-pass CDR germ-lining and stabilization of therapeutic antibodies
Almagro et al. Characterization of a high‐affinity human antibody with a disulfide bridge in the third complementarity‐determining region of the heavy chain
Finlay et al. Phage display: a powerful technology for the generation of high specificity affinity reagents from alternative immune sources
CN113646330A (en) Engineered CD25 polypeptides and uses thereof
Entzminger et al. De novo design of antibody complementarity determining regions binding a FLAG tetra-peptide
Lee et al. A two-in-one antibody engineered from a humanized interleukin 4 antibody through mutation in heavy chain complementarity-determining regions
US20140030253A1 (en) HUMANIZED FORMS OF MONOCLONAL ANTIBODIES TO HUMAN GnRH RECEPTOR
Finlay et al. Phage display: a powerful technology for the generation of high-specificity affinity reagents from alternative immune sources
Murphy et al. Enhancing recombinant antibody performance by optimally engineering its format
Al Qaraghuli et al. Defining the complementarities between antibodies and haptens to refine our understanding and aid the prediction of a successful binding interaction
WO2024049245A1 (en) Method for generating antibody sequence using machine learning technology
Sakaguchi et al. Rapid and reliable hybridoma screening method that is suitable for production of functional structure-recognizing monoclonal antibody
Muzard et al. Grafting of protein L-binding activity onto recombinant antibody fragments
KR20080033877A (en) Protein, method for immobilizing protein, structure, biosensor, nucleic acid, vector and kit for detecting target substance
Karadag et al. Physicochemical determinants of antibody-protein interactions
Wu et al. A fast and efficient procedure to produce scFvs specific for large macromolecular complexes
Gilodi et al. Selection and modelling of a new single-domain intrabody against TDP-43
Zhang et al. Humanization of the shark VNAR single domain antibody using CDR grafting

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23860920

Country of ref document: EP

Kind code of ref document: A1