WO2024049245A1

WO2024049245A1 - Method for generating antibody sequence using machine learning technology

Info

Publication number: WO2024049245A1
Application number: PCT/KR2023/013015
Authority: WO
Inventors: 서승우; 박은영; 강은지; 김채은; 강태현; 곽민우
Original assignee: 주식회사 스탠다임
Priority date: 2022-09-01
Filing date: 2023-08-31
Publication date: 2024-03-07
Also published as: KR20240031723A

Abstract

Provided, according to one aspect, are a method for generating an antibody amino acid sequence, a method for preparing an antibody drug using a generated antibody amino acid sequence, and a computer-readable medium for recording a program for applying same. According to the present invention, the complete sequence of an antibody amino acid sequence may be generated while enabling the controlling of target antigen specificity or species specificity, and the proportion of antibody amino acid sequences binding to a target antigen may be increased.

Description

How to generate antibody sequences using machine learning techniques

Provided are a method of generating an antibody sequence using machine learning technology, a method of manufacturing an antibody pharmaceutical using the same, and a computer readable medium for performing the same.

Therapeutic antibodies have been a focus of the pharmaceutical industry since 1986, when the U.S. Food and Drug Administration approved the first monoclonal antibody (mAb) product. The top 10 bestsellers in 2021 include four therapeutic antibodies, and the number of therapeutic antibodies currently approved by the FDA exceeds 100. A mAb has two chains (heavy chain and light chain) and two regions (variable and constant) for each chain. The variable heavy chain (VH) and variable light chain (VL) are primarily responsible for target specificity. VH consists of 114 amino acids, VL consists of 110 amino acids, and the variable region consists of approximately 220 amino acids.

The amino acid sequence of the variable region is calculated as A ^L , where A is the amino acid type, 20, and L is the protein length, i.e., 220 for an antibody, making it difficult to search all possible sequences of the variable region even with the most powerful high-throughput screening method. impossible. If site-specific amino acid frequencies in the constant region are taken into account, this number will be significantly reduced. However, considering the creation or optimization of heavy chain complementarity determining region 3 (HCDR3), which has an average length of 15 amino acids, approximately 10 ⁸ sequences must be searched. Nonetheless, deep mutation scanning-based mutagenesis libraries can span approximately 10 ⁴ sequences.

Because finding new and effective therapeutic antibodies in a vast sequence space is costly and laborious, deep learning is considered one of the powerful tools. Currently, deep learning is showing significant progress in many fields, including images, natural language, and protein structure. Additionally, recent research has discovered drugs using deep learning to find the binding between HCDR3 and target antigen. For example, deep learning models are applied to screen in silico optimized HCDR3 sequences from a library. A screening library was generated by mutating every single site in HCDR3 and introducing site-directed mutagenesis. However, it is not only limited to a given template sequence, but also requires a huge amount of data to generate the entire sequence of the variable region, but there is a problem of insufficient data.

Therefore, even in situations where there is insufficient data, there is a need to generate the entire sequence of the antibody and increase the ratio of sequences that bind to the antigen among the predicted sequences.

[Prior art literature]

[Patent Document]

(Patent Document 1) KR 10-2022-0091497 A

Methods for generating antibody amino acid sequences are provided.

A method for manufacturing an antibody drug using the generated antibody amino acid sequence is provided.

Provided is a computer-readable medium recording a program applied to perform a method for generating an antibody amino acid sequence.

A method of generating an antibody amino acid sequence, comprising:

tagging, by a computing device, one or more first antibody amino acid sequences comprising one or more antibody amino acid sequences with a region or amino acid of each antibody;

pretraining a machine learning model by training general characteristics of the antibody on one or more first antibody amino acid sequences tagged using the machine learning model by the computing device,

wherein the general characteristic is selected from the group consisting of length of HCDR3, length of HCDR2, length of HCDR1, length of LCDR3, length of LCDR2, length of LCDR1, and isoelectric point;

tagging at least one second antibody amino acid sequence having information about the target antigen with information about the target antigen by the computing device;

Analyzing the first antibody amino acid sequence and the second antibody amino acid sequence in a pre-trained machine learning model by the computing device to generate a third antibody amino acid sequence having information about the target antigen from the first antibody amino acid sequence. ;

Verifying whether there is specific binding to the target antigen in vitro using the generated third antibody amino acid sequence; and

The third antibody amino acid sequence that does not specifically bind to the target antigen provides a method comprising generating a fourth antibody amino acid sequence using a machine learning model.

The term “antibody” is used interchangeably with the term “immunoglobulin (Ig).” The antibody may be, for example, IgA, IgD, IgE, IgG, or IgM. The antibody may be a monoclonal antibody or a polyclonal antibody. The antibody may be an animal-derived antibody, a mouse-human chimeric antibody, a humanized antibody, or a human antibody. A complete antibody has a structure of two full-length light chains and two full-length heavy chains, and each light chain is bound to the heavy chain through a disulfide bond (SS-bond). Each heavy chain consists of a heavy chain variable region (VH) and a heavy chain constant region (consisting of domains CH1, hinge, CH2, and CH3). Each light chain consists of a light chain variable region (VL) and a light chain constant region (CL). The VH and VL regions can be further subdivided into hypervariable regions called complementarity determining regions (CDRs) interspersed with framework regions (FRs).

Each VH and VL consists of three CDR and four FR fragments arranged from amino-to-carboxy-terminus in the following order: FR1, CDR1, FR2, CDR2, FR3, CDR3, and FR4. Immunoglobulins can be divided into five types: IgA, IgD, IgE, IgG, and IgM, depending on the heavy chain constant domain amino acid sequence. IgA and IgG can be further divided into isotypes IgA1, IgA2, IgG1, IgG2, IgG3, and IgG4. Antibody light chains from any vertebrate species can have one of two distinctly distinct types, kappa (κ) and lambda (λ), based on the amino acid sequence of their constant domains.

A “complementarity determining region” (CDR) is the region of an antibody that binds antigen. There are three CDRs (HCDR1, HCDR2, HCDR3) in the VH and three CDRs (LCDR1, LCDR2, LCDR3) in the VL. CDRs are Kabat (Wu et al., (1970) J Exp Med 132(2): 211-250), Chothia (Chothia et al., (1987) J. Mol. Biol. 196(4) ):901-17), IMGT (Lefranc et al., (2003) Dev Comp Immunol 27(1): 55-77) and AbM (Martin and Thornton (1996) J Mol Biol 263(5): 800-815) Numbering can be done in various ways, such as:

“Framework regions” or “FRs” are antibody regions that act as supports for CDRs. The framework region is responsible for supporting the binding of the antigen to the antibody. Framework residues contact the antigen, are part of the binding site of the antibody, and include residues that are close to the CDR in sequence or are located in close proximity to the CDR when folded into a three-dimensional structure. Framework residues may also include residues that do not contact the antigen but indirectly affect binding by contributing to structural support for the CDR. FR can be numbered using various descriptions such as Kabat, Chotia, IMGT and AbM. FR1, FR2, FR3, and FR4 include FRs defined by any of the methods described above. HCFR refers to heavy chain framework regions FR1, FR2, FR3, or FR4. LCFR refers to light chain framework regions FR1, FR2, FR3, or FR4.

Typical 1-letter and 3-letter amino acid codes can be represented in the table below.

The antibody includes an antigen-binding fragment. An antigen-binding fragment is a fragment of the entire structure of an immunoglobulin and refers to a portion of a polypeptide containing a portion to which an antigen can bind. For example, the antigen binding fragment is scFv, (scFv) ₂ , Fv, Fab, Fab', F(ab') ₂ , diabody, triabody, tetrabody, Bis-scFv , a nanobody, or a combination thereof. Antigen-binding fragments can be conjugated to other antibodies, proteins, antigen-binding fragments, or alternative supports to generate bispecific and multispecific proteins.

The antibody amino acid sequence may be an amino acid sequence selected from the group consisting of a full length antibody, a heavy chain variable region, a light chain variable region, a complementarity determining region (CDR), and a framework region. .

The antibody amino acid sequence may be data stored in a storage unit of a computing device.

The method includes tagging, by a computing device, one or more first antibody amino acid sequences comprising one or more antibody amino acid sequences with a region or amino acid of each antibody.

The first antibody amino acid sequence may be obtained from the OAS (Observed Antibody Space) database or the PDB (Protein Data Bank) database. The OAS database may be a database containing a set of human repertoire antibody amino acid sequences. Antibody amino acid sequences stored in the OAS database may have data such as species, vaccination history, and disease. The antibody amino acid sequence stored in the OAS database may not have data on the target antigen. The PDB (Protein Data Bank) database is a database of three-dimensional structural data of biomolecules such as proteins and nucleic acids.

The tag refers to a keyword or term assigned to a piece of information (e.g., a database record or Internet bookmark or computer program). The tag may be an unknown tag ([Unk]), a tag indicating a target antigen, a tag indicating a species, or a tag indicating the position of an amino acid. The tag indicating the species may be a tag indicating the species of the heavy chain variable region (VH species) or a tag indicating the species of the light chain variable region (VL species). The tag indicating the species may be a tag indicating a species selected from the group consisting of humans, mice, rats, rabbits, and apes. The tag may be attached to the N-terminus, middle, or C-terminus of the antibody amino acid sequence. For example, one antibody amino acid sequence may be [Unk][VH species][VL species]<vh3>...<vl3>...<end> or [target][VH species][VL They can be tagged in the following order: species]<vh3>...<vl3>...<end>.

The first antibody amino acid sequence may further include tagging a species selected from the group consisting of humans, mice, rats, rabbits, and apes.

The method includes pretraining a machine learning model by training general characteristics of the antibody on one or more first antibody amino acid sequences tagged using the machine learning model by the computing device.

The general characteristic may be selected from the group consisting of length of HCDR3, length of HCDR2, length of HCDR1, length of LCDR3, length of LCDR2, length of LCDR1, and isoelectric point. The isoelectric point (pI) refers to a specific pH at which the net charge of a positive electrolyte containing both an anionic group and a cationic group, such as a protein, is 0. The isoelectric point may be that of the full length antibody, heavy chain variable region, light chain variable region, complementarity determining region (CDR), or framework region.

The machine learning model is the same as that received by inputting the antibody amino acid sequence information, or about 99% or more, about 98% or more, about 97% or more, about 96% or more, about 95% or more, or about 94% or more. , about 93% or more, about 92% or more, about 91% or more, about 91% or more, about 90% or more, about 85% or more, about 80% or more, about 75% or more, about 70% or more, about 65% or more , amino acid sequence information having sequence identity of about 60% or more, about 55% or more, or about 50% or more can be output.

The machine learning model may drive a machine learning algorithm. The machine learning algorithms include LGBM (Light Gradient Boosting Machine), AdaBoost, Voting Classifier, Random Forest, Logistic algorithm, Neural Network, and QDA. (Quadratic Discriminant Analysis). The machine learning algorithm may be a deep learning algorithm. The machine class algorithm may be linear logistic regression, a recurrent neural network (RNN), or a convolutional neural network (CNN).

The method includes tagging one or more second antibody amino acid sequences with information about the target antigen by the computing device.

The second antibody amino acid sequence may be obtained from a patent database or a phage display library. The patent database includes a dataset in which antibody amino acid sequences obtained from patents are extracted and converted into data. The patent database can be divided into regions using the IMGT method using ANARCI (antigen receptor numbering and receptor classification). The patent database may have information on target antigens.

The method analyzes the first antibody amino acid sequence and the second antibody amino acid sequence in a pre-trained machine learning model by the computing device to generate a third antibody amino acid sequence having information about the target antigen from the first antibody amino acid sequence. Includes creation steps.

In the step of generating the third antibody amino acid sequence, language modeling can be used to predict and select the probability of the next amino acid of the corresponding amino acid in a given amino acid sequence.

The language modeling can calculate the conditional probability P(x) for x _i , the amino acid at the i-th position, to the amino acid immediately preceding the i-th position, x _<i :

(One)

In equation 1, x ₀ refers to the first character given, and l refers to the total length of the sentence.

For example, the neural network function f(x _<i ) can be converted to a conditional probability with a softmax function:

(2)

Here, the softmax function is

(3).

In equation (3), K is the number of words, and j and k represent the indices of tokens in the word.

The language modeling may be an attention-based neural network model. The attention mechanism is self-attention and can actively calculate relationships between words and other words. For example, an attention mechanism can predict the probability of the next amino acid in a given amino acid sequence. The attention mechanism can generate antibody sequences by repeatedly executing to predict and select amino acids with predicted probabilities. In the attention mechanism, hyperparameters and early stopping criteria can be determined by the validation set loss as a result of grid search.

The method includes the step of verifying whether there is specific binding to the target antigen in vitro using the generated third antibody amino acid sequence.

The target antigen refers to any molecule (e.g., protein, peptide, polysaccharide, glycoprotein, glycolipid, nucleic acid, portion thereof, or combination thereof) capable of mediating an immune response. The immune response may include antibody production and activation of immune cells such as T cells, B cells, or NK cells.

The target antigens are PD-1, PD-L1, CTLA-4, LAG-3, BTLA, CD200, CD276, KIR, TIM-1, TIM-3, TIGIT, VISTA, CD27, CD28, CD40, CD40L, CD70, CD75, CD80, CD86, CD73, CD137, GITR, GITRL, IL15, OX40, OX40L, IDO-1, IDO-2, A2AR, ICOS, ICOSL, 4-1BB, and 4-1BBL. .

The step of verifying whether there is specific binding to the target antigen is ELISA (Enzyme Linked Immunosorbent Assay), Radial Immunodiffusion, Immunoprecipitation Analysis, RIA (Radioimmunoassay), Immunofluorescence Analysis, and Immunoblotting. A method selected from the group consisting of may be performed.

The verification step may measure defect affinity to the target antigen, equilibrium dissociation constant (KD), neutralization to the target antigen, or inhibition to the target antigen.

The step of verifying whether there is specific binding to the target antigen in vitro can be determined by absorbance or EC ₅₀ (half maximal effective concentration) as a parameter in ELISA at a light wavelength of 450 nm. In ELISA, if the absorbance is 0.1 or more in light with a wavelength of 450 nm, the third antibody amino acid sequence can be determined to have specific binding to the target antigen.

If the third antibody amino acid sequence is determined to have specific binding to the target antigen, the method may be terminated without passing the third antibody amino acid sequence to the machine learning model again.

The method includes generating a fourth antibody amino acid sequence using the third antibody amino acid sequence that does not specifically bind to the target antigen using a machine learning model.

The third antibody amino acid sequence that does not specifically bind to the target antigen can be transferred back to the machine learning model without tagging the target antigen. For example, when an antibody amino acid sequence with a tag for a target antigen is generated, the tag can be changed to an unknown tag ([Unk]) and passed back to the machine learning model.

Another aspect provides a method of making an antibody drug product comprising constructing an antibody from an antibody amino acid sequence produced according to one aspect in vitro.

A third antibody amino acid sequence or a fourth antibody amino acid sequence that specifically binds to the target antigen can be used to generate antibodies in vitro. Antibody pharmaceuticals can be manufactured by loading a polynucleotide encoding the antibody into a vector and transforming the vector into cells.

The polynucleotide may additionally include a nucleic acid encoding a signal sequence or leader sequence. The term “signal sequence” used herein refers to a signal peptide that directs secretion of a target protein. The signal peptide is cleaved after translation in the host cell. Specifically, the signal sequence is an amino acid sequence that initiates the movement of proteins through the ER (Endoplasmic reticulum) membrane. After initiation, the signal sequence is cleaved within the lumen of the ER by a cellular enzyme commonly known as signal peptidase. At this time, the signal sequence may be a secretion signal sequence of tPa (Tissue Plasminogen Activation), HSV gDs (Signal sequence of Herpes simplex virus glycoprotein D), or growth hormone. Preferably, a secretion signal sequence used in higher eukaryotic cells, including mammals, can be used. In addition, the signal sequence can be used as a wild-type signal sequence, or by replacing it with a codon that is frequently expressed in host cells.

The vector can be introduced into a host cell and recombined and inserted into the host cell genome. Alternatively, the vector is understood as a nucleic acid vehicle containing a polynucleotide sequence capable of spontaneous replication as an episome. The vectors include linear nucleic acids, plasmids, phagemids, cosmids, RNA vectors, viral vectors and analogs thereof. Examples of viral vectors include, but are not limited to, retroviruses, adenoviruses, and adeno-associated viruses. Specifically, the vector may be plasmid DNA, phage DNA, etc., commercially developed plasmids (pUC18, pBAD, pIDTSAMRT-AMP, etc.), E. coli-derived plasmids (pYG601BR322, pBR325, pUC118, pUC119, etc.), Bacillus subtilis. plasmids (pUB110, pTP5, etc.), yeast-derived plasmids (YEp13, YEp24, YCp50, etc.), phage DNA (Charon4A, Charon21A, EMBL3, EMBL4, λλλ, etc.), animal virus vectors (Retrovirus, adenovirus) (Adenovirus, Vaccinia virus, etc.), or insect virus vectors (Baculovirus, etc.). Since the expression level and modification of the protein of the vector varies depending on the host cell, it is preferable to select and use the host cell most suitable for the purpose.

Host cells of the transformed cells may include, but are not limited to, cells of prokaryotic, eukaryotic, mammalian, plant, insect, fungal or cellular origin. An example of the prokaryotic cell may be Escherichia coli. Additionally, yeast can be used as an example of a eukaryotic cell. In addition, CHO cells, F2N cells, CSO cells, BHK cells, Bowes melanoma cells, HeLa cells, 911 cells, AT1080 cells, A549 cells, HEK 293 cells, or HEK293T cells can be used as the mammalian cells. , but is not limited thereto, and any cell that can be used as a mammalian host cell known to those skilled in the art can be used.

In addition, when introducing an expression vector into a host cell, the CaCl ₂ precipitation method, the Hanahan method, which increases efficiency by using a reducing substance called DMSO (dimethyl sulfoxide) in the CaCl ₂ precipitation method, electroporation, and calcium phosphate precipitation method. , protoplast fusion method, stirring method using silicon carbide fiber, Agrobacteria-mediated transformation method, transformation method using PEG, dextran sulfate, lipofectamine, and drying/inhibition-mediated transformation method, etc. can be used. To optimize the properties of an antibody drug or for other purposes, the glycosylation-related genes of the host cell are manipulated using methods known to those skilled in the art to change the antibody's sugar chain pattern (e.g., sialic acid, fucosylation, glycosylation). can be adjusted.

The method of culturing the transformed cells can be performed using methods widely known in the art. Specifically, the culture may be continuously cultured in a batch process or fed batch or repeated fed batch process (Fed batch or Repeated fed batch process).

The antibody pharmaceutical may be a pharmaceutical composition for preventing or treating cancer. The above cancers include stomach cancer, liver cancer, lung cancer, colon cancer, breast cancer, prostate cancer, gallbladder cancer, bladder cancer, kidney cancer, esophageal cancer, skin cancer, rectal cancer, osteosarcoma, multiple myeloma, glioma, ovarian cancer, pancreatic cancer, cervical cancer, endometrial cancer, Any selected from the group consisting of thyroid cancer, laryngeal cancer, testicular cancer, mesothelioma, acute myeloid leukemia, chronic myeloid leukemia, acute lymphoblastic leukemia, chronic lymphoblastic leukemia, brain tumor, neuroblastoma, retinoblastoma, head and neck cancer, salivary gland cancer, and lymphoma. It could be one.

Another aspect provides a computer-readable medium recording a program applied to perform a method of generating an antibody amino acid sequence according to one aspect.

The computer-readable medium includes tagging one or more first antibody amino acid sequences comprising one or more antibody amino acid sequences with a region or amino acid of each antibody by a computing device;

tagging at least one second antibody amino acid sequence having information about the target antigen with information about the target antigen by the computing device; and

Analyzing the first antibody amino acid sequence and the second antibody amino acid sequence in a pre-trained machine learning model by the computing device to generate a third antibody amino acid sequence having information about the target antigen from the first antibody amino acid sequence. A program containing command codes to execute is stored in the storage unit (SDD/HDD).

Antibody amino acid sequence information stored in another server DB may be transmitted to the communication unit of the computing device through local network communication (eg, USB, LAN) or a metropolitan network remote server. The antibody amino acid sequence information transmitted to the communication unit may be stored in a storage unit or memory. A new antibody amino acid sequence can be generated by processing the data of the antibody amino acid sequence stored in the storage unit in the processing unit (FIG. 5).

According to one aspect, a method for generating an antibody amino acid sequence, a method for producing an antibody pharmaceutical using the generated antibody amino acid sequence, and a computer readable medium recording a program for applying the same, while generating the entire sequence of the antibody amino acid sequence, Target antigen specificity or species specificity can be controlled and the proportion of antibody amino acid sequences that bind to the target antigen can be increased.

Figure 1 is a schematic diagram showing a deep learning method, generation of an antibody sequence, ELISA verification, and repetition thereof according to one aspect.

Figure 2a is a schematic diagram of the process of learning an attention-based neural network model, and Figure 2b is a schematic diagram of the process of converting input values within an attention-based neural network.

Figure 3a is a graph showing the length distribution of HCDR3 in the amino acid sequence generated from the trained attention-based neural network and the sequence obtained from the OAS dataset, and Figure 3b is a graph showing the length distribution of HCDR3 in the amino acid sequence generated from the trained attention-based neural network and the sequence obtained from the OAS dataset. This is a graph showing the distribution of theoretical isoelectric points calculated from the obtained sequences.

Figure 4 is a graph comparing the learning curves of a pre-trained model (Finetuned) and a model trained from scratch (From scratch).

5 is a schematic diagram of a computing device according to an aspect.

6 is a flowchart showing a method for producing an antibody amino acid sequence according to one aspect.

Hereinafter, the present invention will be described in more detail through examples. However, these examples are for illustrative purposes only and the scope of the present invention is not limited to these examples.

실시예 1. 딥러닝 모델을 이용하여 항체 서열의 예측Example 1. Prediction of antibody sequence using deep learning model

1.1. 항체 서열의 예측 방법1.1. Methods for predicting antibody sequences

Deep learning models are being used to screen in silico optimized HCDR3 sequences from libraries. A screening library is created by mutating every single site in HCDR3 to introduce site-directed mutagenesis. However, the limitation of existing technology is that it only focused on the selection of optimized HCDR3 according to a given template sequence. This is because the number of data required for a deep learning model increases exponentially as the area of interest expands, as does the entire variable area. In machine learning theory, this is called the curse of dimensionality.

In this example, a method to design antibody sequences was designed by combining deep learning and biological analysis. To design the entire sequence of the variable region, a generative model was used rather than a supervised model that only predicts target-specificity. Generative models can predict characteristics of other antibodies to design new therapeutic antibody sequences.

Additionally, generating full sequences of variable regions requires enormous amounts of data due to the curse of dimensionality. To cope with the data shortage problem, we applied a transfer learning method that pre-trains the model with a large dataset to learn the general characteristics of antibodies. This was then fine-tuned with target-specific data. After training the deep learning model and sampling new sequences from the model, the antibodies generated through antigen-antibody binding were experimentally verified by ELISA.

Model training, sequence generation, and biological validation provide analysis results to the training model, creating a feedback loop (Figure 1). This loop of training, query (sequence generation), and verification is called active learning, and is effective when there is very little labeled data. As a result, active learning effectively searches for antibody sequences that have regions that differ from the native HCDR3 sequence. As additional iterations are performed, the number of sequences found increases.

1.2. 데이터세트의 준비1.2. Preparation of the dataset

The Observed Antibody Space (OAS) database is an extensive antibody database created in 2018 in which antibody sequences are cleaned, annotated, and translated. Specifically, the OAS database is a cleaned, ImMunoGeneTics (IMGT) numbered dataset containing only VH and VL sequences. OAS is rich in metadata such as species, vaccination history, and diseases. Since cancer is not an annotated disease in OAS, an unknown tag ([Unk]) was used in the entire OAS dataset. Each VH and VL sequence can be divided into 7 regions including 3 CDRs and 4 frameworks. The OAS database was already divided into areas using the IMGT numbering method, but the self-produced patent database was not. We used ANARCI (antigen receptor numbering and receptor classification) to classify regions using the IMGT method for our own patent dataset. Additionally, since ANARCI provides predictions regarding the species of the corresponding VH and VL sequences, the species predicted by ANARCI was used as the species tag for the patent dataset. To handle reasonable training, each dataset was randomly divided into training, validation, and test sets in a ratio of 6:2:2, respectively. During training, the model was updated using only the training set, and loss in the validation set was observed, so it was selected as a hyperparameter and training was terminated before the model overfitted. Finally, the model was checked for performance on the test set.

Since there is no antigen binding data in the OAS dataset, sequences generated from models trained on the OAS dataset do not have target specificity. To achieve target specificity, other datasets, such as antigen-antibody pairing data, are needed. Due to lack of published data, approximately 6000 antibodies were screened from patents for 15 tumor targets, including antibodies with target antigens.

1.3. 언어 모델링 및 어텐션 메커니즘1.3. Language modeling and attention mechanisms

Language modeling is a time series model in which the probability of the entire sequence x of length I is assigned, that is, P(x). P(x) can decompose the character at the ith position, i.e. x _i , into the conditional probability for the corresponding previous character, x _{< i} .

(One)

Rather than modeling the probability of the entire sequence, approximate conditional probability of the next token (meaning a word, letter, or amino acid) relative to the previous token is advantageous for model simplification and sampling efficiency. For example, the neural network function f(x _<i ) can be converted to a conditional probability with a softmax function:

(2)

Here, the softmax function is

am.

In equation (3), K is the number of words, and j and k represent the indices of tokens in the word. Many neural network structures can be configured to estimate conditional probabilities, such as linear logistic regression, recurrent neural networks (RNN), and convolutional neural networks (CNN).

An attention-based neural network model was learned (data converted) using an attention-based neural network model. A schematic diagram of the process of learning an attention-based neural network model is shown in Figure 2a. Additionally, a schematic diagram of the process of converting input values within an attention-based neural network is shown in Figure 2b.

It is an attention-based neural network model. The attention mechanism actively calculates relationships between words and other words through self-attention. Specifically, an attention-based neural network model can predict the probability of the next amino acid in a given amino acid sequence. An antibody sequence can be generated by repeatedly running an attention-based neural network model to predict and select amino acids with predicted probabilities.

Specifically, the scaled dot-product self-attention function mapped the input vector to query, key, and value, denoted by Q, K, and V, respectively. The weight between query and key was calculated as Q, K, and V, and the values were multiplied.

(4)

In equation (4), d _k is the dimension of the key vector.

1.4. 어텐션-기반 신경망 모델의 수행1.4. Performance of attention-based neural network model

To implement the attention-based neural network model, the number of layers, embedding dimension, hidden layer dimension, and dropout rate were 12, 252, 1024, and 0.3, respectively. Hyperparameters and early stopping criteria were verified as the results of a grid search using Weight and Biases (Lukas Biewald. Experiment Tracking with Weights & Biases. Software available from wandb. com, (January), 311 2020.) It was determined by the set (validation set) loss.

The pre-trained attention-based neural network model learned general features of natural antibodies, such as the length distribution and theoretical isoelectric point (pI) value of HCDR3. The length distribution and theoretical isoelectric point of HCDR3 were calculated using the amino acid sequence generated from the trained attention-based neural network model and the sequence obtained from the OAS dataset, and the results are shown in Figures 3a and 3b. As shown in Figures 3A and 3B, the HCDR3 length and calculated pI values of the amino acid sequences generated from the trained attention-based neural network model were similar to those of the sequences trained on the OAS dataset.

1.5. 표적- 및 종-특이적 항체의 생성 1.5. Generation of target- and species-specific antibodies

Each target had a wide range in the number of antibody sequences. The number of antibodies per target ranged from a few to over 1,000. For example, for programmed cell death protein 1 (PD-1), there were 1313 anti-PD-1 antibodies. On the other hand, in the case of MET (Mesenchymal Epithelial Transition), less than 10 antibody sequences were identified. To learn target specificity, we trained (finetuned) a target-specific dataset using pre-trained model weights. The learning curves of a model pre-trained and applied to patent data (Finetuned) and a model applied directly to patent data without pre-training (From scratch) were compared, and the results are shown in Figure 4 (lr: optimization speed of the model as learning speed value to determine). Figure 4 shows the verification loss for the number of learning steps (steps). As shown in Figure 4, the validation loss of the patent dataset was faster and had lower coverage for the pre-trained model compared to the model trained from scratch. Therefore, transfer learning increased the training speed by more than 4 times until coverage was reached and gave the model even better performance.

Species were annotated using species tags using the Conditional Transformer (CTRL) algorithm. Additionally, to verify the effectiveness of the CTRL method, the OAS pre-trained model was used to generate sequences under three conditions without tags and with human or mouse tags.

The effectiveness of the CTRL method in the control-generated framework was compared in humans and mice, and the results are shown in Table 1. Numbers in Table 1 represent the ratio of target species to the total sequence generated from the model.

As shown in Table 1, without tags, the proportion of human-generated sequences in the VH and VL frameworks is nearly zero. However, more than half of the attention-based neural network model generated sequences with CTRL were classified as human frameworks in the VH-VL region. By a simple change from human to mouse tags, a different framework was created from the model. Because the OAS has twice as many mouse sequences as human sequences, using mouse tags gave CTRL a higher rate than the human sequence.

Among the generated sequences, the 100 sequences with the least loss were checked to ensure that they were not duplicated in the training, validation, and test datasets, and experimental verification was performed.

실시예 2. 예측된 항체 서열의 검증Example 2. Validation of predicted antibody sequences

2.1. 예측된 항체의 항원 결합2.1. Antigen binding of predicted antibodies

The scFv designed by AI in Example 1.5 was cleaved with SfiI restriction enzyme (New England Biolabs, USA) and cloned into pCombi3x vector or pCDisplay-4 vector. 100 ng of recombinant scFv plasmid was incubated with ER2738 bacterial soluble cells for 20 minutes on ice. Afterwards, the DNA and soluble cells were incubated in a heating block at 42°C for 90 seconds and then at 4°C for 10 minutes. LB medium was added to the sample in an amount four times the volume of competent cells, and incubated at 37°C at 180 rpm for 1 hour. It was inoculated into an LB plate containing carbenicillin at a concentration of 50 μg/mL and incubated at 37°C overnight.

To detect whether AI-designed scFvs bind to antigens (PD-1 and PD-L1), 218 AI-designed anti-PD-1 clones, and 183 AI-designed anti-PD-L1 clones. Dogs were selected and screened by immunoassay. First, a bacterial expression system was used because it requires less effort, time, and cost with high performance compared to a mammalian expression system.

The OmpA (outer membrane protein A) signal peptide was used to secrete the scFv protein into the periplasmic cytoplasm, and the bacterial outer membrane and peptidoglycan layer were removed to obtain a periplasmic fraction. The ScFv protein was fused to an HA tag at the C-terminus. Therefore, the soluble scFv protein in the periplasmic fraction bound to the target antigen was detected using an HRP-conjugated anti-HA antibody and by ELISA analysis.

Specifically, 750 μL of SB medium containing carbenicillin (20 g of yeast extract, 30 g of tryptone, 10 g of MOPS, 1 L of water, pH 7.0) was plated in a 2.2 mL polypropylene deep well plate (Axygen). ) was added to each well. Colonies of the scFv transformant were placed in each well and incubated at 37°C and 180 rpm for about 3-4 hours. Afterwards, 1 mM IPTG (Isopropyl-βSolution, KOREA) was added to each well and incubated at 30°C and 180 rpm for 20 hours to induce the expression of scFv protein. After induction for 20 hours, the deep well plate was centrifuged at 4000 rpm for 20 minutes to remove the supernatant. The pellet was resuspended in 400 μL of STE (20% sucrose, 50 mM Tris-Cl pH 8.0, 1 mM EDTA) buffer. 100 μL of 10 mg/mL lysozyme was added to each well and incubated on ice at 180 rpm for 10 minutes. To remove the outer membrane and peptidoglycan layer, 50 μL of 1 M MgCl ₂ was added and incubated at 4°C at 180 rpm for 10 minutes. The supernatant containing the periplasmic fraction was obtained by centrifugation at 4000 rpm for 20 minutes, and soluble scFv protein was obtained.

The scFv protein bound to the target antigen was selected by enzyme-linked immunosorbent assay (ELISA). As antigens, human PD-1 protein and human PD-L1 protein were purchased and prepared from BIOSYSTEMS Acro (USA). 80 μL of each periplasmic fraction was processed into a 96-well MaxiSorp plate coated with target antigen and incubated at 25°C for 2 hours. Ni-NTA resin and 1-Step™ Ultra TMB-ELISA Substrate Solution were purchased from Thermo scientific (USA), and disposable columns were purchased from BIO-RAD (USA).

Rabbit HRP-conjugated anti-HA antibody (Bethyl (USA), 1:3000-fold dilution), 3,3',5'5'-tetramethylbenzidine solution, and 2.5 NH ₂ SO ₄ were treated sequentially. Antibodies showing an absorbance of 0.1 or more at 450 nm were selected as target antigen-binding antibodies. The expression level of each clone was verified by immunoblotting. Each 4 μL of periplasmic extract was transferred to a nitrocellulose (NC) membrane (GE Healthcare Life Science, Germany) and incubated with blocking buffer (5% nonfat dry milk (BD (USA)) in 0.05% PBST) for 1 h at 25°C. did. Blocked NC membranes were incubated with HRP-conjugated anti-HA antibody (1:3000) for 1 hour at 25°C, and chemiluminescence was detected. As a result, 9 anti-PD-1 antibodies and 10 anti-PD-L1 antibodies showed binding to the corresponding antigens in the initial screening. At the same time, immunoblotting was performed using the same periplasmic fraction to confirm the periplasmic expression of each candidate.

Additionally, the purified positive candidates were further evaluated in a dose-dependent manner in an ELISA assay. Nivolumab (Opdivo®) and pembrolizumab (Keytruda®) were used as anti-PD-1 positive controls, and atezolizumab, avelumab, and durvalumab were used as anti-PD-L1 positive controls. A self-produced non-specific (anti-LPA2) antibody was used as a negative control. Purified antibodies, including positive and negative controls, were serially diluted in duplicate at 1000 nM. On a larger scale, two anti-PD-L1 antibodies (clones 162 and 163) failed due to poor purification. The purified 17 antibodies were further examined for their binding activity to the corresponding antigen in a dose-dependent manner. The concentration, yield, measured Kd value, and absorbance after purification of each scFV clone are shown in Table 2. In Table 2, N/A means not analyzed.

Of the antibodies listed in Table 2, five antibodies (four from anti-PD-1 and one from anti-PD-L1, respectively) showed no binding to the target antigen. As a result, 14 AI-designed antibodies were found to be positive clones that bound to the target antigen. Anti-PD-1 antibody clone 77 (0.094 nM) showed lower EC ₅₀ (nM) values than the anti-PD-1 positive controls, nivolumab (0.93 nM) and pembrolizumab (4.37 nM). None had higher binding activity than the PD-L1 control antibodies atezolizumab (0.17 nM), avelumab (0.35 nM), and durvalumab (0.71 nM), whereas anti-PD-L1 antibody clone 187 (0.82 nM) showed PD-L1 binding activity similar to durvalumab.

Therefore, it was confirmed that new antibodies against PD-1 and PD-L1 could be successfully identified by combining AI-based de novo generation and in vitro experimental studies.

Claims

A method of generating an antibody amino acid sequence, comprising:

tagging, by a computing device, one or more first antibody amino acid sequences comprising one or more antibody amino acid sequences with a region or amino acid of each antibody;

pretraining a machine learning model by training general characteristics of the antibody on one or more first antibody amino acid sequences tagged using the machine learning model by the computing device,

wherein the general characteristic is selected from the group consisting of length of HCDR3, length of HCDR2, length of HCDR1, length of LCDR3, length of LCDR2, length of LCDR1, and isoelectric point;

tagging at least one second antibody amino acid sequence having information about the target antigen with information about the target antigen by the computing device;

Analyzing the first antibody amino acid sequence and the second antibody amino acid sequence in a pre-trained machine learning model by the computing device to generate a third antibody amino acid sequence having information about the target antigen from the first antibody amino acid sequence. ;

Verifying whether there is specific binding to the target antigen in vitro using the generated third antibody amino acid sequence; and

A method comprising generating a fourth antibody amino acid sequence using a machine learning model for the third antibody amino acid sequence that does not specifically bind to the target antigen.
The method of claim 1, wherein the antibody amino acid sequence is selected from the group consisting of a full length antibody, a heavy chain variable region, a light chain variable region, a complementarity determining region (CDR), and a framework region. A method that is an amino acid sequence.
The method according to claim 1, wherein the first antibody amino acid sequence is obtained from OAS (Observed Antibody Space) database or PDB (Protein Data Bank) database.
The method of claim 1, further comprising tagging the first antibody amino acid sequence with a species selected from the group consisting of human, mouse, rat, rabbit, and ape.
The method of claim 1, wherein the second antibody amino acid sequence is obtained from a patent database or a phage display library.
The method of claim 1, wherein the step of generating the third antibody amino acid sequence involves predicting and selecting the probability of the next amino acid of the corresponding amino acid in a given amino acid sequence using language modeling.
The method of claim 6, wherein the language modeling calculates a conditional probability for x i , the amino acid at the i-th position, to the amino acid immediately preceding the i-th position, x <i .
The method of claim 6, wherein the language modeling uses an attention-based neural network model.
The method of claim 1, wherein the target antigen is PD-1, PD-L1, CTLA-4, LAG-3, BTLA, CD200, CD276, KIR, TIM-1, TIM-3, TIGIT, VISTA, CD27, CD28, CD40, From the group consisting of CD40L, CD70, CD75, CD80, CD86, CD73, CD137, GITR, GITRL, IL15, OX40, OX40L, IDO-1, IDO-2, A2AR, ICOS, ICOSL, 4-1BB, and 4-1BBL How to be chosen.
The method of claim 1, wherein the step of verifying whether there is specific binding to the target antigen includes ELISA (Enzyme Linked Immunosorbent Assay), Radial Immunodiffusion, Immunoprecipitation Analysis, RIA (Radioimmunoassay), Immunofluorescence Analysis, and A method comprising performing a method selected from the group consisting of immunoblotting.
The method of claim 10, wherein the step of verifying whether there is specific binding to the target antigen in vitro is to determine absorbance or EC 50 (half maximal effective concentration) as a parameter in ELISA at a light wavelength of 450 nm.
The method of claim 11, wherein the third antibody amino acid sequence is determined to have specific binding to the target antigen when the absorbance is 0.1 or more in 450 nm wavelength light in ELISA.
The method of claim 1, wherein the third antibody amino acid sequence that does not specifically bind to the target antigen is passed back to the machine learning model without tagging the target antigen.
A method of producing an antibody pharmaceutical comprising the step of producing an antibody from the amino acid sequence of the antibody produced according to claim 1 in vitro.
A computer readable medium recording a program applied to perform the method according to any one of claims 1 to 14.