WO2024080783A1

WO2024080783A1 - Apparatus and method for generating tcr information corresponding to pmhc using artificial intelligence technology

Info

Publication number: WO2024080783A1
Application number: PCT/KR2023/015730
Authority: WO
Inventors: 송성재; 함박눈; 서정한; 임채열
Original assignee: 주식회사 네오젠티씨
Priority date: 2022-10-14
Filing date: 2023-10-12
Publication date: 2024-04-18
Also published as: KR102547977B1; KR20240052630A

Abstract

Disclosed is a method for training an artificial intelligence-based prediction model, executed by a computing device. The method may comprise the steps of: acquiring first input data corresponding to the major histocompatibility complex (MHC), second input data corresponding to peptides, and third input data corresponding to the CDR3 of a T Cell Receptor (TCR) corresponding to both the MHC and the peptides - where the first input data include amino acids of a first sequence corresponding to the MHC, the second input data include amino acids of a second sequence corresponding to the peptides, and the third input data include amino acids of a third sequence corresponding to the CDR3; performing preprocessing, which includes grouping and segmenting processes for the first, second, and third input data to generate a training data set; and using the training data set to train the artificial intelligence-based prediction model to generate a prediction result that includes the third input data from the first and second input data. FIG. 3 may serve as a representative figure.

Description

Method and device for generating TCR information corresponding to PMHC using artificial intelligence technology

This disclosure relates to artificial intelligence technology, and more specifically, to analyzing the relationship between peptide-major histocompatibility complex (pMHC) and T Cell Receptor (TCR) using artificial intelligence technology.

The major histocompatibility complex is a locus that encodes ‘MHC molecules’ that function in the immune system. MHC molecules can be of type 1 (class I) and type 2 (class II).

An immunopeptidome refers to a set of peptides expressed on the surface of a cell. For example, an immunopeptidome may refer to a combination of peptides associated with MHC.

Human Leukocyte Antigen (HLA) is a glycoprotein molecule produced by the human Major Histocompatibility Complex (MHC) gene. HLA is not present in mature red blood cells, but is expressed in immature erythroblasts and is expressed on the surface of all tissue cells in the human body, including blood cells such as white blood cells and/or platelets. MHC genes exist in all vertebrates, and the human MHC gene is called an HLA gene, and the product expressed therefrom is called HLA.

MHC genes are involved in recognition of self and non-self, immune response to antigen stimulation, regulation of cellular and humoral immunity, and susceptibility to disease. HLA, a product of the MHC gene, is the second most important antigen after the ABO blood group in the survival of the transplanted organ in solid organ transplantation. HLA is known to play the most important role in the success or failure of bone marrow transplantation. Therefore, immunological recognition of HLA differences can be considered the first step in rejection action against transplanted tissue. Additionally, in transfusion therapy, HLA and antibodies play an important role in the occurrence of various side effects such as platelet transfusion refractoriness, febrile non-hemolytic transfusion side effects, acute lung injury, and post-transfusion graft-versus-host disease.

HLA, like MHC, can be broadly classified into Class I and Class II. Class I is classified into HLA-A, HLA-B, and HLA-C and is expressed in most nucleated cells and platelets. When cytotoxic T cells recognize and eliminate virus-infected cells or tumor cells, they recognize antigens. ) is essential. HLA Class II is classified into HLA-DR, HLA-DQ, and HLA-DP and is expressed in B cells, monocytes, dendritic cells, and activated T cells. It interacts with the antigen receptor of helper T cells to induce cellular It is known to be essential for inducing humoral immune responses and recognizing antigens expressed on antigen-presenting cells. HLA is a gene that shows the greatest polymorphism among genes possessed by humans, and differences in frequency also exist among races and ethnicities.

When a peptide derived from an infectious microorganism-derived protein or a cancer cell-specific protein binds to MHC and is presented on the cell surface, T cells recognize it and trigger an immune response to eliminate the infected cell or cancer cell. In this way, T cells are key regulators (players) that determine specific immune responses to foreign substances that do not exist in the normal human body. Therefore, prediction of TCR (T Cell Receptor) binding to pMHC can be used in the development of personalized vaccines to prevent infectious diseases or cancer.

In relation to this, Republic of Korea Patent No. 10-2322832 has been issued.

The present disclosure has been made in response to the above-described background technology, and is intended to predict or identify TCRs capable of binding pMHC in a more efficient manner and/or more accurately.

The technical problems of the present disclosure are not limited to the technical problems mentioned above, and other technical problems not mentioned will be clearly understood by those skilled in the art from the description below.

According to one embodiment of the present disclosure, a method for training an artificial intelligence-based prediction model performed by a computing device is disclosed. The method includes: first input data corresponding to a Major Histocompatibility Complex (MHC), second input data corresponding to a peptide, and CDR3 of the MHC and a T Cell Receptor (TCR) corresponding to the peptide. Obtaining third input data - the first input data includes amino acids of a first sequence corresponding to the MHC, and the second input data includes amino acids of a second sequence corresponding to the peptide, and the third input data includes amino acids of a third sequence corresponding to the CDR3; generating a learning data set by performing preprocessing including a grouping process and a segmenting process on the first input data, the second input data, and the third input data; And using the training data set, the artificial intelligence-based prediction model generates a prediction result including the third input data from the first input data and the second input data. It may include a learning step.

In one embodiment, the prediction result may further include type V and type J of the TCR corresponding to the first input data and the second input data.

In one embodiment, the grouping process may generate tokens with a length corresponding to each of the different length amino acid sequences by analyzing the frequency of occurrence for each of the different length amino acid sequences.

In one embodiment, the grouping process includes: generating a first token group for a first set of amino acid sequences obtained by analyzing the frequency of occurrence for amino acid sequences having a first length; and a second set obtained by analyzing the frequency of occurrence of amino acid sequences having a second length shorter than the first length, with the amino acid sequences of the first set included in the first group excluded. It may include generating a second token group containing amino acid sequences. And tokens included in the first token group may have the first length and tokens included in the second token group may have the second length.

In one embodiment, the frequency of occurrence is: a value that quantitatively represents the probability of a specific amino acid sequence being found within one CDR3 sequence; Alternatively, it may include a value that quantitatively represents the ratio of the number of times a specific amino acid sequence is found in all CDR3s to the total number of CDR3s.

In one embodiment, tokenization may be performed on each of the first input data, the second input data, and the third input data based on the tokens generated by the grouping process.

In one embodiment, the segmenting process includes a first learning process comprising first tokens corresponding to the first input data and third tokens corresponding to the third input data with a separator token in between. A second learning data set including second tokens corresponding to the second input data and third tokens corresponding to the third input data can be generated with the data set and the separator token interposed therebetween. Here, the third tokens are used as correct answer data in the learning process of the prediction model, and the first learning data set and the second learning data set can be learned by different artificial intelligence networks within the prediction model. .

In one embodiment, the segmentation process includes first tokens corresponding to the first input data, second tokens corresponding to the second input data, and between the first tokens and the second tokens. Generating a third training data set, including a first delimiter token located, third tokens corresponding to the third input data, and a second delimiter token located between the second tokens and the third tokens. can do. Here, the third tokens can be used as correct answer data in the learning process of the prediction model.

In one embodiment, the grouping process includes: obtaining a CDR3 list from an external database or a database included in the computing device; For each set of amino acids whose amino acid sequence length is K, determining a first frequency of occurrence in the CDR3 list; Among the sets of amino acids whose length of the amino acid sequence is K, determining a first set of amino acids whose first frequency of occurrence is equal to or greater than a first predetermined threshold; For each set of amino acids whose length of amino acid sequence is K-1, determining a second frequency of occurrence in the CDR3 list; Among the sets of amino acids whose length of the amino acid sequence is K-1, determining a second set of amino acids whose second frequency of occurrence is equal to or greater than a second predetermined threshold; and performing tokenization with a length of K for the first set of amino acids and performing tokenization with a length of K-1 for the second set of amino acids. Here, K is a natural number of 2 or more, and the first threshold value and the second threshold value may have the same or different values.

In one embodiment, the step of determining the second frequency of occurrence includes, for each of the sets of amino acids whose length of the amino acid sequence is K-1, within the range in which the first set of amino acids is excluded from the CDR3 list. It may include determining a second frequency of appearance.

In one embodiment, the grouping process includes: obtaining a CDR3 list from an external database or a database included in the computing device; Among amino acid sets whose amino acid sequence length is N-M, include in the token list the M+1th set of amino acids whose frequency of occurrence in the CDR3 list is greater than or equal to a predetermined threshold, and include the M+1th set of amino acids in the CDR3 list. Constructing the token list by removing from - where M is an integer greater than 0, and N-M is a natural number greater than 2; After the step of constructing the token list is performed, increasing the value of M by 1 and determining whether a termination condition is satisfied; if the termination condition is not satisfied, constructing the token list Re-performing, and if the termination condition is satisfied, performing tokenization by generating a token corresponding to each of the amino acid sets included in the token list with an amino acid sequence length corresponding to each of the amino acid sets. may include.

In one embodiment, the termination condition may include a first termination condition corresponding to N-M≤1.

In one embodiment, the termination condition includes a second termination condition corresponding to N-M ≤ 2, and the step of performing the tokenization includes generating a number of tokens corresponding to the types of amino acids whose length of the amino acid sequence is 1. The step of performing the tokenization by additionally generating the token may be further included.

In one embodiment, the predetermined threshold may be varied to have a negative correlation with the magnitude of the value of N-M.

In one embodiment, the artificial intelligence-based prediction model may include a Recurrent Neural Network (RNN), a Long Short Term Memory (LSTM) network, or Bidirectional Encoder Representations from Transformers (BERT).

In one embodiment, the step of learning the artificial intelligence-based prediction model involves applying a mask to some amino acids among the amino acid sequences and then matching the masked amino acids with semi-supervised learning. It may include steps to perform learning.

In one embodiment, the step of learning the artificial intelligence-based prediction model includes applying a mask to the amino acids included in the amino acids of the third sequence corresponding to the CDR3 among the amino acid sequences, and then matching the masked amino acids. -May include steps for performing supervised learning.

In one embodiment, the step of training the artificial intelligence-based prediction model includes applying the mask to different positions on an epoch basis when the prediction model is learned over a plurality of epochs or It may include applying the masks of different sizes.

In one embodiment, the step of training the artificial intelligence-based prediction model includes, when a plurality of masks exist for one learning data, the artificial intelligence-based prediction model uses one mask among the plurality of masks. When making predictions, one can apply the average value of amino acids or amino acid X, which represents all amino acids, to another mask.

In one embodiment, the step of learning the artificial intelligence-based prediction model includes using third input data experimentally determined to have no immunogenicity with respect to the first input data and the second input data. It may include the step of learning the artificial intelligence-based prediction model by using it as the correct answer data. Here, in the learning process of the artificial intelligence-based prediction model, the third input data may not be randomly generated.

In one embodiment, a computer program stored on a computer-readable storage medium is disclosed. The computer program, when executed by a computing device, causes the computing device to perform operations for learning an artificial intelligence-based prediction model, the operations comprising: first input data corresponding to the major histocompatibility complex (MHC), a peptide; An operation of acquiring second input data corresponding to and third input data corresponding to the CDR3 of the MHC and the TCR corresponding to the peptide, wherein the first input data includes amino acids of a first sequence corresponding to the MHC, , the second input data includes amino acids of a second sequence corresponding to the peptide, and the third input data includes amino acids of a third sequence corresponding to the CDR3; generating a learning data set by performing preprocessing including a grouping process and a segmenting process on the first input data, the second input data, and the third input data; And using the training data set, the artificial intelligence-based prediction model generates a prediction result including the third input data from the first input data and the second input data. It may include an operation to learn.

A computing device according to one embodiment is disclosed. The computing device includes at least one processor; and memory. The at least one processor: first input data corresponding to a major histocompatibility complex (MHC), second input data corresponding to a peptide, and third input data corresponding to CDR3 of a TCR corresponding to the MHC and the peptide. Obtaining - the first input data includes amino acids of a first sequence corresponding to the MHC, the second input data includes amino acids of a second sequence corresponding to the peptide, and the third input data includes amino acids of a third sequence corresponding to the CDR3; generating a learning data set by performing preprocessing including a grouping process and a segmenting process on the first input data, the second input data, and the third input data; And using the training data set, the artificial intelligence-based prediction model generates a prediction result including the third input data from the first input data and the second input data. You can perform an operation to learn.

Methods and devices according to an embodiment of the present disclosure can predict or identify TCRs capable of binding pMHC in a more efficient manner and/or more accurately.

1 schematically shows a block diagram of a computing device according to an embodiment of the present disclosure.

2 shows an example structure of an artificial intelligence-based model according to an embodiment of the present disclosure.

Figure 3 exemplarily shows a method of learning a prediction model that generates a prediction result including the CDR3 sequence of a TCR according to an embodiment of the present disclosure.

Figure 4 exemplarily shows a method for generating a prediction result including the CDR3 sequence of a TCR according to an embodiment of the present disclosure.

5 exemplarily illustrates a grouping process according to one embodiment of the present disclosure.

Figure 6 illustratively shows a segmentation process according to one embodiment of the present disclosure.

7 is a schematic diagram of a computing environment according to one embodiment of the present disclosure.

Various embodiments are described with reference to the drawings. In this specification, various descriptions are presented to provide an understanding of the disclosure. Before describing specific details for implementing the present disclosure, it should be noted that configurations that are not directly related to the technical gist of the present disclosure have been omitted to the extent that they do not distract from the technical gist of the present invention. In addition, the terms or words used in this specification and claims have meanings that are consistent with the technical idea of the present invention, based on the principle that the inventor can define the concept of appropriate terms in order to explain his or her invention in the best way. It should be interpreted as a concept.

As used herein, the terms "component", "module", "system", "part", etc. refer to a computer-related entity, hardware, firmware, software, a combination of software and hardware, or an implementation of software, and are used interchangeably. It can possibly be used. For example, a component may be, but is not limited to, a process running on a processor, a processor, an object, a thread of execution, a program, and/or a computer. For example, both an application running on a computing device and the computing device can be a component. One or more components may reside within a processor and/or thread of execution. A component may be localized within one computer. A component may be distributed between two or more computers. Additionally, these components can execute from various computer-readable media having various data structures stored thereon. Components may transmit signals, for example, with one or more data packets (e.g., data and/or signals from one component interacting with other components in a local system, a distributed system, to other systems and over a network such as the Internet). Depending on the data being transmitted, they may communicate through local and/or remote processes.

Additionally, the term “or” is intended to mean an inclusive “or” and not an exclusive “or.” That is, unless otherwise specified or clear from context, “X utilizes A or B” is intended to mean one of the natural implicit substitutions. That is, either X uses A; Either X uses B; Or, if X uses both A and B, “X uses A or B” can apply to either of these cases. Additionally, the term “and/or” as used herein should be understood to refer to and include all possible combinations of one or more of the related listed items.

Additionally, the terms “comprise” and/or “comprising” should be understood to mean that the corresponding feature and/or element is present. However, the terms “comprise” and/or “comprising” should be understood as not excluding the presence or addition of one or more other features, elements and/or groups thereof. Additionally, unless otherwise specified or the context is clear to indicate a singular form, the singular terms herein and in the claims should generally be construed to mean “one or more.”

And, the term “at least one of A or B” or “at least one of A and B” means “if it contains only A,” “if it contains only B,” or “if it is a combination of A and B.” It should be interpreted to mean.

Those skilled in the art will additionally recognize that the various illustrative logical components, blocks, modules, circuits, means, logics, and algorithms described in connection with the embodiments disclosed herein may be implemented using electronic hardware, computer software, or a combination of both. It must be recognized that it can be implemented with To clearly illustrate the interchangeability of hardware and software, various illustrative components, blocks, means, logic, modules, circuits, and steps have been described above generally in terms of their functionality. Whether such functionality is implemented in hardware or software will depend on the specific application and design constraints imposed on the overall system. A skilled technician can implement the described functionality in a variety of ways for each specific application. However, such implementation decisions should not be construed as causing a departure from the scope of the present disclosure.

The description of the presented embodiments is provided to enable anyone skilled in the art to use or practice the present invention. Various modifications to these embodiments will be apparent to those skilled in the art. The general principles defined herein may be applied to other embodiments without departing from the scope of the disclosure. Therefore, the present invention is not limited to the embodiments presented herein. The present invention is to be interpreted in the broadest scope consistent with the principles and novel features presented herein.

In the present disclosure, terms represented by N, such as first, second, or third, are used to distinguish at least one entity. For example, the entities expressed as first and second may be the same or different from each other.

In this disclosure, for convenience of explanation, human leukocyte antigen (HLA) is used as an example of MHC. Therefore, the description of HLA or MHC used below is an example for expressing the description of MHC or HLA, and the scope of rights of the present disclosure will be determined based on the content stated in the claims, and through examples of HLA The scope of the rights should not be interpreted as limited to HLA. As such, HLA and MHC in the present disclosure may be used interchangeably.

The term “human leukocyte antigen (HLA)” used in this disclosure is a glycoprotein molecule produced by the human MHC gene, and is a gene that shows the greatest polymorphism among genes possessed by humans. HLA typing, which determines HLA type, can be actively used in various fields such as organ transplantation, immunotherapy, disease-related research, paternity tests such as paternity determination, forensic use, and genetic research.

HLA types in the present disclosure may include, for example, HLA-A type, HLA-B type, and/or HLA-C type.

In one embodiment, the complex of MHC and peptide may refer to a peptide antigen presentation complex through MHC class I molecules after processing through the proteasome in an antigen presenting cell (APC). The proteasome is composed of two monomers, LMP-2 and LMP-7 (low molecular weight polypeptide). These two proteasome monomers are located near the TAP-1 and TAP-2 genes within the MHC gene. Proteasome monomers are particularly important for the degradation of peptides that bind to MHC I molecules. When cells are treated with the cytokine interferon gamma (IFN-γ), the expression of LMP-2 and LMP-7 can be induced. Expression of LMP-2 and LMP-7 changes the substrate specificity of the proteasome, increasing its ability to decompose into peptides. The expression of proteins related to antigen presentation, such as MHC I and MECL-1, as well as LMP proteins, is increased by IFN-γ, which can increase antigen presentation in antigen-presenting cells.

MHC I molecules bind not only to peptide antigens degraded by the proteasome, but also to peptides produced by proteolytic enzymes present in the endoplasmic reticulum.

Tapasin acts as a bridge between TAP-1 (transporter associated with antigen processing) and MHC I, which has a stable tertiary structure within the endoplasmic reticulum (ER), and when a peptide enters, the MHC I complex is bound to the tapasin. It leaves the Shin and TAP proteins and becomes a complete peptide-MHC class I complex.

There are two types of T cell receptor (TCR), mainly consisting of TCRα and TCRβ. The TCRα chain is located at an independent locus on chromosome 14, and the TCRβ chain is located on chromosome 7. TCRβ is composed of the V gene segment, D gene segment, J gene segment, and C gene segment, similar to the immunoglobulin (Ig) heavy chain, and the TCRα chain has the D gene segment, similar to the Ig light chain. It is composed of V, J, and C gene segments.

The TCR recombination process is similar to the B cell receptor recombination process. The TCRα chain is encoded by V, J, and C gene segments similar to the light chain of immunoglobulin, and the TCRβ chain is formed by DJ binding followed by V binding, where C binds to V-D-J. Formed through a genetic recombination process. The connection of V-(D)-J is mediated by RAG-1 and RAG-2 (recombination-activating gene).

Cleavage of the hairpin structure occurs by endonuclease, and complementary P-nucleotides are added to the short single strand of DNA formed at this time. Additionally, 1 to 20 nucleotides can be randomly added to the cut end, which are called N-nucleotides, and this process is mediated by TdT (terminal deoxynucleotide transferase). P- and N-nucleotides increase the diversity of T cell receptors.

T cells only recognize external antigens presented on the cell surface, and the activation of mature peripheral T cells begins through the interaction between the antigen and peptide presented in the peptide binding niche of the TCR and MHC molecules.

In particular, most immune profiling focuses on the analysis of the CDR3 region within the TCR sequence. The CDR3 region is an important region involved in the interaction between antigen and receptor, and the most mutations are identified.

1 schematically shows a block diagram of a computing device 100 according to an embodiment of the present disclosure.

Computing device 100 according to an embodiment of the present disclosure may include a processor 110 and a memory 130.

The configuration of the computing device 100 shown in FIG. 1 is only a simplified example. In one embodiment of the present disclosure, the computing device 100 may include different components for performing the computing environment of the computing device 100, and only some of the disclosed components may configure the computing device 100.

The computing device 100 in the present disclosure may refer to any type of node constituting a system for implementing embodiments of the present disclosure. Computing device 100 may refer to any type of user terminal or any type of server. The components of the computing device 100 described above are exemplary and some may be excluded or additional components may be included. For example, when the above-described computing device 100 includes a user terminal, an output unit (not shown) and an input unit (not shown) may be included within the scope of the computing device 100.

The computing device 100 in the present disclosure may perform technical features according to embodiments of the present disclosure, which will be described later. For example, the computing device 100 may use an artificial intelligence-based prediction model that uses input data corresponding to MHC and peptides to generate a prediction result including the CDR3 sequence of the TCR corresponding to the input data. For example, the computing device 100 may acquire information about MHC and peptides from a sample obtained from a subject, and generate a prediction result corresponding to CDR3 of the TCR based on the information about MHC and peptides.

In one embodiment of the present disclosure, the computing device 100 may obtain a result of performing base sequence analysis (eg, Next Generation Sequencing) from a server or an external entity. In another embodiment, the computing device 100 may perform base sequence analysis on genetic data (eg, DNA or RNA) obtained from a biological sample derived from a subject. As used in the present disclosure, the term base sequencing may be performed by any type of technique capable of analyzing the sequence of bases, for example, whole genome sequencing, whole exome. It may include, but is not limited to, whole exome sequencing or whole transcriptome sequencing.

As used in the present disclosure, the term "subject" may refer to a subject or individual for obtaining a biological sample containing a major histocompatibility complex (MHC), a peptide, and/or a complex thereof.

The terms and samples used in the present disclosure can be used without limitation as long as they are obtained from an individual or subject whose MHC type is to be determined, for example, cells or tissues obtained through biopsy, blood, whole blood, serum, plasma, saliva, etc. It may be cerebrospinal fluid, various secretions, urine and/or feces, etc. Preferably, the sample may be selected from the group consisting of blood, plasma, serum, saliva, nasal fluid, sputum, ascites, vaginal secretions and/or urine, and more preferably may be blood, plasma or serum. The sample may be pretreated prior to use for detection or diagnosis. For example, pretreatment methods may include homogenization, filtration, distillation, extraction, concentration, inactivation of interfering components, and/or addition of reagents, etc. In the present disclosure, the biological sample may be, but is not limited to, tissue, cells, whole blood, and/or blood.

In one embodiment, the processor 110 may consist of at least one core, such as a central processing unit (CPU) and a general purpose graphics processing unit (GPGPU) of the computing device 100. , may include a processor for data analysis and/or processing, such as a tensor processing unit (TPU).

The processor 110 may read the computer program stored in the memory 130 and generate a prediction result including the CDR3 sequence of the TCR, according to an embodiment of the present disclosure.

According to an embodiment of the present disclosure, the processor 110 may perform an operation for learning a neural network. The processor 110 is used for learning neural networks, such as processing input data for learning in deep learning (DL), extracting features from input data, calculating errors, and updating the weights of the neural network using backpropagation. Calculations can be performed. At least one of the CPU, GPGPU, and TPU of the processor 110 may process learning of the network function. For example, CPU and GPGPU can work together to process learning of network functions and data classification using network functions. Additionally, in one embodiment of the present disclosure, processors of a plurality of computing devices may be used together to process learning of a network function and data classification using a network function. Additionally, a computer program executed in a computing device according to an embodiment of the present disclosure may be a CPU, GPGPU, or TPU executable program.

Additionally, processor 110 may typically handle overall operations of computing device 100. For example, the processor 110 processes data, information, or signals input or output through components included in the computing device 100 or runs an application program stored in the storage to provide appropriate information or information to the user. Functions can be provided or processed.

According to one embodiment of the present disclosure, the memory 130 may store any type of information generated or determined by the processor 110 and any type of information received by the computing device 100. According to an embodiment of the present disclosure, the memory 130 may be a storage medium that stores computer software that allows the processor 110 to perform operations according to embodiments of the present disclosure. Accordingly, the memory 130 may refer to computer-readable media for storing software codes required to perform embodiments of the present disclosure, data to be executed by the codes, and execution results of the codes.

According to one embodiment of the present disclosure, the memory 130 may refer to any type of storage medium. For example, the memory 130 may be a flash memory type or a hard disk type. ), multimedia card micro type, card type memory (e.g. SD or XD memory, etc.), RAM (Random Access Memory), SRAM (Static Random Access Memory), ROM (Read-Only) Memory, ROM), EEPROM (Electrically Erasable Programmable Read-Only Memory), PROM (Programmable Read-Only Memory), magnetic memory, magnetic disk, and optical disk. The computing device 100 may operate in connection with web storage that performs a storage function of the memory 130 on the Internet. The description of the memory described above is only an example, and the memory 130 used in the present disclosure is not limited to the examples described above.

The communication unit (not shown) in the present disclosure can be configured regardless of the communication mode, such as wired or wireless, and can be used in various communication networks such as a personal area network (PAN) and a wide area network (WAN). It can be configured. In addition, the network unit 150 can operate based on the well-known World Wide Web (WWW), and is a wireless transmission technology used for short-distance communication such as Infrared Data Association (IrDA) or Bluetooth. You can also use .

Computing device 100 in the present disclosure may include any type of user terminal and/or any type of server. Accordingly, embodiments of the present disclosure may be performed by a server and/or a user terminal.

A user terminal may include any type of terminal capable of interacting with a server or other computing device. User terminals include, for example, mobile phones, smart phones, laptop computers, personal digital assistants (PDAs), slate PCs, tablet PCs, and ultrabooks. It can be included.

Servers may include any type of computing system or computing device, such as, for example, microprocessors, mainframe computers, digital processors, portable devices, and device controllers.

In an additional embodiment, the above-described server may mean an entity that stores and manages TCR information, immunopeptidome information, peptide sequence information, base sequence information, or genetic information. The server is used to store immunopeptidome information, peptide sequence information, amino acid identifier information for each position, nucleotide sequence information, genetic information, or database reliability information (e.g., McPas, TCR3F, Huarc, VDJdb, IMGT). It may include a storage unit (not shown), and the storage unit may be included in a server or may exist under the management of the server. As another example, the storage unit may be implemented in a form that exists outside the server and can communicate with the server. In this case, the storage may be managed and controlled by an external server that is different from the server.

Throughout this specification, prediction model, artificial intelligence-based prediction model, artificial intelligence model, artificial intelligence-based model, computational model, neural network, network function, and neural network may be used with the same meaning.

A neural network can generally consist of a set of interconnected computational units, which can be referred to as nodes. These nodes may also be referred to as neurons. A neural network consists of at least one node. Nodes (or neurons) that make up neural networks may be interconnected by one or more links.

Within a neural network, one or more nodes connected through a link may form a relative input node and output node relationship. The concepts of input node and output node are relative, and any node in an output node relationship with one node may be in an input node relationship with another node, and vice versa. As described above, input node to output node relationships can be created around links. One or more output nodes can be connected to one input node through a link, and vice versa.

In a relationship between an input node and an output node connected through one link, the value of the data of the output node may be determined based on the data input to the input node. Here, the link connecting the input node and the output node may have a weight. Weights may be variable and may be varied by the user or algorithm in order for the neural network to perform the desired function. For example, when one or more input nodes are connected to one output node by respective links, the output node is set to the values input to the input nodes connected to the output node and the links corresponding to each input node. The output node value can be determined based on the weight.

As described above, in a neural network, one or more nodes are interconnected through one or more links to form an input node and output node relationship within the neural network. The characteristics of the neural network can be determined according to the number of nodes and links within the neural network, the correlation between the nodes and links, and the value of the weight assigned to each link. For example, if there are two neural networks with the same number of nodes and links and different weight values of the links, the two neural networks may be recognized as different from each other.

A neural network may consist of a set of one or more nodes. A subset of nodes that make up a neural network can form a layer. Some of the nodes constituting the neural network may form one layer based on the distances from the first input node. For example, a set of nodes with a distance n from the initial input node may constitute n layers. The distance from the initial input node can be defined by the minimum number of links that must be passed to reach the node from the initial input node. However, this definition of a layer is arbitrary for explanation purposes, and the order of a layer within a neural network may be defined in a different way than described above. For example, a layer of nodes may be defined by distance from the final output node.

In one embodiment of the present disclosure, a set of neurons or nodes may be defined by the expression layer.

The initial input node may refer to one or more nodes in the neural network into which data is directly input without going through links in relationships with other nodes. Alternatively, in the relationship between nodes based on links within a neural network, it may refer to nodes that do not have other input nodes connected by links. Similarly, the final output node may refer to one or more nodes that do not have an output node in their relationship with other nodes among the nodes in the neural network. Additionally, hidden nodes may refer to nodes constituting a neural network other than the first input node and the last output node.

The neural network according to an embodiment of the present disclosure is a neural network in which the number of nodes in the input layer may be the same as the number of nodes in the output layer, and the number of nodes decreases and then increases again as it progresses from the input layer to the hidden layer. You can. In addition, the neural network according to another embodiment of the present disclosure may be a neural network in which the number of nodes in the input layer may be less than the number of nodes in the output layer, and the number of nodes decreases as it progresses from the input layer to the hidden layer. there is. In addition, the neural network according to another embodiment of the present disclosure may be a neural network in which the number of nodes in the input layer may be greater than the number of nodes in the output layer, and the number of nodes increases as it progresses from the input layer to the hidden layer. You can. A neural network according to another embodiment of the present disclosure may be a neural network that is a combination of the above-described neural networks.

A deep neural network (DNN) may refer to a neural network that includes multiple hidden layers in addition to the input layer and output layer. Deep neural networks allow you to identify latent structures in data. That is, the potential structure of a photo, text, video, voice, protein sequence structure, gene sequence structure, peptide sequence structure, music (e.g., which object is in the photo, what is the content and emotion of the text, the voice (what the content and emotions are, etc.), and/or the binding affinity between the peptide and MHC. Deep neural networks include convolutional neural networks (CNNs), recurrent neural networks (RNNs), auto encoders, generative adversarial networks (GANs), and restricted Boltzmann machines ( It may include restricted boltzmann machine (RBM), deep belief network (DBN), Q network, U network, Siamese network, etc. The description of the deep neural network described above is only an example and the present disclosure is not limited thereto.

For example, the artificial intelligence-based prediction model of the present disclosure may include a Recurrent Neural Network (RNN), a Long Short Term Memory (LSTM) network, or a Bidirectional Encoder Representations from Transformers (BERT). You can.

The artificial intelligence-based prediction model of the present disclosure can be expressed by a network structure of any of the structures described above, including an input layer, a hidden layer, and an output layer.

Neural networks that can be used in the artificial intelligence-based model of the present disclosure include at least one of supervised learning, unsupervised learning, semi-supervised learning, or reinforcement learning. It can be learned in this way. Learning of a neural network may be a process of applying knowledge for the neural network to perform a specific operation to the neural network. For example, a prediction model according to an embodiment of the present disclosure applies a mask to some of the amino acids in the amino acid sequences and then uses semi-supervised learning to match the masked amino acids. It can be learned in this way.

Neural networks can be trained to minimize output errors. In neural network learning, learning data is repeatedly input into the neural network, the output of the neural network and the error of the target for the learning data are calculated, and the error of the neural network is transferred from the output layer of the neural network to the input layer in the direction of reducing the error. This is the process of updating the weight of each node in the neural network through backpropagation. In the case of supervised learning, learning data in which the correct answer is labeled in each learning data is used (i.e., labeled learning data), and in the case of unsupervised learning, the correct answer may not be labeled in each learning data. That is, for example, in the case of supervised learning on data classification, the training data may be data in which each training data is labeled with a category. Labeled training data is input to the neural network, and the error can be calculated by comparing the output (category) of the neural network with the label of the training data. As another example, in the case of unsupervised learning on data classification, the error can be calculated by comparing the input training data with the neural network output. The calculated error is backpropagated in the reverse direction (i.e., from the output layer to the input layer) in the neural network, and the connection weight of each node in each layer of the neural network can be updated according to backpropagation. The amount of change in the connection weight of each updated node may be determined according to the learning rate. The neural network's calculation of input data and backpropagation of errors can constitute a learning cycle (epoch). The learning rate may be applied differently depending on the number of repetitions of the learning cycle of the neural network. For example, in the early stages of learning a neural network, a high learning rate can be used to ensure that the neural network quickly achieves a certain level of performance to increase efficiency, and in the later stages of training, a low learning rate can be used to increase accuracy.

In the learning of neural networks, the training data can generally be a subset of real data (i.e., the data to be processed using the learned neural network), and thus the error for the training data is reduced, but the error for the real data is reduced. There may be an incremental learning cycle. Overfitting is a phenomenon in which errors in actual data increase due to excessive learning on training data. For example, a phenomenon in which a neural network that learned a cat by showing a yellow cat fails to recognize that it is a cat when it sees a non-yellow cat may be a type of overfitting. Overfitting can cause errors in machine learning algorithms to increase. To prevent such overfitting, various optimization methods can be used. To prevent overfitting, methods such as increasing the training data, regularization, dropout to disable some of the network nodes during the learning process, and use of a batch normalization layer are used. It can be applied.

A computer-readable medium storing a data structure according to an embodiment of the present disclosure is disclosed. The above-described data structure can be stored in a storage unit in the present disclosure, executed by a processor, and transmitted and received by a communication unit.

Data structure can refer to the organization, management, and storage of data to enable efficient access and modification of data. Data structure can refer to the organization of data to solve a specific problem (e.g., retrieving data, storing data, or modifying data in the shortest possible time). A data structure may be defined as a physical or logical relationship between data elements designed to support a specific data processing function. Logical relationships between data elements may include connection relationships between user-defined data elements. Physical relationships between data elements may include actual relationships between data elements that are physically stored in a computer-readable storage medium (e.g., a persistent storage device). A data structure may specifically include a set of data, relationships between data, and functions or instructions applicable to the data. Effectively designed data structures allow computing devices to perform computations while minimally using the computing device's resources. Specifically, computing devices can increase the efficiency of operations, reading, insertion, deletion, comparison, exchange, and search through effectively designed data structures.

Data structures can be divided into linear data structures and non-linear data structures depending on the type of data structure. A linear data structure may be a structure in which only one piece of data is connected to another piece of data. Linear data structures may include List, Stack, Queue, and Deque. A list can refer to a set of data that has an internal order. The list may include a linked list. A linked list may be a data structure in which data is connected in such a way that each data has a pointer and is connected in one line. In a linked list, a pointer can contain connection information to the next or previous data. A linked list can be expressed as a singly linked list, doubly linked list, or circular linked list depending on its form. A stack may be a data listing structure that allows limited access to data. A stack can be a linear data structure in which data can be processed (for example, inserted or deleted) at only one end of the data structure. Data stored in the stack may have a data structure (LIFO-Last in First Out) where the later it enters, the sooner it comes out. A queue is a data listing structure that allows limited access to data. Unlike the stack, it can be a data structure (FIFO-First in First Out) where data stored later is released later. A deck can be a data structure that can process data at both ends of the data structure.

A non-linear data structure may be a structure in which multiple pieces of data are connected behind one piece of data. Nonlinear data structures may include graph data structures. A graph data structure can be defined by vertices and edges, and an edge can include a line connecting two different vertices. Graph data structure may include a tree data structure. A tree data structure may be a data structure in which there is only one path connecting two different vertices among a plurality of vertices included in the tree. In other words, it may be a data structure that does not form a loop in the graph data structure.

Throughout this specification, artificial intelligence-based model, computational model, neural network, network function, and neural network may be used with the same meaning. Below, it is described in a unified manner as a neural network. Data structures may include neural networks. And the data structure including the neural network may be stored in a computer-readable medium. Data structures including neural networks also include data preprocessed for processing by a neural network, data input to the neural network, weights of the neural network, hyperparameters of the neural network, data acquired from the neural network, activation functions associated with each node or layer of the neural network, neural network It may include a loss function for learning. A data structure containing a neural network may include any of the components disclosed above. In other words, the data structure including the neural network includes data preprocessed for processing by the neural network, data input to the neural network, weights of the neural network, hyperparameters of the neural network, data acquired from the neural network, activation functions associated with each node or layer of the neural network, neural network It may be configured to include all or any combination of loss functions for learning. In addition to the configurations described above, a data structure containing a neural network may include any other information that determines the characteristics of the neural network. Additionally, the data structure may include all types of data used or generated in the computational process of a neural network and is not limited to the above. Computer-readable media may include computer-readable recording media and/or computer-readable transmission media. A neural network can generally consist of a set of interconnected computational units, which can be referred to as nodes. These nodes may also be referred to as neurons. A neural network consists of at least one node.

The data structure may include data input to the neural network. A data structure containing data input to a neural network may be stored in a computer-readable medium. Data input to the neural network may include learning data input during the neural network learning process and/or input data input to the neural network on which training has been completed. Data input to the neural network may include data that has undergone pre-processing and/or data subject to pre-processing. Preprocessing may include a data processing process to input data into a neural network. Therefore, the data structure may include data subject to preprocessing and data generated by preprocessing. The above-described data structure is only an example and the present disclosure is not limited thereto.

The data structure may include the weights of the neural network. (In this specification, weights and parameters may be used with the same meaning.) And the data structure including the weights of the neural network may be stored in a computer-readable medium. A neural network may include multiple weights. Weights may be variable and may be varied by the user or algorithm in order for the neural network to perform the desired function. For example, when one or more input nodes are connected to one output node by respective links, the output node is set to the values input to the input nodes connected to the output node and the links corresponding to each input node. Based on the weight, the data value output from the output node can be determined. The above-described data structure is only an example and the present disclosure is not limited thereto.

As an example and not a limitation, the weights may include weights that are changed during the neural network learning process and/or weights for which neural network learning has been completed. Weights that vary during the neural network learning process may include weights at the start of the learning cycle and/or weights that vary during the learning cycle. Weights for which neural network training has been completed may include weights for which a learning cycle has been completed. Therefore, a data structure including weights of a neural network may include weights that change during the neural network learning process and/or weights that have completed neural network learning. Therefore, the above-described weights and/or combinations of each weight are included in the data structure including the weights of the neural network. The above-described data structure is only an example and the present disclosure is not limited thereto.

The data structure including the weights of the neural network may be stored in a computer-readable storage medium (e.g., memory, hard disk) after going through a serialization process. Serialization can be the process of converting a data structure into a form that can be stored on the same or a different computing device and later reconstituted and used. Computing devices can transmit and receive data over a network by serializing data structures. Data structures containing the weights of a serialized neural network can be reconstructed on the same computing device or on a different computing device through deserialization. The data structure including the weights of the neural network is not limited to serialization. Furthermore, the data structure including the weights of the neural network is a data structure to increase computational efficiency while minimizing the use of computing device resources (e.g., B-Tree, R-Tree, Trie, m-way search tree in non-linear data structures). , AVL tree, Red-Black Tree). The foregoing is merely an example and the present disclosure is not limited thereto.

The data structure may include hyper-parameters of a neural network. And the data structure including the hyperparameters of the neural network can be stored in a computer-readable medium. A hyperparameter may be a variable that can be changed by the user. Hyperparameters include, for example, learning rate, cost function, number of learning cycle repetitions, weight initialization (e.g., setting the range of weight values subject to weight initialization), Hidden Unit. It may include a number (e.g., number of hidden layers, number of nodes in hidden layers). The above-described data structure is only an example and the present disclosure is not limited thereto.

A transformer may be considered as a network function for the prediction model according to an embodiment of the present disclosure. As an example, the prediction model may operate based on a transformer. This prediction model may be operated using, for example, a recurrent neural network to which an attention algorithm is applied or a transformer to which an attention algorithm is applied.

In one embodiment, the transformer may be comprised of an encoder that encodes the embedded data and a decoder that decodes the encoded data. A transformer may have a structure that receives a series of data, goes through encoding and decoding steps, and outputs a series of data of different types. In one embodiment, a series of data can be processed into a form that can be operated by a transformer. The process of processing a series of data into a form that can be operated by a transformer may include an embedding process. Expressions such as data token, embedding vector, embedding token, etc. may refer to data embedded in a form that a transformer can process.

In order for the transformer to encode and decode a series of data, the encoders and decoders within the transformer can be processed using an attention algorithm. The attention algorithm calculates the similarity of one or more keys for a given query, reflects the given similarity to the value corresponding to each key, and returns the reflected similarity value ( It may refer to an algorithm that calculates the attention value by weighting the values.

Depending on how the query, key, and value are set, various types of attention algorithms can be classified. For example, if attention is obtained by setting the query, key, and value all the same, this may mean a self-attention algorithm. In order to process a series of input data in parallel, when the dimension of the embedding vector is reduced and attention is obtained by obtaining individual attention heads for each divided embedding vector, this refers to a multi-head attention algorithm. can do.

In one embodiment, the transformer may be composed of modules that perform a plurality of multi-head self-attention algorithms or multi-head encoder-decoder algorithms. In one embodiment, the transformer may also include additional components other than the attention algorithm, such as an embedding layer, a normalization layer, and a softmax layer. A method of configuring a transformer using an attention algorithm may include the method disclosed in Vaswani et al., Attention Is All You Need, 2017 NIPS, which is incorporated herein by reference.

Transformers can be applied to various data domains such as embedded natural language, embedded sequence information, segmented image data, and audio waveforms to convert a series of input data into a series of output data. In order to convert data with various data domains into a series of data that can be input to the transformer, the transformer can embed the data. Transformers can process additional data expressing relative positional or phase relationships between a series of input data. Alternatively, a series of input data may be embedded by additionally reflecting vectors expressing relative positional relationships or phase relationships between the input data. In one example, the relative positional relationship between a series of input data may include, but is not limited to, word order within a natural language sentence, relative positional relationship of each segmented image, temporal order of segmented audio waveforms, etc. . The process of adding information expressing relative positional or phase relationships between a series of input data may be referred to as positional encoding.

An example of a method for embedding data and converting it to a transformer is disclosed in Dosovitskiy, et al., AN IMAGE IS WORTH 16X16 WORDS: TRANSFORMERS FOR IMAGE RECOGNITION AT SCALE, which document is incorporated herein by reference.

In one embodiment, the steps shown in FIG. 3 may be performed by computing device 100. In a further embodiment, the steps shown in FIG. 3 may be implemented by multiple entities, such that some of the steps shown in FIG. 3 are performed at a user terminal and others are performed at a server.

In one embodiment, the computing device 100 includes first input data corresponding to the major histocompatibility complex (MHC), second input data corresponding to the peptide, and first input data corresponding to the CDR3 of the MHC and the TCR corresponding to the peptide. 3 Input data can be obtained (310).

In one example, the first input data includes amino acids of a first sequence corresponding to the MHC, the second input data includes amino acids of a second sequence corresponding to the peptide, and the third input data includes amino acids in the CDR3. It may contain amino acids of a corresponding third sequence.

As another example, first input data and second input data may be integrated with each other. Accordingly, the integrated first input data and second input data may represent an amino acid sequence representing an MHC-peptide conjugate.

As another example, the first input data, second input data, and/or third input data may include a nucleotide sequence.

In one example, an amino acid sequence may include a group of identifiers that represent amino acids, for example.

In one embodiment, the first input data, second input data and/or third input data may be obtained from public databases and/or experimental result data.

In a further embodiment, the input data may include input data to which Blosum encoding or one-hot encoding has been applied to the amino acid sequence.

In a further embodiment, the input data includes a first feature indicating polarity between amino acids, a second feature indicating the size of the amino acid, a third feature indicating whether the amino acid is hydrophobic or hydrophilic, and a fourth feature indicating the presence or absence of a charge on the amino acid. , or it may additionally include at least one of the fifth characteristics indicating whether the amino acid is aromatic or aliphatic.

In one embodiment, the computing device 100 performs preprocessing, including a grouping process and a segmenting process, on the first input data, the second input data, and the third input data, A learning data set can be created (320).

In one embodiment, preprocessing may include a grouping process and a segmenting process. The grouping process may involve grouping amino acid sequences into N units. The segmentation process may refer to a process of combining grouped groups. A learning data set is a data set created based on a grouping process and a segmenting process and can be used to learn a prediction model.

In one embodiment, the grouping process may include a process of generating tokens having a length corresponding to each of the different length amino acid sequences by analyzing the frequency of occurrence for each of the different length amino acid sequences. Here, the length of the amino acid sequence may correspond to the number of amino acids included in the amino acid sequence. For example, if the length of the amino acid sequence is 4, the number of amino acids included in the amino acid sequence may be 4.

As an example, the grouping process may include a tokenization process. That is, tokenization may be performed on each of the first input data, second input data, and third input data based on the tokens generated by the grouping process.

In one embodiment, the frequency of occurrence may represent a value that quantitatively represents the probability of a specific amino acid sequence being found within one CDR3 sequence. In one embodiment, the frequency of appearance may quantitatively represent the ratio of the number of times a specific amino acid sequence is found in all CDR3s to the total number of CDR3s.

In one embodiment, the grouping process includes generating a first token group for a first set of amino acid sequences obtained by analyzing the frequency of occurrence for amino acid sequences having a first length, and adding to the first group A second set of amino acid sequences obtained by analyzing the frequency of occurrence of amino acid sequences having a second length shorter than the first length, with the included first set of amino acid sequences excluded. 2 It may include the step of creating a token group. Here, tokens included in the first token group may have a first length and tokens included in the second token group may have a second length.

The grouping process will be described later in Figure 5.

In one embodiment, the segmenting process may refer to a process of combining grouped groups. In one example, the segmentation process may include combining tokens to create a training data set. The segmenting process may include, for example, segment embedding.

In one embodiment, the segmentation process includes: a first training data set comprising first tokens corresponding to first input data and third tokens corresponding to third input data with a separator token in between; and a second learning data set including second tokens corresponding to the second input data and third tokens corresponding to the third input data with a separator token in between. Here, the third tokens can be used as correct answer data in the learning process of the prediction model. The prediction model is a combination of a first inference result based on the first learning data set and a second inference result based on the second learning data set, and includes first input data corresponding to the MHC and second input data corresponding to the peptide. It can be learned to output third input data corresponding to CDR3 corresponding to the input data. Here, the prediction model may include a first prediction model learned based on the first learning data set and a second prediction model learned based on the second learning data set. In this embodiment, third input data may be output or learned to output third input data based on the output of the first prediction model and the output of the second prediction model.

In one embodiment, the computing device 100 uses a training data set to generate an artificial intelligence-based prediction model that includes third input data from the first input data and the second input data. An intelligence-based prediction model can be trained (330).

In one embodiment, the prediction result may further include a V type and a J type of TCR as well as CDR3 corresponding to the first input data and the second input data.

In one embodiment, the prediction model may be operated based on any learning method to generate a prediction result including third input data from first and second input data.

As an example, the learning method may include applying a mask to amino acids in some of the amino acid sequences and then semi-supervised learning to match the masked amino acids.

For example, the learning method may include applying a mask to amino acids included in the amino acids of the third sequence corresponding to CDR3 among the amino acid sequences, and then semi-supervised learning to match the masked amino acids. In this example, the learning method according to an embodiment of the present disclosure applies a mask to amino acids corresponding to CDR3, and amino acids of the first sequence corresponding to the MHC and amino acids of the second sequence corresponding to the peptide. This may include a method of not applying a mask.

In a further embodiment, a learning method for training a prediction model may include applying a mask to a different location on a per-epoch basis or applying a mask of a different size on a per-epoch basis when the prediction model is learned over a plurality of epochs. Methods may be included. For example, in the first epoch, a mask of the first size may be applied to at least some of the amino acids corresponding to the CDR3 corresponding to the first position, and in the second epoch different from the first epoch, the second epoch different from the first position may be applied. A mask of a second size may be applied to at least some of the amino acids corresponding to the CDR3 corresponding to the position.

In a further embodiment, the prediction model may be trained to output output data corresponding to CDR3 based on input data representing the combination of MHC and peptide.

In one embodiment, MHC and peptides capable of binding or each of MHC and peptides may be obtained through public databases (e.g., VDJdb. McPas. TCR3D, Huarc). In an additional embodiment, MHC and peptides that can bind may be determined by a separate artificial intelligence-based model that uses MHC and peptides as inputs and whether or not they can bind as output.

Figure 4 exemplarily shows a method for generating a prediction result including the CDR3 sequence of a TCR according to an embodiment of the present disclosure. For example, the steps shown in FIG. 4 may be performed by computing device 100.

In one embodiment, after training of the prediction model 470 is completed, the prediction model 470 receives first input data 410a corresponding to the MHC and second input data 420a corresponding to the peptide, and CDR3 Third input data 430a corresponding to can be output. In an additional embodiment, after training of the prediction model 470 is completed, the prediction model 470 receives first input data 410a corresponding to the MHC and second input data 420a corresponding to the peptide, CDR3 , third input data 430a corresponding to type V and type J can be output.

In one embodiment, the prediction model 470 may be trained to output CDR3 and/or V type/J type corresponding to the peptide and MHC based on the training data 460.

In one embodiment, the first input data 410a may include an amino acid sequence corresponding to MHC. The second input data 420a may include an amino acid sequence corresponding to the peptide. The third input data 430a may include amino acid sequences corresponding to CDR3, V type, and/or J type. In the learning process for the prediction model 470, the third input data 430a is a CDR3 (added thereto) that can be combined with the first input data 410a corresponding to the MHC and the second input data 420a corresponding to the peptide. V type and J type). That is, data on CDR3 corresponding to MHC and peptides can be pre-acquired, and learning of the prediction model 470 can be performed based on these pre-acquired data.

In one embodiment, grouping process 440 may generate

tokens

410b, 420b, and 430b corresponding to input

data

410a, 420a, and 430a. In one embodiment, grouping process 440 generates a first token 410b corresponding to first input data 410a and a second token 420b corresponding to second input data 420a and , and a third token 430b corresponding to the third input data 430a can be generated.

In one embodiment, grouping process 440 includes a process for splitting input data into smaller units or sizes, where the split units may correspond to tokens.

In one embodiment, the segmentation process 450 may include a process of creating a training data set 460 based on the tokens.

The segmenting process 450 according to an embodiment of the present disclosure generates a training data set 460 including first input data 410a corresponding to MHC and third input data 430a corresponding to CDR3. can do. In this learning data set 460, the third input data 430a corresponding to the first input data 410a may be used as the correct answer data. The segmenting process 450 may generate a training data set 460 including second input data 420a corresponding to the peptide and third input data 430a corresponding to the CDR3. In this learning data set 460, the third input data 430a corresponding to the second input data 420a can be used as the correct answer data. In this embodiment, the segmentation process 450 generates two types of training data sets 460, and applies the generated two types of training data sets 460 to the prediction model 470 independently. You can. In this case, the prediction model 470 may be one integrated model or may be composed of a plurality of separate models for receiving and processing each learning data set 470.

The segmenting process 450 according to an embodiment of the present disclosure includes first input data 410a corresponding to MHC, second input data 420a corresponding to peptide, and third input data 430a corresponding to CDR3. ) can be created. In this learning data set 460, third input data 430a corresponding to the first input data 410a and the second input data 420a may be used as correct answer data. In this embodiment, the segmenting process 450 may generate input data in the form of concatenated first input data 410a and second input data 420a, and third input data 430a. The learning data set 460 can be created by using the correct answer data corresponding to the input data. In this case, through the segmentation process, input data in an integrated form can be generated in such a way that, for example, amino acid sequences corresponding to MHC and amino acid sequences corresponding to peptides are concatenated with each other.

In one embodiment, concatenated first and second input data may be used in the inference process of the prediction model 470, and CDR3 corresponding to this concatenated input data may be output. In one embodiment, in the inference process of the prediction model 370, CDR3 for the first input data and CDR3 for the second input data are combined or input to another model, so that the first input data and the second input data The corresponding CDR3 may be output.

In one embodiment, the grouping process may obtain the CDR3 list 510 from an external database or a database included in the computing device. For example, the CDR3 list 510 may include amino acid sequences corresponding to existing CDR3s.

In one embodiment, the amino acid sequences included in the CDR3 list 510 are various numbers (K), such as 7 amino acids, 8 amino acids, 9 amino acids, 10 amino acids, or 11 amino acids. It can be made up of amino acids. These specific lengths of amino acids may be referred to as an amino acid set. For example, assuming that the total number of amino acids is 20, amino acid sequences with a length of 4 may exist in a total number of 20 to the power of 4. A combination of amino acids included in the number of 20 to the power of 4 may be referred to as an amino acid set. In this situation, the number of amino acid sets of length 4 can be a total of 20 to the power of 4.

In one embodiment, for each set of amino acids whose amino acid sequence length is K, the frequency of appearance in the CDR3 list 510 can be determined. The process of determining how many sets of amino acids of a specific length appear or exist within the range of amino acid sequences included in the CDR3 list 510 may be referred to as the frequency of occurrence analysis 500 in FIG. 5 .

In one embodiment, specific amino acid sequences among amino acid sequences included in the CDR3 list 510 may be tokenized through frequency of occurrence analysis 500 . The computing device 100 may include tokenized amino acid sequences into the token list 520 during the grouping process. Token list 520 may include tokens generated through a grouping process.

In one embodiment, the computing device 100 determines a first set of amino acids whose first frequency of occurrence is greater than or equal to a first predetermined threshold among sets of amino acids whose amino acid sequences have a length of K, and selects the first set of amino acids in the CDR3 list 510. Excluding the 1 amino acid set, the appearance frequency division (500) can be performed by reducing the length to a specific unit (e.g., 1) based on K and determining the frequency of appearance for each of the amino acid sets of the reduced length. there is. In this example, after the first set of amino acids is determined, computing device 100 determines, for each of the sets of amino acids whose amino acid sequences have a length of K-1, a second frequency of occurrence within the CDR3 list 510, and And among sets of amino acids whose amino acid sequences have a length of K-1, a second set of amino acids whose second frequency of occurrence is equal to or greater than a predetermined second threshold value may be determined. The computing device 100 performs tokenization of the length of K on the first amino acid sets determined according to the result of the appearance frequency analysis 500 for the amino acid set of the length K, and tokenizes the amino acid set of the length K-1. Tokenization of a length of K-1 may be performed on the second sets of amino acids determined according to the results of the frequency of occurrence analysis (500). Tokens generated according to the results of tokenization may be stored in the token list 520. Here, K is a natural number of 1 or more or a natural number of 2 or more, and the first threshold and the second threshold may have the same or different values.

In one embodiment, the grouping process may obtain the CDR3 list 510 from an external database or a database included in the computing device. The grouping process includes, among sets of amino acids whose amino acid sequences are N-M in length, in the token list 520 the M+1th set of amino acids whose frequency of occurrence in the CDR3 list 510 is greater than or equal to a predetermined threshold, and The token list 520 can be constructed by removing the M+1 amino acid set from the CDR3 list 510. The grouping process, after constructing the token list 520, may increment the value of M by 1 and determine whether a termination condition has been satisfied. The grouping process, if the termination condition is not satisfied, re-performs the step of constructing the token list, and if the termination condition is satisfied, amino acid sequence length corresponding to each of the amino acid sets included in the token list 520. Tokenization can be performed by generating a token corresponding to each set. Here, M is an integer greater than 0, and N-M can be a natural number greater than 2.

The termination condition according to an embodiment of the present disclosure may include conditions for determining whether to further perform the frequency analysis (500) while reducing the length of the amino acid sequence or to end the frequency analysis (500) here. You can.

For example, the termination condition may include a first termination condition corresponding to N-M≤1. If the above termination condition is satisfied based on the current values of N and M, the computing device 100 may terminate the frequency of occurrence analysis 500 without increasing the value of M.

As another example, the termination condition may include a second termination condition corresponding to N-M≤2. Additionally, the grouping process may perform tokenization by additionally generating a number of tokens corresponding to the types of amino acids whose amino acid sequence length is 1. For example, amino acids with an amino acid sequence length of 1 may include 20 amino acids plus 21 amino acids with an X amino acid added. As another example, amino acids whose amino acid sequence length is 1 may include any type of amino acid that can represent one amino acid.

In one embodiment, the threshold used in the frequency of occurrence analysis 500 may be varied to have a negative correlation with the magnitude of the value of N-M. For example, when the value of N-M is large, the size of the threshold may be relatively low.

As shown in FIG. 5, the frequency of occurrence analysis 500 may extract sets of amino acids with a length of 4 whose frequency of occurrence is x% or more with respect to sets of amino acids with a length of n=4 (530). In FIG. 5, n=4 is illustrated for convenience of explanation, but it will be clear to those skilled in the art that amino acid sequences of various lengths may be used depending on the embodiment.

In one embodiment, computing device 100 may include the extracted sets of amino acids of length 4 in token list 520 and remove them from CDR3 list 510 (540). For example, step 550 may be performed with the amino acid sets of length 4 extracted from the amino acid sequences included in the CDR3 list 510 excluded.

In one embodiment, the computing device 100 may subtract 1 from n to extract sets of amino acids with a length of 3 whose frequency of occurrence is x% or more for sets of amino acids with n=3 (550). Here, the threshold for the frequency of appearance is exemplified as x%, the same as in step 530, but it is also possible to use different thresholds for each stage depending on the implementation aspect. For example, the threshold may be set to have a negative correlation (e.g., inverse proportion, linear decrease, etc.) with the size of n. Through this negative correlation between the threshold and the amino acid length, tokenization of long amino acid sequences can be sufficiently secured.

In one embodiment, computing device 100 may include the extracted sets of amino acids of length 3 in token list 520 and remove them from CDR3 list 510 (560). For example, step 570 may be performed with the amino acid sets of length 3 extracted from the amino acid sequences included in the CDR3 list 510 excluded. That is, in step 570, the frequency of occurrence of amino acid sets with n=2 within the range of the CDR3 list 510 excluding the amino acid sets corresponding to the extracted length 4 and the amino acid sets corresponding to the extracted length 3 is x. % or more sets of amino acids of length 2 can be extracted. For example, assuming that there are a total of 20 types of amino acids, the total number of amino acid sets corresponding to length 2 may be 20 ² =400. Among the 400 types of amino acid sets, amino acid sets with a frequency of occurrence of x% or more in the CDR3 list 510 from which the previously extracted amino acid sets have been removed may be extracted in step 570.

In one embodiment, computing device 100 may include the extracted sets of amino acids of length 2 in token list 520 and remove them from CDR3 list 510 (580). For example, step 590 may be performed with the amino acid sets of length 2 extracted from the amino acid sequences included in the CDR3 list 510 excluded.

In one embodiment, computing device 100 may include each of the single amino acids of length n=1 in token list 520 (590). That is, for amino acids with a length of n=1, all types of amino acids with a length of 1 can be included in the token list 520.

Because the grouping or tokenization is achieved in the manner described above, the technique according to an embodiment of the present disclosure is universal and consistent when performing more accurate predictions for CDR3 corresponding to MHC and peptides containing amino acid sequences. You can use localized tokens.

In one embodiment, tokens generated based on the grouping process according to the embodiment of FIG. 5 may have a length corresponding to the length of the amino acid sequence (i.e., the value of n). In one embodiment, by using tokens generated based on the grouping process according to the embodiment of Figure 5, first input data corresponding to the MHC, second input data corresponding to the peptide, and Tokenization may be performed on the third input data corresponding to CDR3. That is, the technique according to an embodiment of the present disclosure performs tokenization through frequency analysis (500) of amino acid sequences included in the CDR3 list (510), and performs tokenization into MHC, peptide, and CDR3. Can be applied to all.

Reference number 610 represents a learning data set with MHC and peptide as questions and CDR3 as the answer according to an embodiment of the present disclosure.

Reference number 620 refers to a first learning data set (630) with MHC as a question and CDR3 as the correct answer, and a second learning data set (640) with peptides as a question and CDR3 as the correct answer, according to another embodiment of the present disclosure. ).

According to one embodiment, first input data (t1, t2, t3, ... tn) corresponding to MHC and second input data (t'1, t'2, t'3, ... t'n) corresponding to the peptide. ) as question (0) and the third input data (t''1, t''2, t''3, … t''n) corresponding to CDR3 as answer (1). A learning data set 610 is disclosed. Here, each of CLS, t1, t2, t3, t'1, t'2, t'3, SEP, t''1, t''2, and t''3 may correspond to one token.

According to one embodiment, the first input data (t1, t2, t3, ... tn) corresponding to the MHC is set as a question (0), and the third input data (t''1, t') corresponding to the CDR3 is set as question (0). The first learning data set 630 with '2, t''3, ... t''n) as the correct answer (1) is disclosed. The second input data (t'1, t'2, t'3, ... t'n) is set as question (0), and the third input data (t''1, t''2, t') corresponding to CDR3 A second learning data set 640 with '3,...t''n) as the correct answer (1) is disclosed. Here, each of CLS, t1, t2, t3, t'1, t'2, t'3, SEP, t''1, t''2, and t''3 may correspond to one token. In this embodiment, the first learning data 630 and the second learning data 640 may be learned through different models. Through post-processing of the results learned through different models, the final CDR3 sequence corresponding to the peptide and MHC can be generated.

The CLS token in the present disclosure represents a token indicating the beginning of a sentence, and the SEP may represent a separator token inserted between tokens to distinguish the tokens.

In this way, the segmenting process of the present disclosure can generate a learning data set for learning a prediction model by combining the generated tokens with a delimiter token and a start token.

The prediction model in this disclosure may mean an artificial intelligence-based generation model. As an example, the prediction model may include a Recurrent Neural Network (RNN), Long Short Term Memory (LSTM) network, or Bidirectional Encoder Representations from Transformers (BERT).

In one embodiment, Next Sentence Prediction (NSP) in a language model such as BERT usually randomly generates the question (0) and the correct answer (1). In the learning process of the prediction model according to an embodiment of the present disclosure, for the data corresponding to question (0), it is experimentally determined that immunogenicity does not exist (false) and it is determined that immunogenicity exists (false) true) CDR3 data can be used as the correct answer (1). In other words, during the learning process of an artificial intelligence-based prediction model, CDR3 data is not randomly generated. Since most amino acid sequences will have false values when randomly generated, effective learning is difficult to implement, and a more efficient prediction model can be learned through a technique according to an embodiment of the present disclosure.

In one embodiment of the present disclosure, a Masked Learning method may be used in the learning process of a prediction model. After applying a mask to some amino acids among the amino acid sequences included in the learning

data sets

610 and 620, semi-supervised learning may be performed to match the masked amino acids. The target to which the mask is applied here may be limited to amino acid sequences corresponding to the correct answer (1). Accordingly, the accuracy of the prediction model for predicting the corresponding CDR3 by receiving peptides and MHC can be further increased.

In one embodiment of the present disclosure, multiple masks may be applied to one training data set. In this case, the Masked Learning method can be implemented by applying X amino acids, which represent all amino acids or the average of amino acids, to the other mask in the process of matching one of the plurality of masks. Accordingly, since the feature called

The technique according to an example of the present disclosure can be performed by using a model pre-trained through semi-supervised learning and applying fine tuning of the supervised learning method. A technique according to an example of the present disclosure may be implemented by a prediction model including one or more encoders and/or decoders. A technique according to an example of the present disclosure is a Masked Learning method that masks some tokens and allows the prediction model to match them, and Next Sentence Prediction that matches the context or next specific token when two token sets or two tokens are given. method can be used.

For example, in one embodiment of the present disclosure, an artificial intelligence-based system that inputs amino acid sequences corresponding to peptides and/or MHC or tokens corresponding to peptides and/or MHC and outputs the length of the corresponding TCR The model can be utilized further. As another example, in one embodiment of the present disclosure, an artificial intelligence-based method that inputs amino acid sequences corresponding to peptides and/or MHC or tokens corresponding to peptides and/or MHC and outputs the length of the corresponding CDR3 The model can be used additionally. As another example, in one embodiment of the present disclosure, an artificial device that inputs amino acid sequences corresponding to a peptide and/or MHC or tokens corresponding to a peptide and/or MHC and outputs the corresponding V type and J type. Intelligence-based models can be additionally utilized. As another example, in one embodiment of the present disclosure, amino acid sequences corresponding to the peptide and/or MHC or tokens corresponding to the peptide and/or MHC are taken as input and the CDR3 with the corresponding V type and J type is used as input. An artificial intelligence-based model that outputs the length of the random insertion sequence can be additionally used. The technique according to an embodiment of the present disclosure can use this additional artificial intelligence model to obtain information about the corresponding CDR3 and/or V type and J type from the peptide and/or MHC, and CDR3 and/or Masked learning for the prediction model can be performed by using information about the V type and J type as input. In this case, the prediction model may be learned by applying a mask to at least some of the tokens or sequences associated with CDR3, thereby predicting tokens or sequences corresponding to the applied mask. In the examples described in the preceding drawings, this artificial intelligence-based model can be further utilized. Accordingly, the artificial intelligence-based model receives the amino acid sequence and/or token corresponding to the peptide and/or MHC and outputs information related to CDR3, and the prediction model that inputs the information related to CDR3 is finally used to determine the peptide and/or Alternatively, it can be learned to output the sequence of CDR3 using the amino acid sequence and/or token corresponding to MHC as input. In additional embodiments of the present disclosure, the prediction model may be designed to include the additional models based on artificial intelligence described above. Accordingly, two models may exist within the prediction model, and the first model can obtain information about the corresponding CDR3 and/or V type and J type from the corresponding peptide and/or MHC. And the amino acid sequence of CDR3 can be output through the second model among the prediction models by using information about CDR3 and/or V type and J type as input.

The Next Sentence Prediction method in the present disclosure is a process of learning to predict CDR3 from a peptide, for example, by recognizing the amino acid sequences corresponding to the peptide as one sentence and the amino acid sequences corresponding to CDR3 as one sentence. may include. For example, the Next Sentence Prediction method in the present disclosure is a process of learning to predict CDR3 from MHC by recognizing amino acid sequences corresponding to MHC as one Sentence and amino acid sequences corresponding to CDR3 as one Sentence. may include. For example, the Next Sentence Prediction method in the present disclosure is a process of learning to predict CDR3 from pMHC by recognizing the amino acid sequences corresponding to pMHC as one Sentence and the amino acid sequences corresponding to CDR3 as one Sentence. may include. For example, the Next Sentence Prediction method in the present disclosure recognizes amino acid sequences corresponding to MHC as one Sentence, recognizes amino acid sequences corresponding to peptides as one Sentence, and recognizes amino acid sequences corresponding to CDR3 as one Sentence. By recognizing it as a sentence, it can include a learning process to predict the CDR3 corresponding to the peptide and MHC. According to this embodiment, as described above, in the masking process, the masking target may be limited to amino acid sequences and/or tokens corresponding to CDR3.

A component, module, or unit in the present disclosure includes routines, procedures, programs, components, data structures, etc. that perform a specific task or implement a specific abstract data type. Additionally, one of ordinary skill in the art will understand that the methods presented in this disclosure can be used in uni-processor or multiprocessor computing devices, minicomputers, mainframe computers, as well as personal computers, handheld computing devices, microprocessor-based or programmable consumer electronics, etc. ( It will be fully appreciated that each of these may be implemented with other computer system configurations, including those capable of operating in conjunction with one or more associated devices.

Embodiments described in this disclosure can also be practiced in distributed computing environments where certain tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote memory storage devices.

Computing devices typically include a variety of computer-readable media. Computer-readable media can be any medium that can be accessed by a computer, and such computer-readable media includes volatile and non-volatile media, transitory and non-transitory media, removable and non-transitory media. Includes removable media. By way of example, and not limitation, computer-readable media may include computer-readable storage media and computer-readable transmission media.

Computer-readable storage media refers to volatile and non-volatile media, transient and non-transitory media, removable and non-removable, implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. Includes media. Computer readable storage media may include RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital video disk (DVD) or other optical disk storage, magnetic cassette, magnetic tape, magnetic disk storage or other magnetic storage. This includes, but is not limited to, a device, or any other medium that can be accessed by a computer and used to store desired information.

A computer-readable transmission medium typically implements computer-readable instructions, data structures, program modules, or other data on a modulated data signal, such as a carrier wave or other transport mechanism. Includes all information delivery media. The term modulated data signal refers to a signal in which one or more of the characteristics of the signal have been set or changed to encode information within the signal. By way of example, and not limitation, computer-readable transmission media includes wired media such as a wired network or direct-wired connection, and wireless media such as acoustic, RF, infrared, and other wireless media. Combinations of any of the above are also intended to be included within the scope of computer-readable transmission media.

An example environment implementing various aspects of the invention is shown, including computer 2002, which includes a processing unit 2004, a system memory 2006, and a system bus 2008. Computer 200 herein may be used interchangeably with computing device. System bus 2008 couples system components, including but not limited to system memory 2006, to processing unit 2004. Processing unit 2004 may be any of a variety of commercially available processors. Dual processors and other multiprocessor architectures may also be used as processing units 2004.

System bus 2008 may be any of several types of bus structures that may further be interconnected to a memory bus, peripheral bus, and local bus using any of a variety of commercial bus architectures. System memory 2006 includes read only memory (ROM) 2010 and random access memory (RAM) 2012. The basic input/output system (BIOS) is stored in non-volatile memory (2010), such as ROM, EPROM, and EEPROM, and is a basic input/output system (BIOS) that helps transfer information between components within the computer (2002), such as during startup. Contains routines. RAM 2012 may also include high-speed RAM, such as static RAM for caching data.

Computer 2002 may also read from or use an internal hard disk drive (HDD) 2014 (e.g., EIDE, SATA), magnetic floppy disk drive (FDD) 2016 (e.g., removable diskette 2018). (for writing to), SSDs, and optical disk drives (2020) (e.g., for reading CD-ROM disks (2022) or for reading from or writing to other high-capacity optical media, such as DVDs). Includes. The hard disk drive 2014, magnetic disk drive 2016, and optical disk drive 2020 are connected to a system bus 2008 by a hard disk drive interface 2024, magnetic disk drive interface 2026, and optical drive interface 2028, respectively. ) can be connected to. The interface 2024 for implementing an external drive includes, for example, at least one or both of Universal Serial Bus (USB) and IEEE 1394 interface technologies.

These drives and their associated computer-readable media provide non-volatile storage of data, data structures, computer-executable instructions, and the like. For the computer 2002, drive and media correspond to storing any data in a suitable digital format. Although the description of computer-readable storage media above refers to removable optical media such as HDDs, removable magnetic disks, and CDs or DVDs, those skilled in the art will also recognize removable optical media such as zip drives, magnetic cassettes, flash memory cards, cartridges, etc. It will be appreciated that other types of computer-readable storage media may also be used in the exemplary operating environment and that any such media may contain computer-executable instructions for performing the methods of the invention. .

A number of program modules may be stored in the drive and RAM 2012, including an operating system 2030, one or more application programs 2032, other program modules 2034, and program data 2036. All or portions of the operating system, applications, modules and/or data may also be cached in RAM 2012. It will be appreciated that the invention may be implemented on various commercially available operating systems or combinations of operating systems.

A user may input commands and information into the computer 2002 through one or more wired/wireless input devices, such as a pointing device such as a keyboard 2038 and a mouse 2040. Other input devices (not shown) may include microphones, IR remote controls, joysticks, game pads, stylus pens, touch screens, etc. These and other input devices are connected to the processing unit 2004 through an input device interface 2042, which is often connected to the system bus 2008, but may also include a parallel port, an IEEE 1394 serial port, a game port, a USB port, an IR interface, It can be connected by other interfaces, etc.

A monitor 2044 or other type of display device is also connected to system bus 2008 through an interface, such as a video adapter 2046. In addition to the monitor 2044, computers typically include other peripheral output devices (not shown) such as speakers, printers, etc.

Computer 2002 may operate in a networked environment using logical connections to one or more remote computers, such as remote computer(s) 2048, via wired and/or wireless communications. Remote computer(s) 2048 may be a workstation, server computer, router, personal computer, portable computer, microprocessor-based entertainment device, peer device, or other conventional network node, and generally refers to computer 2002. For simplicity, only memory storage device 2050 is shown, although it includes many or all of the components described. The logical connections depicted include wired/wireless connections to a local area network (LAN) 2052 and/or a larger network, such as a wide area network (WAN) 2054. These LAN and WAN networking environments are common in offices and companies and facilitate enterprise-wide computer networks, such as intranets, all of which can be connected to a worldwide computer network, such as the Internet.

When used in a LAN networking environment, computer 2002 is connected to local network 2052 through wired and/or wireless communications network interfaces or adapters 2056. Adapter 2056 may facilitate wired or wireless communication to LAN 2052, which also includes a wireless access point installed thereon for communicating with wireless adapter 2056. When used in a WAN networking environment, the computer 2002 may include a modem 2058, connected to a communication server on the WAN 2054, or other means of establishing communication over the WAN 2054, such as via the Internet. has Modem 2058, which may be internal or external and a wired or wireless device, is coupled to system bus 2008 via serial port interface 2042. In a networked environment, program modules described for computer 2002, or portions thereof, may be stored in remote memory/storage device 2050. It will be appreciated that the network connections shown are exemplary and that other means of establishing a communications link between computers may be used.

Computer 2002 may be associated with any wireless device or object deployed and operating in wireless communications, such as a printer, scanner, desktop and/or portable computer, portable data assistant (PDA), communications satellite, wirelessly detectable tag. Performs actions to communicate with any device or location and telephone. This includes at least Wi-Fi and Bluetooth wireless technologies. Accordingly, communication may be a predefined structure as in a conventional network or may simply be ad hoc communication between at least two devices.

It is to be understood that the specific order or hierarchy of steps in the processes presented is an example of illustrative approaches. It is to be understood that the specific order or hierarchy of steps in processes may be rearranged within the scope of the present disclosure, based on design priorities. The method claims of this disclosure provide elements of the various steps in a sample order but are not meant to be limited to the particular order or hierarchy presented.

As described above, the relevant content has been described in the best form for carrying out the invention.

It can be used in computing devices, systems, etc. to generate TCR information corresponding to pMHC.

Claims

As a method of training an artificial intelligence-based prediction model performed by a computing device,

First input data corresponding to the Major Histocompatibility Complex (MHC), second input data corresponding to the peptide, and third input corresponding to CDR3 of the MHC and T Cell Receptor (TCR) corresponding to the peptide. Acquiring data - the first input data includes amino acids of a first sequence corresponding to the MHC, the second input data includes amino acids of a second sequence corresponding to the peptide, and the third Input data includes amino acids of a third sequence corresponding to said CDR3;

generating a learning data set by performing preprocessing including a grouping process and a segmenting process on the first input data, the second input data, and the third input data;

Using the learning data set, the artificial intelligence-based prediction model generates a prediction result including the third input data from the first input data and the second input data. learning step;

Including,

method.
According to claim 1,

The prediction result further includes type V and type J of the TCR corresponding to the first input data and the second input data,

method.
According to claim 1,

The grouping process is,

By analyzing the frequency of occurrence for each of the different length amino acid sequences, generating tokens with a length corresponding to each of the different length amino acid sequences,

method.
According to claim 3,

The above grouping process is:

generating a first token group for a first set of amino acid sequences obtained by analyzing the frequency of occurrence for amino acid sequences having a first length; and

A second set obtained by analyzing the frequency of occurrence of amino acid sequences having a second length shorter than the first length, with the first set of amino acid sequences included in the first group excluded. generating a second token group comprising amino acid sequences;

Includes, and

tokens included in the first token group have the first length and tokens included in the second token group have the second length,

method.
According to claim 3,

The frequency of occurrence is:

A value that quantitatively represents the probability of finding a specific amino acid sequence within one CDR3 sequence; or

A value quantitatively representing the ratio of the number of times a specific amino acid sequence is found in all CDR3s to the total number of CDR3s;

Including,

method.
According to claim 3,

Based on the tokens generated by the grouping process, tokenization is performed on each of the first input data, the second input data, and the third input data,

method.
According to claim 6,

The segmenting process is,

A first learning data set including first tokens corresponding to the first input data and third tokens corresponding to the third input data with a separator token between them, and the separator token between them. Generating a second learning data set including second tokens corresponding to second input data and third tokens corresponding to the third input data,

Here, the third tokens are used as correct answer data in the learning process of the prediction model, and

The first learning data set and the second learning data set are learned by different artificial intelligence networks within the prediction model,

method.
According to claim 6,

The segmenting process is,

First tokens corresponding to the first input data, second tokens corresponding to the second input data, a first separator token located between the first tokens and the second tokens, and the third input Generate a third training data set, including third tokens corresponding to data and a second delimiter token located between the second tokens and the third tokens,

Here, the third tokens are used as correct answer data in the learning process of the prediction model,

method.
According to claim 1,

The above grouping process is:

Obtaining a CDR3 list from an external database or a database included in the computing device;

For each set of amino acids whose amino acid sequence length is K, determining a first frequency of occurrence in the CDR3 list;

Among the sets of amino acids whose length of the amino acid sequence is K, determining a first set of amino acids whose first frequency of occurrence is equal to or greater than a first predetermined threshold;

For each set of amino acids whose length of amino acid sequence is K-1, determining a second frequency of occurrence in the CDR3 list;

Among the sets of amino acids whose length of the amino acid sequence is K-1, determining a second set of amino acids whose second frequency of occurrence is equal to or greater than a second predetermined threshold; and

performing tokenization with a length of K for the first set of amino acids and performing tokenization with a length of K-1 for the second set of amino acids;

Includes, where K is a natural number of 2 or more, and the first threshold and the second threshold have the same or different values,

method.
According to clause 9,

The step of determining the second appearance frequency is,

For each of the sets of amino acids whose length of the amino acid sequence is K-1, determining the second frequency of appearance within a range excluding the first set of amino acids from the CDR3 list;

Including,

method.
According to claim 1,

The above grouping process is:

Obtaining a CDR3 list from an external database or a database included in the computing device;

Among amino acid sets whose amino acid sequence length is N-M, include in the token list the M+1th set of amino acids whose frequency of occurrence in the CDR3 list is greater than or equal to a predetermined threshold, and include the M+1th set of amino acids in the CDR3 list. Constructing the token list by removing from - where M is an integer greater than 0, and N-M is a natural number greater than 2;

After the step of constructing the token list is performed, increasing the value of M by 1 and determining whether a termination condition is satisfied;

If the termination condition is not satisfied, re-performing the step of constructing the token list; and

If the termination condition is satisfied, performing tokenization by generating a token corresponding to each of the amino acid sets included in the token list with an amino acid sequence length corresponding to each of the amino acid sets;

Including,

method.
According to claim 11,

The termination condition includes a first termination condition corresponding to N-M≤1,

method.
According to claim 11,

The termination condition includes a second termination condition corresponding to N-M≤2,

The step of performing the tokenization is,

Further comprising performing the tokenization by additionally generating a number of tokens corresponding to the types of amino acids whose length of the amino acid sequence is 1,

method.
According to claim 11,

The predetermined threshold is varied to have a negative correlation with the magnitude of the value of N-M,

method.
According to claim 1,

The artificial intelligence-based prediction model is,

Including Recurrent Neural Network (RNN), Long Short Term Memory (LSTM) network, or Bidirectional Encoder Representations from Transformers (BERT),

method.
According to claim 1,

The step of learning the artificial intelligence-based prediction model is,

After applying a mask to some of the amino acids in the amino acid sequences, performing semi-supervised learning to match the masked amino acids;

Including,

method.
According to claim 16,

The step of learning the artificial intelligence-based prediction model is,

applying a mask to amino acids included in amino acids of a third sequence corresponding to CDR3 among amino acid sequences, and then performing the semi-supervised learning to match the masked amino acids;

Including,

method.
According to claim 16,

The step of learning the artificial intelligence-based prediction model is,

When the prediction model is learned over a plurality of epochs, applying the mask to a different position on an epoch basis or applying the mask with a different size on a per epoch basis;

Including,

method.
According to claim 1,

The step of learning the artificial intelligence-based prediction model is,

When a plurality of masks exist for one learning data, when the artificial intelligence-based prediction model predicts one mask among the plurality of masks, an amino acid X representing the average value of amino acids or all amino acids in the other mask to apply,

method.
According to claim 1,

The step of learning the artificial intelligence-based prediction model is,

Learning the artificial intelligence-based prediction model by using third input data experimentally determined to have no immunogenicity for the first input data and the second input data as correct answer data;

Includes, and

In the learning process of the artificial intelligence-based prediction model, the third input data is not randomly generated,

method.
A computer program stored in a computer-readable storage medium, wherein the computer program, when executed by a computing device, causes the computing device to perform operations for learning an artificial intelligence-based prediction model, the operations comprising:

An operation of acquiring first input data corresponding to the major histocompatibility complex (MHC), second input data corresponding to a peptide, and third input data corresponding to CDR3 of the TCR corresponding to the MHC and the peptide - the first The input data includes amino acids of a first sequence corresponding to the MHC, the second input data includes amino acids of a second sequence corresponding to the peptide, and the third input data includes the amino acids of the first sequence corresponding to the CDR3. Contains 3 sequences of amino acids -;

An operation of generating a learning data set by performing preprocessing including a grouping process and a segmenting process on the first input data, the second input data, and the third input data:

Using the learning data set, the artificial intelligence-based prediction model generates a prediction result including the third input data from the first input data and the second input data. learning behavior;

Including,

A computer program stored on a computer-readable storage medium.
As a computing device,

at least one processor; and

Memory;

Includes,

The at least one processor:

An operation of acquiring first input data corresponding to the major histocompatibility complex (MHC), second input data corresponding to a peptide, and third input data corresponding to CDR3 of the TCR corresponding to the MHC and the peptide - the first The input data includes amino acids of a first sequence corresponding to the MHC, the second input data includes amino acids of a second sequence corresponding to the peptide, and the third input data includes the amino acids of the first sequence corresponding to the CDR3. Contains 3 sequences of amino acids -;

An operation of generating a learning data set by performing preprocessing including a grouping process and a segmenting process on the first input data, the second input data, and the third input data:

Using the training data set, learn the artificial intelligence-based prediction model so that the artificial intelligence-based prediction model generates a prediction result including the third input data from the first input data and the second input data. The action of telling;

To perform,

Computing device.