KR102558550B1

KR102558550B1 - Apparatus and method for generating prediction result for tcr using artificial intelligence technology

Info

Publication number: KR102558550B1
Application number: KR1020230007826A
Authority: KR
Inventors: 송성재; 서정한; 임채열
Original assignee: 주식회사 네오젠티씨
Priority date: 2023-01-19
Filing date: 2023-01-19
Publication date: 2023-07-24
Also published as: KR20240115707A

Abstract

Disclosed is a method for generating a prediction result using artificial intelligence technology, performed by a computing device. The method may comprise the steps of: obtaining first data corresponding to CDR3α of a TCR and second data corresponding to CDR3β of the TCR, wherein the first data includes an amino acid sequence corresponding to the CDR3α, and the second data includes an amino acid sequence corresponding to the CDR3β; receiving the first data and the second data and determining whether the first data including the CDR3α and the second data including the CDR3β correspond to a precedence/subsequence relationship using a first module based on artificial intelligence; and storing, when a result indicating that the first data and the second data correspond to the precedence/subsequence relationship is output, a combination of the first data and the second data in a CDR set candidate list. According to the present invention, the method is capable of more effectively and accurately predicting, determining, or identifying a complementarity determining region of a TCR.

Description

Method and apparatus for generating prediction results for TCR using artificial intelligence technology

본 개시내용은 인공지능 기술에 관한 것이며, 보다 구체적으로 인공지능 기술을 이용하여 세포 반응 유도에 효과적인 TCR(T Cell Receptor) 구성을 도출하기 위한 것이다.The present disclosure relates to artificial intelligence technology, and more specifically, to derive a TCR (T Cell Receptor) configuration effective for inducing a cellular response using artificial intelligence technology.

주조직 적합성 복합체(Major Histocompatibility Complex; MHC)는 면역계에서 작용하는 'MHC 분자'를 암호화하는 유전자 자리이다. MHC 분자에는 1형(class I)과 2형(class II)이 존재한다. 면역펩티돔은 세포의 표면에서 표현되는 펩타이드들의 세트를 의미하며, 예를 들어 면역펩티돔은 MHC와 연관된 펩타이드들의 조합을 의미할 수 있다. 인간 백혈구 항원(Human Leukocyte Antigen; HLA)은 인간의 주조직 적합성 복합체 유전자에 의해 생성되는 당단백 분자이다. HLA는 성숙한 적혈구에는 존재하지 않지만 미성숙 적아세포(erythroblast)에서는 발현되며 백혈구 및/또는 혈소판 등의 혈액세포를 포함한 인체 내 모든 조직세포의 표면에 발현된다. MHC 유전자는 모든 척추동물에 존재하며 인간의 MHC 유전자를 HLA 유전자, 이로부터 발현된 산물을 HLA이라고 한다. MHC 유전자들은 자기(self) 및 비자기(non-self)의 인지, 항원 자극에 대한 면역반응, 세포성 면역과 체액성 면역의 조절 및 질병에 대한 감수성 등에 관여한다. MHC 유전자의 산물인 HLA는 고형 장기이식에서 이식된 장기의 생존에 있어서 ABO식 혈액형 다음으로 중요한 항원이다.The major histocompatibility complex (MHC) is a genetic locus that encodes 'MHC molecules' that act in the immune system. MHC molecules exist in type 1 (class I) and type 2 (class II). Immunopeptidome refers to a set of peptides expressed on the cell surface, and for example, the immunopeptidome may refer to a combination of MHC-related peptides. Human Leukocyte Antigen (HLA) is a glycoprotein molecule produced by the human major histocompatibility complex gene. HLA is not present in mature erythrocytes, but is expressed in immature erythroblasts and is expressed on the surface of all tissue cells in the human body, including blood cells such as leukocytes and/or platelets. The MHC gene is present in all vertebrates, and the human MHC gene is referred to as the HLA gene, and the product expressed therefrom is referred to as HLA. MHC genes are involved in self and non-self recognition, immune response to antigen stimulation, regulation of cellular and humoral immunity, and susceptibility to disease. HLA, a product of the MHC gene, is the second most important antigen next to the ABO blood type in survival of the transplanted organ in solid organ transplantation.

HLA는 MHC와 마찬가지로 크게 Class I 및 Class II로 분류될 수 있다. Class I은 HLA-A, HLA-B, HLA-C로 분류되고 대부분의 유핵세포 및 혈소판에서 발현되며, 세포독성 T 세포가 바이러스에 감염된 세포나 종양세포를 인지하여 제거할 때 항원 인식(antigen recognition)에 필수적이다. HLA Class II는 HLA-DR, HLA-DQ, HLA-DP로 분류되고 B 세포, 단핵세포, 수지상세포, 활성화된 T 세포에서 발현되며, 헬퍼 T 세포의 항원 수용체(antigen receptor)와 작용하여 세포성 및 체액성 면역반응을 유발하고, 그리고 항원제시세포에 표현된 항원을 인지할 때 필수적인 것으로 알려져 있다. HLA는 인간이 가지고 있는 유전자 중에서 가장 큰 다형성(polymorphism)을 보이는 유전자이며 인종 및 민족 간에도 빈도 차이가 존재한다.HLA, like MHC, can be largely classified into Class I and Class II. Class I is classified as HLA-A, HLA-B, and HLA-C and is expressed in most nucleated cells and platelets, and is essential for antigen recognition when cytotoxic T cells recognize and eliminate virus-infected cells or tumor cells. HLA Class II is classified into HLA-DR, HLA-DQ, and HLA-DP, and is expressed in B cells, monocytes, dendritic cells, and activated T cells. It interacts with the antigen receptor of helper T cells to induce cellular and humoral immune responses, and is known to be essential when recognizing antigens expressed on antigen-presenting cells. HLA is a gene that shows the largest polymorphism among genes possessed by humans, and there is a frequency difference between races and ethnic groups.

감염 미생물 유래의 단백질 혹은 암세포 특유의 단백질에서 유래한 펩타이드가 MHC에 결합하여 세포표면에 제시되면 T 세포가 인식하여 면역반응을 유발함으로써 감염된 세포 혹은 암세포를 제거하게 된다. 이처럼 T 세포는 정상적인 인체에 존재하지 않는 이물질에 대한 특정한 면역반응을 결정하는 핵심 조절자(player)이다. 따라서, pMHC와 결합되는 TCR(T Cell Receptor)에 대한 예측은 감염질환이나 암의 예방을 위한 개인화된 백신 등의 개발에 활용될 수 있다.When peptides derived from proteins derived from infectious microorganisms or proteins specific to cancer cells bind to MHC and are presented on the cell surface, T cells recognize them and trigger an immune response to eliminate infected cells or cancer cells. As such, T cells are key regulators (players) that determine specific immune responses to foreign substances that do not exist in the normal human body. Therefore, the prediction of the T Cell Receptor (TCR) that binds to pMHC can be used for the development of personalized vaccines for the prevention of infectious diseases or cancer.

TCR-T(T cell receptor-engineered T cell)의 제조 과정은 TCR에 대한 유전자 서열을 만드는 과정, TCR 유전자 서열을 환자 세포에서 분리한 T 세포 내로 도입하는 과정, TCR이 도입된 T 세포를 증식 배양하는 단계를 포함한다. TCR 유전자를 T 세포 내로 도입하기 위한 방법으로 바이러스 벡터가 주로 사용되며, TCR-T를 도입하기 위하여 T 세포를 활성화하는 과정 및 TCR-T의 증식을 위하여 인터루킨-2 등 사이토카인을 처리하는 과정이 사용된다. TCR 유전자의 클로닝을 위해서 MHC-종양 항원에 친화성(affinity)이 큰 T 세포를 MHC와 항원 epitope 펩타이드의 복합체로 선별한 다음, 이 세포의 TCR 유전자가 클로닝 처리된다. TCR의 α 쇄와 β 쇄 각각의 유전자에 대하여 코돈 최적화(codon optimization) 과정을 거치는데, 이 과정은 T 세포 표면에 TCR의 발현량 증강을 좌우하는 것으로 알려져 있다.The manufacturing process of a T cell receptor-engineered T cell (TCR-T) includes the steps of creating a gene sequence for the TCR, introducing the TCR gene sequence into a T cell isolated from a patient cell, and proliferating and culturing the T cell into which the TCR has been introduced. A viral vector is mainly used as a method for introducing the TCR gene into T cells, and a process of activating T cells to introduce TCR-T and treating cytokines such as interleukin-2 for TCR-T proliferation is used. For cloning of the TCR gene, T cells with high affinity for MHC-tumor antigens are selected as a complex of MHC and antigen epitope peptides, and then the TCR gene of these cells is cloned. Codon optimization is performed for each of the TCR α chain and β chain genes, and this process is known to influence the increase in TCR expression on the T cell surface.

대한민국 등록특허 10-2322832Korean Registered Patent No. 10-2322832

본 개시내용은 전술한 배경기술에 대응하여 안출된 것으로, 펩타이드-MHC 복합체 및 이에 결합하는 TCR을 보다 효율적인 방식으로 그리고/또는 보다 정확하게 예측 또는 식별하기 위함이다.The present disclosure has been made in response to the above background art, and is to predict or identify a peptide-MHC complex and a TCR binding thereto in a more efficient manner and/or more accurately.

본 개시내용의 기술적 과제들은 이상에서 언급한 기술적 과제로 제한되지 않으며, 언급되지 않은 또 다른 기술적 과제들은 아래의 기재로부터 당업자에게 명확하게 이해될 수 있을 것이다.The technical problems of the present disclosure are not limited to the technical problems mentioned above, and other technical problems not mentioned will be clearly understood by those skilled in the art from the description below.

본 개시내용의 일 실시예에 따라, 컴퓨팅 장치에 의해 수행되는 방법이 개시된다. 상기 방법은: TCR(T cell receptor)의 CDR3α에 대응되는 제 1 데이터 및 TCR의 CDR3β에 대응되는 제 2 데이터를 획득하는 단계 - 상기 제 1 데이터는 상기 CDR3α에 대응되는 아미노산 서열을 포함하며, 그리고 상기 제 2 데이터는 상기 CDR3β에 대응되는 아미노산 서열을 포함함 -, 인공지능 기반의 제 1 모델을 사용하여, 상기 제 1 데이터 및 상기 제 2 데이터를 입력받아, 상기 CDR3α를 포함하는 상기 제 1 데이터와 상기 CDR3β를 포함하는 상기 제 2 데이터가 선후행 관계에 해당하는지 여부를 결정하는 단계 및 상기 제 1 데이터와 상기 제 2 데이터가 상기 선후행 관계에 해당한다는 결과가 출력된 경우, 상기 제 1 데이터와 상기 제 2 데이터의 조합을 CDR 세트 후보 리스트에 저장하는 단계를 포함할 수 있다.According to one embodiment of the present disclosure, a method performed by a computing device is disclosed. The method includes: acquiring first data corresponding to CDR3α of T cell receptor (TCR) and second data corresponding to CDR3β of TCR, wherein the first data includes an amino acid sequence corresponding to CDR3α, and the second data includes an amino acid sequence corresponding to CDR3β -, receiving the first data and the second data using an artificial intelligence-based first model, and including the CDR3α Determining whether data 1 and the second data including the CDR3β correspond to a precedence relationship, and when a result is output that the first data and the second data correspond to the precedence relationship, storing a combination of the first data and the second data in a CDR set candidate list.

일 실시예에서, 상기 TCR의 CDR3α에 대응되는 상기 제 1 데이터는 상기 제 1 모델과 상이한 제 2 모델에 의해 생성되며, 상기 제 2 모델은 펩타이드-MHC 결합체(pMHC)를 입력받고 그리고 상기 pMHC에 대응하는 CDR3α를 포함하는 상기 제 1 데이터를 출력하도록 사전 학습된 인공지능 기반의 모델이며, 그리고 상기 TCR의 CDR3β에 대응되는 상기 제 2 데이터는 상기 제 1 모델과 상이한 제 3 모델에 의해 생성되며, 상기 제 3 모델은 상기 pMHC를 입력으로 하고 상기 pMHC에 대응하는 CDR3β를 포함하는 상기 제 2 데이터를 출력하도록 사전 학습된 인공지능 기반의 모델일 수 있다.In one embodiment, the first data corresponding to CDR3α of the TCR is generated by a second model different from the first model, the second model is an artificial intelligence-based model pretrained to receive a peptide-MHC conjugate (pMHC) and output the first data including CDR3α corresponding to the pMHC, and the second data corresponding to CDR3β of the TCR is a third model different from the first model , and the third model may be an artificial intelligence-based model pretrained to take the pMHC as an input and output the second data including CDR3β corresponding to the pMHC.

일 실시예에서, 상기 TCR의 CDR3α에 대응되는 상기 제 1 데이터 및 상기 TCR의 CDR3β에 대응되는 상기 제 2 데이터는 상기 제 1 모델과 상이한 제 4 모델에 의해 생성될 수 있다.In one embodiment, the first data corresponding to CDR3α of the TCR and the second data corresponding to CDR3β of the TCR may be generated by a fourth model different from the first model.

일 실시예에서, 상기 제 4 모델은 인코더 및 디코더를 포함할 수 있다.In one embodiment, the fourth model may include an encoder and a decoder.

일 실시예에서, 상기 제 4 모델은, 펩타이드 및 MHC에 대응되는 아미노산 서열을 포함하는 제 1 입력 데이터셋이 상기 인코더로 입력되고, 상기 펩타이드 및 MHC와 관련된 CDR3α와 CDR3β에 대응되는 아미노산 서열이 상기 디코더로 입력되며, 그리고 상기 디코더에서 상기 제 1 데이터 및 상기 제 1 데이터를 포함하는 예측 결과가 출력되도록 사전 학습된 인공 지능 기반의 모델일 수 있다.In one embodiment, the fourth model may be an artificial intelligence-based model pre-trained such that a first input dataset including an amino acid sequence corresponding to a peptide and MHC is input to the encoder, an amino acid sequence corresponding to CDR3α and CDR3β related to the peptide and MHC is input to the decoder, and the first data and a prediction result including the first data are output from the decoder.

일 실시예에서, 상기 제 1 모델은: CDR3α에 대응되는 제 1 데이터셋 또는 TCR의 CDR3β에 대응되는 제 2 데이터셋을 무작위적으로(randomly) 조합하여, 제 1 음성(negative) 데이터셋을 생성하는 단계, 상기 무작위적으로 조합된 상기 제 1 음성 데이터셋에서, 서로 결합하는 것으로 식별된 CDR3α와 CDR3β를 식별하는 단계, 상기 결합하는 것으로 식별된 CDR3α와 CDR3β의 조합을 상기 제 1 음성 데이터셋에서 제외함으로써, 제 2 음성 데이터셋을 생성하는 단계, 및 상기 결합하는 것으로 식별된 CDR3α와 CDR3β의 조합을 제 1 양성(positive) 데이터셋에 포함시키는 단계에 기초하여 사전 학습될 수 있다.In one embodiment, the first model comprises: generating a first negative dataset by randomly combining a first dataset corresponding to CDR3α or a second dataset corresponding to CDR3β of TCR; identifying CDR3α and CDR3β identified as binding to each other in the randomly combined first negative dataset; It can be pretrained based on the steps of generating a second negative dataset by excluding from one negative dataset, and including the combination of CDR3α and CDR3β identified as binding in the first positive dataset.

일 실시예에서, 상기 제 2 음성 데이터셋에 포함된 CDR3α와 CDR3β의 조합들은 상기 제 1 모델의 학습 과정에서 선후행 관계에 해당하지 않는다는 정답 데이터로 라벨링될 수 있다.In one embodiment, combinations of CDR3α and CDR3β included in the second speech dataset may be labeled as correct answer data that do not correspond to a precedence relationship in the learning process of the first model.

일 실시예에서, 상기 제 1 양성 데이터셋에 포함된 CDR3α와 CDR3β의 조합들은 상기 제 1 모델의 학습 과정에서 선후행 관계에 해당한다는 정답 데이터로 라벨링될 수 있다.In one embodiment, combinations of CDR3α and CDR3β included in the first positive dataset may be labeled as correct answer data corresponding to a precedence relationship in the learning process of the first model.

일 실시예에서, 상기 제 1 모델은: 상기 제 2 음성 데이터셋에 포함된 CDR3α와 CDR3β의 조합들 중에서, 제 1 양성 데이터셋에 포함된 CDR3α와 CDR3β 조합과 사전 결정된 임계값 이상의 유사도를 적어도 하나의 유사 조합(similar combination)을 식별하는 단계, 및 상기 식별된 유사 조합을 상기 제 1 양성 데이터셋에 포함시키는 단계에 추가적으로 기초하여 사전 학습될 수 있다.In one embodiment, the first model may be pre-trained based on additional steps: identifying at least one similar combination having a degree of similarity equal to or greater than a predetermined threshold to the CDR3α and CDR3β combination included in the first positive dataset, among the combinations of CDR3α and CDR3β included in the second negative dataset, and including the identified similar combination in the first positive dataset.

일 실시예에서, 상기 제 1 모델은 RNN(Recurrent Neural Network), LSTM(Long Short Term Memory) 네트워크, BiLSTM(Bidirectional Long Short Term Memory) 네트워크, GPT(Generative Pre-trained Transformer), Diffusion model, BERT(Bidirectional Encoder Representations from Transformers), spanBERT, GRU(Gated Recurrent Unit), 또는 BiGRU(Bidirectional Gated Recurrent Unit)를 포함할 수 있다.In one embodiment, the first model may include a Recurrent Neural Network (RNN), a Long Short Term Memory (LSTM) network, a Bidirectional Long Short Term Memory (BiLSTM) network, a Generative Pre-trained Transformer (GPT), Diffusion model, Bidirectional Encoder Representations from Transformers (BERT), spanBERT, Gated Recurrent Unit (GRU), or Bidirectional Gated Recurrent Unit (BiGRU).

일 실시예에서, 상기 CDR 세트 후보 리스트는 pMHC와 결합 가능한 TCR에 포함되는 TCR α쇄와 *?*의 조합을 의미할 수 있다.In one embodiment, the CDR set candidate list may refer to a combination of TCR α chain and *?* included in TCRs capable of binding to pMHC.

일 실시예에서, 상기 CDR 세트 후보 리스트에 저장되는 상기 제 1 데이터와 상기 제 2 데이터의 조합은 TCR-T 생성에 사용될 수 있다.In one embodiment, a combination of the first data and the second data stored in the CDR set candidate list may be used to generate a TCR-T.

일 실시예에서, 컴퓨터 판독가능 저장 매체에 저장되는 컴퓨터 프로그램이 개시된다. 상기 컴퓨터 프로그램은 컴퓨팅 장치에 의해 실행 시 상기 컴퓨팅 장치로 하여금 인공지능 기술을 이용하여 예측 결과를 생성하는 동작들을 수행하도록 하며, 상기 동작들은: TCR의 CDR3α에 대응되는 제 1 데이터 및 TCR의 CDR3β에 대응되는 제 2 데이터를 획득하는 동작 - 상기 제 1 데이터는 상기 CDR3α에 대응되는 아미노산 서열을 포함하며, 그리고 상기 제 2 데이터는 상기 CDR3β에 대응되는 아미노산 서열을 포함함 -, 인공지능 기반의 제 1 모델을 사용하여, 상기 제 1 데이터 및 상기 제 2 데이터를 입력받아, 상기 CDR3α를 포함하는 상기 제 1 데이터와 상기 CDR3β를 포함하는 상기 제 2 데이터가 선후행 관계에 해당하는지 여부를 결정하는 동작, 및 상기 제 1 데이터와 상기 제 2 데이터가 상기 선후행 관계에 해당한다는 결과가 출력된 경우, 상기 제 1 데이터와 상기 제 2 데이터의 조합을 CDR 세트 후보 리스트에 저장하는 동작을 포함할 수 있다.In one embodiment, a computer program stored on a computer readable storage medium is disclosed. The computer program, when executed by a computing device, causes the computing device to perform operations for generating a prediction result using artificial intelligence technology, wherein the operations include: obtaining first data corresponding to CDR3α of TCR and second data corresponding to CDR3β of TCR, the first data including an amino acid sequence corresponding to the CDR3α, and the second data including an amino acid sequence corresponding to the CDR3β, using a first artificial intelligence-based model The method may include receiving the first data and the second data and determining whether the first data including the CDR3α and the second data including the CDR3β correspond to a precedence relationship, and storing a combination of the first data and the second data in a CDR set candidate list when a result that the first data and the second data correspond to the precedence relationship is output.

일 실시예에 따른 컴퓨팅 장치가 개시된다. 상기 컴퓨팅 장치는 적어도 하나의 프로세서 및 메모리를 포함할 수 있다. 상기 적어도 하나의 프로세서는: TCR의 CDR3α에 대응되는 제 1 데이터 및 TCR의 CDR3β에 대응되는 제 2 데이터를 획득하는 동작 - 상기 제 1 데이터는 상기 CDR3α에 대응되는 아미노산 서열을 포함하며, 그리고 상기 제 2 데이터는 상기 CDR3β에 대응되는 아미노산 서열을 포함함 -, 인공지능 기반의 제 1 모델을 사용하여, 상기 제 1 데이터 및 상기 제 2 데이터를 입력받아, 상기 CDR3α를 포함하는 상기 제 1 데이터와 상기 CDR3β를 포함하는 상기 제 2 데이터가 선후행 관계에 해당하는지 여부를 결정하는 동작, 및 상기 제 1 데이터와 상기 제 2 데이터가 상기 선후행 관계에 해당한다는 결과가 출력된 경우, 상기 제 1 데이터와 상기 제 2 데이터의 조합을 CDR 세트 후보 리스트에 저장하는 동작을 수행할 수 있다.A computing device according to an embodiment is disclosed. The computing device may include at least one processor and memory. The at least one processor: obtaining first data corresponding to CDR3α of TCR and second data corresponding to CDR3β of TCR, the first data including an amino acid sequence corresponding to CDR3α, and the second data including an amino acid sequence corresponding to CDR3β, receiving the first data and the second data using an artificial intelligence-based first model, and receiving the first data and the second data including the CDR3α An operation of determining whether data and the second data including the CDR3β correspond to a precedence relationship, and an operation of storing a combination of the first data and the second data in a CDR set candidate list when a result is output that the first data and the second data correspond to the precedence relationship.

본 개시내용의 일 실시예에 따른 방법 및 장치는, TCR의 상보성 결정부위(complementarity determining region)를 보다 효율적인 방식으로 그리고/또는 보다 정확하게 예측, 결정 또는 식별할 수 있다.The method and device according to one embodiment of the present disclosure can predict, determine or identify the complementarity determining region of the TCR in a more efficient manner and/or more accurately.

도 1은 본 개시내용의 일 실시예에 따른 컴퓨팅 장치의 블록 구성도를 개략적으로 도시한다.
도 2는 본 개시내용의 일 실시예에 따른 인공지능 기반 모델의 예시적인 구조를 도시한다.
도 3은 본 개시내용의 일 실시예에 따라 인공지능 기반의 예측 모델을 사용하여 입력 데이터로부터 TCR의 CDR3α 및 CDR3β와 관련된 예측 결과를 획득하기 위한 예시적인 방법을 도시한다.
도 4는 본 개시내용의 일 실시예에 따라, TCR의 CDR3α와 TCR의 CDR3β에 대한 조합 정보를 포함하는 예측 결과의 생성 방법을 예시적으로 도시한다.
도 5는 본 개시내용의 일 실시예에 따라, TCR의 CDR3α 또는 TCR의 CDR3β 정보를 포함하는 데이터의 생성 방법을 예시적으로 도시한다.
도 6은 본 개시내용의 일 실시예에 따라, TCR의 CDR3α와 TCR의 CDR3β의 조합과 관련한 정보를 포함하는 제 1 양성 데이터, 제 1 음성 데이터 및 제 2 음성 데이터를 예시적으로 도시한다.
도 7은 본 개시내용의 일 실시예에 따라, CDR 세트 후보 리스트에 CDR3α와 CDR3β의 조합을 저장하는 단계를 예시적으로 도시한다.
도 8은 본 개시내용의 일 실시예에 따른 컴퓨팅 환경의 개략도이다.1 schematically illustrates a block configuration diagram of a computing device according to an embodiment of the present disclosure.
2 illustrates an exemplary structure of an artificial intelligence-based model according to an embodiment of the present disclosure.
3 illustrates an exemplary method for obtaining predictive results related to CDR3α and CDR3β of a TCR from input data using an artificial intelligence-based predictive model according to an embodiment of the present disclosure.
4 exemplarily illustrates a method for generating a prediction result including combination information for CDR3α of TCR and CDR3β of TCR, according to an embodiment of the present disclosure.
5 exemplarily illustrates a method of generating data including information of CDR3α of TCR or CDR3β of TCR according to an embodiment of the present disclosure.
6 illustratively illustrates first positive data, first negative data, and second negative data including information related to a combination of CDR3α of TCR and CDR3β of TCR, according to an embodiment of the present disclosure.
7 illustratively illustrates storing a combination of CDR3α and CDR3β in a CDR set candidate list, according to an embodiment of the present disclosure.
8 is a schematic diagram of a computing environment according to one embodiment of the present disclosure.

다양한 실시예들이 도면을 참조하여 설명된다. 본 명세서에서, 다양한 설명들이 본 개시내용의 이해를 제공하기 위해서 제시된다. 본 개시내용의 실시를 위한 구체적인 내용을 설명하기에 앞서, 본 개시내용의 기술적 요지와 직접적 관련이 없는 구성에 대해서는 본 발명의 기술적 요지를 흩뜨리지 않는 범위 내에서 생략하였음에 유의하여야 할 것이다. 또한, 본 명세서 및 청구범위에 사용된 용어 또는 단어는 발명자가 자신의 발명을 최선의 방법으로 설명하기 위해 적절한 용어의 개념을 정의할 수 있다는 원칙에 입각하여 본 발명의 기술적 사상에 부합하는 의미와 개념으로 해석되어야 할 것이다.Various embodiments are described with reference to the drawings. In this specification, various descriptions are presented to provide an understanding of the present disclosure. Prior to describing specific details for the implementation of the present disclosure, it should be noted that configurations not directly related to the technical gist of the present disclosure have been omitted within the scope of not distracting from the technical gist of the present invention. In addition, the terms or words used in the present specification and claims should be interpreted as meanings and concepts consistent with the technical idea of the present invention based on the principle that the inventor can define the concept of appropriate terms in order to best describe his/her invention.

본 명세서에서 사용되는 용어 "컴포넌트", "모듈", "시스템", "부" 등은 컴퓨터-관련 엔티티, 하드웨어, 펌웨어, 소프트웨어, 소프트웨어 및 하드웨어의 조합, 또는 소프트웨어의 실행을 지칭하며, 상호 교환 가능하게 사용될 수 있다. 예를 들어, 컴포넌트는 프로세서상에서 실행되는 처리과정(procedure), 프로세서, 객체, 실행 스레드, 프로그램, 및/또는 컴퓨터일 수 있지만, 이들로 제한되는 것은 아니다. 예를 들어, 컴퓨팅 장치에서 실행되는 애플리케이션 및 컴퓨팅 장치 모두 컴포넌트일 수 있다. 하나 이상의 컴포넌트는 프로세서 및/또는 실행 스레드 내에 상주할 수 있다. 일 컴포넌트는 하나의 컴퓨터 내에 로컬화 될 수 있다. 일 컴포넌트는 2개 이상의 컴퓨터들 사이에 분배될 수 있다. 또한, 이러한 컴포넌트들은 그 내부에 저장된 다양한 데이터 구조들을 갖는 다양한 컴퓨터 판독가능한 매체로부터 실행할 수 있다. 컴포넌트들은 예를 들어 하나 이상의 데이터 패킷들을 갖는 신호(예를 들면, 로컬 시스템, 분산 시스템에서 다른 컴포넌트와 상호작용하는 하나의 컴포넌트로부터의 데이터 및/또는 신호를 통해 다른 시스템과 인터넷과 같은 네트워크를 통해 전송되는 데이터)에 따라 로컬 및/또는 원격 처리들을 통해 통신할 수 있다.As used herein, the terms “component,” “module,” “system,” “unit,” and the like refer to a computer-related entity, hardware, firmware, software, a combination of software and hardware, or an execution of software, and may be used interchangeably. For example, a component may be, but is not limited to, a procedure, processor, object, thread of execution, program, and/or computer running on a processor. For example, both an application running on a computing device and a computing device may be components. One or more components may reside within a processor and/or thread of execution. A component can be localized within a single computer. A component may be distributed between two or more computers. Also, these components can execute from various computer readable media having various data structures stored thereon. Components may communicate via local and/or remote processes, e.g., according to a signal with one or more packets of data (e.g., data from one component interacting with another component in a local system, distributed system, and/or data transmitted via a signal to another system and over a network such as the Internet).

더불어, 용어 "또는"은 배타적 "또는"이 아니라 내포적 "또는"을 의미하는 것으로 의도된다. 즉, 달리 특정되지 않거나 문맥상 명확하지 않은 경우에, "X는 A 또는 B를 이용한다"는 자연적인 내포적 치환 중 하나를 의미하는 것으로 의도된다. 즉, X가 A를 이용하거나; X가 B를 이용하거나; 또는 X가 A 및 B 모두를 이용하는 경우, "X는 A 또는 B를 이용한다"가 이들 경우들 어느 것으로도 적용될 수 있다. 또한, 본 명세서에 사용된 "및/또는"이라는 용어는 열거된 관련 아이템들 중 하나 이상의 아이템의 가능한 모든 조합을 지칭하고 포함하는 것으로 이해되어야 한다.In addition, the term “or” is intended to mean an inclusive “or” rather than an exclusive “or”. That is, unless otherwise specified or clear from the context, “X employs A or B” is intended to mean one of the natural inclusive substitutions. That is, X uses A; X uses B; Or, if X uses both A and B, "X uses either A or B" may apply to either of these cases. Also, the term "and/or" as used herein should be understood to refer to and include all possible combinations of one or more of the listed related items.

또한, "포함한다" 및/또는 "포함하는"이라는 용어는, 해당 특징 및/또는 구성요소가 존재함을 의미하는 것으로 이해되어야 한다. 다만, "포함한다" 및/또는 "포함하는"이라는 용어는, 하나 이상의 다른 특징, 구성요소 및/또는 이들의 그룹의 존재 또는 추가를 배제하지 않는 것으로 이해되어야 한다. 또한, 달리 특정되지 않거나 단수 형태를 지시하는 것으로 문맥상 명확하지 않은 경우에, 본 명세서와 청구범위에서 단수는 일반적으로 "하나 또는 그 이상"을 의미하는 것으로 해석되어야 한다.Also, the terms "comprises" and/or "comprising" should be understood to mean that the features and/or components are present. However, it should be understood that the terms "comprises" and/or "comprising" do not exclude the presence or addition of one or more other features, elements, and/or groups thereof. Also, unless otherwise specified or where the context clearly indicates that a singular form is indicated, the singular in this specification and claims should generally be construed to mean "one or more".

그리고, "A 또는 B 중 적어도 하나" 또는 “A 및 B 중 적어도 하나” 라는 용어는, "A만을 포함하는 경우", "B 만을 포함하는 경우", "A와 B의 구성으로 조합된 경우"를 의미하는 것으로 해석되어야 한다.In addition, the term "at least one of A or B" or "at least one of A and B" is to be interpreted as meaning "includes only A", "includes only B", and "combines the configuration of A and B".

당업자들은 추가적으로 여기서 개시된 실시예들과 관련되어 설명된 다양한 예시적인 논리적 구성요소들, 블록들, 모듈들, 회로들, 수단들, 로직들, 및 알고리즘들이 전자 하드웨어, 컴퓨터 소프트웨어, 또는 양쪽 모두의 조합들로 구현될 수 있음을 인식해야 한다. 하드웨어 및 소프트웨어의 상호교환성을 명백하게 예시하기 위해, 다양한 예시적인 구성요소들, 블록들, 수단들, 로직들, 모듈들, 회로들, 및 단계들은 그들의 기능성 측면에서 일반적으로 위에서 설명되었다. 그러한 기능성이 하드웨어로 또는 소프트웨어로서 구현되는지 여부는 전반적인 시스템에 부과된 특정 어플리케이션(application) 및 설계 제한들에 달려 있다. 숙련된 기술자들은 각각의 특정 어플리케이션들을 위해 다양한 방법들로 설명된 기능성을 구현할 수 있다. 다만, 그러한 구현의 결정들이 본 개시내용의 영역을 벗어나게 하는 것으로 해석되어서는 안 된다.Those of skill should further appreciate that the various illustrative logical components, blocks, modules, circuits, means, logics, and algorithms described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both. To clearly illustrate the interchangeability of hardware and software, various illustrative components, blocks, means, logics, modules, circuits, and steps have been described above generally in terms of their functionality. Whether such functionality is implemented in hardware or as software depends on the particular application and design constraints imposed on the overall system. Skilled artisans may implement the described functionality in varying ways for each particular application. However, such implementation decisions should not be interpreted as causing a departure from the scope of this disclosure.

제시된 실시예들에 대한 설명은 본 개시의 기술 분야에서 통상의 지식을 가진 자가 본 발명을 이용하거나 또는 실시할 수 있도록 제공된다. 이러한 실시예들에 대한 다양한 변형들은 본 개시의 기술 분야에서 통상의 지식을 가진 자에게 명백할 것이다. 여기에 정의된 일반적인 원리들은 본 개시의 범위를 벗어남이 없이 다른 실시예들에 적용될 수 있다. 그리하여, 본 발명은 여기에 제시된 실시예 들로 한정되는 것이 아니다. 본 발명은 여기에 제시된 원리들 및 신규한 특징들과 일관되는 최광의의 범위에서 해석되어야 할 것이다.The description of the presented embodiments is provided to enable any person skilled in the art to use or practice the present invention. Various modifications to these embodiments will be apparent to those skilled in the art of this disclosure. The general principles defined herein may be applied to other embodiments without departing from the scope of this disclosure. Thus, the present invention is not limited to the embodiments presented herein. The present invention is to be accorded the widest scope consistent with the principles and novel features set forth herein.

본 개시내용에서의 제 1, 제 2, 또는 제 3과 같이 제 N 으로 표현되는 용어들은 적어도 하나의 엔티티들을 구분하기 위해 사용된다. 예를 들어, 제 1 과 제 2 로 표현된 엔티티들은 서로 동일하거나 또는 상이할 수 있다.In the present disclosure, terms expressed as Nth, such as first, second, or third, are used to distinguish at least one entity. For example, the first and second entities may be the same as or different from each other.

본 개시내용에서는 설명의 편의를 위하여, MHC(Major Histocompatibility Complex)에 대한 일례로 인간 백혈구 항원(Human Leukocyte Antigen; HLA)을 예시적으로 사용하기로 한다. 따라서, 이하에서 사용되는 HLA 또는 MHC에 대한 설명은 MHC 또는 HLA에 대한 설명을 표현하기 위한 일례이며, 본 개시내용의 권리범위는 청구범위에 기재된 내용에 근거하여 결정될 것이며, HLA에 대한 예시를 통해 그 권리범위가 HLA로 제한 해석되지는 않아야 할 것이다. 이처럼, 본 개시내용에서의 HLA와 MHC는 서로 교환 가능하게 사용될 수 있다.In the present disclosure, for convenience of explanation, human leukocyte antigen (HLA) will be used as an example for MHC (Major Histocompatibility Complex) as an example. Therefore, the description of HLA or MHC used below is an example for expressing a description of MHC or HLA, and the scope of the present disclosure will be determined based on the content described in the claims, and through examples of HLA, the scope of rights will not be construed as limited to HLA. As such, HLA and MHC in the present disclosure may be used interchangeably.

본 개시내용에서 사용되는 용어, “인간 백혈구 항원(HLA)”은 인간의 MHC 유전자에 의해 생성되는 당단백 분자로, 인간이 가지고 있는 유전자 중에서 가장 큰 다형성(polymorphism)을 보이는 유전자이다. HLA 타입을 결정하는 HLA 타이핑은 장기이식, 면역치료, 질병관련 연구, 친자감별과 같은 부친시험, 법의학적 이용, 유전학적 연구 등의 다양한 분야에서 매우 활발하게 이용될 수 있다.As used in the present disclosure, the term “human leukocyte antigen (HLA)” is a glycoprotein molecule produced by the human MHC gene, and is a gene that shows the largest polymorphism among genes possessed by humans. HLA typing, which determines the HLA type, can be used very actively in various fields such as organ transplantation, immunotherapy, disease-related research, paternity tests such as paternity, forensic use, and genetic research.

본 개시내용에서의 HLA 타입은 예를 들어, HLA-A 타입, HLA-B 타입 및/또는 HLA-C 타입을 포함할 수 있다.An HLA type in the present disclosure may include, for example, an HLA-A type, an HLA-B type, and/or an HLA-C type.

본 개시내용에서의 MHC와 펩타이드의 결합체는 항원을 제시하는(antigen presenting) 세포에서 프로테아좀(proteasome)을 통한 가공을 거쳐 MHC class I 분자를 통한 펩타이드 항원 제시 복합체를 의미할 수 있다. 프로테아좀은 LMP-2와 LMP-7 (low molecular weight polypeptide)의 2개의 단위체들로 구성되어 있다. 이러한 2개의 프로테아좀의 단위체들은 MHC 유전자 내의 TAP-1과 TAP-2 유전자 부근에 위치하고 있다. 프로테아좀의 단위체들은 MHC I 분자에 결합하는 펩타이드의 분해에 특히 중요하다. 사이토카인인 인터페론 감마(IFN-γ)를 세포에 처리하면, LMP-2와 LMP-7의 발현이 유도될 수 있다. LMP-2와 LMP-7의 발현은 프로테아좀의 기질 특이성에 변화를 초래하여, 펩타이드로의 분해 능력을 증가시킨다. LMP 단백질뿐만 아니라 MHC I, MECL-1 등의 항원제시에 관련된 단백질들이 IFN-γ에 의하여 발현이 증가하여, 항원제시 세포에서 항원제시가 증가될 수 있다.The conjugate of MHC and peptide in the present disclosure may refer to a peptide antigen presenting complex through MHC class I molecules through processing through proteasome in antigen presenting cells. The proteasome is composed of two units, LMP-2 and LMP-7 (low molecular weight polypeptide). These two proteasome units are located near the TAP-1 and TAP-2 genes in the MHC gene. The proteasome subunits are particularly important for the degradation of peptides bound to MHC I molecules. When cells are treated with interferon gamma (IFN-γ), a cytokine, the expression of LMP-2 and LMP-7 can be induced. Expression of LMP-2 and LMP-7 results in a change in the substrate specificity of the proteasome, increasing its ability to degrade into peptides. IFN-γ increases the expression of proteins related to antigen presentation, such as MHC I and MECL-1, as well as LMP proteins, so that antigen presentation can be increased in antigen presenting cells.

MHC I 분자는 프로테아좀에 의해 분해된 펩타이드 항원뿐 아니라, 소포체 내에 존재하는 단백질분해효소에 의해 생성된 펩타이드와 결합하기도 한다.MHC I molecules bind not only to peptide antigens degraded by the proteasome, but also to peptides produced by proteolytic enzymes present in the endoplasmic reticulum.

타파신(tapasin)은 TAP-1(transporter associated with antigen processing)과 소포체(ER) 내에서 안정한 3차구조를 이룬 MHC I 사이의 교량 역할을 하고, 펩타이드가 들어오면 MHC I 복합체는 결합해 있던 타파신과 TAP 단백질을 이탈하여 완전한 펩타이드-MHC class I 복합체가 된다.Tapasin serves as a bridge between TAP-1 (transporter associated with antigen processing) and MHC I, which has a stable tertiary structure in the endoplasmic reticulum (ER).

T세포 수용체(T cell receptor: TCR)에는 두 가지 종류가 있는데, 주로 TCRα와 TCRβ로 이루어져 있다. TCRα 사슬은 14번 염색체, TCRβ사슬은 7번 염색체에 각각 독립된 유전자좌에 있다. TCRβ는 이뮤노글로불린(immunoglobulin, Ig) 중쇄와 유사하게 V 유전자 분절, D 유전자 분절, J 유전자 분절, C 유전자 분절로 구성되어 있고 TCRα 사슬은 Ig 경쇄와 유사하게 D 유전자 분절을 가지고 있지 않으며, V, J, C 유전자 분절로 이루어져 있다.There are two types of T cell receptor (TCR), mainly composed of TCRα and TCRβ. The TCRα chain is located on chromosome 14, and the TCRβ chain is located at an independent locus on chromosome 7. TCRβ is composed of a V gene segment, a D gene segment, a J gene segment, and a C gene segment, similar to an immunoglobulin (Ig) heavy chain, and a TCRα chain does not have a D gene segment, similar to an Ig light chain, and consists of V, J, and C gene segments.

TCR의 재조합 과정에서 TCRα 사슬은 면역글로불린의 경쇄와 비슷하게 V,J 및 C 유전자 분절에 의해 암호화되고 TCRβ 사슬은 DJ 결합 후 V 결합으로 형성된 V-D-J에 C가 결합하는 유전자 재조합 과정으로 형성된다. V-(D)-J의 연결은 RAG-1과 RAG-2(recombination-activating gene, 재조합 활성화 유전자)에 의해서 매개된다.In the recombination process of TCR, the TCRα chain is encoded by the V, J, and C gene segments, similar to the immunoglobulin light chain, and the TCRβ chain is formed by a gene recombination process in which C binds to V-D-J formed by DJ binding and then V linkage. The linkage of V-(D)-J is mediated by RAG-1 and RAG-2 (recombination-activating gene).

엔도뉴클레아제(endonuclease)에 의해 머리핀 구조의 절단이 일어나고, 이때 형성된 짧은 한 가닥의 DNA에 상보적인 P-뉴클레오티드가 첨가된다. 또한 절단 말단에 1 내지 20개의 뉴클레오티드가 무작위로 첨가될 수 있는데, 이들을 N-뉴클레오티드라 하며, 이 과정은 TdT(terminal deoxynucleotide transferase)에 의해 매개된다. P-와 N-뉴클레오티드는 T세포 수용체의 다양성을 증가시킨다.The hairpin structure is cleaved by an endonuclease, and a complementary P-nucleotide is added to the short strand of DNA formed at this time. In addition, 1 to 20 nucleotides may be randomly added to the cleavage ends, which are referred to as N-nucleotides, and this process is mediated by TdT (terminal deoxynucleotide transferase). P- and N-nucleotides increase the diversity of T-cell receptors.

T세포는 세포표면에 제시된 외부 항원만을 인지하는데, TCR과 MHC 분자의 펩타이드 결합 틈새부위에 제시된 항원-펩타이드 간 상호작용에 의해 성숙된 말초 T세포의 활성이 시작된다.T cells recognize only foreign antigens presented on the cell surface, and the activity of mature peripheral T cells is initiated by the interaction between antigens and peptides presented at the peptide binding cleft of TCR and MHC molecules.

특히 대부분의 immune profiling은 TCR 서열 내 CDR3(complementarity determining region 3, 상보성 결정 부위) 영역의 분석에 초점이 맞춰져 있다. CDR3 영역은 항원과 수용체 사이의 상호작용에 관여하는 중요 영역으로서, 가장 많은 변이가 확인된다.In particular, most immune profiling is focused on the analysis of CDR3 (complementarity determining region 3, complementarity determining region) region in the TCR sequence. The CDR3 region is an important region involved in the interaction between an antigen and a receptor, and the most mutations are identified.

본 개시내용에서의 실시예는 펩타이드-MHC 결합체(pMHC)와 TCR 간의 상호작용을 통해 유발되는 T 세포의 면역활성을 주목적으로, 신생항원(neoantigen), CMV, EBV, KRAS 등의 항원성 펩타이드를 포함하는 펩타이드와, HLA의 종류 및 인종별/개체별 표현형에 따른 MHC 분자의 결합체(pMHC)에 대응하는 CDR3을 포함하는 TCR의 서열을 도출하는 단계를 포함한다.Examples in the present disclosure include the step of deriving a sequence of a TCR including a peptide including an antigenic peptide such as neoantigen, CMV, EBV, and KRAS, and a CDR3 corresponding to a conjugate of MHC molecules (pMHC) according to the type of HLA and the phenotype by race/individual, with the main purpose of immunoactivation of T cells induced through the interaction between the peptide-MHC complex (pMHC) and the TCR.

도 1은 본 개시내용의 일 실시예에 따른 컴퓨팅 장치(100)의 블록 구성도를 개략적으로 도시한다.1 schematically illustrates a block configuration diagram of a computing device 100 according to one embodiment of the present disclosure.

본 개시의 일 실시예에 따른 컴퓨팅 장치(100)는 프로세서(110) 및 메모리(130)를 포함할 수 있다.Computing device 100 according to an embodiment of the present disclosure may include a processor 110 and a memory 130 .

도 1에 도시된 컴퓨팅 장치(100)의 구성은 간략화 하여 나타낸 예시일 뿐이다. 본 개시의 일 실시예에서 컴퓨팅 장치(100)는 컴퓨팅 장치(100)의 컴퓨팅 환경을 수행하기 위한 다른 구성들이 포함될 수 있고, 개시된 구성들 중 일부만이 컴퓨팅 장치(100)를 구성할 수도 있다.The configuration of the computing device 100 shown in FIG. 1 is only a simplified example. In one embodiment of the present disclosure, the computing device 100 may include other components for performing a computing environment of the computing device 100, and only some of the disclosed components may constitute the computing device 100.

본 개시내용에서의 컴퓨팅 장치(100)는 본 개시내용의 실시예들을 구현하기 위한 시스템을 구성하는 임의의 형태의 노드를 의미할 수 있다. 컴퓨팅 장치(100)는 임의의 형태의 사용자 단말 또는 임의의 형태의 서버를 의미할 수 있다. 전술한 컴퓨팅 장치(100)의 컴포넌트들은 예시적인 것으로 일부가 제외될 수 있거나 또는 추가 컴포넌트가 포함될 수도 있다. 일례로, 전술한 컴퓨팅 장치(100)가 사용자 단말을 포함하는 경우, 출력부(미도시) 및 입력부(미도시)가 상기 컴퓨팅 장치(100)의 범위 내에 포함될 수 있다.The computing device 100 in the present disclosure may mean any type of node constituting a system for implementing embodiments of the present disclosure. The computing device 100 may refer to any type of user terminal or any type of server. Components of the aforementioned computing device 100 are exemplary and some may be excluded or additional components may be included. For example, when the aforementioned computing device 100 includes a user terminal, an output unit (not shown) and an input unit (not shown) may be included within the scope of the computing device 100 .

본 개시내용에서의 컴퓨팅 장치(100)는 후술될 본 개시내용의 실시예들에 따른 기술적 특징들을 수행할 수 있다. 예를 들어, 컴퓨팅 장치(100)는 펩타이드, MHC 및 TCR에 대응되는 입력 데이터를 이용하는 인공지능 기반의 예측 모델을 사용하여, 입력 데이터에 대응되는 TCR의 α쇄인 CDR3α에 대한 정보와 TCR의 β쇄인 CDR3β에 대한 정보를 포함하는 예측 결과를 생성할 수 있다. 예를 들어, 컴퓨팅 장치(100)는 피검체로부터 획득된 시료로부터의 펩타이드 및 MHC를 이용하여 TCR 예컨대, CDR3, CDR1 및/또는 CDR2)에 대한 예측 결과를 생성할 수 있다. 예를 들어, 컴퓨팅 장치(100)는 피검체로부터 획득된 시료로부터 펩타이드, MHC 및 TCR(예컨대, CDR3, CDR1 또는 CDR2)에 대한 정보를 획득하고, 펩타이드, MHC 및 TCR에 대한 정보를 기반으로 TCR의 CDR3α 아미노산 서열과 TCR의 CDR3β 아미노산 서열에 대응되는 예측 결과를 생성할 수 있다.The computing device 100 in the present disclosure may perform technical features according to embodiments of the present disclosure to be described later. For example, the computing device 100 uses an artificial intelligence-based predictive model using input data corresponding to peptides, MHC, and TCR, and information on CDR3α, the α chain of the TCR corresponding to the input data, and the β chain of the TCR. It can generate a prediction result including information about CDR3β. For example, the computing device 100 may generate a prediction result for a TCR (eg, CDR3, CDR1, and/or CDR2) using peptides and MHC from a sample obtained from a subject. For example, the computing device 100 obtains information on a peptide, MHC, and TCR (e.g., CDR3, CDR1, or CDR2) from a sample obtained from a subject, and based on the information on the peptide, MHC, and TCR, CDR3α amino acid sequence of the TCR and prediction results corresponding to the CDR3β amino acid sequence of the TCR may be generated.

본 개시내용의 일 실시예에서, 컴퓨팅 장치(100)는 염기 서열 분석(예컨대, Next Generation Sequencing)를 수행한 결과를 서버 또는 외부 엔티티 등으로부터 획득할 수 있다. 다른 실시예에서, 컴퓨팅 장치(100)는 피검체 유래의 생물학적 시료로부터 획득된 단백체 데이터 및/또는 유전자 데이터(예컨대, DNA 또는 RNA)에 대한 염기 서열 분석을 수행할 수도 있다. 본 개시내용에서 사용되는 용어, 염기서열 분석은 염기의 서열을 분석할 수 있는 임의의 형태의 기법들에 의해 수행될 수 있으며, 예를 들어, 전장 유전체 염기서열 분석(whole genome sequencing), 전체 엑솜 염기서열 분석(whole exome sequencing) 또는 전체 전사체 염기서열 분석(whole transcriptome sequencing)을 포함할 수 있으나, 이에 제한되는 것은 아니다.In one embodiment of the present disclosure, the computing device 100 may obtain a result of base sequence analysis (eg, Next Generation Sequencing) from a server or an external entity. In another embodiment, the computing device 100 may perform nucleotide sequence analysis on proteomic data and/or genetic data (eg, DNA or RNA) obtained from a biological sample derived from a subject. As used in the present disclosure, sequencing may be performed by any type of technique capable of analyzing base sequences, and may include, for example, whole genome sequencing, whole exome sequencing, or whole transcriptome sequencing, but is not limited thereto.

본 개시내용에서 사용되는 용어, 피검체는 펩타이드, MHC, TCR 및/또는 이들의 결합체(complex)를 포함하는 생물학적 시료를 획득하기 위한 대상체 또는 개체를 의미할 수 있다.As used in the present disclosure, a subject may refer to a subject or individual for obtaining a biological sample containing a peptide, MHC, TCR, and/or a complex thereof.

본 개시내용에서 사용되는 용어, 시료는 MHC 타입 및/또는 TCR을 결정하고자 하는 개체 또는 대상체로부터 획득된 것이라면 제한 없이 사용할 수 있으며, 예를 들어 생검 등으로 얻어진 세포나 조직, 혈액, 전혈, 혈청, 혈장, 타액, 뇌척수액, 각종 분비물, 소변 및/또는 대변 등일 수 있다. 바람직하게 시료는 혈액, 혈장, 혈청, 타액, 비액, 객담, 복수, 질 분비물 및/또는 소변으로 이루어진 군에서 선택될 수 있으며, 보다 바람직하게는 혈액, 혈장 또는 혈청일 수 있다. 상기 시료는 검출 또는 진단에 사용하기 전에 사전 처리할 수 있다. 예를 들어, 사전처리 방법은 균질화(homogenization), 여과, 증류, 추출, 농축, 방해 성분의 불활성화, 및/또는 시약의 첨가 등을 포함할 수 있다. 본 개시내용에서, 생물학적 시료는 조직, 세포, 전혈, 및/또는 혈액인 것일 수 있으나, 이에 제한되는 것은 아니다.As used in the present disclosure, the term and sample may be used without limitation as long as they are obtained from an individual or subject for whom MHC type and/or TCR are to be determined, and may be, for example, cells or tissues obtained by biopsy, blood, whole blood, serum, plasma, saliva, cerebrospinal fluid, various secretions, urine and/or feces. Preferably, the sample may be selected from the group consisting of blood, plasma, serum, saliva, nasal fluid, sputum, ascites, vaginal secretion and/or urine, and more preferably blood, plasma or serum. The sample may be pre-treated prior to use in detection or diagnosis. For example, pretreatment methods may include homogenization, filtration, distillation, extraction, concentration, inactivation of interfering components, and/or addition of reagents, and the like. In the present disclosure, a biological sample may be tissue, cell, whole blood, and/or blood, but is not limited thereto.

일 실시예에서, 프로세서(110)는 적어도 하나의 코어로 구성될 수 있으며, 컴퓨팅 장치(100)의 중앙 처리 장치(CPU: central processing unit), 범용 그래픽 처리 장치 (GPGPU: general purpose graphics processing unit), 텐서 처리 장치(TPU: tensor processing unit) 등의 데이터 분석 및/또는 처리를 위한 프로세서를 포함할 수 있다.In one embodiment, the processor 110 may include at least one core, and may include a processor for data analysis and/or processing, such as a central processing unit (CPU), a general purpose graphics processing unit (GPGPU), or a tensor processing unit (TPU) of the computing device 100.

프로세서(110)는 메모리(130)에 저장된 컴퓨터 프로그램을 판독하여 본 개시내용의 일 실시예에 따라, PMHC-TCR의 면역원성을 포함하는 예측 결과를 생성할 수 있다.The processor 110 may read a computer program stored in the memory 130 to generate a predictive result including immunogenicity of the PMHC-TCR according to an embodiment of the present disclosure.

본 개시의 일 실시예에 따라 프로세서(110)는 신경망의 학습을 위한 연산을 수행할 수도 있다. 프로세서(110)는 딥러닝(DL: deep learning)에서 학습을 위한 입력 데이터의 처리, 입력 데이터에서의 피쳐 추출, 오차 계산, 역전파(backpropagation)를 이용한 신경망의 가중치 업데이트 등의 신경망의 학습을 위한 계산을 수행할 수 있다. 프로세서(110)의 CPU, GPGPU, 및 TPU 중 적어도 하나가 네트워크 함수의 학습을 처리할 수 있다. 예를 들어, CPU 와 GPGPU가 함께 네트워크 함수의 학습, 네트워크 함수를 이용한 데이터 분류를 처리할 수 있다. 또한, 본 개시의 일 실시예에서 복수의 컴퓨팅 장치들의 프로세서들을 함께 사용하여 네트워크 함수의 학습, 네트워크 함수를 이용한 데이터 분류를 처리할 수도 있다. 또한, 본 개시의 일 실시예에 따른 컴퓨팅 장치에서 수행되는 컴퓨터 프로그램은 CPU, GPGPU 또는 TPU 실행가능 프로그램일 수 있다.According to an embodiment of the present disclosure, the processor 110 may perform an operation for learning a neural network. The processor 110 may perform calculations for neural network learning, such as processing input data for learning in deep learning (DL), extracting features from input data, calculating errors, and updating neural network weights using backpropagation. At least one of the CPU, GPGPU, and TPU of the processor 110 may process learning of the network function. For example, the CPU and GPGPU can process learning of network functions and data classification using network functions. In addition, in one embodiment of the present disclosure, learning of a network function and data classification using a network function may be processed by using processors of a plurality of computing devices together. In addition, a computer program executed in a computing device according to an embodiment of the present disclosure may be a CPU, GPGPU or TPU executable program.

추가적으로, 프로세서(110)는 통상적으로 컴퓨팅 장치(100)의 전반적인 동작을 처리할 수 있다. 예를 들어, 프로세서(110)는 컴퓨팅 장치(100)에 포함된 구성요소들을 통해 입력 또는 출력되는 데이터, 정보, 또는 신호 등을 처리하거나 저장부에 저장된 응용 프로그램을 구동함으로써, 사용자에게 적절한 정보 또는 기능을 제공 또는 처리할 수 있다.Additionally, the processor 110 may typically handle overall operations of the computing device 100 . For example, the processor 110 processes data, information, or signals input or output through components included in the computing device 100 or drives an application program stored in a storage unit, thereby providing or processing appropriate information or functions to the user.

본 개시의 일 실시예에 따르면, 메모리(130)는 프로세서(110)가 생성하거나 결정한 임의의 형태의 정보 및 컴퓨팅 장치(100)가 수신한 임의의 형태의 정보를 저장할 수 있다. 본 개시의 일 실시예에 따르면, 메모리(130)는 프로세서(110)가 본 개시의 실시예들에 따른 동작을 수행하도록 하는 컴퓨터 소프트웨어를 저장하는 저장매체 일 수 있다. 따라서, 메모리(130)는 본 개시내용에 실시예들을 수행하는 데 필요한 소프트웨어 코드, 코드의 실행 대상이 되는 데이터, 코드의 실행 결과를 저장하기 위한 컴퓨터 판독 매체들을 의미할 수 있다.According to one embodiment of the present disclosure, memory 130 may store any type of information generated or determined by processor 110 and any type of information received by computing device 100 . According to one embodiment of the present disclosure, the memory 130 may be a storage medium that stores computer software that causes the processor 110 to perform operations according to embodiments of the present disclosure. Accordingly, the memory 130 may refer to computer readable media for storing software codes necessary for performing embodiments of the present disclosure, data subject to execution of the codes, and results of execution of the codes.

본 개시의 일 실시예에 따르면, 메모리(130)는 임의의 타입의 저장 매체를 의미할 수 있다 예를 들어, 메모리(130)는 플래시 메모리 타입(flash memory type), 하드디스크 타입(hard disk type), 멀티미디어 카드 마이크로 타입(multimedia card micro type), 카드 타입의 메모리(예를 들어 SD 또는 XD 메모리 등), 램(Random Access Memory, RAM), SRAM(Static Random Access Memory), 롬(Read-Only Memory, ROM), EEPROM(Electrically Erasable Programmable Read-Only Memory), PROM(Programmable Read-Only Memory), 자기 메모리, 자기 디스크, 광디스크 중 적어도 하나의 타입의 저장매체를 포함할 수 있다. 컴퓨팅 장치(100)는 인터넷(internet) 상에서 상기 메모리(130)의 저장 기능을 수행하는 웹 스토리지(web storage)와 관련되어 동작할 수도 있다. 전술한 메모리에 대한 기재는 예시일 뿐, 본 개시내용에서 사용되는 메모리(130)는 전술한 예시들로 제한되지 않는다.According to an embodiment of the present disclosure, the memory 130 may refer to any type of storage medium. For example, the memory 130 may include a flash memory type, a hard disk type, a multimedia card micro type, a card type memory (eg SD or XD memory, etc.), RAM (Random Access Memory, RAM), SRAM (Static Random Access Memory), ROM (Read-Only Memory, ROM), E It may include at least one type of storage medium among electrically erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), magnetic memory, magnetic disk, and optical disk. The computing device 100 may operate in relation to a web storage that performs a storage function of the memory 130 on the Internet. The description of the above memory is only an example, and the memory 130 used in the present disclosure is not limited to the above example.

본 개시내용에서의 통신부(미도시)는 유선 및 무선 등과 같은 그 통신 양태를 가리지 않고 구성될 수 있으며, 단거리 통신망(PAN: Personal Area Network), 근거리 통신망(WAN: Wide Area Network) 등 다양한 통신망으로 구성될 수 있다. 또한, 상기 통신부 공지의 월드와이드웹(WWW: World Wide Web) 기반으로 동작할 수 있으며, 적외선(IrDA: Infrared Data Association) 또는 블루투스(Bluetooth)와 같이 단거리 통신에 이용되는 무선 전송 기술을 이용할 수도 있다.The communication unit (not shown) in the present disclosure may be configured regardless of its communication aspect, such as wired and wireless, and may be configured with various communication networks such as a personal area network (PAN) and a wide area network (WAN). In addition, the communication unit may operate based on the known World Wide Web (WWW), and may use a wireless transmission technology used for short-range communication such as Infrared Data Association (IrDA) or Bluetooth.

본 개시내용에서의 컴퓨팅 장치(100)는 임의의 형태의 사용자 단말 및/또는 임의의 형태의 서버를 포함할 수 있다. 따라서, 본 개시내용의 실시예들은 서버 및/또는 사용자 단말에 의해 수행될 수 있다.The computing device 100 in the present disclosure may include any type of user terminal and/or any type of server. Accordingly, embodiments of the present disclosure may be performed by a server and/or a user terminal.

사용자 단말은 서버 또는 다른 컴퓨팅 장치와 상호작용 가능한 임의의 형태의 단말을 포함할 수 있다. 사용자 단말은 예를 들어, 휴대폰, 스마트 폰(smart phone), 노트북 컴퓨터(laptop computer), PDA(personal digital assistants), 슬레이트 PC(slate PC), 태블릿 PC(tablet PC) 및 울트라북(ultrabook)을 포함할 수 있다. 서버는 예를 들어, 마이크로프로세서, 메인프레임 컴퓨터, 디지털 프로세서, 휴대용 디바이스 및 디바이스 제어기 등과 같은 임의의 타입의 컴퓨팅 시스템 또는 컴퓨팅 장치를 포함할 수 있다. A user terminal may include any type of terminal capable of interacting with a server or other computing device. The user terminal may include, for example, a mobile phone, a smart phone, a laptop computer, a personal digital assistant (PDA), a slate PC, a tablet PC, and an ultrabook. A server may include any type of computing system or computing device, such as, for example, microprocessors, mainframe computers, digital processors, portable devices and device controllers, and the like.

추가적인 실시예에서 전술한 서버는 TCR의 정보, 면역펩티돔 정보, 펩타이드 시퀀스의 정보, MHC 분자의 정보, 염기 서열 정보 또는 유전자 정보를 저장 및 관리하는 엔티티를 의미할 수도 있다. 서버는 면역펩티돔 정보, 펩타이드 서열의 정보, 위치 별 아미노산 식별자들의 정보, 염기 서열 정보, 유전자 정보 또는 데이터베이스(예를 들어, Expasy, VDJdb, TCR3F, huARdb, IMGT, McPAS)의 신뢰도 정보 등을 저장하기 위한 저장부(미도시)를 포함할 수 있으며, 저장부는 서버내에 포함되거나 혹은 서버의 관리 하에 존재할 수 있다. 다른 예시로, 저장부는 서버 외부에 존재하여 서버와 통신가능한 형태로 구현될 수도 있다. 이 경우 서버와는 상이한 다른 외부 서버에 의해 저장부가 관리 및 제어될 수 있다.In a further embodiment, the aforementioned server may refer to an entity that stores and manages TCR information, immunopeptidome information, peptide sequence information, MHC molecule information, nucleotide sequence information, or gene information. The server may include a storage unit (not shown) for storing immunopeptidome information, peptide sequence information, positional amino acid identifier information, nucleotide sequence information, gene information, or reliability information of a database (e.g., Expasy, VDJdb, TCR3F, huARdb, IMGT, McPAS), and the storage unit may be included in the server or may exist under the management of the server. As another example, the storage unit may exist outside the server and may be implemented in a form capable of communicating with the server. In this case, the storage unit may be managed and controlled by another external server different from the server.

도 2는 본 개시내용의 일 실시예에 따른 인공지능 기반 모델의 예시적인 구조를 도시한다.2 illustrates an exemplary structure of an artificial intelligence-based model according to an embodiment of the present disclosure.

본 명세서에 걸쳐, 예측 모델, 인공지능 기반의 예측 모델, 인공지능 모델, 인공지능 기반 모델, 연산 모델, 신경망, 네트워크 함수, 뉴럴 네트워크(neural network)는 동일한 의미로 사용될 수 있다.Throughout this specification, the terms predictive model, artificial intelligence-based predictive model, artificial intelligence model, artificial intelligence-based model, computational model, neural network, network function, and neural network may be used interchangeably.

신경망은 일반적으로 노드라 지칭될 수 있는 상호 연결된 계산 단위들의 집합으로 구성될 수 있다. 이러한 노드들은 뉴런(neuron)들로 지칭될 수도 있다. 신경망은 적어도 하나 이상의 노드들을 포함하여 구성된다. 신경망들을 구성하는 노드(또는 뉴런)들은 하나 이상의 링크에 의해 상호 연결될 수 있다.A neural network may consist of a set of interconnected computational units, which may generally be referred to as nodes. These nodes may also be referred to as neurons. A neural network includes one or more nodes. Nodes (or neurons) constituting neural networks may be interconnected by one or more links.

신경망 내에서, 링크를 통해 연결된 하나 이상의 노드들은 상대적으로 입력 노드 및 출력 노드의 관계를 형성할 수 있다. 입력 노드 및 출력 노드의 개념은 상대적인 것으로서, 하나의 노드에 대하여 출력 노드 관계에 있는 임의의 노드는 다른 노드와의 관계에서 입력 노드 관계에 있을 수 있으며, 그 역도 성립할 수 있다. 상술한 바와 같이, 입력 노드 대 출력 노드 관계는 링크를 중심으로 생성될 수 있다. 하나의 입력 노드에 하나 이상의 출력 노드가 링크를 통해 연결될 수 있으며, 그 역도 성립할 수 있다. In a neural network, one or more nodes connected through a link may form a relative relationship of an input node and an output node. The concept of an input node and an output node is relative, and any node in an output node relationship with one node may have an input node relationship with another node, and vice versa. As described above, an input node to output node relationship may be created around a link. More than one output node can be connected to one input node through a link, and vice versa.

하나의 링크를 통해 연결된 입력 노드 및 출력 노드 관계에서, 출력 노드의 데이터는 입력 노드에 입력된 데이터에 기초하여 그 값이 결정될 수 있다. 여기서 입력 노드와 출력 노드를 상호 연결하는 링크는 가중치(weight)를 가질 수 있다. 가중치는 가변적일 수 있으며, 신경망이 원하는 기능을 수행하기 위해, 사용자 또는 알고리즘에 의해 가변 될 수 있다. 예를 들어, 하나의 출력 노드에 하나 이상의 입력 노드가 각각의 링크에 의해 상호 연결된 경우, 출력 노드는 상기 출력 노드와 연결된 입력 노드들에 입력된 값들 및 각각의 입력 노드들에 대응하는 링크에 설정된 가중치에 기초하여 출력 노드 값을 결정할 수 있다.In a relationship between an input node and an output node connected through one link, the value of data of the output node may be determined based on data input to the input node. Here, a link interconnecting an input node and an output node may have a weight. The weight may be variable, and may be changed by a user or an algorithm in order to perform a function desired by the neural network. For example, when one or more input nodes are connected to one output node by respective links, the output node may determine an output node value based on values input to input nodes connected to the output node and a weight set for a link corresponding to each input node.

상술한 바와 같이, 신경망은 하나 이상의 노드들이 하나 이상의 링크를 통해 상호 연결되어 신경망 내에서 입력 노드 및 출력 노드 관계를 형성한다. 신경망 내에서 노드들과 링크들의 개수 및 노드들과 링크들 사이의 연관관계, 링크들 각각에 부여된 가중치의 값에 따라, 신경망의 특성이 결정될 수 있다. 예를 들어, 동일한 개수의 노드 및 링크들이 존재하고, 링크들의 가중치 값이 상이한 두 신경망이 존재하는 경우, 두 개의 신경망들은 서로 상이한 것으로 인식될 수 있다.As described above, in the neural network, one or more nodes are interconnected through one or more links to form an input node and output node relationship in the neural network. Characteristics of the neural network may be determined according to the number of nodes and links in the neural network, an association between the nodes and links, and a weight value assigned to each link. For example, when there are two neural networks having the same number of nodes and links and different weight values of the links, the two neural networks may be recognized as different from each other.

신경망은 하나 이상의 노드들의 집합으로 구성될 수 있다. 신경망을 구성하는 노드들의 부분 집합은 레이어(layer)를 구성할 수 있다. 신경망을 구성하는 노드들 중 일부는, 최초 입력 노드로부터의 거리들에 기초하여, 하나의 레이어(layer)를 구성할 수 있다. 예를 들어, 최초 입력 노드로부터 거리가 n인 노드들의 집합은, n 레이어를 구성할 수 있다. 최초 입력 노드로부터 거리는, 최초 입력 노드로부터 해당 노드까지 도달하기 위해 거쳐야 하는 링크들의 최소 개수에 의해 정의될 수 있다. 그러나, 이러한 레이어의 정의는 설명을 위한 임의적인 것으로서, 신경망 내에서 레이어의 차수는 상술한 것과 상이한 방법으로 정의될 수 있다. 예를 들어, 노드들의 레이어는 최종 출력 노드로부터 거리에 의해 정의될 수도 있다.A neural network may be composed of a set of one or more nodes. A subset of nodes constituting a neural network may constitute a layer. Some of the nodes constituting the neural network may form one layer based on distances from the first input node. For example, a set of nodes having a distance of n from the first input node may constitute n layers. The distance from the first input node may be defined by the minimum number of links that must be passed through to reach the corresponding node from the first input node. However, the definition of such a layer is arbitrary for explanation, and the order of a layer in a neural network may be defined in a method different from the above. For example, a layer of nodes may be defined by a distance from a final output node.

본 개시내용의 일 실시예에서, 뉴런들 또는 노드들의 집합은 레이어라는 표현으로 정의될 수 있다.In one embodiment of the present disclosure, a set of neurons or nodes may be defined as a layer.

최초 입력 노드는 신경망 내의 노드들 중 다른 노드들과의 관계에서 링크를 거치지 않고 데이터가 직접 입력되는 하나 이상의 노드들을 의미할 수 있다. 또는, 신경망 네트워크 내에서, 링크를 기준으로 한 노드 간의 관계에 있어서, 링크로 연결된 다른 입력 노드들을 가지지 않는 노드들을 의미할 수 있다. 이와 유사하게, 최종 출력 노드는 신경망 내의 노드들 중 다른 노드들과의 관계에서, 출력 노드를 가지지 않는 하나 이상의 노드들을 의미할 수 있다. 또한, 히든 노드는 최초 입력 노드 및 최후 출력 노드가 아닌 신경망을 구성하는 노드들을 의미할 수 있다.An initial input node may refer to one or more nodes to which data is directly input without going through a link in relation to other nodes among nodes in the neural network. Alternatively, in a relationship between nodes based on a link in a neural network, it may mean nodes that do not have other input nodes connected by a link. Similarly, the final output node may refer to one or more nodes that do not have an output node in relation to other nodes among nodes in the neural network. Also, the hidden node may refer to nodes constituting the neural network other than the first input node and the last output node.

본 개시의 일 실시예에 따른 신경망은 입력 레이어의 노드의 개수가 출력 레이어의 노드의 개수와 동일할 수 있으며, 입력 레이어에서 히든 레이어로 진행됨에 따라 노드의 수가 감소하다가 다시 증가하는 형태의 신경망일 수 있다. 또한, 본 개시의 다른 일 실시예에 따른 신경망은 입력 레이어의 노드의 개수가 출력 레이어의 노드의 개수 보다 적을 수 있으며, 입력 레이어에서 히든 레이어로 진행됨에 따라 노드의 수가 감소하는 형태의 신경망일 수 있다. 또한, 본 개시의 또 다른 일 실시예에 따른 신경망은 입력 레이어의 노드의 개수가 출력 레이어의 노드의 개수보다 많을 수 있으며, 입력 레이어에서 히든 레이어로 진행됨에 따라 노드의 수가 증가하는 형태의 신경망일 수 있다. 본 개시의 또 다른 일 실시예에 따른 신경망은 상술한 신경망들의 조합된 형태의 신경망일 수 있다.The neural network according to an embodiment of the present disclosure may have the same number of nodes of the input layer as the number of nodes of the output layer, and may be a neural network in which the number of nodes decreases and then increases again as the number of nodes progresses from the input layer to the hidden layer. In addition, the neural network according to another embodiment of the present disclosure may be a neural network in which the number of nodes of the input layer may be less than the number of nodes of the output layer and the number of nodes decreases as the number of nodes progresses from the input layer to the hidden layer. In addition, the neural network according to another embodiment of the present disclosure may have more nodes in the input layer than nodes in the output layer, and the number of nodes increases as the number of nodes increases from the input layer to the hidden layer. A neural network according to another embodiment of the present disclosure may be a neural network in the form of a combination of the aforementioned neural networks.

딥 뉴럴 네트워크(DNN: deep neural network, 심층신경망)는 입력 레이어와 출력 레이어 외에 복수의 히든 레이어를 포함하는 신경망을 의미할 수 있다. 딥 뉴럴 네트워크를 이용하면 데이터의 잠재적인 구조(latent structures)를 파악할 수 있다. 즉, 사진, 글, 비디오, 음성, 단백질 시퀀스 구조, 유전자 시퀀스 구조, 펩타이드 서열의 구조, 아미노산 서열의 구조, 음악의 잠재적인 구조(예를 들어, 어떤 물체가 사진에 있는지, 글의 내용과 감정이 무엇인지, 음성의 내용과 감정이 무엇인지 등), TCR과 pMHC 간의 결합 친화도 및/또는 펩타이드와 MHC 간의 결합 친화도를 파악할 수 있다. 딥 뉴럴 네트워크는 컨볼루션 뉴럴 네트워크(CNN: convolutional neural network), 리커런트 뉴럴 네트워크(RNN: recurrent neural network), 오토 인코더(auto encoder), 적대적 생성 네트워크(GAN: Generative Adversarial Network), 제한 볼츠만 머신(RBM: Restricted Boltzmann Machine), 심층 신뢰 네트워크(DBN: deep belief network), Q 네트워크, U 네트워크, 샴 네트워크, 등을 포함할 수 있다. 전술한 딥 뉴럴 네트워크의 기재는 예시일 뿐이며 본 개시는 이에 제한되지 않는다.A deep neural network (DNN) may refer to a neural network including a plurality of hidden layers in addition to an input layer and an output layer. Deep neural networks can reveal latent structures in data. That is, the structure of a photograph, text, video, audio, protein sequence structure, gene sequence structure, structure of a peptide sequence, structure of an amino acid sequence, potential structure of music (e.g., which object is in a photograph, content and emotion of text, content and emotion of voice, etc.), binding affinity between TCR and pMHC, and/or binding affinity between peptide and MHC. The deep neural network may include a convolutional neural network (CNN), a recurrent neural network (RNN), an auto encoder, a Generative Adversarial Network (GAN), a Restricted Boltzmann Machine (RBM), a deep belief network (DBN), a Q network, a U network, a Siamese network, and the like. The description of the deep neural network described above is only an example, and the present disclosure is not limited thereto.

일례로, 본 개시내용의 인공지능 기반의 예측 모델은, 인공지능 기반의 예측 모델은, RNN(Recurrent Neural Network), LSTM(Long Short Term Memory) 네트워크, BiLSTM(Bidirectional Long Short Term Memory) 네트워크, BERT(Bidirectional Encoder Representations from Transformers), spanBERT, GRU(Gated Recurrent Unit), 또는 BiGRU(Bidirectional Gated Recurrent Unit)를 포함할 수 있다.As an example, the AI-based predictive model of the present disclosure may include a Recurrent Neural Network (RNN), a Long Short Term Memory (LSTM) network, a Bidirectional Long Short Term Memory (BiLSTM) network, Bidirectional Encoder Representations from Transformers (BERT), spanBERT, Gated Recurrent Unit (GRU), or Bidirectional Gated Recurrent Unit (BiGRU).

본 개시내용의 인공지능 기반의 예측 모델은 입력 레이어, 히든 레이어 및 출력 레이어를 포함하는 전술한 임의의 구조의 네트워크 구조에 의해 표현될 수 있다.The artificial intelligence-based predictive model of the present disclosure may be represented by a network structure of any of the foregoing structures including an input layer, a hidden layer, and an output layer.

본 개시내용의 인공지능 기반 모델에서 사용될 수 있는 뉴럴 네트워크는 지도 학습(supervised learning), 비지도 학습(unsupervised learning), 반지도학습(semi supervised learning), 또는 강화학습(reinforcement learning) 중 적어도 하나의 방식으로 학습될 수 있다. 뉴럴 네트워크의 학습은 뉴럴 네트워크가 특정한 동작을 수행하기 위한 지식을 뉴럴 네트워크에 적용하는 과정일 수 있다. 일례로, 본 개시내용의 일 실시예에 따른 예측 모델은, 아미노산 서열들을 구성하는 프레임 또는 유닛 중 적어도 일부에 마스크(mask)를 적용한 이후, 마스킹된(masked) 프레임 또는 유닛을 맞추는 반-지도 학습(semi-supervised learning) 방법으로 학습될 수 있다.Neural networks that can be used in the artificial intelligence-based model of the present disclosure can be trained in at least one of supervised learning, unsupervised learning, semi-supervised learning, or reinforcement learning. Learning of the neural network may be a process of applying knowledge for the neural network to perform a specific operation to the neural network. As an example, the predictive model according to an embodiment of the present disclosure applies a mask to at least some of the frames or units constituting the amino acid sequences, and then matches the masked frames or units. It can be learned by a semi-supervised learning method.

뉴럴 네트워크는 출력의 오류를 최소화하는 방향으로 학습될 수 있다. 뉴럴 네트워크의 학습에서 반복적으로 학습 데이터를 뉴럴 네트워크에 입력시키고 학습 데이터에 대한 뉴럴 네트워크의 출력과 타겟의 에러를 계산하고, 에러를 줄이기 위한 방향으로 뉴럴 네트워크의 에러를 뉴럴 네트워크의 출력 레이어에서부터 입력 레이어 방향으로 역전파(backpropagation)하여 뉴럴 네트워크의 각 노드의 가중치를 업데이트 하는 과정이다. 지도 학습의 경우 각각의 학습 데이터에 정답이 라벨링되어있는 학습 데이터를 사용하며(즉, 라벨링된 학습 데이터), 비지도 학습의 경우는 각각의 학습 데이터에 정답이 라벨링되어 있지 않을 수 있다. 즉, 예를 들어 데이터 분류에 관한 지도 학습의 경우의 학습 데이터는 학습 데이터 각각에 카테고리가 라벨링 된 데이터 일 수 있다. 라벨링된 학습 데이터가 뉴럴 네트워크에 입력되고, 뉴럴 네트워크의 출력(카테고리)과 학습 데이터의 라벨을 비교함으로써 오류(error)가 계산될 수 있다. 다른 예로, 데이터 분류에 관한 비지도 학습의 경우 입력인 학습 데이터가 뉴럴 네트워크 출력과 비교됨으로써 오류가 계산될 수 있다. 계산된 오류는 뉴럴 네트워크에서 역방향(즉, 출력 레이어에서 입력 레이어 방향)으로 역전파되며, 역전파에 따라 뉴럴 네트워크의 각 레이어의 각 노드들의 연결 가중치가 업데이트 될 수 있다. 업데이트 되는 각 노드의 연결 가중치는 학습률(learning rate)에 따라 변화량이 결정될 수 있다. 입력 데이터에 대한 뉴럴 네트워크의 계산과 에러의 역전파는 학습 사이클(epoch)을 구성할 수 있다. 학습률은 뉴럴 네트워크의 학습 사이클의 반복 횟수에 따라 상이하게 적용될 수 있다. 예를 들어, 뉴럴 네트워크의 학습 초기에는 높은 학습률을 사용하여 뉴럴 네트워크가 빠르게 일정 수준의 성능을 확보하도록 하여 효율성을 높이고, 학습 후기에는 낮은 학습률을 사용하여 정확도를 높일 수 있다.A neural network can be trained in a way that minimizes output errors. In neural network learning, iteratively inputs training data to the neural network, calculates the output of the neural network for the training data and the error of the target, and backpropagates the error of the neural network from the output layer of the neural network to the input layer in order to reduce the error, thereby updating the weight of each node of the neural network. In the case of supervised learning, each learning data is labeled with the correct answer (ie, labeled learning data), and in the case of unsupervised learning, the correct answer may not be labeled in each learning data. That is, for example, learning data in the case of supervised learning related to data classification may be data in which each learning data is labeled with a category. Labeled training data is input to a neural network, and an error may be calculated by comparing an output (category) of the neural network and a label of the training data. As another example, in the case of unsupervised learning for data classification, an error may be calculated by comparing input learning data with a neural network output. The calculated error is back-propagated in a reverse direction (ie, from the output layer to the input layer) in the neural network, and connection weights of each node of each layer of the neural network may be updated according to the back-propagation. The amount of change in the connection weight of each updated node may be determined according to a learning rate. The neural network's computation of input data and backpropagation of errors can constitute a learning cycle (epoch). The learning rate may be applied differently according to the number of iterations of the learning cycle of the neural network. For example, a high learning rate may be used in the early stage of neural network training to increase efficiency by allowing the neural network to quickly obtain a certain level of performance, and a low learning rate may be used in the late stage to increase accuracy.

뉴럴 네트워크의 학습에서 일반적으로 학습 데이터는 실제 데이터(즉, 학습된 뉴럴 네트워크를 이용하여 처리하고자 하는 데이터)의 부분집합일 수 있으며, 따라서, 학습 데이터에 대한 오류는 감소하나 실제 데이터에 대해서는 오류가 증가하는 학습 사이클이 존재할 수 있다. 과적합(overfitting)은 이와 같이 학습 데이터에 과하게 학습하여 실제 데이터에 대한 오류가 증가하는 현상이다. 예를 들어, 노란색 고양이를 보여 고양이를 학습한 뉴럴 네트워크가 노란색 이외의 고양이를 보고는 고양이임을 인식하지 못하는 현상이 과적합의 일종일 수 있다. 과적합은 머신러닝 알고리즘의 오류를 증가시키는 원인으로 작용할 수 있다. 이러한 과적합을 막기 위하여 다양한 최적화 방법이 사용될 수 있다. 과적합을 막기 위해서는 학습 데이터를 증가시키거나, 레귤라리제이션(regularization), 학습의 과정에서 네트워크의 노드 일부를 비활성화하는 드롭아웃(dropout), 배치 정규화 레이어(batch normalization layer)의 활용 등의 방법이 적용될 수 있다.In learning of a neural network, in general, training data may be a subset of real data (i.e., data to be processed using the trained neural network), and therefore, a learning cycle may exist in which errors for the training data decrease but errors for the actual data increase. Overfitting is a phenomenon in which errors for actual data increase due to excessive learning on training data. For example, a phenomenon in which a neural network that has learned a cat by showing a yellow cat does not recognize that it is a cat when it sees a cat other than yellow may be a type of overfitting. Overfitting can act as a cause of increasing the error of machine learning algorithms. Various optimization methods can be used to prevent such overfitting. To prevent overfitting, methods such as increasing training data, regularization, inactivating some nodes of the network during learning, dropout, and using a batch normalization layer may be applied.

본 개시의 일 실시예에 따른 데이터 구조를 저장한 컴퓨터 판독가능 매체가 개시된다. 전술한 데이터 구조는 본 개시내용에서의 저장부에 저장될 수 있으며, 프로세서에 의해 실행될 수 있으며 그리고 통신부에 의해 송수신될 수 있다.A computer readable medium storing a data structure according to an embodiment of the present disclosure is disclosed. The above-described data structure may be stored in a storage unit in the present disclosure, executed by a processor, and transmitted and received by a communication unit.

데이터 구조는 데이터에 효율적인 접근 및 수정을 가능하게 하는 데이터의 조직, 관리, 저장을 의미할 수 있다. 데이터 구조는 특정 문제(예를 들어, 최단 시간으로 데이터 검색, 데이터 저장, 데이터 수정) 해결을 위한 데이터의 조직을 의미할 수 있다. 데이터 구조는 특정한 데이터 처리 기능을 지원하도록 설계된, 데이터 요소들 간의 물리적이거나 논리적인 관계로 정의될 수도 있다. 데이터 요소들 간의 논리적인 관계는 사용자 정의 데이터 요소들 간의 연결관계를 포함할 수 있다. 데이터 요소들 간의 물리적인 관계는 컴퓨터 판독가능 저장매체(예를 들어, 영구 저장 장치)에 물리적으로 저장되어 있는 데이터 요소들 간의 실제 관계를 포함할 수 있다. 데이터 구조는 구체적으로 데이터의 집합, 데이터 간의 관계, 데이터에 적용할 수 있는 함수 또는 명령어를 포함할 수 있다. 효과적으로 설계된 데이터 구조를 통해 컴퓨팅 장치는 컴퓨팅 장치의 자원을 최소한으로 사용하면서 연산을 수행할 수 있다. 구체적으로 컴퓨팅 장치는 효과적으로 설계된 데이터 구조를 통해 연산, 읽기, 삽입, 삭제, 비교, 교환, 검색의 효율성을 높일 수 있다.Data structure can refer to the organization, management, and storage of data that enables efficient access and modification of data. Data structure may refer to the organization of data to solve a specific problem (eg, data retrieval, data storage, data modification in the shortest time). A data structure may be defined as a physical or logical relationship between data elements designed to support a specific data processing function. A logical relationship between data elements may include a connection relationship between user-defined data elements. A physical relationship between data elements may include an actual relationship between data elements physically stored in a computer-readable storage medium (eg, a persistent storage device). The data structure may specifically include a set of data, a relationship between data, and a function or command applicable to the data. Through an effectively designed data structure, a computing device can perform calculations while using minimal resources of the computing device. Specifically, the computing device can increase the efficiency of operation, reading, insertion, deletion, comparison, exchange, and search through an effectively designed data structure.

데이터 구조는 데이터 구조의 형태에 따라 선형 데이터 구조와 비선형 데이터 구조로 구분될 수 있다. 선형 데이터 구조는 하나의 데이터 뒤에 하나의 데이터만이 연결되는 구조일 수 있다. 선형 데이터 구조는 리스트(List), 스택(Stack), 큐(Queue), 데크(Deque)를 포함할 수 있다. 리스트는 내부적으로 순서가 존재하는 일련의 데이터 집합을 의미할 수 있다. 리스트는 연결 리스트(Linked List)를 포함할 수 있다. 연결 리스트는 각각의 데이터가 포인터를 가지고 한 줄로 연결되어 있는 방식으로 데이터가 연결된 데이터 구조일 수 있다. 연결 리스트에서 포인터는 다음이나 이전 데이터와의 연결 정보를 포함할 수 있다. 연결 리스트는 형태에 따라 단일 연결 리스트, 이중 연결 리스트, 원형 연결 리스트로 표현될 수 있다. 스택은 제한적으로 데이터에 접근할 수 있는 데이터 나열 구조일 수 있다. 스택은 데이터 구조의 한 쪽 끝에서만 데이터를 처리(예를 들어, 삽입 또는 삭제)할 수 있는 선형 데이터 구조일 수 있다. 스택에 저장된 데이터는 늦게 들어갈수록 빨리 나오는 데이터 구조(LIFO-Last in First Out)일 수 있다. 큐는 제한적으로 데이터에 접근할 수 있는 데이터 나열 구조로서, 스택과 달리 늦게 저장된 데이터일수록 늦게 나오는 데이터 구조(FIFO-First in First Out)일 수 있다. 데크는 데이터 구조의 양 쪽 끝에서 데이터를 처리할 수 있는 데이터 구조일 수 있다.The data structure can be divided into a linear data structure and a non-linear data structure according to the shape of the data structure. A linear data structure may be a structure in which only one data is connected after one data. Linear data structures may include lists, stacks, queues, and decks. A list may refer to a series of data sets in which order exists internally. The list may include a linked list. A linked list may be a data structure in which data are connected in such a way that each data is connected in a single line with a pointer. In a linked list, a pointer can contain information about connection to the next or previous data. A linked list can be expressed as a singly linked list, a doubly linked list, or a circular linked list depending on the form. A stack can be a data enumeration structure that allows limited access to data. A stack can be a linear data structure in which data can be processed (eg, inserted or deleted) at only one end of the data structure. The data stored in the stack may be a LIFO-Last in First Out (Last in First Out) data structure. A queue is a data listing structure that allows limited access to data, and unlike a stack, it can be a data structure (FIFO-First in First Out) in which data stored later comes out later. A deck can be a data structure that can handle data from either end of the data structure.

비선형 데이터 구조는 하나의 데이터 뒤에 복수개의 데이터가 연결되는 구조일 수 있다. 비선형 데이터 구조는 그래프(Graph) 데이터 구조를 포함할 수 있다. 그래프 데이터 구조는 정점(Vertex)과 간선(Edge)으로 정의될 수 있으며 간선은 서로 다른 두개의 정점을 연결하는 선을 포함할 수 있다. 그래프 데이터 구조 트리(Tree) 데이터 구조를 포함할 수 있다. 트리 데이터 구조는 트리에 포함된 복수개의 정점 중에서 서로 다른 두개의 정점을 연결시키는 경로가 하나인 데이터 구조일 수 있다. 즉 그래프 데이터 구조에서 루프(loop)를 형성하지 않는 데이터 구조일 수 있다.The nonlinear data structure may be a structure in which a plurality of data are connected after one data. The non-linear data structure may include a graph data structure. A graph data structure can be defined as a vertex and an edge, and an edge can include a line connecting two different vertices. A graph data structure may include a tree data structure. The tree data structure may be a data structure in which one path connects two different vertices among a plurality of vertices included in the tree. That is, it may be a data structure that does not form a loop in a graph data structure.

본 명세서에 걸쳐, 예측 모델, 인공지능 기반 모델, 연산 모델, 신경망, 네트워크 함수, 뉴럴 네트워크(neural network)는 동일한 의미로 사용될 수 있다. 이하에서는 신경망으로 통일하여 기술한다. 데이터 구조는 신경망을 포함할 수 있다. 그리고 신경망을 포함한 데이터 구조는 컴퓨터 판독가능 매체에 저장될 수 있다. 신경망을 포함한 데이터 구조는 또한 신경망에 의한 처리를 위하여 전처리(pre-processing)된 데이터, 신경망에 입력되는 데이터, 신경망의 가중치, 신경망의 하이퍼 파라미터(hyperparameter), 신경망으로부터 획득한 데이터, 신경망의 각 노드 또는 레이어와 연관된 활성 함수, 신경망의 학습을 위한 손실 함수 등을 포함할 수 있다. 신경망을 포함한 데이터 구조는 상기 개시된 구성들 중 임의의 구성 요소들을 포함할 수 있다. 즉 신경망을 포함한 데이터 구조는 신경망에 의한 처리를 위하여 전처리된 데이터, 신경망에 입력되는 데이터, 신경망의 가중치, 신경망의 하이퍼 파라미터, 신경망으로부터 획득한 데이터, 신경망의 각 노드 또는 레이어와 연관된 활성 함수, 신경망의 학습을 위한 손실 함수 등 전부 또는 이들의 임의의 조합을 포함하여 구성될 수 있다. 전술한 구성들 이외에도, 신경망을 포함한 데이터 구조는 신경망의 특성을 결정하는 임의의 다른 정보를 포함할 수 있다. 또한, 데이터 구조는 신경망의 연산 과정에 사용되거나 발생되는 모든 형태의 데이터를 포함할 수 있으며 전술한 사항에 제한되는 것은 아니다. 컴퓨터 판독가능 매체는 컴퓨터 판독가능 기록 매체 및/또는 컴퓨터 판독가능 전송 매체를 포함할 수 있다. 신경망은 일반적으로 노드라 지칭될 수 있는 상호 연결된 계산 단위들의 집합으로 구성될 수 있다. 이러한 노드들은 뉴런(neuron)들로 지칭될 수도 있다. 신경망은 적어도 하나 이상의 노드들을 포함하여 구성된다.Throughout this specification, the terms predictive model, artificial intelligence-based model, computational model, neural network, network function, and neural network may be used interchangeably. Hereinafter, a neural network is unified and described. The data structure may include a neural network. And the data structure including the neural network may be stored in a computer readable medium. The data structure including the neural network may also include pre-processed data for processing by the neural network, data input to the neural network, weights of the neural network, hyperparameters of the neural network, data obtained from the neural network, an activation function associated with each node or layer of the neural network, a loss function for learning the neural network, and the like. A data structure including a neural network may include any of the components described above. That is, the data structure including the neural network may include all or any combination of data preprocessed for processing by the neural network, data input to the neural network, weights of the neural network, hyperparameters of the neural network, data obtained from the neural network, activation functions associated with each node or layer of the neural network, loss functions for learning the neural network, and the like. In addition to the foregoing configurations, the data structure comprising the neural network may include any other information that determines the characteristics of the neural network. In addition, the data structure may include all types of data used or generated in the computational process of the neural network, but is not limited to the above. A computer readable medium may include a computer readable recording medium and/or a computer readable transmission medium. A neural network may consist of a set of interconnected computational units, which may generally be referred to as nodes. These nodes may also be referred to as neurons. A neural network includes one or more nodes.

데이터 구조는 신경망에 입력되는 데이터를 포함할 수 있다. 신경망에 입력되는 데이터를 포함하는 데이터 구조는 컴퓨터 판독가능 매체에 저장될 수 있다. 신경망에 입력되는 데이터는 신경망 학습 과정에서 입력되는 학습 데이터 및/또는 학습이 완료된 신경망에 입력되는 입력 데이터를 포함할 수 있다. 신경망에 입력되는 데이터는 전처리를 거친 데이터 및/또는 전처리 대상이 되는 데이터를 포함할 수 있다. 전처리는 데이터를 신경망에 입력시키기 위한 데이터 처리 과정을 포함할 수 있다. 따라서 데이터 구조는 전처리 대상이 되는 데이터 및 전처리로 발생되는 데이터를 포함할 수 있다. 전술한 데이터 구조는 예시일 뿐 본 개시는 이에 제한되지 않는다.The data structure may include data input to the neural network. A data structure including data input to the neural network may be stored in a computer readable medium. Data input to the neural network may include training data input during a neural network learning process and/or input data input to a neural network that has been trained. Data input to the neural network may include preprocessed data and/or data to be preprocessed. Pre-processing may include a data processing process for inputting data to a neural network. Accordingly, the data structure may include data subject to pre-processing and data generated by pre-processing. The foregoing data structure is only an example, and the present disclosure is not limited thereto.

데이터 구조는 신경망의 가중치를 포함할 수 있다(본 명세서에서 가중치, 파라미터는 동일한 의미로 사용될 수 있다). 그리고 신경망의 가중치를 포함한 데이터 구조는 컴퓨터 판독가능 매체에 저장될 수 있다. 신경망은 복수개의 가중치를 포함할 수 있다. 가중치는 가변적일 수 있으며, 신경망이 원하는 기능을 수행하기 위해, 사용자 또는 알고리즘에 의해 가변 될 수 있다. 예를 들어, 하나의 출력 노드에 하나 이상의 입력 노드가 각각의 링크에 의해 상호 연결된 경우, 출력 노드는 상기 출력 노드와 연결된 입력 노드들에 입력된 값들 및 각각의 입력 노드들에 대응하는 링크에 설정된 가중치에 기초하여 출력 노드에서 출력되는 데이터 값을 결정할 수 있다. 전술한 데이터 구조는 예시일 뿐 본 개시는 이에 제한되지 않는다.The data structure may include weights of the neural network (weights and parameters may be used interchangeably in this specification). And the data structure including the weight of the neural network may be stored in a computer readable medium. A neural network may include a plurality of weights. The weight may be variable, and may be changed by a user or an algorithm in order to perform a function desired by the neural network. For example, when one or more input nodes are connected to one output node by respective links, the output node may determine a data value output from the output node based on values input to input nodes connected to the output node and a weight set for a link corresponding to each input node. The foregoing data structure is only an example, and the present disclosure is not limited thereto.

제한이 아닌 예로서, 가중치는 신경망 학습 과정에서 가변되는 가중치 및/또는 신경망 학습이 완료된 가중치를 포함할 수 있다. 신경망 학습 과정에서 가변되는 가중치는 학습 사이클이 시작되는 시점의 가중치 및/또는 학습 사이클 동안 가변되는 가중치를 포함할 수 있다. 신경망 학습이 완료된 가중치는 학습 사이클이 완료된 가중치를 포함할 수 있다. 따라서 신경망의 가중치를 포함한 데이터 구조는 신경망 학습 과정에서 가변되는 가중치 및/또는 신경망 학습이 완료된 가중치를 포함한 데이터 구조를 포함할 수 있다. 그러므로 상술한 가중치 및/또는 각 가중치의 조합은 신경망의 가중치를 포함한 데이터 구조에 포함되는 것으로 한다. 전술한 데이터 구조는 예시일 뿐 본 개시는 이에 제한되지 않는다.As a non-limiting example, the weights may include weights that are varied during neural network training and/or weights for which neural network training has been completed. The variable weight in the neural network learning process may include a weight at the time the learning cycle starts and/or a variable weight during the learning cycle. The weights for which neural network learning has been completed may include weights for which learning cycles have been completed. Accordingly, the data structure including the weights of the neural network may include a data structure including weights that are variable during the neural network learning process and/or weights for which neural network learning is completed. Therefore, it is assumed that the above-described weights and/or combinations of weights are included in the data structure including the weights of the neural network. The foregoing data structure is only an example, and the present disclosure is not limited thereto.

신경망의 가중치를 포함한 데이터 구조는 직렬화(serialization) 과정을 거친 후 컴퓨터 판독가능 저장 매체(예를 들어, 메모리, 하드 디스크)에 저장될 수 있다. 직렬화는 데이터 구조를 동일하거나 다른 컴퓨팅 장치에 저장하고 나중에 다시 재구성하여 사용할 수 있는 형태로 변환하는 과정일 수 있다. 컴퓨팅 장치는 데이터 구조를 직렬화하여 네트워크를 통해 데이터를 송수신할 수 있다. 직렬화된 신경망의 가중치를 포함한 데이터 구조는 역직렬화(deserialization)를 통해 동일한 컴퓨팅 장치 또는 다른 컴퓨팅 장치에서 재구성될 수 있다. 신경망의 가중치를 포함한 데이터 구조는 직렬화에 한정되는 것은 아니다. 나아가 신경망의 가중치를 포함한 데이터 구조는 컴퓨팅 장치의 자원을 최소한으로 사용하면서 연산의 효율을 높이기 위한 데이터 구조(예를 들어, 비선형 데이터 구조에서 B-Tree, R-Tree, Trie, m-way search tree, AVL tree, Red-Black Tree)를 포함할 수 있다. 전술한 사항은 예시일 뿐 본 개시는 이에 제한되지 않는다.The data structure including the weights of the neural network may be stored in a computer readable storage medium (eg, a memory or a hard disk) after going through a serialization process. Serialization can be the process of converting a data structure into a form that can be stored on the same or another computing device and later reconstructed and used. A computing device may serialize data structures to transmit and receive data over a network. The data structure including the weights of the serialized neural network may be reconstructed on the same computing device or another computing device through deserialization. The data structure including the weights of the neural network is not limited to serialization. Furthermore, the data structure including the weights of the neural network may include a data structure (e.g., B-Tree, R-Tree, Trie, m-way search tree, AVL tree, Red-Black Tree in a nonlinear data structure) to increase computational efficiency while minimizing resource use of a computing device. The foregoing is only an example, and the present disclosure is not limited thereto.

데이터 구조는 신경망의 하이퍼 파라미터를 포함할 수 있다. 그리고 신경망의 하이퍼 파라미터를 포함한 데이터 구조는 컴퓨터 판독가능 매체에 저장될 수 있다. 하이퍼 파라미터는 사용자에 의해 가변되는 변수일 수 있다. 하이퍼 파라미터는 예를 들어, 학습률(learning rate), 비용 함수(cost function), 학습 사이클 반복 횟수, 가중치 초기화(Weight initialization)(예를 들어, 가중치 초기화 대상이 되는 가중치 값의 범위 설정), Hidden Unit 개수(예를 들어, 히든 레이어의 개수, 히든 레이어의 노드 수)를 포함할 수 있다. 전술한 데이터 구조는 예시일 뿐 본 개시는 이에 제한되지 않는다.The data structure may include hyperparameters of the neural network. Also, the data structure including the hyperparameters of the neural network may be stored in a computer readable medium. A hyperparameter may be a variable variable by a user. Hyperparameters may include, for example, a learning rate, a cost function, the number of repetitions of learning cycles, weight initialization (eg, setting a range of weight values to be targeted for weight initialization), and the number of hidden units (eg, the number of hidden layers and the number of nodes in the hidden layer). The foregoing data structure is only an example, and the present disclosure is not limited thereto.

본 개시의 일 실시예에 따른 예측 모델에 대한 네트워크 함수로서 트랜스포머(transformer)가 고려될 수도 있다. 일례로, 예측 모델은 트랜스포머 기반으로 동작될 수 있다. 이러한 예측 모델은 예를 들어 어텐션 알고리즘이 적용된 순환 신경망 또는 어텐션 알고리즘이 적용된 트랜스포머를 사용하여 동작될 수 있다.A transformer may be considered as a network function for a predictive model according to an embodiment of the present disclosure. As an example, the predictive model may be operated based on a transformer. Such a predictive model may be operated using, for example, a recurrent neural network to which an attention algorithm is applied or a transformer to which an attention algorithm is applied.

일 실시예에서, 트랜스포머는 임베딩된 데이터들을 인코딩하는 인코더 및 인코딩된 데이터들을 디코딩하는 디코더로 구성될 수 있다. 트랜스포머는 일련의 데이터(a series of data)들을 수신하여, 인코딩 및 디코딩 단계를 거처 상이한 타입의 일련의 데이터들을 출력하는 구조를 지닐 수 있다. 일 실시예에서, 일련의 데이터들은 트랜스포머가 연산가능한 형태로 가공될 수 있다. 일련의 데이터들을 트랜스포머가 연산가능한 형태로 가공하는 과정은 임베딩 과정을 포함할 수 있다. 데이터 토큰, 임베딩 벡터, 임베딩 토큰 등과 같은 표현들은, 트랜스포머가 처리할 수 있는 형태로 임베딩된 데이터들을 지칭하는 것일 수 있다.In one embodiment, a transformer may consist of an encoder that encodes the embedded data and a decoder that decodes the encoded data. The transformer may have a structure that receives a series of data and outputs a series of data of different types through encoding and decoding steps. In one embodiment, the series of data can be processed into a form operable by a transformer. A process of processing a series of data into a form in which a transformer can operate may include an embedding process. Expressions such as data token, embedding vector, and embedding token may refer to embedded data in a form that can be processed by a transformer.

트랜스포머가 일련의 데이터들을 인코딩 및 디코딩하기 위하여, 트랜스포머 내의 인코더 및 디코더들을 어텐션(attention) 알고리즘을 활용하여 처리할 수 있다. 어텐션 알고리즘이란 주어진 쿼리(Query)에 대해, 하나 이상의 키(Key)에 대한 유사도를 구하고, 이렇게 주어진 유사도를, 각각의 키(Key)와 대응하는 값(Value)에 반영한 후, 유사도가 반영된 값(Value)들을 가중합하여 어텐션(attention) 값을 계산하는 알고리즘을 의미할 수 있다. In order for the transformer to encode and decode a series of data, encoders and decoders within the transformer may be processed using an attention algorithm. The attention algorithm may refer to an algorithm that calculates an attention value by calculating the similarity of one or more keys for a given query, reflecting the given similarity to each key and corresponding value, and then weighting the values reflecting the similarity.

쿼리(Query), 키(Key) 및 값(Value)를 어떻게 설정하느냐에 따라, 다양한 종류의 어텐션 알고리즘이 분류될 수 있다. 예를 들어, 쿼리, 키 및 값을 모두 동일하게 설정하여 어텐션을 구하는 경우, 이는 셀프-어텐션 알고리즘을 의미할 수 있다. 입력된 일련의 데이터들을 병렬로 처리하기 위해, 임베딩 벡터를 차원을 축소하여, 각 분할된 임베딩 벡터에 대해 개별적인 어텐션 헤드를 구하여 어텐션을 구하는 경우, 이는 멀티-헤드(multi-head) 어텐션 알고리즘을 의미할 수 있다.Depending on how to set the query, key, and value, various types of attention algorithms can be classified. For example, when attention is obtained by setting the same query, key, and value, this may mean a self-attention algorithm. In order to process a series of input data in parallel, when the dimensions of an embedding vector are reduced and individual attention heads are obtained for each divided embedding vector to obtain attention, this may mean a multi-head attention algorithm.

일 실시예에서, 트랜스포머는 복수의 멀티-헤드 셀프 어텐션 알고리즘 또는 멀티-헤드 인코더-디코더 알고리즘을 수행하는 모듈들로 구성될 수 있다. 일 실시예에서, 트랜스포머는 임베딩 레이어, 정규화 레이어, 소프트맥스(softmax) 층 등 어텐션 알고리즘이 아닌 부가적인 구성요소들 또한 포함할 수 있다. 어텐션 알고리즘을 이용하여 트랜스포머를 구성하는 방법은 Vaswani et al., Attention Is All You Need, 2017 NIPS에 개시된 방법을 포함할 수 있으며, 이는 여기에 참조로서 통합된다.In one embodiment, a transformer may consist of modules that perform a plurality of multi-head self-attention algorithms or multi-head encoder-decoder algorithms. In one embodiment, the transformer may also include additional elements other than the attention algorithm, such as an embedding layer, a normalization layer, and a softmax layer. Methods for constructing transformers using the attention algorithm may include methods disclosed in Vaswani et al., Attention Is All You Need, 2017 NIPS, which is incorporated herein by reference.

트랜스포머는 임베딩된 자연어, 임베딩된 시퀀스 정보, 분할된 이미지 데이터, 오디오 파형 등 다양한 데이터 도메인에 적용하여, 일련의 입력 데이터를 일련의 출력 데이터로 변환할 수 있다. 다양한 데이터 도메인을 가진 데이터들을 트랜스포머에 입력가능한 일련의 데이터들로 변환하기 위해, 트랜스포머는 데이터들을 임베딩할 수 있다. 트랜스포머는 일련의 입력 데이터 사이의 상대적 위치관계 또는 위상관계를 표현하는 추가적인 데이터를 처리할 수 있다. 또는 일련의 입력 데이터에 입력 데이터들 사이의 상대적인 위치관계 또는 위상관계를 표현하는 벡터들이 추가적으로 반영되어 일련의 입력 데이터가 임베딩될 수 있다. 일 예에서, 일련의 입력 데이터 사이의 상대적 위치관계는, 자연어 문장 내에서의 어순, 각각의 분할된 이미지의 상대적 위치 관계, 분할된 오디오 파형의 시간 순서 등을 포함할 수 있으나, 이에 제한되지 않는다. 일련의 입력 데이터들 사이의 상대적인 위치관계 또는 위상관계를 표현하는 정보를 추가하는 과정은 위치 인코딩(positional encoding)으로 지칭될 수 있다.Transformers can be applied to various data domains such as embedded natural language, embedded sequence information, segmented image data, and audio waveforms to convert a series of input data into a series of output data. In order to convert data having various data domains into a series of data that can be input to the transformer, the transformer can embed the data. Transformers can process additional data representing relative positional or phase relationships between a set of input data. Alternatively, a series of input data may be embedded by additionally reflecting vectors representing a relative positional relationship or phase relationship between input data to the series of input data. In one example, the relative positional relationship between a series of input data may include, but is not limited to, word order in a natural language sentence, relative positional relationship of each segmented image, temporal sequence of segmented audio waveforms, and the like. A process of adding information representing a relative positional relationship or phase relationship between a series of input data may be referred to as positional encoding.

데이터를 임베딩하여 트랜스포머로 변환하는 방법의 일 예는 Dosovitskiy, et al., AN IMAGE IS WORTH 16X16 WORDS: TRANSFORMERS FOR IMAGE RECOGNITION AT SCALE에 개시되어 있으며, 해당 문서는 여기에 참조로서 통합된다.One example of how to embed data and transform it into a transformer is disclosed in Dosovitskiy, et al., AN IMAGE IS WORTH 16X16 WORDS: TRANSFORMERS FOR IMAGE RECOGNITION AT SCALE, which document is incorporated herein by reference.

도 3은 본 개시내용의 일 실시예에 따라 인공지능 기반의 예측 모델을 사용하여 입력 데이터로부터 TCR의 CDR3α 및 CDR3β와 관련된 예측 결과를 획득하기 위한 예시적인 방법을 도시한다.3 illustrates an exemplary method for obtaining predictive results related to CDR3α and CDR3β of a TCR from input data using an artificial intelligence-based predictive model according to an embodiment of the present disclosure.

본 개시내용에서의 "TCR α쇄 및 TCR β쇄의 조합 가능 여부"에 대하여 "CDR3α와 CDR3β의 선후행 관계 해당 여부"를 예시적으로 사용하기로 한다. 따라서, 이하에서 사용되는 선후행 관계 해당 여부, 또는 조합 가능 여부에 대한 설명은 선후행 관계 해당 여부, 또는 조합 가능 여부에 대한 설명을 표현하기 위한 일례이며, "CDR3α와 CDR3β의 선후행 관계 해당 여부"라는 예시를 통해 그 권리범위가 제한 해석되지는 않아야 할 것이다. 이처럼, 본 개시내용에서의 TCR α쇄 및 TCR β쇄의 "조합 가능 여부", CDR3α와 CDR3β의 "선후행 관계 해당 여부"는 서로 교환 가능하게 사용될 수 있다.Regarding "whether TCR α chain and TCR β chain can be combined" in the present disclosure, "whether CDR3α and CDR3β correspond to each other in a precedent-or-following relationship" will be used as an example. Therefore, the description of whether a precedent relationship or whether a combination is possible used below is an example for expressing a description of whether a precedence relationship is applicable or whether a combination is possible, and the scope of rights should not be construed as being limited through the example of “whether CDR3α and CDR3β correspond to precedence relationship”. As such, in the present disclosure, “whether combinable” of TCR α chain and TCR β chain and “whether a precedence relationship” between CDR3α and CDR3β can be used interchangeably.

도 3은 본 개시내용의 일 실시예에 따라 인공지능 기반의 예측 모델을 사용하여 입력 데이터 세트로부터 TCR의 CDR3α 및 CDR3β와 관련된 예측 결과를 획득하기 위한 예시적인 방법을 도시한다. 본 개시 내용에서 예측 모델의 예측 결과를 생성하는 단계는 예측 모델을 학습시키기(training) 위한 내용과, 학습된 예측 모델을 통해 실제로 추론(inference)(예컨대, CDR3α, CDR3β에 대한 정보를 포함하는 예측) 결과를 얻기 위한 내용을 포괄할 수 있다.3 illustrates an exemplary method for obtaining predictive results related to CDR3α and CDR3β of a TCR from an input data set using an artificial intelligence-based predictive model according to an embodiment of the present disclosure. In the present disclosure, the step of generating a predictive result of a predictive model may include content for training the predictive model and content for actually obtaining an inference (eg, prediction including information on CDR3α and CDR3β) through the learned predictive model.

본 개시내용에서 예측 모델을 학습시키는 과정 또는 예측 모델을 이용하여 추론을 수행하는 과정(예컨대, 예측 결과를 생성하는 과정) 중 하나의 과정으로 예측 모델의 동작이 설명될 수 있다. 이는 설명의 편의를 위하여 작성된 것으로, 학습을 수행하는 과정 또는 추론을 수행하는 과정 중 하나의 과정으로 기재되었다고 하더라도, 학습 또는 추론 중 다른 하나의 과정을 포괄하는 의도로 해석되어야 한다.In the present disclosure, the operation of the predictive model may be described as one of a process of learning a predictive model or a process of performing inference using the predictive model (eg, a process of generating a prediction result). This is written for the convenience of description, and even if it is described as one of the processes of performing learning or inference, it should be interpreted as intended to cover the other process of learning or inference.

일 실시예에서, 도 3에서 도시되는 단계들은 컴퓨팅 장치(100)에 의해 수행될 수 있다. 추가적인 실시예에서, 도 3에서 도시되는 단계들 중 일부는 사용자 단말에서 수행되고 다른 일부는 서버에서 수행되는 방식과 같이 도 3에서의 단계들은 복수의 엔티티들에 의해 구현될 수도 있다.In one embodiment, the steps shown in FIG. 3 may be performed by computing device 100 . In a further embodiment, the steps in FIG. 3 may be implemented by a plurality of entities, such that some of the steps shown in FIG. 3 are performed in a user terminal and others are performed in a server.

본 개시내용의 일 실시예에서, 컴퓨팅 장치(100)는 TCR의 CDR3α에 대응되는 제 1 데이터 및 TCR의 CDR3β에 대응되는 제 2 데이터를 획득할 수 있다(S310).In one embodiment of the present disclosure, the computing device 100 may obtain first data corresponding to CDR3α of TCR and second data corresponding to CDR3β of TCR (S310).

추가적인 실시예에서, 상기 제 1 데이터는 상기 TCR의 CDR3α에 대응되는 아미노산 서열을 포함하고, 상기 제 2 데이터는 상기 TCR의 CDR3에 대응되는 아미노산 서열을 포함할 수 있다.In a further embodiment, the first data may include an amino acid sequence corresponding to CDR3α of the TCR, and the second data may include an amino acid sequence corresponding to CDR3 of the TCR.

본 개시내용의 일 실시예에서, 제 1 데이터에 포함되는 CDR3α에 대한 정보 및 제 2 데이터에 포함되는 CDR3β에 대한 정보는 제 1 모델과는 상이한, 인공지능 기반의 또는 룰 기반의 생성 모델을 통해 출력될 수 있다. 일례로, 상기 CDR3α에 대한 정보는 pMHC를 입력으로 하고, 그에 대응하는 CDR3α를 출력으로 하는 제 2 모델을 이용하여 생성될 수 있다. 다른 예시로, 상기 CDR3β에 대한 정보는 pMHC를 입력으로 하고, 그에 대응하는 CDR3β를 출력으로 하는 제 3 모델을 이용하여 생성될 수 있다.In one embodiment of the present disclosure, information on CDR3α included in the first data and information on CDR3β included in the second data may be output through an artificial intelligence-based or rule-based generation model different from the first model. For example, the information on CDR3α may be generated using a second model having pMHC as an input and CDR3α corresponding to the pMHC as an output. As another example, the information on CDR3β may be generated using a third model having pMHC as an input and CDR3β corresponding to the pMHC as an output.

일 실시예에서, 펩타이드의 아미노산 서열 정보, MHC의 아미노산 서열 정보, 및/또는 TCR의 CDR3, CDR1, CDR2의 아미노산 서열 정보는 피검체로부터 얻은 단백체 시료의 시퀀싱을 통해 획득할 수 있다.In one embodiment, amino acid sequence information of peptides, amino acid sequence information of MHC, and/or amino acid sequence information of CDR3, CDR1, and CDR2 of TCR may be obtained through sequencing of a proteomic sample obtained from a subject.

일 실시예에서, 펩타이드의 아미노산 서열 정보, MHC의 아미노산 서열 정보, 및/또는 TCR의 CDR3, CDR1, CDR2의 아미노산 서열 정보는 공공 DB(예컨대, VDJdb, Expasy 등)로부터 획득할 수 있다.In one embodiment, amino acid sequence information of peptides, amino acid sequence information of MHC, and/or amino acid sequence information of CDR3, CDR1, and CDR2 of TCR may be obtained from a public DB (eg, VDJdb, Expasy, etc.).

일 실시예에서, 컴퓨팅 장치(100)는 TCR, 펩타이드 또는 MHC의 아미노산 서열 분석에 질량 분석(Mass Spectrometry)을 수행하는 질량 분석 기기(예컨대, Mass Spectrometer, LC-MS/MS)를 이용할 수 있다. 일 실시예에서, 컴퓨팅 장치(100)는 펩타이드, MHC 또는 TCR의 아미노산 서열 분석에 생어법(Sanger sequencing method), 에드만 분해법(Edman Degradation), 또는 PMF 분석(Peptide Mass Fingerprinting) 등을 이용할 수 있다.In one embodiment, the computing device 100 may use a mass spectrometer (eg, mass spectrometer, LC-MS/MS) that performs mass spectrometry in analyzing the amino acid sequence of TCR, peptide, or MHC. In one embodiment, the computing device 100 may use a Sanger sequencing method, Edman Degradation, or PMF analysis (Peptide Mass Fingerprinting) to analyze the amino acid sequence of a peptide, MHC or TCR.

본 개시내용의 일 실시예에서, 컴퓨팅 장치(100)는 인공지능 기반의 제 1 모델을 사용하여, 제 1 데이터 및 제 2 데이터를 입력받아, CDR3α를 포함하는 제 1 데이터와 CDR3β를 포함하는 제 2 데이터의 선후행 관계 해당 여부를 결정할 수 있다(S320).In one embodiment of the present disclosure, the computing device 100 receives the first data and the second data using the artificial intelligence-based first model, and determines whether the first data including the CDR3α and the second data including the CDR3β correspond to a precedence relationship (S320).

일 실시예에서, 제 1 데이터 및 제 2 데이터는 특정 펩타이드-MHC 결합체(pMHC)와 상호작용하는 TCR의 CDR3α 및 TCR의 CDR3β일 수 있다. 제 1 모델에 입력된 결과, 조합되는 CDR3α와 CDR3β로 식별되는 데이터의 경우, True Positive(진양성) 데이터로 여겨질 수 있다. 일례로, 조합되는 것으로 식별된 CDR3α 및 CDR3β가 분자생물학적 실험적으로도 실제 상기 pMHC와 결합하는 것으로 판별된 경우, 상기 CDR3α를 포함하는 제 1 데이터 및 상기 CDR3β를 포함하는 제 2 데이터가 제 1 모델에 입력되어 출력된, 상기 CDR3α와 상기 CDR3β가 선후행 관계에 해당한다는 결과는 True Positive로서 데이터 베이스에 저장될 수 있다.In one embodiment, the first data and the second data may be CDR3α of TCR and CDR3β of TCR that interact with a specific peptide-MHC conjugate (pMHC). In the case of data identified as CDR3α and CDR3β to be combined as a result of input to the first model, it may be regarded as True Positive data. For example, when CDR3α and CDR3β identified as being combined are determined to actually bind to the pMHC even in a molecular biological experiment, the first data including the CDR3α and the second data including the CDR3β are input to the first model and the output result that the CDR3α and the CDR3β correspond to a precedence relationship can be stored in the database as True Positive.

추가적인 실시예에서, True Negative(진음성) 데이터의 확보를 위해 제 1 데이터 및 제 2 데이터는 서로 조합되지 않는 TCR의 CDR3α 및 TCR의 CDR3β일 수 있다. 구체적으로, 상기 서로 조합되지 않는 것으로 식별된 CDR3α 및 CDR3β가 분자생물학적 실험적으로도 실제 서로 조합되지 않는 것으로 판별된 경우, 상기 제 1 데이터 또는 상기 제 2 데이터가 디코더에 입력되어 출력된, 상기 CDR3α와 상기 CDR3β가 선후행 관계에 해당하지 않는다는 결과는 True Positive로서 데이터 베이스에 저장될 수 있다. 이와 같은 True Positive, False Positive, True Negative 또는 False Negative(진양성, 위양성, 진음성 또는 위음성) 데이터는 예측 모델의 학습을 위한 데이터셋으로 저장 및/또는 사용될 수 있다.In an additional embodiment, in order to secure true negative data, the first data and the second data may be CDR3α of TCR and CDR3β of TCR that are not combined with each other. Specifically, when the CDR3α and CDR3β identified as not being combined with each other are determined to be not actually combined with each other even in molecular biological experiments, the first data or the second data is input to the decoder and outputted. The result that CDR3α and CDR3β do not correspond to a precedence relationship can be stored in the database as True Positive. Such True Positive, False Positive, True Negative, or False Negative (true positive, false positive, true negative, or false negative) data may be stored and/or used as a dataset for training a predictive model.

추가적인 실시예에서, CDR3α와 CDR3β의 조합 여부에 대한 분자생물학적 실험 결과와 제 1 모델에서 출력한 CDR3α와 CDR3β의 조합 여부에 대한 예측 결과의 일치/불일치는 서로 논리적·시간적 선후를 따지지 않는다. 일례로, CDR3α와 CDR3β의 조합 여부가 먼저 분자생물학적 실험적으로 조합되지 않는 것으로 판별된 후, 제 1 모델에서 상기 CDR3α와 상기 CDR3β가 조합되지 않는 것으로 예측되었다면, 상기 CDR3α와 상기 CDR3β의 조합은 True Negative 데이터로 저장될 수 있다. 다른 예시로, 먼저 제 1 모델에서 CDR3α와 CDR3β의 조합 여부가 조합되는 것으로 예측된 후, 상기 CDR3α와 상기 CDR3β의 조합 여부가 분자생물학적 실험적으로 조합되지 않는 것으로 판별되었다면, 상기 CDR3α와 상기 CDR3β의 조합은 False Positive 데이터로써 저장될 수 있다.In a further embodiment, the coincidence/inconsistency between the molecular biological test result for whether CDR3α and CDR3β are combined and the prediction result for whether CDR3α and CDR3β are combined output from the first model do not follow logical and temporal precedence. For example, if the combination of CDR3α and CDR3β is first determined not to be combined experimentally by molecular biology, and then predicted to not be combined with the CDR3α and CDR3β in the first model, the combination of CDR3α and CDR3β can be stored as True Negative data. As another example, if it is first predicted that the combination of CDR3α and CDR3β is combined in the first model, and then it is determined that the combination of CDR3α and CDR3β is not combined through molecular biological experiments, the combination of CDR3α and CDR3β can be stored as False Positive data.

추가적인 실시예에서, 컴퓨팅 장치(100)는 CDR3α와 CDR3β의 조합 여부에 대한 제 1 모델을 통한 예측 결과와 무관하게 CDR3α와 CDR3β의 조합 여부에 대한 분자생물학적 실험 결과를 True(예컨대, 진양성 또는 진음성) 데이터로 사용 및/또는 저장할 수 있다.In an additional embodiment, the computing device 100 may use and/or store as true (e.g., true positive or true negative) data the results of molecular biological experiments on whether CDR3α and CDR3β are combined, regardless of the prediction result through the first model on whether or not CDR3α and CDR3β are combined.

일 실시예에서, 컴퓨팅 장치(100)는 예측 모델과 분자생물학적 실험 결과에 따라 결정된 CDR3α와 CDR3β의 조합 여부에 대한 True Positive, False Positive, True Negative 또는 False Negative 데이터들을 이용하여 예측 모델의 성능 평가를 할 수 있다. 구체적으로, 입력 데이터의 수가 많아질수록 예측 모델의 성능 평가 정확도가 향상될 수 있으며, 예를 들어, 다양한 학습 데이터셋으로 반복적으로 학습된 예측 모델의 성능이 향상된다면 True Positive, True Negative 결과에 비해 False Positive, False Negative 결과가 출력되는 비율이 작아질 수 있다.In one embodiment, the computing device 100 evaluates the performance of the predictive model using True Positive, False Positive, True Negative, or False Negative data on whether or not the combination of CDR3α and CDR3β is determined according to the prediction model and molecular biological experiment results. Specifically, as the number of input data increases, the performance evaluation accuracy of the predictive model may improve. For example, if the performance of a predictive model that is repeatedly trained with various training datasets improves, the ratio of outputting false positive and false negative results may decrease compared to true positive and true negative results.

일 실시예에서, CDR3α와 CDR3β의 조합 여부에 대한 분자생물학적 실험은 pMHC의 TCR에의 노출 후 IFN-γ 검출, FACS, 유세포분리(flowcytometry), LC-MS/MS 분석 등을 통하여 수행될 수 있으나, 이에 국한되지 않는다.In one embodiment, molecular biological experiments on the combination of CDR3α and CDR3β may be performed through IFN-γ detection, FACS, flow cytometry, LC-MS / MS analysis after exposure of pMHC to TCR, etc., but is not limited thereto.

일 실시예에서, 컴퓨팅 장치(100)는 CDR3α와 관련된 제 1 데이터에 포함되는 서열을 구성하는 아미노산들의 식별자, CDR3β와 관련된 제 2 데이터에 포함되는 아미노산들의 식별자, 펩타이드에 대응하는 서열을 구성하는 아미노산들의 식별자, 및/또는 MHC에 대응하는 서열을 구성하는 아미노산들의 식별자를 전처리할 수 있다. 추가적인 실시예에서, TCR과 관련된 제 1 데이터 또는 제 2 데이터는 T 세포 수용체를 이루는 상보성 결정부위(complimentary determining region)(예컨대, CDR3α, CDR3β) 중 적어도 하나를 구성하는 아미노산 서열들을 포함할 수 있다. 추가적인 예시에서, 예측 모델(예컨대, 본 개시내용에서의 제 1 모델, 제 2 모델, 또는 제 3 모델)의 학습 또는 추론을 위한 전처리는 펩타이드와 MHC(예컨대, HLA, mouse MHC)를 입력 데이터로 하여 작동하는 외부 예측 모델을 통해 생성된 가이드(guide) 정보를 통해 이루어질 수 있다. 구체적으로, 외부 예측 모델을 통해 생성된 가이드 정보는 HLA의 타입(예컨대, HLA-B14:02, HLA-A02:03 등), TCR의 아미노산 서열 길이 정보, TCR의 V(D)J 타입 정보 등을 포함할 수 있다. 일례로, TCR의 아미노산 서열 길이 정보는 V/J 타입 정보와 함께 CDR3의 아미노산 다양성이 큰 부분에 대한 예측 모델이 출력 데이터의 아미노산 예측 결과 다양성을 넓게 설정하도록 할 수 있다.In one embodiment, the computing device 100 may pre-process the identifiers of amino acids constituting the sequence included in the first data related to CDR3α, the identifiers of amino acids included in the second data related to CDR3β, the identifiers of amino acids constituting the sequence corresponding to the peptide, and/or the identifiers of amino acids constituting the sequence corresponding to MHC. In a further embodiment, the first data or the second data related to the TCR may include amino acid sequences constituting at least one of the complementary determining regions (eg, CDR3α, CDR3β) constituting the T cell receptor. In a further example, preprocessing for learning or inference of a predictive model (e.g., the first model, the second model, or the third model in the present disclosure) is performed using peptides and MHC (e.g., HLA, mouse MHC) as input data. Guide information generated through an external prediction model that operates. Specifically, the guide information generated through the external prediction model may include HLA type (eg, HLA-B14: 02, HLA-A02: 03, etc.), TCR amino acid sequence length information, TCR V (D) J type information and the like. For example, the amino acid sequence length information of the TCR together with the V/J type information may enable a prediction model for a portion of CDR3 having a large amino acid diversity to set a wide range of amino acid prediction result diversity of output data.

추가적인 실시예에서, 제 1 데이터 및/또는 제 2 데이터에 대응되는 TCR 정보는 T 세포 수용체에 대한 공공 데이터 DB로부터 얻은 V/D/J 타입 등을 포함할 수 있다. 추가적인 실시예에서, 제 1 데이터 및/또는 제 2 데이터에 대응되는 TCR 정보는 T 세포 수용체에 대한 공공 DB(database)로부터 추출한 공지의 TCR 아미노산 서열(예컨대, CDR1, CDR2, CDR3)을 포함할 수 있다. 추가적인 실시예에서, 제 1 데이터 및/또는 제 2 데이터에 대응되는 TCR 정보는 유기체 유래의 피검체에서 수득하여 실험적으로 시퀀싱한 단백체 서열을 포함할 수 있다.In an additional embodiment, the TCR information corresponding to the first data and/or the second data may include a V/D/J type obtained from a public data DB for T cell receptors. In a further embodiment, the TCR information corresponding to the first data and / or the second data may include a known TCR amino acid sequence (eg, CDR1, CDR2, CDR3) extracted from a public database for T cell receptors. In a further embodiment, the TCR information corresponding to the first data and/or the second data may include a proteomic sequence obtained from an organism-derived subject and experimentally sequenced.

일 실시예에서, 제 1 데이터 또는 제 2 데이터 중 적어도 하나는 제 1 모델과 상이한 외부의 별도 예측 모델로부터 출력된 데이터로부터 획득될 수 있다.In one embodiment, at least one of the first data and the second data may be obtained from data output from an external separate prediction model different from the first model.

추가적인 실시예에서, 상기 제 1 모델과 상이한 외부의 별도 예측 모델은 pMHC를 입력 데이터로 하고, 상기 pMHC와 결합하는 TCR의 CDR3를 출력 데이터로 하는, 룰 기반의 또는 인공지능 기반의 모델일 수 있다. 본 개시내용의 일 실시예에서, pMHC에 포함되는 펩타이드는 항원-펩타이드(antigenic peptide)로서, 특정 조직(예컨대, 병변 조직, 종양 조직)에서 발견되는 펩타이드일 수 있다. 추가적인 실시예에서, pMHC에 포함되는 펩타이드는 체세포 또는 항원제시세포의 MHC 분자에 제시되는 펩타이드일 수 있다. 추가적인 실시예에서, pMHC에 포함되는 펩타이드는 공공 데이터 DB로부터 추출한 펩타이드일 수 있다. 추가적인 실시예에서, pMHC에 포함되는 펩타이드는 임의의 항원-펩타이드로 가정한, 6개 내지 11개의 아미노산으로 구성된 시퀀스를 갖는 펩타이드일 수 있으나, 이에 국한되지 않는다.In a further embodiment, the external separate prediction model different from the first model may be a rule-based or artificial intelligence-based model that takes pMHC as input data and uses CDR3 of a TCR that binds to pMHC as output data. In one embodiment of the present disclosure, the peptide included in pMHC is an antigen-peptide, and may be a peptide found in a specific tissue (eg, lesion tissue, tumor tissue). In a further embodiment, the peptide included in pMHC may be a peptide presented on MHC molecules of somatic cells or antigen presenting cells. In a further embodiment, the peptide included in pMHC may be a peptide extracted from a public data DB. In a further embodiment, the peptide included in pMHC may be a peptide having a sequence consisting of 6 to 11 amino acids, which is assumed to be an arbitrary antigen-peptide, but is not limited thereto.

추가적인 실시예에서, pMHC에 포함되는 MHC는 주조직 적합 복합체에 대한 공공 DB로부터 얻은 HLA 타입을 포함할 수 있다. 추가적인 실시예에서, pMHC에 포함되는 MHC는 주조직 적합 복합체에 대한 공공 DB로부터 추출한 공지의 아미노산 서열을 포함할 수 있다. 추가적인 실시예에서, pMHC에 포함되는 MHC는 유기체 유래의 피검체에서 수득하여 실험적으로 시퀀싱한 단백체 서열을 포함할 수 있다.In a further embodiment, the MHC included in the pMHC may include an HLA type obtained from a public DB for the major histocompatibility complex. In a further embodiment, the MHC included in the pMHC may include a known amino acid sequence extracted from a public DB for the major histocompatibility complex. In a further embodiment, the MHC included in the pMHC may include a proteomic sequence obtained from an organism-derived subject and experimentally sequenced.

일 실시예에서, 제 1 데이터 또는 제 2 데이터는 아미노산 서열에 대하여 Blosum 인코딩 또는 원-핫 인코딩이 적용된 입력 데이터를 포함할 수 있다.In one embodiment, the first data or the second data may include input data to which Blosum encoding or one-hot encoding is applied to an amino acid sequence.

일 실시예에서, 제 1 데이터 또는 제 2 데이터는 아미노산들 간의 극성을 나타내는 제 1 특징, 아미노산의 크기를 나타내는 제 2 특징, 아미노산의 소수성 또는 친수성 여부를 나타내는 제 3 특징, 아미노산의 전하의 존재 여부를 나타내는 제 4 특징, 또는 아미노산의 방향족 또는 지방족 여부를 나타내는 제 5 특징 중 적어도 하나의 특징을 추가로 포함할 수 있다.In one embodiment, the first data or the second data may further include at least one of a first characteristic representing polarity between amino acids, a second characteristic representing the size of amino acids, a third characteristic representing hydrophobicity or hydrophilicity of amino acids, a fourth characteristic representing whether an amino acid has a charge, or a fifth characteristic representing aromatic or aliphatic amino acids.

일 실시예에서, MHC 또는 TCR에 관련된 데이터는 진화생물학적 게놈(evolutionary biological genome) 데이터에 기반한 다중 서열 정렬(MSA: Multiple Sequence Alignment)을 포함할 수 있다.In one embodiment, data related to MHC or TCR may include Multiple Sequence Alignment (MSA) based on evolutionary biological genome data.

일 실시예에서, CDR3α와 CDR3β가 조합되는지에 대한 정답 데이터는 분자생물학적 실험으로 결정될 수 있다. 추가적인 예시에서, 상기 분자생물학적 실험은 예를 들어, 구성 CDR3α, CDR3β가 조합된 TCR과 pMHC를 반응(예컨대, 펩타이드와 림프구의 공배양(coculture), 펩타이드와 TCR-T(TCR engineered T cell)의 공배양, 펩타이드 단백체의 체세포내 도입 및 림프구 노출 등)시킨 후, IFN-γ 검출로 pMHC와 상호작용한 TCR이 면역원성을 나타내는지를 통해 TCR을 구성하는 CDR3α 및 CDR3β의 조합 여부를 결정할 수 있다.In one embodiment, correct answer data for whether CDR3α and CDR3β are combined can be determined by molecular biology experiments. In a further example, the molecular biological experiment, for example, reacts the TCR and pMHC in which the constitutive CDR3α and CDR3β are combined (e.g., coculture of peptide and lymphocytes, coculture of peptide and TCR-T (TCR engineered T cell), introduction of peptide proteome into somatic cells and exposure to lymphocytes, etc.). and CDR3β can be determined.

추가적인 실시예에서, TCR의 CDR3α와 TCR의 CDR3β가 조합되는지(예컨대, 짝을 이루어 TCR을 구성하는지)에 대한 정답 데이터는 공공 DB(예를 들어, VDJdb, Expasy)로부터 획득할 수 있다.In an additional embodiment, answer data on whether the CDR3α of the TCR and the CDR3β of the TCR are combined (eg, paired to form a TCR) may be obtained from a public DB (eg, VDJdb, Expasy).

제 1 모델의 CDR3α와 CDR3β의 조합 여부(예컨대, 선후행 관계 해당 여부) 예측 방법에 대한 추가적인 설명은 도 4에서 후술하기로 한다.A further description of a method for predicting whether or not the combination of CDR3α and CDR3β of the first model (eg, whether or not a precedent relationship corresponds) will be described later with reference to FIG. 4 .

본 개시내용의 일 실시예에서, 컴퓨팅 장치(100)는 제 1 데이터와 제 2 데이터가 선후행 관계에 해당한다는 결과가 출력된 경우, 제 1 데이터와 제 2 데이터의 조합을 CDR 세트 후보 리스트에 저장할 수 있다(S330).In one embodiment of the present disclosure, the computing device 100 may store a combination of the first data and the second data in the CDR set candidate list when a result is output that the first data and the second data correspond to a precedence relationship (S330).

일 실시예에서, 제 1 모델이 제 1 데이터에 포함된 CDRα와 제 2 데이터에 포함된 CDRβ가 선후행 관계에 해당하는 것으로 예측한 경우, 컴퓨팅 장치(100)는 상기 CDRα와 CDRβ는 서로 조합되어 TCR을 구성할 수 있는 것으로 판단할 수 있다.In one embodiment, when the first model predicts that the CDRα included in the first data and the CDRβ included in the second data correspond to a precedence relationship, the computing device 100 may determine that the CDRα and CDRβ may be combined with each other to form a TCR.

제 1 모델의 학습과 추론, 제 1 모델의 학습에 사용되는 CDRα, CDRβ의 무작위적 조합들을 포함하는 음성 데이터셋 및 CDRα, CDRβ의 실존 조합들을 포함하는 양성 데이터셋에 대한 추가적인 설명은 도 4 내지 도 7에서 후술하기로 한다.Additional descriptions of the negative dataset including random combinations of CDRα and CDRβ and the positive dataset including real combinations of CDRα and CDRβ used for learning and inference of the first model and learning of the first model will be described later with reference to FIGS. 4 to 7.

도 4는 본 개시내용의 일 실시예에 따라, TCR의 CDR3α와 TCR의 CDR3β에 대한 조합 정보를 포함하는 예측 결과의 생성 방법을 예시적으로 도시한다.4 exemplarily illustrates a method for generating a prediction result including combination information for CDR3α of TCR and CDR3β of TCR, according to an embodiment of the present disclosure.

도 4는 본 개시내용의 일 실시예에 따라, 전처리된 입력 데이터 세트를 통해 CDR3α와 CDR3β의 선후행 관계 해당 여부를 포함하는 예측 결과를 생성하는 예측 모델이 학습 또는 추론하는 방법을 예시적으로 도시한다.FIG. 4 illustratively illustrates a method for learning or inferring a prediction model that generates a prediction result including whether or not a precedence relationship between CDR3α and CDR3β is applied through a preprocessed input data set according to an embodiment of the present disclosure.

일 실시예에서, 도 4는 TCR의 CDR3α 및 TCR의 CDR3β의 아미노산 서열을 포함하는 로우(raw) 데이터(41a, 41b)를 전처리하여 획득되는 TCR에 관련된 데이터로 구성되는 입력 데이터 세트(420), 그리고 상기 입력 데이터 세트(420)를 이용하여 구동되는 예측 모델(4300)과 그 예측 결과(440)(예컨대, CDR3α와 CDR3β의 조합 가능 또는 CDR3α와 CDR3β의 조합 불가능)를 도시한다. 일례로, 도 4에서 도시되는 모델은 예측 모델(4300)에 대응될 수 있다.In one embodiment, FIG. 4 shows an input data set 420 consisting of data related to TCR obtained by preprocessing raw data 41a and 41b including amino acid sequences of CDR3α and CDR3β of TCR, and a prediction model 4300 driven using the input data set 420 and a prediction result 440 (eg, a combination of CDR3α and CDR3β or CDR3α and Combination of CDR3β is not possible). As an example, the model shown in FIG. 4 may correspond to the predictive model 4300 .

일 실시예에서, 컴퓨팅 장치(100)는 인공지능 기반의 예측 모델(4300)을 통해 TCR α쇄 및 TCR β쇄의 선후행 예측 결과(440)를 출력할 수 있다.In one embodiment, the computing device 100 may output the preceding prediction result 440 of the TCR α chain and the TCR β chain through the artificial intelligence-based prediction model 4300 .

도 4에서 예측 모델(4300)의 학습 과정에서 참조 번호 440은 CDR3α와 CDR3β가 조합 가능하다는 제 1 라벨 또는 CDR3α와 CDR3β가 조합 불가능하다는 제 2 라벨을 포함하는 라벨 데이터로 활용될 수 있다.In the process of learning the predictive model 4300 in FIG. 4 , reference number 440 may be used as label data including a first label indicating that CDR3α and CDR3β can be combined or a second label indicating that CDR3α and CDR3β cannot be combined.

본 개시내용의 일 실시예에 따라, 예측 모델(4300)이 추론하는 TCR α쇄 및 TCR β쇄의 조합 가능 여부는 TCR의 CDR3α에 대응되는 정보(41a) 및/또는 TCR의 CDR3β에 대응되는 정보(41b)에 기반하여 생성될 수 있다.According to one embodiment of the present disclosure, whether or not the combination of the TCR α chain and the TCR β chain inferred by the prediction model 4300 is possible is information corresponding to CDR3α of TCR (41a) and / or information corresponding to CDR3β of TCR. It can be generated based on (41b).

일 실시예에서, 예측 모델(4300)이 pMHC-TCR의 면역원성을 추론하기 위해 필요한 정보들을 입력받는 과정에서 컴퓨팅 장치(100)는, TCR의 CDR3α에 대응되는 정보(41a), 및/또는 TCR의 CDR3β에 대응되는 정보(41b)를 일련의 연속된 데이터(410)의 형태로 입력할 수 있다. 일 실시예에서, 학습 및/또는 추론하는 과정에 TCR의 CDR3α에 대응되는 제 1 데이터 및 TCR의 CDR3β에 대응되는 제 2 데이터가 전처리되어 입력 데이터 세트(420)로 사용될 수 있다.In one embodiment, in the process of receiving information necessary for the predictive model 4300 to infer the immunogenicity of pMHC-TCR, the computing device 100 may input information 41a corresponding to CDR3α of TCR and/or information 41b corresponding to CDR3β of TCR in the form of a series of continuous data 410. In an embodiment, in the process of learning and/or reasoning, first data corresponding to CDR3α of TCR and second data corresponding to CDR3β of TCR may be preprocessed and used as the input data set 420 .

추가적인 예시에서, 예측 모델의 학습 또는 추론을 위한 전처리는 펩타이드와 MHC(예컨대, HLA, mouse MHC)를 입력 데이터로 하여 작동하는 외부 예측 모델을 통해 생성된 가이드 정보를 통해 이루어질 수 있다. 구체적으로, 외부 예측 모델을 통해 생성된 가이드 정보는 HLA의 타입(예컨대, HLA-B14:02, HLA-A02:03 등), TCR의 아미노산 서열 길이 정보, TCR의 V(D)J 타입 정보 등을 포함할 수 있다. 일례로, TCR의 아미노산 서열 길이 정보는 V/J 타입 정보와 함께 CDR3의 아미노산 다양성이 큰 부분에 대한 예측 모델이 출력 데이터의 아미노산 예측 결과 다양성을 넓게 설정하도록 할 수 있다.In a further example, preprocessing for learning or inference of the prediction model may be performed through guide information generated through an external prediction model operating with peptides and MHC (eg, HLA, mouse MHC) as input data. Specifically, the guide information generated through the external prediction model may include HLA type (eg, HLA-B14: 02, HLA-A02: 03, etc.), TCR amino acid sequence length information, TCR V (D) J type information and the like. For example, the amino acid sequence length information of the TCR together with the V/J type information may enable a prediction model for a portion of CDR3 having a large amino acid diversity to set a wide range of amino acid prediction result diversity of output data.

일 실시예에서, TCR의 CDR3α에 대응되는 정보(41a), 및/또는 TCR의 CDR3β에 대응되는 정보(41b)의 연속된 데이터(410)는 TCR의 CDR3α에 대응되는 정보(41a) 앞에 고유 분류 토큰 [cls]을 포함할 수 있다. 일 실시예에서, 도 4에 도시되는 바와 같이, 일련의 연속적인 데이터(410)가 전처리된 입력 데이터 세트(420)는 TCR의 CDR3α에 대응되는 정보(41a), TCR의 CDR3β에 대응되는 정보(41b) 사이에 구분자 토큰 [sep]을 포함할 수 있다.In one embodiment, the contiguous data 410 of information 41a corresponding to CDR3α of TCR and/or information 41b corresponding to CDR3β of TCR may include a unique classification token [cls] in front of information 41a corresponding to CDR3α of TCR. In one embodiment, as shown in FIG. 4, an input data set 420 in which a series of consecutive data 410 is preprocessed may include a delimiter token [sep] between information 41a corresponding to CDR3α of TCR and information 41b corresponding to CDR3β of TCR.

일 실시예에서, TCR의 CDR3α에 대응되는 제 1 데이터(41a)는 도 4에서 도시되는 바와 같이, E₁ 내지 E_N로 전처리될 수 있다. 추가적인 실시예에서, TCR의 CDR3β에 대응되는 제 2 데이터(41b)는 도 4에서 도시되는 바와 같이, E_1' 내지 E_N'로 전처리될 수 있다. 추가적인 실시예에서, TCR과 관련된 제 1 데이터 또는 제 2 데이터에 포함된 T 세포 수용체를 이루는 상보성 결정부위(complimentary determining region)(예컨대, CDR3α, CDR3β) 중 적어도 하나를 구성하는 아미노산 서열들을 포함할 수 있다.In one embodiment, as shown in FIG. 4 , the first data 41a corresponding to CDR3α of the TCR may be pre-processed into E ₁ to E _N . In a further embodiment, the second data 41b corresponding to CDR3β of the TCR may be pre-processed into E _1' to E _N' , as shown in FIG. 4 . In a further embodiment, the first data or the second data related to the TCR may include amino acid sequences constituting at least one of the T cell receptor-forming complementary determining regions (eg, CDR3α, CDR3β).

전처리 과정의 일 실시예에서, TCR의 CDR 아미노산 서열(41a, 41b)은 예컨대, 가이드 정보(예컨대, 외부 모델을 통한 정보)에 기초하여, 또는 무작위적으로 세그멘팅(segmenting)될 수 있다. 일례로, TCR의 CDR3β에 관련되는 제 2 데이터가 CASSPGTGGALAEQFFGPG 아미노산 서열을 포함할 때, 당해 CDR3β는 CAS/SPG/T/GGA/LA/EQ/FFGPG 또는 CASS/PGT/GG/A/LAEQ/FFGPG와 같이 세그멘팅될 수 있다.In one embodiment of the preprocessing process, the CDR amino acid sequences 41a and 41b of the TCR may be segmented, for example, based on guide information (eg, information through an external model) or randomly. For example, when the second data related to CDR3β of the TCR includes the amino acid sequence CASSPGTGGALAEQFFGPG, the CDR3β can be segmented as CAS/SPG/T/GGA/LA/EQ/FFGPG or CASS/PGT/GG/A/LAEQ/FFGPG.

일 실시예에서, 컴퓨팅 장치(100)는 로우 데이터를 전처리하여 CDR3α(41a)를 구성하는 유닛들(E₁ 내지 E_N), CDR3β(41b)를 구성하는 유닛들(E_1' 내지 E_N')을 생성할 수 있다. 추가적인 실시예에서, 제 1 데이터, 제 2 데이터에 포함되는 아미노산 서열은 MHC와 펩타이드가 결합(binding)을 이루어 만드는 pMHC(40)에 기초하여 획득될 수 있으나, 이에 국한되지 않는다.In one embodiment, the computing device 100 pre-processes raw data to generate units E ₁ to E _N constituting CDR3α 41a and units E _1′ to E _N constituting CDR3β 41b. In an additional embodiment, the amino acid sequence included in the first data and the second data may be obtained based on pMHC (40) formed by binding MHC and a peptide, but is not limited thereto.

일례로, 제 1 데이터(41a)에 포함되는 CDR3α와 제 2 데이터(41b)에 포함되는 CDR3β가, 특정 pMHC(40)에 대하여 결합을 이루는 TCR의 α쇄 및 β쇄의 짝인 경우, 그와 같은 제 1 데이터 및 제 2 데이터를 입력받은 제 1 모델(4300)은, 해당 CDR3α와 CDR3β가 조합되는 것으로 판단해야 본 개시내용의 예측 목적을 달성하였다고 볼 수 있다.For example, when CDR3α included in the first data 41a and CDR3β included in the second data 41b are a pair of α and β chains of a TCR that bind to a specific pMHC (40), the first model 4300 receiving the first data and the second data must determine that the corresponding CDR3α and CDR3β are combined to achieve the prediction purpose of the present disclosure. .

다른 예시로, 제 1 데이터에 포함되는 CDR3α와 제 2 데이터에 포함되는 CDR3β가, 특정 pMHC(40)에 대하여 결합을 이루지 않는 TCR의 α쇄 및 β쇄의 짝인 경우, 그와 같은 제 1 데이터(41a) 및 제 2 데이터(41b)를 입력받은 제 1 모델(4300)은, 해당 CDR3α와 CDR3β가 조합되지 않는 것으로 판단해야 본 개시내용의 예측 목적을 달성하였다고 볼 수 있다.As another example, when CDR3α included in the first data and CDR3β included in the second data are paired with the α and β chains of the TCR that do not bind to a specific pMHC (40), the first model 4300 receiving the first data 41a and the second data 41b must determine that the corresponding CDR3α and CDR3β are not combined to achieve the prediction purpose of the present disclosure. can see

본 개시내용의 일 실시예에서, 제 1 데이터(41a)에 포함되는 CDR3α에 대한 정보(402) 및 제 2 데이터(41b)에 포함되는 CDR3β에 대한 정보(403)는 제 1 모델과는 상이한, 인공지능 기반의 또는 룰 기반의 생성 모델(예컨대, 제 2 모델(4020), 제 3 모델(4030))을 통해 출력될 수 있다. 일 실시예에서 CDR3α에 대한 정보(402)는 pMHC(40)를 입력으로 하고, 그에 대응하는 CDR3α를 출력으로 하는 제 2 모델(4020)을 이용하여 생성될 수 있다. 일 실시예에서, CDR3β에 대한 정보(403)는 pMHC(40)를 입력으로 하고, 그에 대응하는 CDR3β를 출력으로 하는 제 3 모델(4030)을 이용하여 생성될 수 있다.In one embodiment of the present disclosure, information 402 on CDR3α included in the first data 41a and information 403 on CDR3β included in the second data 41b may be output through an artificial intelligence-based or rule-based generation model (e.g., the second model 4020 and the third model 4030) different from the first model. In one embodiment, information 402 on CDR3α may be generated using a second model 4020 that takes pMHC 40 as an input and outputs CDR3α corresponding thereto. In one embodiment, information 403 on CDR3β may be generated using a third model 4030 having pMHC 40 as an input and a corresponding CDR3β as an output.

일 실시예에서, 제 1 모델에 입력되는 제 1 데이터(41a) 및 제 2 데이터(41b)의 로우 데이터는, CDR3α와 해당 CDR3α에 대응되는 CDR3β의 결과를 생성하거나, CDR3β와 해당 CDR3β에 대응되는 CDR3α의 결과를 생성하는, 제 1 모델과는 상이한 외부 모델에 의하여 획득될 수 있다. 상기 제 1 모델과 상이한 외부 모델에 대한 추가적인 설명은 도 5에서 후술하기로 한다.In one embodiment, the raw data of the first data 41a and the second data 41b input to the first model generates a result of CDR3α and CDR3β corresponding to the corresponding CDR3α, or CDR3β and CDR3β. It may be obtained by an external model different from the first model that generates a result of CDR3α corresponding to the corresponding CDR3β. Additional description of the external model different from the first model will be described later with reference to FIG. 5 .

일 실시예에서, 컴퓨팅 장치(100)는 CDR3α 및 CDR3β에 대한 전처리된 입력 데이터 세트(420)를 임베딩할 수 있다. 추가적인 실시예에서, 컴퓨팅 장치(100)는 CDR3α 및 CDR3β에 대한 전처리된 입력 데이터 세트(420), [cls] 토큰, 및 [sep] 토큰의 위치정보에 대하여 위치 임베딩(positional embedding)할 수 있다.In one embodiment, computing device 100 may embed preprocessed input data sets 420 for CDR3α and CDR3β. In a further embodiment, the computing device 100 may positional embed the location information of the preprocessed input data set 420 for CDR3α and CDR3β, the [cls] token, and the [sep] token.

일 실시예에서, 예측 모델이 학습 또는 추론하는 과정에 토큰 임베딩 및 위치 임베딩이 포함될 수 있다. 제 1 모델(4300)의 학습 방법은 로우 데이터(410)가 전처리되어 생성되는 CDR3α 및 CDR3β의 입력 데이터 세트(420)가 토큰화 또는 세그멘팅된 아미노산 식별자들을 임베딩하는 단계를 포함할 수 있다. 본 개시내용에 따른 일 실시예에서, 제 1 모델(4300), 제 2 모델(4020), 또는 제 3 모델(4030)의 학습 방법은 TCR의 CDR3α, TCR의 CDR3β 서열을 구성하는 유닛들의 아미노산 서열 내 순서 정보를 반영하는 위치 데이터를 임베딩하는 단계를 포함할 수 있다. 일례로, TCR의 CDR3α 서열의 아미노산의 개수가 9개, TCR의 CDR3β 서열의 아미노산의 개수가 13개라면, 제 1 모델(4300)은 22개(예컨대, 9와 13의 합산)의 위치 임베딩 벡터로 학습될 수 있다.In one embodiment, token embedding and position embedding may be included in the process of learning or inferring the predictive model. The learning method of the first model 4300 may include embedding tokenized or segmented amino acid identifiers into the input data set 420 of CDR3α and CDR3β generated by preprocessing the raw data 410 . In one embodiment according to the present disclosure, the learning method of the first model 4300, the second model 4020, or the third model 4030 may include embedding positional data reflecting order information in the amino acid sequence of units constituting the CDR3α and CDR3β sequences of the TCR. For example, if the number of amino acids in the CDR3α sequence of the TCR is 9 and the number of amino acids in the CDR3β sequence of the TCR is 13, the first model 4300 can be trained with 22 (e.g., sum of 9 and 13) position embedding vectors.

본 개시내용에 따른 일 실시예에서, 위치 임베딩은 pMHC 및/또는 TCR을 구성하는 아미노산에 대한 위치 정보를 반영하기 위한 연산을 의미할 수 있다. 일례로, 위치 임베딩은 아미노산에 대한 순서 정보를 반영하기 위한 별도의 임베딩 층(layer)에서 수행될 수 있다.In one embodiment according to the present disclosure, positional embedding may mean an operation for reflecting positional information about amino acids constituting pMHC and/or TCR. For example, positional embedding may be performed in a separate embedding layer for reflecting sequence information for amino acids.

추가적인 예시로, E₁ 내지 E_N, E_1' 내지 E_N'은 하나 이상의 아미노산들을 포함하는 토큰을 나타낼 수 있다. 일례로, TCR 서열의 일부를 구성하는 토큰은 CASS, CAS, ASS, CA, AS, SS, C, A, S와 같이 복수개 아미노산들의 서열, 또는 단일 아미노산일 수 있다.As a further example, E ₁ to E _N , E _1' to E _N' may represent a token comprising one or more amino acids. For example, a token constituting a part of a TCR sequence may be a sequence of a plurality of amino acids, such as CASS, CAS, ASS, CA, AS, SS, C, A, S, or a single amino acid.

일 실시예에서, CDR3α와 CDR3β의 선후행 관계 여부는 자연어 처리 과정(natural language process)의 NSP task(450)에 기반하여 예측될 수 있다. 예를 들어, 제 1 모델(4300)은 자연어 처리 과정에서 두 문장이 연결되어 있던 문장인지 여부를 예측하는 Next Sentence Prediction(NSP)의 두 문장 간의 관계 파악 프로세스를 사용할 수 있다. 일 실시예에서, NSP task(450)는, 한 문장이 "Several airports are connected by Korean Air(KE), United(UA), Air Berlin(AB), Delta Airlines(DL), and Cathay Pacific Airways(CX)."일 때, 다음 문장으로 "Those airports including ICN, JFK, FRA, CDG, BER, MAN and HKG are located in Incheon, New York, Frankfurt, Paris, Berlin, Manchester and Hongkong."가 후행할 수 있음(IsNext)으로 판단하는 프로세스를 사용할 수 있다.In one embodiment, whether a precedence relationship between CDR3α and CDR3β may be predicted based on the NSP task 450 of the natural language process. For example, the first model 4300 may use a process of determining a relationship between two sentences of Next Sentence Prediction (NSP) that predicts whether two sentences are connected sentences in a natural language processing process. In one embodiment, the NSP task 450 may be followed by "Those airports including ICN, JFK, FRA, CDG, BER, MAN and HKG are located in Incheon, New York, Frankfurt, Paris, Berlin, Manchester and Hongkong." (IsNext) process can be used.

추가적인 실시예에서, NSP task(450)는, 한 문장이 "Several airports are connected by Korean Air(KE), United(UA), Air Berlin(AB), Delta Airlines(DL), and Cathay Pacific Airways(CX)."일 때, 다음 문장으로 "Those airports including NRT, KIX, PEK, PVG and PUS are located in Tokyo, Osaka, Beijing, Shanghai and Busan."가 후행할 수 없음(NotNext)으로 판단하는 프로세스를 사용할 수 있다.In an additional embodiment, the NSP task 450 may use a process that determines that “Those airports including NRT, KIX, PEK, PVG and PUS are located in Tokyo, Osaka, Beijing, Shanghai and Busan.” as the next sentence is NotNext when one sentence is “Several airports are connected by Korean Air (KE), United (UA), Air Berlin (AB), Delta Airlines (DL), and Cathay Pacific Airways (CX).”

본 개시내용에 따른 일 실시예에서, NSP는 문장의 선후행 관계 여부를 식별하는 방법으로서 사용될 수 있으며, 언어 모델(language processing model)의 자연어 처리 과정에 쓰이는 기법에만 국한되지 않는다. 예시적으로 본 개시 내용의 제 1 모델은, 문장의 성분, 어순, 맥락, 단어간 관계 등을 함유하는 언어 모델의 토큰들에 국한되지 않고, 아미노산의 종, 전기화학적·생화학적 특징, 이웃한 아미노산 종간 관계, 3차원적 화합물 구조 등이 언어 모델의 토큰이 사용하는 기계학습 방식을 차용할 수 있음을 의미한다.In one embodiment according to the present disclosure, NSP may be used as a method of identifying whether or not there is a precedence relationship between sentences, and is not limited to a technique used in a natural language processing process of a language processing model. Exemplarily, the first model of the present disclosure is not limited to tokens of a language model containing components of a sentence, word order, context, relationship between words, etc., and species of amino acids, electrochemical and biochemical characteristics, relationships between neighboring amino acid species, and three-dimensional compound structures. This means that the machine learning method used by the tokens of the language model can be borrowed.

일 실시예에서, 상기와 같은 프로세스를 사용하는 NSP task(450)를 통해 제 1 모델(4300)은, CASSATGSQNTLYFGPG 아미노산 서열을 갖는 CDR3α에 CASTDTSQNTLYFGAG 아미노산 서열을 갖는 CDR3β가 후행할 수 있다(IsNext)고 판단할 수 있다. 예컨대, 제 1 모델(4300)은 상기와 같은 CDR3α와 CDR3β의 데이터를 입력받아 CASSATGSQNTLYFGPG 아미노산 서열을 갖는 CDR3α와 CASTDTSQNTLYFGAG 아미노산 서열을 갖는 CDR3β는 서로 조합된다는 예측 결과를 출력할 수 있다.In one embodiment, the first model 4300 through the NSP task 450 using the above process may determine that CDR3α having the CASSATGSQNTLYFGPG amino acid sequence may be followed by CDR3β having the amino acid sequence CASTDTSQNTLYFGAG (IsNext). For example, the first model 4300 may receive the CDR3α and CDR3β data as described above and output a prediction result that CDR3α having the CASSATGSQNTLYFGPG amino acid sequence and CDR3β having the CASTDTSQNTLYFGAG amino acid sequence are combined with each other.

추가적인 실시예에서, 상기와 같은 프로세스를 사용하는 NSP task(450)를 통해 제 1 모델(4300)은, CASSIRSSYEQYFEG 아미노산 서열을 갖는 CDR3α에 CASTDTSQNTLYFGAG 아미노산 서열을 갖는 CDR3β가 후행할 수 없다(NotNext)고 판단할 수 있다. 예컨대, 제 1 모델(4300)은 상기와 같은 CDR3α와 CDR3β의 데이터를 입력받아 CASSIRSSYEQYFEG 아미노산 서열을 갖는 CDR3α와 CASTDTSQNTLYFGAG 아미노산 서열을 갖는 CDR3β는 서로 조합될 수 없다는 예측 결과를 출력할 수 있다.In a further embodiment, the first model 4300 through the NSP task 450 using the above process can determine that CDR3α having the CASSIRSSYEQYFEG amino acid sequence cannot follow CDR3β having the CASTDTSQNTLYFGAG amino acid sequence (NotNext). For example, the first model 4300 receives the CDR3α and CDR3β data as described above and outputs a prediction result that CDR3α having the CASSIRSSYEQYFEG amino acid sequence and CDR3β having the CASTDTSQNTLYFGAG amino acid sequence cannot be combined with each other.

이와 같은 실시예에서, 다양한 CDR3 서열의 출력을 통해 컴퓨팅 장치(100)는 pMHC 항원-특이적인 TCR의 정보를 생성하여, 보다 효과적으로 in silico, in vivo 또는 in vitro에서 조합가능한 TCR의 α쇄와 β쇄(예컨대, CDR3α와 CDR3β)의 아미노산 서열 정보를 획득할 수 있다.In such an embodiment, the computing device 100 generates information of the pMHC antigen-specific TCR through the output of various CDR3 sequences, and more effectively in silico , in vivo or in vitro Combinable TCR α and β chains (eg, CDR3α and CDR3β) amino acid sequence information can be obtained.

도 5는 본 개시내용의 일 실시예에 따라, TCR의 CDR3α 또는 TCR의 CDR3β 정보를 포함하는 데이터의 생성 방법을 예시적으로 도시한다.5 exemplarily illustrates a method of generating data including information of CDR3α of TCR or CDR3β of TCR according to an embodiment of the present disclosure.

본 개시내용에 따른 일 실시예에서, 제 1 모델(4300)에 입력되는 제 1 데이터(41a) 및 제 2 데이터(41b)의 로우 데이터의 획득을 위하여, CDR3α와 해당 CDR3α에 대응되는 CDR3β의 결과를 생성하거나, CDR3β와 해당 CDR3β에 대응되는 CDR3α의 결과를 생성하는, 제 1 모델과는 상이한 외부 모델(예컨대, 제 2 모델(4020), 제 3 모델(4030))이 사용될 수 있다. 추가적인 실시예에서, 제 2 모델(4020), 제 3 모델(4030)과 같은 외부 모델에 의하여 생성될 수 있는 CDR3α, CDR3β에 대한 정보는, 도 5에 도시되는 바와 같은 예측 모델(500)에 의하여 획득될 수도 있다.In an embodiment according to the present disclosure, in order to acquire raw data of the first data 41a and the second data 41b input to the first model 4300, a CDR3α and a result of CDR3β corresponding to the corresponding CDR3α are generated, or an external model different from the first model (e.g., the second model 4020), a third Model 4030) can be used. In an additional embodiment, information on CDR3α and CDR3β, which may be generated by external models such as the second model 4020 and the third model 4030, may be obtained by the predictive model 500 as shown in FIG. 5.

일 실시예에서, 도 5는 pMHC를 구성하는 펩타이드 및 MHC의 아미노산 서열을 포함하는 로우 데이터(511)를 전처리하여 획득되는 펩타이드, MHC에 관련된 데이터로 구성되는 제 1 입력 데이터 세트(501), 그리고 상기 제 1 입력 데이터 세트(501)를 이용하여 구동되는 인코더를 포함하는 예측 모델(500)과 제 2 출력 데이터(55a)(예컨대, 입력된 또는 출력된 CDR3β와 조합되는 CDR3α의 아미노산 서열)를 도시한다. 일례로, 도 5에서 도시되는 모델은 예측 모델(500)에 대응될 수 있다.In one embodiment, FIG. 5 shows a peptide obtained by preprocessing raw data 511 including peptides constituting pMHC and amino acid sequences of MHC, a first input data set 501 composed of data related to MHC, and a prediction model 500 including an encoder driven using the first input data set 501 and second output data 55a (e.g., CDR3α combined with input or output CDR3β) The amino acid sequence of) is shown. As an example, the model shown in FIG. 5 may correspond to the predictive model 500 .

추가적인 실시예에서, 예측 모델은 CDR3α와 관련된 데이터 및 CDR3β와 관련된 데이터를 포함하는 제 2 입력 데이터 세트(502)를 이용하여 구동되는 디코더를 포함할 수 있다. 일 실시예에서, 인코더와 디코더를 포함하는 예측 모델의 CDR3α 서열 예측 결과(55a)는 CDR3β의 결과 앞에, 예를 들어, CVRYLCAIENTF[eos]와 같이 출력될 수 있다.In a further embodiment, the predictive model may include a decoder driven using a second input data set 502 comprising data related to CDR3α and data related to CDR3β. In one embodiment, the CDR3α sequence prediction result 55a of the prediction model including the encoder and the decoder may be output before the CDR3β result, eg, CVRYLCAIENTF[eos].

추가적인 실시예에서, 상기 예시적인 출력 결과 CVRYLCAIENTF[eos]의 EOS 토큰은 CDR3의 정보, pMHC를 구성하는 펩타이드 및 MHC의 정보를 포함할 수 있다. 또한, 도 5에 도시되는 바와 같이, 디코더에 입력되는 CDR3α 및 CDR3β를 포함하는 제 2 입력 데이터 세트(502)는, 전처리되어 상기 CDR3α 및 CDR3β의 데이터 사이에 [EOS] 토큰을 포함할 수 있다. 추가적인 실시예에서, 이와 같은 EOS 토큰은 예측 모델(500)의 학습 대상이 될 수 있다. 구체적으로, EOS 토큰이 갖는 CDR3의 정보, pMHC를 구성하는 펩타이드 및 MHC의 정보는 학습 및/또는 추론 과정을 통해 업데이트될 수 있다.In a further embodiment, the EOS token of CVRYLCAIENTF[eos] as an exemplary output result may include information of CDR3, peptides constituting pMHC, and information of MHC. In addition, as shown in FIG. 5, the second input data set 502 including CDR3α and CDR3β input to the decoder is preprocessed to include [EOS] tokens between the CDR3α and CDR3β data. In a further embodiment, these EOS tokens may be the subject of prediction model 500 training. Specifically, CDR3 information of the EOS token, peptides constituting pMHC, and MHC information may be updated through a learning and/or inference process.

본 개시내용의 일 실시예에 따라, 컴퓨팅 장치(100)는 인공지능 기반의 예측 모델(500)을 통해 TCR α쇄, β쇄의 조합인 CDR3α/CDR3β 예측 결과(550)를 출력할 수 있다.According to an embodiment of the present disclosure, the computing device 100 may output a CDR3α/CDR3β prediction result 550, which is a combination of TCR α chain and β chain, through an artificial intelligence-based prediction model 500.

본 개시내용의 일 실시예에 따라, 예측 모델(500)이 추론하는 TCR α쇄, β쇄의 조합인 CDR3α/CDR3β 예측 결과는 임베딩 레이어(510), 제 1 세트의 히든 레이어(520), 제 2 세트의 히든 레이어(530), 프로젝션 레이어(540)에 기반하여 생성될 수 있다.According to an embodiment of the present disclosure, the CDR3α / CDR3β prediction result, which is a combination of the TCR α chain and β chain inferred by the prediction model 500, is an embedding layer 510, a first set of hidden layers 520, and a second set of hidden layers 530 and projection layers 540.

일 실시예에서, 컴퓨팅 장치(100)는 임베딩 레이어(510)에 입력되는 pMHC의 로우 데이터(511)에 기초한 제 1 입력 데이터 세트(501), CDR3의 로우 데이터에 기초한 제 2 입력 데이터 세트(502)가 전처리된 데이터를 프로세싱할 수 있다.In one embodiment, the computing device 100 is input to the embedding layer 510, the first input data set 501 based on the raw data 511 of pMHC and the second input data set 502 based on the raw data of CDR3 are pre-processed data.

일 실시예에서, 컴퓨팅 장치(100)는 임베딩 레이어(510)로부터 제 1 세트의 히든 레이어(520)로 입력되는 데이터를 프로세싱할 수 있다. 추가적인 예시에서, 도 5에서 도시되는 바와 같이, 예측 모델(500)의 제 1 세트의 히든 레이어(520)에서, 제 1 입력 데이터(501)를 입력받은 서브 히든 레이어들과 CDR3에 대한 데이터(502)를 입력받은 서브 히든 레이어들은 잔차 연결될 수 있다. 추가적인 예시에서, 예측 모델의 제 2 세트의 히든 레이어(530)에서, 제 1 입력 데이터(501)를 입력받은 서브 히든 레이어들과 CDR3에 대한 데이터(502)를 입력받은 서브 히든 레이어들은 서로 잔차 연결될 수 있다.In one embodiment, the computing device 100 may process data input from the embedding layer 510 to the first set of hidden layers 520 . As a further example, as shown in FIG. 5 , in the first set of hidden layers 520 of the predictive model 500, the sub hidden layers receiving the first input data 501 and the sub hidden layers receiving the data 502 for CDR3 may be residually connected. As a further example, in the second set of hidden layers 530 of the predictive model, sub hidden layers receiving the first input data 501 and sub hidden layers receiving the data 502 for CDR3 may be residually connected to each other.

본 개시내용의 일 실시예에 따라, 예측 모델은 인코더 및 디코더를 포함할 수 있다. 일 실시예에서, pMHC 데이터를 포함하는 제 1 입력 데이터 세트(501)가 상기 인코더로 입력되고, CDR3α 데이터 및 CDR3β 데이터를 포함하는 제 2 입력 데이터 세트(502)가 상기 디코더로 입력되며, 디코더에서 CDR3α 데이터 및 CDR3β 데이터와 연관된 아미노산 서열들을 포함하는 예측 결과가 출력되도록, 예측 모델(500)에 대한 사전 학습이 이루어질 수 있다.According to one embodiment of the present disclosure, a predictive model may include an encoder and a decoder. In one embodiment, the first input data set 501 including pMHC data is input to the encoder, the second input data set 502 including CDR3α data and CDR3β data is input to the decoder, and the CDR3α data and CDR3β data.

추가적인 실시예에서, 디코더에 입력되는 CDR3와 관련된 제 2 입력 데이터 세트(502)는, CDR3α와 관련된 데이터 및 CDR3β와 관련된 데이터 사이에 존재하는 EOS 토큰을 포함할 수 있다. 추가적인 실시예에서, 디코더에 입력되는 CDR3와 관련된 제 2 입력 데이터 세트(502)는, CDR3β와 관련된 데이터 종단에 EOS 토큰을 포함할 수 있다.In a further embodiment, the second input data set 502 related to CDR3 input to the decoder may include an EOS token between data related to CDR3α and data related to CDR3β. In a further embodiment, the second input data set 502 associated with CDR3 input to the decoder may include an EOS token at the end of data associated with CDR3β.

본 개시내용에 따른 일 실시예에서, 디코더에 입력되는 CDR3와 관련된 제 2 입력 데이터 세트가 포함하는 CDR3α 데이터 및 CDR3β 데이터 사이에 존재하는 제 1 EOS 토큰은 pMHC와 CDR3α에 대한 정보를 포함할 수 있다. 다른 실시예에서, 디코더에 입력되는 CDR3와 관련된 제 2 입력 데이터 세트(502)가 포함하는 CDR3β 데이터 종단에 존재하는 제 2 EOS 토큰은 pMHC와 CDR3β에 대한 정보를 포함하는 제 2 EOS 토큰을 포함할 수 있다.In one embodiment according to the present disclosure, the first EOS token present between CDR3α data and CDR3β data included in the second input data set related to CDR3 input to the decoder may include information on pMHC and CDR3α. In another embodiment, the second EOS token present at the end of CDR3β data included in the second input data set 502 related to CDR3 input to the decoder may include a second EOS token including information on pMHC and CDR3β.

추가적인 실시예에서, pMHC와 CDR3β에 대한 정보를 포함하는 제 2 EOS 토큰은 상기 pMHC 및/또는 상기 CDR3β에 대응하는 CDR3α의 아미노산 서열 정보를 포함하는 제 2 출력 데이터(55a)를 획득하는 데에 사용될 수 있다.In a further embodiment, the second EOS token including information on pMHC and CDR3β may be used to obtain second output data 55a including information on the amino acid sequence of CDR3α corresponding to the pMHC and/or CDR3β.

일 실시예에서, 예측 모델은 프로젝션 레이어(540)에서의 프로세싱에 사용되는 V 벡터(541)는, 예컨대, LSTM의 마지막 은닉 상태를 Dense Layer 통과시켜 0~1 사이의 확률 값으로 이루어진 벡터로 변환하여, 아미노산 식별자들 또는 EOS 토큰의 출력 확률을 계산하는 데에 사용될 수 있다. 일례로, 예측 모델은 프로젝션 레이어(540)에서 V 벡터(541) 적용 후 하나의 클래스(예컨대, 아미노산 R, H, K, D, E, S, T, N, Q, C, U, G, P, A, I, L, M, F, W, Y, V 또는 [eos])을 결정할 수 있다. 추가적인 예시에서, 예측 모델이, 프로젝션 레이어(540)에서 입력 데이터에 V 벡터(541) 적용된 CDR3α에 대한 제 2 출력 데이터(55a)의 8번째 아미노산이 Y(Tyrosine) 일 확률을 0.50, W(Tryptophan)일 확률을 0.35, F(Phenylalanine)일 확률을 0.15로 예측한 경우, 상기 CDR3α의 8번째 아미노산은 Y인 것으로 예측 결과를 생성할 수 있다.In one embodiment, the predictive model converts the V vector 541 used for processing in the projection layer 540 into a vector consisting of probability values between 0 and 1 by passing, for example, the last hidden state of the LSTM through the Dense Layer. It can be used to calculate the output probability of amino acid identifiers or EOS tokens. As an example, the predictive model may determine one class (e.g., amino acids R, H, K, D, E, S, T, N, Q, C, U, G, P, A, I, L, M, F, W, Y, V or [eos]) after applying the V vector 541 in the projection layer 540. In a further example, when the prediction model predicts that the 8th amino acid of the second output data 55a is Y (Tyrosine) with a probability of 0.50, W (Tryptophan) with a probability of 0.35, and F (Phenylalanine) with a probability of 0.15 for CDR3α to which the V vector 541 is applied to the input data in the projection layer 540, the 8th amino acid of the CDR3α is Y can create

추가적인 예시에서, 예측 모델(500)이, 프로젝션 레이어(540)에서 입력 데이터에 V 벡터(541) 적용된 CDR3α에 대한 제 2 출력 데이터(55a)의 9번째 아미노산이 S(Serine) 일 확률을 0.61, N(Asparagine)일 확률을 0.20, T(Threonine)일 확률을 0.19로 예측한 경우, 상기 CDR3α의 9번째 아미노산은 T인 것으로 예측 결과를 생성할 수 있다.In a further example, when the prediction model 500 predicts that the probability that the 9th amino acid of the second output data 55a is S (Serine) is 0.61, the probability that it is N (Asparagine) is 0.20, and the probability that it is T (Threonine) is 0.19 for CDR3α to which the V vector 541 is applied to the input data in the projection layer 540, the 9th amino acid of the CDR3α is predicted to be T. can create

일 실시예에서, 예측 모델(500)은 제 2 입력 데이터 세트(502)에 포함된 CDR3을 구성하는 아미노산의 서열 내 위치들 각각에 대응되는 위치에서의 데이터를 출력할 수 있다. 추가적인 실시예에서, 예측 모델(500)은 제 2 입력 데이터 세트(502)의 제 1 위치에 아미노산이 포함된 경우, 출력 데이터에서 상기 제 1 위치에 대응되는 제 2 위치에 EOS 토큰이 제 1 순위의 예측 결과로서 도출되면, 상기 제 1 순위 예측인 EOS 토큰 대신, 보다 후순위인 제 2 순위의 예측 결과로 도출되는 아미노산을 예측 결과로 출력할 수 있다.In one embodiment, the predictive model 500 may output data at positions corresponding to positions in the sequence of amino acids constituting CDR3 included in the second input data set 502 . In a further embodiment, when an amino acid is included in the first position of the second input data set 502, the prediction model 500 may output, as a prediction result, an amino acid derived as a prediction result of a second rank, which is a lower order, instead of the EOS token, which is the first rank prediction, if an EOS token is derived as a first rank prediction result at a second position corresponding to the first position in the output data.

일례로, 예측 모델(500)은 제 2 입력 데이터 세트(502)의 제 1 위치(예컨대, N-말단 4번째)에 아미노산이 포함된 경우, 출력 데이터에서 상기 제 1 위치에 대응되는 제 2 위치(예컨대, N-말단 4번째)에 [eos]가 제 1 순위(예컨대, 확률 0.52)의 예측 결과로서 도출되면, 이를 배제하고 [eos] 보다 후순위인 제 2 순위(예컨대, 확률 0.48)의 예측 결과로 도출되는 아미노산(예컨대, Q: glutamine)을 출력 데이터로 할 수 있다. 이를 통해 컴퓨팅 장치(100)는 생물학적으로 타당한 CDR3의 길이 이전에 end-of-sequence 결과가 생성되는 것을 방지하여, 더 정확한 TCR에 대한 정보를 생성할 수 있다.For example, when an amino acid is included in a first position (eg, N-terminal 4th) of the second input data set 502, the predictive model 500 excludes [eos] at a second position (eg, N-terminal 4th) corresponding to the first position in the output data as a prediction result of the first rank (eg, probability 0.52), and then excludes it and second rank (eg, probability 0.4 Amino acids (eg, Q: glutamine) derived as a result of the prediction of 8) may be used as output data. Through this, the computing device 100 can prevent an end-of-sequence result from being generated prior to the length of the biologically valid CDR3, thereby generating more accurate TCR information.

일 실시예에서, 예측 모델(500)은 사전 결정된 임계 길이(예컨대, length of 5 amino acids)를 초과하기 이전의 위치에서 제 1 순위의 예측 결과로 EOS 토큰이 도출되는 경우, 상기 제 1 순위 예측인 EOS 토큰 대신, 보다 후순위인 제 2 순위의 예측 결과로 도출되는 아미노산을 예측 결과로 출력할 수 있다.In one embodiment, the prediction model 500, when an EOS token is derived as a first-rank prediction result at a position before exceeding a predetermined critical length (e.g., the length of 5 amino acids), the first-rank prediction EOS token, instead of an amino acid derived as a second-rank prediction result that is more subordinate to the prediction result.

일례로, 예측 모델(500)은 사전 결정된 임계 길이 5를 초과하기 이전, 예를 들어, CDR3β의 N-말단 5번째 위치에 [eos]가 제 1 순위(예컨대, 확률 0.54)의 예측 결과로서 도출되면, 이를 배제하고 [eos] 보다 후순위인 제 2 순위(예컨대, 확률 0.41)의 예측 결과로 도출되는 아미노산(예컨대, I: isoleucine)을 출력 데이터로 할 수 있다. 이를 통해 컴퓨팅 장치(100)는 생물학적으로 타당한 CDR3의 길이 이전에 end-of-sequence 결과가 생성되는 것을 방지하여, 더 정확한 TCR에 대한 정보를 생성할 수 있다.As an example, the predictive model 500 may exclude [eos] at the N-terminal 5th position of CDR3β before exceeding a predetermined threshold length 5 as a prediction result of the first rank (eg, probability 0.54), and exclude it and use an amino acid (eg, I: isoleucine) derived as a prediction result of the second rank (eg, probability 0.41) subordinate to [eos] as output data. Through this, the computing device 100 can prevent an end-of-sequence result from being generated prior to the length of the biologically valid CDR3, thereby generating more accurate TCR information.

일 실시예에서, 결합 가능한 MHC와 펩타이드 혹은 MHC 및 펩타이드 각각은 공공 데이터베이스(예를 들어, IEDB, HLA atlas Ligand, SysteMHC Altas)를 통해 획득될 수 있다. 추가적인 실시예에서, 결합 가능한 MHC와 펩타이드(즉, pMHC)는 MHC와 펩타이드를 입력으로 하고 결합가능한지 여부를 출력으로 하는 별도의 인공지능 또는 룰 기반의 모델에 의해 결정될 수도 있다.In one embodiment, MHC and peptides capable of binding or each of MHC and peptides may be obtained through a public database (eg, IEDB, HLA atlas Ligand, SysteMHC Altas). In an additional embodiment, bindable MHC and peptides (ie, pMHC) may be determined by a separate artificial intelligence or rule-based model that takes MHC and peptides as inputs and outputs whether bindable or not.

추가적인 실시예에서, 제 1 입력 데이터 세트의 pMHC와 결합 가능한 TCR에 대한 정보(예컨대, 길이 정보, V/J 타입 등)는 상기 pMHC를 입력으로 하고 pMHC에 대응하는 TCR의 정보를 생성하는 별도의 인공지능 또는 룰 기반의 모델에 의해 결정될 수도 있다.In an additional embodiment, information on a TCR that can be combined with pMHC of the first input data set (e.g., length information, V/J type, etc.) takes the pMHC as an input and generates TCR information corresponding to the pMHC. It may be determined by a separate artificial intelligence or rule-based model.

추가적인 실시예에서, 예측 모델(500)에 포함되는 복수의 서브 예측 모델들에서 복수 번 수행되는 에폭마다 참조되는 서브 예측 모델들의 TCR의 CDR3 예측 결과를 통해 정확한 성능 평가가 가능해지며, 예측 모델의 학습 및/또는 추론 효율 및/또는 정확도가 개선될 수 있다.In an additional embodiment, accurate performance evaluation is possible through the CDR3 prediction results of the TCRs of the sub-prediction models referred to for each epoch performed multiple times in the plurality of sub-prediction models included in the prediction model 500, and learning and/or inference efficiency and/or accuracy of the predictive model can be improved.

본 개시내용의 일 실시예에 따라, 컴퓨팅 장치(100)는 제 1 EOS 토큰을 이용하여 생성되는 제 1 출력 데이터에 포함되는 CDR3β 후보 리스트를 생성할 수 있다. 추가적인 실시예에서, 컴퓨팅 장치(100)는 제 2 EOS 토큰을 이용하여 생성되는 제 2 출력 데이터에 포함되는 CDR3α 후보 리스트를 생성할 수 있다. 추가적인 실시예에서, 본 개시내용에서의 결합되는 CDR3α와 CDR3β의 조합을 출력하는 모델(예컨대, 제 1 모델(4300), 예측 모델(500))이 생성한 결과들이 포함된 CDR3 세트 후보 리스트는 예를 들어, TCR-T에 사용될 수 있는 TCR 구성 CDR3(CDR3α 및 CDR3β)의 후보들을 포함한 리스트를 의미할 수 있다. 일례로, 피검체 1에 대한 CDR3α 또는 CDR3β의 후보 리스트가 생성되어, 피검체 1에 대한 처치(treatment)에 상기 후보 리스트로부터 선택된 구성 CDR3을 포함하는 TCR이 사용될 수 있다.According to an embodiment of the present disclosure, the computing device 100 may generate a CDR3β candidate list included in the first output data generated using the first EOS token. In an additional embodiment, the computing device 100 may generate a CDR3α candidate list included in the second output data generated using the second EOS token. In a further embodiment, the CDR3 set candidate list including the results generated by the model (e.g., the first model 4300 and the prediction model 500) outputting a combination of CDR3α and CDR3β combined in the present disclosure may refer to a list including candidates of TCR constituent CDR3 (CDR3α and CDR3β) that can be used for TCR-T, for example. For example, a candidate list of CDR3α or CDR3β for subject 1 may be generated, and a TCR including a constitutive CDR3 selected from the candidate list may be used for treatment of subject 1.

추가적인 실시예에서, 구체적으로, 컴퓨팅 장치(100)는 제 1 출력 데이터 및 제 2 출력 데이터에 포함되는 CDR3α과 CDR3β의 각 후보 리스트에서, CDR3α와 CDR3β를 조합하여 특정 펩타이드와 특정 MHC에 대응되는 TCR의 상보성 결정 부위를 도출해낼 수 있다. 이에 따라 예측 모델은 표적 항원(예컨대, 고형암의 종양에서 특이적으로 발견되는 항원)이 식별되면, 해당 표적 항원에 대응하는 가장 효과적인 TCR을 도출할 수 있다.In an additional embodiment, specifically, the computing device 100 combines CDR3α and CDR3β in each of the CDR3α and CDR3β candidate lists included in the first output data and the second output data to derive a complementarity determining region of the TCR corresponding to a specific peptide and a specific MHC. Accordingly, when a target antigen (eg, an antigen specifically found in a tumor of a solid cancer) is identified, the prediction model can derive the most effective TCR corresponding to the target antigen.

도 5 에서 예시되는 바와 같은 상기 출력 데이터(55a)를 통해 컴퓨팅 장치(100)는 in silico, in vivo 또는 in vitro에서 TCR의 β쇄(TCRβ-chain)와 조합되는 TCR의 α쇄(TCRα-chain)인 CDR3α의 아미노산 서열 정보를 출력하여 항원-특이적인 TCR 정보를 생성할 수 있다.Through the output data 55a as illustrated in FIG. 5, the computing device 100 can generate antigen-specific TCR information by outputting information on the amino acid sequence of CDR3α, which is the TCRα-chain, which is combined with the TCRβ-chain in silico , in vivo , or in vitro .

일 실시예에서, 제 1 입력 데이터 세트(501)에 포함되는 펩타이드와 MHC가 결합을 이루는 경우, 상기 결합을 이루는 상기 펩타이드와 상기 MHC를 포함하는 제 1 입력 데이터 세트(501)를 입력받은 예측 모델(500)은 조합이 되는 CDR3α과 CDR3β의 정보가 제 2 입력 데이터 세트(502)에 포함된 경우, 상기 CDR3β와 조합될 수 있는 상기 CDR3α의 아미노산 서열을 포함한 정보(55a)를 생성하여야 본 개시내용의 예측 목적을 달성하였다고 볼 수 있다. 다른 예시에서, 제 1 입력 데이터 세트(501)에 포함되는 펩타이드와 MHC가 결합을 이루는 경우, 상기 결합을 이루는 상기 펩타이드와 상기 MHC를 포함하는 제 1 입력 데이터 세트(501)를 입력받은 예측 모델(500)은 조합이 되는 CDR3α과 CDR3β의 정보가 제 2 입력 데이터 세트(502)에 포함된 경우, 상기 CDR3β와 조합될 수 있는 상기 CDR3α의 아미노산 서열과 임계치(예컨대, 6 amino acids 중 5 amino acid) 이상 유사한 아미노산 서열을 갖는 정보(55a)를 생성하여야 본 개시내용의 예측 목적을 달성하였다고 볼 수 있다.In one embodiment, when peptides included in the first input data set 501 and MHC form a bond, the prediction model 500 receiving the first input data set 501 including the peptides and the MHC forming the bond forms information (55a) including the amino acid sequence of the CDR3α that can be combined with the CDR3β when the information of CDR3α and CDR3β to be combined is included in the second input data set 502 ), it can be seen that the prediction purpose of the present disclosure has been achieved. In another example, when the peptides and MHCs included in the first input data set 501 form a bond, the prediction model 500 receiving the first input data set 501 including the peptides and the MHCs forming the bond forms an input. It can be seen that the prediction purpose of the present disclosure has been achieved only when information 55a having a similar amino acid sequence of at least 5 amino acids among amino acids is generated.

도 6은 본 개시내용의 일 실시예에 따라, TCR의 CDR3α와 TCR의 CDR3β의 조합과 관련한 정보를 포함하는 제 1 양성 데이터, 제 1 음성 데이터 및 제 2 음성 데이터를 예시적으로 도시한다.6 illustratively illustrates first positive data, first negative data, and second negative data including information related to a combination of CDR3α of TCR and CDR3β of TCR, according to an embodiment of the present disclosure.

본 개시내용의 일 실시예에서, 컴퓨팅 장치(100)는 도 6에서 도시되는 바와 같이, CDR3α에 대응되는 제 1 데이터셋 또는 TCR의 CDR3β에 대응되는 제 2 데이터셋을 무작위적으로(randomly) 조합하여, 제 1 음성(negative) 데이터셋을 생성할 수 있다.In one embodiment of the present disclosure, as shown in FIG. 6 , the computing device 100 randomly combines a first dataset corresponding to CDR3α or a second dataset corresponding to CDR3β of TCR to generate a first negative dataset.

일 실시예에서, 음성 데이터는 결합 여부가 미지인(unknown) CDR3α와 CDR3β의 조합들을 포함할 수 있다. 구체적으로, 예를 들어, 저장매체에 46가지의 CDR3α 아미노산 서열들, 그리고 61가지의 CDR3β 아미노산 서열들이 저장되어 있다고 할 때, 컴퓨팅 장치(100)는 상기 46가지의 CDR3α 아미노산 서열들, 그리고 61가지의 CDR3β 아미노산 서열들의 무작위적인 조합들(예컨대, 46×61 = 2806가지)을 TCR α쇄 및 TCR β쇄의 가능한 미지의 조합으로 데이터베이스화(제 1 음성 데이터셋)할 수 있다.In one embodiment, the negative data may include combinations of unknown CDR3α and CDR3β. Specifically, for example, assuming that 46 CDR3α amino acid sequences and 61 CDR3β amino acid sequences are stored in a storage medium, the computing device 100 databases random combinations of the 46 CDR3α amino acid sequences and 61 CDR3β amino acid sequences (e.g., 46 × 61 = 2806) as possible unknown combinations of the TCR α chain and the TCR β chain (first 1 voice dataset).

일 실시예에서, 컴퓨팅 장치(100)는 무작위적으로 조합된 상기와 같은 제 1 음성 데이터셋에서, 서로 결합하는 것으로 식별된 CDR3α와 CDR3β를 확인할 수 있다. 추가적인 실시예에서, 상기 CDR3α와 상기 CDR3β의 상호 결합 여부는 분자생물학적 실험을 통해 확인될 수 있다. 예컨대, 컴퓨팅 장치(100)는 피검체에서 수득한 TCR의 아미노산 시퀀스를 실험적으로 분석한 결과, 상호 결합되는 CDR3α와 CDR3β 짝(pair)(예컨대, 조합)을 식별할 수 있다. 다른 실시예에서, 상기 CDR3α와 CDR3β의 상호 결합 여부는 공공 DB의 데이터를 통해 확인될 수 있다.In one embodiment, the computing device 100 may identify CDR3α and CDR3β identified as binding to each other in the randomly combined first voice dataset. In a further embodiment, whether the CDR3α and the CDR3β bind to each other can be confirmed through molecular biological experiments. For example, as a result of experimentally analyzing the amino acid sequence of the TCR obtained from the subject, the computing device 100 may identify a pair (eg, combination) of CDR3α and CDR3β that binds to each other. In another embodiment, whether the CDR3α and CDR3β bind to each other can be confirmed through public DB data.

추가적인 실시예에서, 컴퓨팅 장치(100)는 상기 분자생물학적 실험을 통해, 또는 공공 DB를 통해 서로 결합하는 것으로 식별된 CDR3α와 CDR3β를 제 1 음성 데이터셋에서 제외함으로써, 제 2 음성 데이터셋을 생성할 수 있다.In an additional embodiment, the computing device 100 may generate a second audio dataset by excluding CDR3α and CDR3β identified as binding to each other through the molecular biological experiment or a public DB from the first audio dataset.

일 실시예에서, 컴퓨팅 장치(100)는 분자생물학적 실험을 통해, 또는 공공 DB를 통해 서로 결합하는 것으로 식별된 CDR3α와 CDR3β의 조합을 제 1 양성(positive) 데이터셋에 포함시킬 수 있다. 구체적으로, 제 1 양성 데이터셋에 포함되는 CDR3α와 CDR3β의 조합은 서로 결합되는 것이 확실한(known) TCR α쇄 및 TCR β쇄의 조합들로 구성될 수 있다. 일례로, 공공 DB 상에 공지된 CDR3α와 CDR3β의 조합들이 제 1 양성 데이터셋에 포함될 수 있다. 다른 예시로, 개별적인 피검체로부터 수득한 TCR에서 실험적으로 발견되는 CDR3α와 CDR3β의 조합들이 제 1 양성 데이터셋에 포함될 수 있다.In one embodiment, the computing device 100 may include a combination of CDR3α and CDR3β identified as binding to each other through a molecular biological experiment or through a public DB in the first positive dataset. Specifically, the combination of CDR3α and CDR3β included in the first positive dataset may be composed of combinations of TCR α chain and TCR β chain that are known to bind to each other. For example, combinations of CDR3α and CDR3β known on public DBs may be included in the first positive dataset. As another example, combinations of CDR3α and CDR3β found experimentally in TCRs obtained from individual subjects may be included in the first positive dataset.

추가적인 실시예에서, 음성 데이터로 분류된 결합 여부가 미지인 CDR3α와 CDR3β가 서로 결합되는 것이 확인되면, 컴퓨팅 장치(100)는 해당 CDR3α와 CDR3β의 조합을 제 1 양성 데이터셋으로 이전할 수 있다. 추가적인 실시예에서, 음성 데이터로 분류된 결합 여부가 미지인 CDR3α와 CDR3β가 서로 결합되지 않는 것이 확인되면, 컴퓨팅 장치(100)는 해당 CDR3α와 CDR3β의 조합을 제 2 음성 데이터셋으로 저장할 수 있다.In an additional embodiment, when it is confirmed that CDR3α and CDR3β with unknown binding, classified as negative data, bind to each other, the computing device 100 may transfer the combination of the corresponding CDR3α and CDR3β to the first positive dataset. In an additional embodiment, if it is confirmed that CDR3α and CDR3β with unknown binding, classified as voice data, do not bind to each other, the computing device 100 may store the combination of the corresponding CDR3α and CDR3β as a second voice dataset.

본 개시내용에 따른 일 실시예에서, 제 1 모델(4300)은 상기와 같은 방법으로 생성된 제 1 양성 데이터셋, 제 1 음성 데이터셋, 및/또는 제 2 음성 데이터셋에 기초하여 학습될 수 있다. 일 실시예에서, 제 1 양성 데이터셋에 포함된 CDR3α와 CDR3β 조합을 학습 데이터(training data)로 하여 훈련된 제 1 모델(4300)은 선후행 관계인 CDR3α와 CDR3β 세트를 학습할 수 있다. 추가적인 실시예에서, 제 2 음성 데이터셋에 포함된 CDR3α와 CDR3β 조합을 학습 데이터로 하여 훈련된 제 1 모델(4300)은 선후행 관계가 아닌 CDR3α와 CDR3β 세트를 학습할 수 있다.In an embodiment according to the present disclosure, the first model 4300 may be trained based on the first positive dataset, the first negative dataset, and/or the second negative dataset generated in the above method. In one embodiment, the first model 4300 trained using the combination of CDR3α and CDR3β included in the first positive dataset as training data may learn a set of CDR3α and CDR3β that is a precedent relationship. In a further embodiment, the first model 4300 trained using the combination of CDR3α and CDR3β included in the second speech dataset as training data may learn a set of CDR3α and CDR3β that does not have a precedence relationship.

추가적인 실시예에서, 제 1 음성 데이터셋에 포함된 CDR3α와 CDR3β 조합에 대한 제 1 모델(4300)의 예측 결과가 분자생물학적 실험 결과와 일치하는 경우, 컴퓨팅 장치(100)는 상기 CDR3α와 상기 CDR3β의 조합을 제 2 음성 데이터셋 또는 제 1 양성 데이터셋으로 이전할 수 있다. 예를 들어, 제 1 모델(4300)이 선후행 관계로 예측한 CDR3α와 CDR3β의 조합이 실제 분자생물학적 실험을 통해 결합하는 것으로 식별되면, 컴퓨팅 장치(100)는 상기 선후행 관계로 예측된 CDR3α와 CDR3β의 조합을 제 1 양성 데이터셋으로 이전할 수 있다. 다른 예시로, 제 1 모델(4300)이 선후행 관계로 예측한 CDR3α와 CDR3β의 조합이 분자생물학적 실험을 통해 결합하지 않는 것으로 식별되면, 컴퓨팅 장치(100)는 상기 선후행 관계에 해당하는 것으로 잘못 예측된 CDR3α와 CDR3β의 조합을 제 2 음성 데이터셋으로 이전할 수 있다.In a further embodiment, when the predicted result of the first model 4300 for the combination of CDR3α and CDR3β included in the first negative dataset is consistent with the result of molecular biology experiment, the computing device 100 may transfer the combination of CDR3α and CDR3β to the second negative dataset or the first positive dataset. For example, if the combination of CDR3α and CDR3β predicted by the first model 4300 in a precedent relationship is identified as binding through an actual molecular biological experiment, the computing device 100 may transfer the combination of CDR3α and CDR3β predicted in the precedence relationship to the first positive dataset. As another example, if a combination of CDR3α and CDR3β predicted by the first model 4300 in a precedent relationship is identified as not binding through a molecular biological experiment, the computing device 100 may transfer the combination of CDR3α and CDR3β incorrectly predicted to correspond to the precedence relationship to the second audio dataset.

본 개시내용에 따른 일 실시예에서, 교사(supervised)/반교사(semi-supervised) 학습 데이터의 최신 정보(known)가 추가됨으로써 상기와 같은 방식으로 업데이트되는 음성 데이터셋 또는 양성 데이터셋은, TCR α쇄 및 TCR β쇄의 조합에 대한 예측 모델의 학습의 효율성 및/또는 정확도를 향상시킬 수 있다.In one embodiment according to the present disclosure, the negative dataset or the positive dataset updated in the above manner by adding the latest known information of the supervised / semi-supervised learning data can improve the learning efficiency and / or accuracy of the predictive model for the combination of the TCR α chain and the TCR β chain.

도 7은 본 개시내용의 일 실시예에 따라, CDR 세트 후보 리스트에 CDR3α와 CDR3β의 조합을 저장하는 단계를 예시적으로 도시한다.7 illustratively illustrates storing a combination of CDR3α and CDR3β in a CDR set candidate list, according to an embodiment of the present disclosure.

일 실시예에서, 컴퓨팅 장치(100)는 제 2 모델(4020)을 통해 생성된 CDR3α들의 리스트(702)를 획득할 수 있다. 일 실시예에서, 컴퓨팅 장치(100)는 제 3 모델(4030)을 통해 생성된 CDR3β들의 리스트(703)를 획득할 수 있다.In one embodiment, the computing device 100 may obtain the list 702 of CDR3αs generated through the second model 4020 . In one embodiment, the computing device 100 may obtain the list 703 of CDR3β generated through the third model 4030 .

추가적인 실시예에서, CDR3α/CDR3β 조합 예측 모델(710)(예를 들어, 제 1 모델(4300))은 입력되는 CDR3α 및 CDR3β에 대하여 선후행 관계에 해당하는지 여부를 예측할 수 있다. 예를 들어, CDR3α/CDR3β 조합 예측 모델(710)이 입력되는 CDR3α 및 CDR3β에 대하여 선후행 관계 해당 여부를 판단한 결과, CDR3α 및 CDR3β가 선후행 관계에 해당하지 않는다고 판단된 경우, 상기 CDR3α 및 상기 CDR3β의 조합은 결합되지 않는 것으로 출력될 수 있다. 이처럼 선후행 관계가 아닌 CDR3α 및 CDR3β 조합은, 도 7에서 도시되는 바와 같이 NotNext 출력 데이터(724)로 분류될 수 있다.In an additional embodiment, the CDR3α/CDR3β combination prediction model 710 (eg, the first model 4300) may predict whether the input CDR3α and CDR3β correspond to a precedence relationship. For example, as a result of determining whether the CDR3α and CDR3β input to the CDR3α / CDR3β combination prediction model 710 correspond to a precedence relationship, if it is determined that CDR3α and CDR3β do not correspond to a precedence relationship, the combination of CDR3α and CDR3β may be output as not binding. CDR3α and CDR3β combinations that do not have a precedence relationship can be classified as NotNext output data 724 as shown in FIG. 7 .

다른 예시로, CDR3α/CDR3β 조합 예측 모델(710)이 입력되는 CDR3α 및 CDR3β에 대하여 선후행 관계 해당 여부를 판단한 결과, CDR3α 및 CDR3β가 선후행 관계에 해당한다고 판단된 경우, 상기 CDR3α 및 상기 CDR3β의 조합은 결합되는 것으로 출력될 수 있다. 이처럼 선후행 관계인 CDR3α 및 CDR3β 조합은, 도 7에서 도시되는 바와 같이 IsNext 출력 데이터(722)로 분류될 수 있다.As another example, as a result of determining whether the CDR3α and CDR3β input to the CDR3α / CDR3β combination prediction model 710 correspond to a precedence relationship, when it is determined that CDR3α and CDR3β correspond to a precedence relationship, the combination of CDR3α and CDR3β may be output as being combined. The combination of CDR3α and CDR3β, which is in a precedent relationship, can be classified as IsNext output data 722 as shown in FIG. 7 .

일 실시예에서, CDR3α/CDR3β 조합 예측 모델(710)이 예측한 IsNext 출력 데이터(722)에 포함된 CDR3α와 CDR3β의 조합들은 CDR 세트 후보 리스트에 저장(720)될 수 있다. 추가적인 실시예에서, 최종적으로 CDR 세트 후보 리스트에 저장(720)된 다수의 CDR3α와 CDR3β의 조합들은 TCR 및/또는 TCR-T에 발현되는 TCR의 정보로 사용될 수 있다. 추가적인 실시예에서, 최종적으로 CDR 세트 후보 리스트에 저장(720)된 다수의 CDR3α와 CDR3β의 조합들은 표적 항원-특이적인 TCR의 정보로 사용될 수 있다.In one embodiment, combinations of CDR3α and CDR3β included in the IsNext output data 722 predicted by the CDR3α/CDR3β combination prediction model 710 may be stored 720 in a CDR set candidate list. In a further embodiment, combinations of a plurality of CDR3α and CDR3β finally stored in the CDR set candidate list (720) may be used as information on TCRs expressed in TCRs and/or TCR-Ts. In a further embodiment, combinations of a plurality of CDR3α and CDR3β finally stored in the CDR set candidate list (720) may be used as information of a target antigen-specific TCR.

일 실시예에서, 본 개시내용에서의 결합되는 CDR3α와 CDR3β의 조합을 출력하는 모델(예컨대, 제 1 모델(4300), 예측 모델(500))로부터 CDR 세트 후보 리스트에 저장(720)되는 생성 결과들은 예를 들어, TCR-T에 사용될 수 있는 TCR 구성 CDR3(CDR3α 및 CDR3β)의 후보들을 포함한 리스트를 의미할 수 있다. 일례로, 피검체 1에 대한 CDR3α 또는 CDR3β의 후보 리스트가 생성되어, 피검체 1에 대한 처치(treatment)에 상기 후보 리스트로부터 선택된 CDR3α와 CDR3β을 포함하는 TCR이 사용될 수 있다.In one embodiment, the generated results stored in the CDR set candidate list 720 from a model (e.g., the first model 4300, the predictive model 500) outputting a combination of CDR3α and CDR3β combined in the present disclosure may mean a list including candidates of TCR constituent CDR3 (CDR3α and CDR3β) that can be used for TCR-T, for example. For example, a CDR3α or CDR3β candidate list for subject 1 may be generated, and a TCR including CDR3α and CDR3β selected from the candidate list may be used for treatment of subject 1.

추가적인 실시예에서, 컴퓨팅 장치(100)는 CDR 세트 후보 리스트에 저장(720)된 CDR3α 및 CDR3β의 조합들 중에서, 특정 펩타이드와 특정 MHC에 대응되는 TCR의 상보성 결정 부위를 도출해낼 수 있다. CDR3α와 CDR3β가 선후행 관계 해당하는 조합들만을 CDR 세트 후보 리스트에 저장(720)해 둠으로써, 컴퓨팅 장치(100)는 상기 CDR 세트 후보 리스트로부터 대상체(예컨대, 병변, 시험 대상 검체 등) 처치 목적에 적합한 CDR3α 및 CDR3β의 조합들을 선택할 수 있다. 일례로, 컴퓨팅 장치(100)는 표적 항원(예컨대, 고형암의 종양에서 특이적으로 발견되는 항원)이 식별되면, 해당 표적 항원에 대응하는, 가장 효과적인 CDR3α 및 CDR3β를 포함하는 TCR을 도출할 수 있다.In an additional embodiment, the computing device 100 may derive a complementarity determining region of a TCR corresponding to a specific peptide and a specific MHC from among the combinations of CDR3α and CDR3β stored in the CDR set candidate list (720). By storing (720) only combinations of CDR3α and CDR3β corresponding to a precedence relationship in the CDR set candidate list, the computing device 100 can select combinations of CDR3α and CDR3β suitable for the purpose of treating a subject (e.g., a lesion, a test subject, etc.) from the CDR set candidate list. For example, when a target antigen (eg, an antigen specifically found in a tumor of solid cancer) is identified, the computing device 100 may derive a TCR corresponding to the target antigen and including the most effective CDR3α and CDR3β.

도 8은 본 개시내용의 일 실시예에 따른 컴퓨팅 환경의 개략도이다.8 is a schematic diagram of a computing environment according to one embodiment of the present disclosure.

본 개시내용에서의 컴포넌트, 모듈 또는 부(unit)는 특정의 태스크를 수행하거나 특정의 추상 데이터 유형을 구현하는 루틴, 프로시져, 프로그램, 컴포넌트, 데이터 구조 등을 포함한다. 또한, 당업자라면 본 개시내용에서 제시되는 방법들이 단일-프로세서 또는 멀티프로세서 컴퓨팅 장치, 미니컴퓨터, 메인프레임 컴퓨터는 물론 퍼스널 컴퓨터, 핸드헬드 컴퓨팅 장치, 마이크로프로세서-기반 또는 프로그램가능 가전 제품, 기타 등등(이들 각각은 하나 이상의 연관된 장치와 연결되어 동작할 수 있음)을 비롯한 다른 컴퓨터 시스템 구성으로 실시될 수 있다는 것을 충분히 인식할 것이다.A component, module or unit in this disclosure includes routines, procedures, programs, components, data structures, etc. that perform particular tasks or implement particular abstract data types. Further, those skilled in the art will fully appreciate that the methods presented in this disclosure may be practiced with other computer system configurations, including single-processor or multiprocessor computing devices, minicomputers, mainframe computers, as well as personal computers, handheld computing devices, microprocessor-based or programmable consumer electronics, and the like, each of which may be operative in connection with one or more associated devices.

본 개시내용에서 설명된 실시예들은 또한 어떤 태스크들이 통신 네트워크를 통해 연결되어 있는 원격 처리 장치들에 의해 수행되는 분산 컴퓨팅 환경에서 실시될 수 있다. 분산 컴퓨팅 환경에서, 프로그램 모듈은 로컬 및 원격 메모리 저장 장치 둘 다에 위치할 수 있다.Embodiments described in this disclosure may also be practiced in distributed computing environments where certain tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote memory storage devices.

컴퓨팅 장치는 통상적으로 다양한 컴퓨터 판독가능 매체를 포함한다. 컴퓨터에 의해 액세스 가능한 매체는 그 어떤 것이든지 컴퓨터 판독가능 매체가 될 수 있고, 이러한 컴퓨터 판독가능 매체는 휘발성 및 비휘발성 매체, 일시적(transitory) 및 비일시적(non-transitory) 매체, 이동식 및 비-이동식 매체를 포함한다. 제한이 아닌 예로서, 컴퓨터 판독가능 매체는 컴퓨터 판독가능 저장 매체 및 컴퓨터 판독가능 전송 매체를 포함할 수 있다.A computing device typically includes a variety of computer readable media. Computer readable media can be any medium that can be accessed by a computer, and such computer readable media includes volatile and nonvolatile media, transitory and non-transitory media, removable and non-removable media. By way of example, and not limitation, computer readable media may include computer readable storage media and computer readable transmission media.

컴퓨터 판독가능 저장 매체는 컴퓨터 판독가능 명령어, 데이터 구조, 프로그램 모듈 또는 기타 데이터와 같은 정보를 저장하는 임의의 방법 또는 기술로 구현되는 휘발성 및 비휘발성 매체, 일시적 및 비-일시적 매체, 이동식 및 비이동식 매체를 포함한다. 컴퓨터 판독가능 저장 매체는 RAM, ROM, EEPROM, 플래시 메모리 또는 기타 메모리 기술, CD-ROM, DVD(digital video disk) 또는 기타 광 디스크 저장 장치, 자기 카세트, 자기 테이프, 자기 디스크 저장 장치 또는 기타 자기 저장 장치, 또는 컴퓨터에 의해 액세스될 수 있고 원하는 정보를 저장하는 데 사용될 수 있는 임의의 기타 매체를 포함하지만, 이에 한정되지 않는다.Computer readable storage media includes volatile and nonvolatile media, transitory and non-transitory media, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. Computer readable storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital video disk (DVD) or other optical disk storage device, magnetic cassette, magnetic tape, magnetic disk storage device or other magnetic storage device, or any other medium that can be accessed by a computer and used to store desired information.

컴퓨터 판독가능 전송 매체는 통상적으로 반송파(carrier wave) 또는 기타 전송 메커니즘(transport mechanism)과 같은 피변조 데이터 신호(modulated data signal)에 컴퓨터 판독가능 명령어, 데이터 구조, 프로그램 모듈 또는 기타 데이터 등을 구현하고 모든 정보 전달 매체를 포함한다. 피변조 데이터 신호라는 용어는 신호 내에 정보를 인코딩하도록 그 신호의 특성들 중 하나 이상을 설정 또는 변경시킨 신호를 의미한다. 제한이 아닌 예로서, 컴퓨터 판독가능 전송 매체는 유선 네트워크 또는 직접 배선 접속(direct-wired connection)과 같은 유선 매체, 그리고 음향, RF, 적외선, 기타 무선 매체와 같은 무선 매체를 포함한다. 상술된 매체들 중 임의의 것의 조합도 역시 컴퓨터 판독가능 전송 매체의 범위 안에 포함되는 것으로 한다.A computer readable transmission medium typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery medium. The term modulated data signal means a signal that has one or more of its characteristics set or changed so as to encode information within the signal. By way of example, and not limitation, computer readable transmission media includes wired media such as a wired network or direct-wired connection, and wireless media such as acoustic, RF, infrared, and other wireless media. Combinations of any of the above are also intended to be included within the scope of computer readable transmission media.

컴퓨터(2002)를 포함하는 본 발명의 여러가지 측면들을 구현하는 예시적인 환경(2000)이 나타내어져 있으며, 컴퓨터(2002)는 처리 장치(2004), 시스템 메모리(2006) 및 시스템 버스(2008)를 포함한다. 본 명세서에서의 컴퓨터(200)는 컴퓨팅 장치와 상호 교환가능하게 사용될 수 있다. 시스템 버스(2008)는 시스템 메모리(2006)(이에 한정되지 않음)를 비롯한 시스템 컴포넌트들을 처리 장치(2004)에 연결시킨다. 처리 장치(2004)는 다양한 상용 프로세서들 중 임의의 프로세서일 수 있다. 듀얼 프로세서 및 기타 멀티프로세서 아키텍처도 역시 처리 장치(2004)로서 이용될 수 있다.An exemplary environment 2000 implementing various aspects of the present invention is shown comprising a computer 2002, which includes a processing unit 2004, a system memory 2006 and a system bus 2008. Computer 200 herein may be used interchangeably with a computing device. System bus 2008 couples system components, including but not limited to system memory 2006, to processing unit 2004. Processing unit 2004 may be any of a variety of commercially available processors. Dual processor and other multiprocessor architectures may also be used as the processing unit 2004.

시스템 버스(2008)는 메모리 버스, 주변장치 버스, 및 다양한 상용 버스 아키텍처 중 임의의 것을 사용하는 로컬 버스에 추가적으로 상호 연결될 수 있는 몇가지 유형의 버스 구조 중 임의의 것일 수 있다. 시스템 메모리(2006)는 판독 전용 메모리(ROM)(2010) 및 랜덤 액세스 메모리(RAM)(2012)를 포함한다. 기본 입/출력 시스템(BIOS)은 ROM, EPROM, EEPROM 등의 비휘발성 메모리(2010)에 저장되며, 이 BIOS는 시동 중과 같은 때에 컴퓨터(2002) 내의 구성요소들 간에 정보를 전송하는 일을 돕는 기본적인 루틴을 포함한다. RAM(2012)은 또한 데이터를 캐싱하기 위한 정적 RAM 등의 고속 RAM을 포함할 수 있다.System bus 2008 may be any of several types of bus structures that may additionally be interconnected to a memory bus, a peripheral bus, and a local bus using any of a variety of commercial bus architectures. System memory 2006 includes read only memory (ROM) 2010 and random access memory (RAM) 2012 . A basic input/output system (BIOS) is stored in non-volatile memory 2010, such as ROM, EPROM, or EEPROM, and includes basic routines that help transfer information between components within the computer 2002, such as during startup. RAM 2012 may also include high-speed RAM, such as static RAM, for caching data.

컴퓨터(2002)는 또한 내장형 하드 디스크 드라이브(HDD)(2014)(예를 들어, EIDE, SATA), 자기 플로피 디스크 드라이브(FDD)(2016)(예를 들어, 이동식 디스켓(2018)으로부터 판독을 하거나 그에 기록을 하기 위한 것임), SSD 및 광 디스크 드라이브(2020)(예를 들어, CD-ROM 디스크(2022)를 판독하거나 DVD 등의 기타 고용량 광 매체로부터 판독을 하거나 그에 기록을 하기 위한 것임)를 포함한다. 하드 디스크 드라이브(2014), 자기 디스크 드라이브(2016) 및 광 디스크 드라이브(2020)는 각각 하드 디스크 드라이브 인터페이스(2024), 자기 디스크 드라이브 인터페이스(2026) 및 광 드라이브 인터페이스(2028)에 의해 시스템 버스(2008)에 연결될 수 있다. 외장형 드라이브 구현을 위한 인터페이스(2024)는 예를 들어, USB(Universal Serial Bus) 및 IEEE 1394 인터페이스 기술 중 적어도 하나 또는 그 둘 다를 포함한다.Computer 2002 also includes internal hard disk drives (HDDs) 2014 (e.g., EIDE, SATA), magnetic floppy disk drives (FDDs) 2016 (e.g., for reading from or writing to removable diskettes 2018), SSDs, and optical disk drives 2020 (e.g., for reading CD-ROM disks 2022 or reading from or writing to other high capacity optical media such as DVDs). intended to do). Hard disk drive 2014, magnetic disk drive 2016, and optical disk drive 2020 may be connected to system bus 2008 by hard disk drive interface 2024, magnetic disk drive interface 2026, and optical drive interface 2028, respectively. The interface 2024 for external drive implementation includes, for example, at least one or both of Universal Serial Bus (USB) and IEEE 1394 interface technologies.

이들 드라이브 및 그와 연관된 컴퓨터 판독가능 매체는 데이터, 데이터 구조, 컴퓨터 실행가능 명령어, 기타 등등의 비휘발성 저장을 제공한다. 컴퓨터(2002)의 경우, 드라이브 및 매체는 임의의 데이터를 적당한 디지털 형식으로 저장하는 것에 대응한다. 상기에서의 컴퓨터 판독가능 저장 매체에 대한 설명이 HDD, 이동식 자기 디스크, 및 CD 또는 DVD 등의 이동식 광 매체를 언급하고 있지만, 당업자라면 집 드라이브(zip drive), 자기 카세트, 플래쉬 메모리 카드, 카트리지, 기타 등등의 컴퓨터에 의해 판독가능한 다른 유형의 저장 매체도 역시 예시적인 운영 환경에서 사용될 수 있으며 또 임의의 이러한 매체가 본 발명의 방법들을 수행하기 위한 컴퓨터 실행가능 명령어를 포함할 수 있다는 것을 잘 알 것이다.These drives and their associated computer readable media provide non-volatile storage of data, data structures, computer executable instructions, and the like. In the case of computer 2002, drives and media correspond to storing any data in a suitable digital format. Although the description of computer-readable storage media above refers to HDDs, removable magnetic disks, and removable optical media such as CDs or DVDs, those skilled in the art will appreciate that other types of computer-readable storage media, such as zip drives, magnetic cassettes, flash memory cards, cartridges, and the like, may also be used in the exemplary operating environment and any such media may include computer-executable instructions for performing the methods of the present invention.

운영 체제(2030), 하나 이상의 어플리케이션 프로그램(2032), 기타 프로그램 모듈(2034) 및 프로그램 데이터(2036)를 비롯한 다수의 프로그램 모듈이 드라이브 및 RAM(2012)에 저장될 수 있다. 운영 체제, 어플리케이션, 모듈 및/또는 데이터의 전부 또는 그 일부분이 또한 RAM(2012)에 캐싱될 수 있다. 본 발명이 여러가지 상업적으로 이용가능한 운영 체제 또는 운영 체제들의 조합에서 구현될 수 있다는 것을 잘 알 것이다.A number of program modules may be stored on the drive and RAM 2012, including an operating system 2030, one or more application programs 2032, other program modules 2034, and program data 2036. All or portions of the operating system, applications, modules and/or data may also be cached in RAM 2012. It will be appreciated that the present invention may be implemented in a variety of commercially available operating systems or combinations of operating systems.

사용자는 하나 이상의 유선/무선 입력 장치, 예를 들어, 키보드(2038) 및 마우스(2040) 등의 포인팅 장치를 통해 컴퓨터(2002)에 명령 및 정보를 입력할 수 있다. 기타 입력 장치(도시 생략)로는 마이크, IR 리모콘, 조이스틱, 게임 패드, 스타일러스 펜, 터치 스크린, 기타 등등이 있을 수 있다. 이들 및 기타 입력 장치가 종종 시스템 버스(2008)에 연결되어 있는 입력 장치 인터페이스(2042)를 통해 처리 장치(2004)에 연결되지만, 병렬 포트, IEEE 1394 직렬 포트, 게임 포트, USB 포트, IR 인터페이스, 기타 등등의 기타 인터페이스에 의해 연결될 수 있다.A user may enter commands and information into the computer 2002 through one or more wired/wireless input devices, such as a keyboard 2038 and a pointing device such as a mouse 2040. Other input devices (not shown) may include a microphone, IR remote control, joystick, game pad, stylus pen, touch screen, and the like. These and other input devices are often connected to processing unit 2004 through input device interface 2042, which is connected to system bus 2008, but may be connected by other interfaces such as parallel ports, IEEE 1394 serial ports, game ports, USB ports, IR interfaces, and the like.

모니터(2044) 또는 다른 유형의 디스플레이 장치도 역시 비디오 어댑터(2046) 등의 인터페이스를 통해 시스템 버스(2008)에 연결된다. 모니터(2044)에 부가하여, 컴퓨터는 일반적으로 스피커, 프린터, 기타 등등의 기타 주변 출력 장치(도시 생략)를 포함한다.A monitor 2044 or other type of display device is also connected to the system bus 2008 through an interface such as a video adapter 2046. In addition to the monitor 2044, computers typically include other peripheral output devices (not shown) such as speakers, printers, and the like.

컴퓨터(2002)는 유선 및/또는 무선 통신을 통한 원격 컴퓨터(들)(2048) 등의 하나 이상의 원격 컴퓨터로의 논리적 연결을 사용하여 네트워크화된 환경에서 동작할 수 있다. 원격 컴퓨터(들)(2048)는 워크스테이션, 서버 컴퓨터, 라우터, 퍼스널 컴퓨터, 휴대용 컴퓨터, 마이크로프로세서-기반 오락 기기, 피어 장치 또는 기타 통상의 네트워크 노드일 수 있으며, 일반적으로 컴퓨터(2002)에 대해 기술된 구성요소들 중 다수 또는 그 전부를 포함하지만, 간략함을 위해, 메모리 저장 장치(2050)만이 도시되어 있다. 도시되어 있는 논리적 연결은 근거리 통신망(LAN)(2052) 및/또는 더 큰 네트워크, 예를 들어, 원거리 통신망(WAN)(2054)에의 유선/무선 연결을 포함한다. 이러한 LAN 및 WAN 네트워킹 환경은 사무실 및 회사에서 일반적인 것이며, 인트라넷 등의 전사적 컴퓨터 네트워크(enterprise-wide computer network)를 용이하게 해주며, 이들 모두는 전세계 컴퓨터 네트워크, 예를 들어, 인터넷에 연결될 수 있다.Computer 2002 may operate in a networked environment using logical connections to one or more remote computers, such as remote computer(s) 2048 via wired and/or wireless communications. Remote computer(s) 2048 may be a workstation, server computer, router, personal computer, portable computer, microprocessor-based entertainment device, peer device, or other common network node, and generally includes many or all of the components described for computer 2002, although for simplicity, only memory storage device 2050 is shown. The logical connections shown include wired/wireless connections to a local area network (LAN) 2052 and/or a larger network, such as a wide area network (WAN) 2054 . Such LAN and WAN networking environments are common in offices and corporations and facilitate enterprise-wide computer networks, such as intranets, all of which can be connected to worldwide computer networks, such as the Internet.

LAN 네트워킹 환경에서 사용될 때, 컴퓨터(2002)는 유선 및/또는 무선 통신 네트워크 인터페이스 또는 어댑터(2056)를 통해 로컬 네트워크(2052)에 연결된다. 어댑터(2056)는 LAN(2052)에의 유선 또는 무선 통신을 용이하게 해줄 수 있으며, 이 LAN(2052)은 또한 무선 어댑터(2056)와 통신하기 위해 그에 설치되어 있는 무선 액세스 포인트를 포함하고 있다. WAN 네트워킹 환경에서 사용될 때, 컴퓨터(2002)는 모뎀(2058)을 포함할 수 있거나, WAN(2054) 상의 통신 서버에 연결되거나, 또는 인터넷을 통하는 등, WAN(2054)을 통해 통신을 정하는 기타 수단을 갖는다. 내장형 또는 외장형 및 유선 또는 무선 장치일 수 있는 모뎀(2058)은 직렬 포트 인터페이스(2042)를 통해 시스템 버스(2008)에 연결된다. 네트워크화된 환경에서, 컴퓨터(2002)에 대해 설명된 프로그램 모듈들 또는 그의 일부분이 원격 메모리/저장 장치(2050)에 저장될 수 있다. 도시된 네트워크 연결이 예시적인 것이며 컴퓨터들 사이에 통신 링크를 설정하는 기타 수단이 사용될 수 있다는 것을 잘 알 것이다.When used in a LAN networking environment, computer 2002 is connected to local network 2052 through wired and/or wireless communication network interfaces or adapters 2056. Adapter 2056 may facilitate wired or wireless communications to LAN 2052, which also includes a wireless access point installed therein to communicate with wireless adapter 2056. When used in a WAN networking environment, computer 2002 may include a modem 2058, be coupled to a communication server on WAN 2054, or have other means of establishing communications over WAN 2054, such as over the Internet. Modem 2058, which can be internal or external and wired or wireless, is connected to system bus 2008 through serial port interface 2042. In a networked environment, program modules described for computer 2002, or portions thereof, may be stored in remote memory/storage device 2050. It will be appreciated that the network connections shown are exemplary and other means of establishing a communication link between computers may be used.

컴퓨터(1602)는 무선 통신으로 배치되어 동작하는 임의의 무선 장치 또는 개체, 예를 들어, 프린터, 스캐너, 데스크톱 및/또는 휴대용 컴퓨터, PDA(portable data assistant), 통신 위성, 무선 검출가능 태그와 연관된 임의의 장비 또는 장소, 및 전화와 통신을 하는 동작을 한다. 이것은 적어도 Wi-Fi 및 블루투스 무선 기술을 포함한다. 따라서, 통신은 종래의 네트워크에서와 같이 미리 정의된 구조이거나 단순하게 적어도 2개의 장치 사이의 애드혹 통신(ad hoc communication)일 수 있다.Computer 1602 is operative to communicate with any wireless device or entity that is deployed and operating in wireless communication, e.g., printers, scanners, desktop and/or portable computers, portable data assistants (PDAs), communications satellites, any equipment or location associated with wireless detectable tags, and telephones. This includes at least Wi-Fi and Bluetooth wireless technologies. Thus, the communication may be a predefined structure as in conventional networks or simply an ad hoc communication between at least two devices.

제시된 프로세스들에 있는 단계들의 특정한 순서 또는 계층 구조는 예시적인 접근들의 일례임을 이해하도록 한다. 설계 우선순위들에 기반하여, 본 개시내용의 범위 내에서 프로세스들에 있는 단계들의 특정한 순서 또는 계층 구조가 재배열될 수 있다는 것을 이해하도록 한다. 본 개시내용의 방법 청구항들은 샘플 순서로 다양한 단계들의 엘리먼트들을 제공하지만 제시된 특정한 순서 또는 계층 구조에 한정되는 것을 의미하지는 않는다.It is to be understood that the specific order or hierarchy of steps in the processes presented is an example of example approaches. Based upon design priorities, it is to be understood that the specific order or hierarchy of steps in the processes may be rearranged within the scope of the present disclosure. The method claims of this disclosure present elements of the various steps in a sample order, but are not meant to be limited to the specific order or hierarchy presented.

Claims

As a method for generating a prediction result using artificial intelligence technology performed by a computing device,
Acquiring first data corresponding to CDR3α of T cell receptor (TCR) and second data corresponding to CDR3β of TCR, wherein the first data includes an amino acid sequence corresponding to CDR3α, and the second data includes an amino acid sequence corresponding to CDR3β;
Using an artificial intelligence-based first model, receiving the first data and the second data, and determining whether the first data including the CDR3α and the second data including the CDR3β correspond to a precedence relationship; and
storing a combination of the first data and the second data in a CDR set candidate list when it is output that the first data and the second data correspond to the precedence relationship;
Including,
The first model is:
generating a first negative dataset by randomly combining a first dataset corresponding to CDR3α or a second dataset corresponding to CDR3β of TCR;
identifying CDR3α and CDR3β identified as binding to each other in the randomly combined first voice dataset;
generating a second audio dataset by excluding the combination of CDR3α and CDR3β identified as binding from the first audio dataset; and
Incorporating the combination of CDR3α and CDR3β identified as binding into a first positive dataset;
Corresponding to the pre-learned model based on
method.

According to claim 1,
The first data corresponding to CDR3α of the TCR is generated by a second model different from the first model, and the second model is an artificial intelligence-based or rule-based model pretrained to receive a peptide-MHC conjugate (pMHC) and output the first data including CDR3α corresponding to the pMHC, and
The second data corresponding to CDR3β of the TCR is generated by a third model different from the first model, and the third model is an artificial intelligence-based or rule-based model that has been pretrained to output the second data including the CDR3β corresponding to the pMHC as an input,
method.

According to claim 1,
The first data corresponding to CDR3α of the TCR and the second data corresponding to CDR3β of the TCR are generated by a fourth model different from the first model,
The fourth model includes an encoder and a decoder, and
In the fourth model, a first input dataset including amino acid sequences corresponding to peptides and MHC is input to the encoder, amino acid sequences corresponding to CDR3α and CDR3β related to the peptide and MHC are input to the decoder, and the first data and a prediction result including the first data is output from the decoder.
method.

delete

According to claim 1,
Combinations of CDR3α and CDR3β included in the second voice dataset are labeled as correct answer data that do not correspond to a precedence relationship in the learning process of the first model, and
The combinations of CDR3α and CDR3β included in the first positive dataset are labeled as correct answer data that correspond to a precedence relationship in the learning process of the first model,
method.

According to claim 1,
The first model is:
Among the combinations of CDR3α and CDR3β included in the second negative dataset, identifying at least one similar combination with a CDR3α and CDR3β combination included in the first positive dataset and a similarity equal to or greater than a predetermined threshold; and
including the identified similar combinations in the first positive dataset;
In addition to being pre-learned based on,
method.

According to claim 1,
The first model includes a language model,
method.

According to claim 7,
The language model is
Recurrent Neural Network (RNN), Long Short Term Memory (LSTM) network, Bidirectional Long Short Term Memory (BiLSTM) network, Diffusion model, Generative Pre-trained Transformer (GPT), Bidirectional Encoder Representations from Transformers (BERT), spanBERT, Gated Recurrent Unit (GRU), or Bidirectional Gated Recurrent Unit (BiGRU),
method.

According to claim 1,
The CDR set candidate list refers to a combination of TCR α chain and β included in TCR capable of binding to pMHC, and
The combination of the first data and the second data stored in the CDR set candidate list is used for TCR-T generation,
method.

A computer program stored on a computer-readable storage medium, which, when executed by a computing device, causes the computing device to perform operations for generating a prediction result using artificial intelligence technology, the operations comprising:
Obtaining first data corresponding to CDR3α of TCR and second data corresponding to CDR3β of TCR, wherein the first data includes an amino acid sequence corresponding to CDR3α, and the second data includes an amino acid sequence corresponding to CDR3β;
Using an artificial intelligence-based first model, receiving the first data and the second data, and determining whether the first data including the CDR3α and the second data including the CDR3β correspond to a precedence relationship; and
storing a combination of the first data and the second data in a CDR set candidate list when a result indicating that the first data and the second data correspond to the precedence relationship is output;
Including,
The first model is:
generating a first negative dataset by randomly combining a first dataset corresponding to CDR3α or a second dataset corresponding to CDR3β of TCR;
identifying CDR3α and CDR3β identified as binding to each other in the randomly combined first voice dataset;
generating a second audio dataset by excluding the combination of CDR3α and CDR3β identified as binding from the first audio dataset; and
Incorporating the combination of CDR3α and CDR3β identified as binding into a first positive dataset;
Corresponding to the pre-learned model based on
A computer program stored on a computer readable storage medium.

As a computing device,
at least one processor; and
Memory;
Including,
The at least one processor is:
Obtaining first data corresponding to CDR3α of TCR and second data corresponding to CDR3β of TCR, wherein the first data includes an amino acid sequence corresponding to CDR3α, and the second data includes an amino acid sequence corresponding to CDR3β;
Using an artificial intelligence-based first model, receiving the first data and the second data, and determining whether the first data including the CDR3α and the second data including the CDR3β correspond to a precedence relationship; and
storing a combination of the first data and the second data in a CDR set candidate list when a result indicating that the first data and the second data correspond to the precedence relationship is output;
to perform,
The first model is:
generating a first negative dataset by randomly combining the first dataset corresponding to CDR3α or the second dataset corresponding to CDR3β of TCR;
In the randomly combined first voice dataset, CDR3α and CDR3β identified as binding to each other are identified;
generating a second audio dataset by excluding the combination of CDR3α and CDR3β identified as binding from the first audio dataset; and
incorporating the combination of CDR3α and CDR3β identified as binding into a first positive dataset;
Corresponding to the pre-learned model based on
computing device.