KR20130004203A

KR20130004203A - Combined biomarkers information processing method for lung cancer diagnosis

Info

Publication number: KR20130004203A
Application number: KR1020120134488A
Authority: KR
Inventors: 김철우; 박필제; 신용성; 김용대; 김정연; 오미애; 강경남
Original assignee: 주식회사 바이오인프라; 김용대; 김철우
Priority date: 2012-11-26
Filing date: 2012-11-26
Publication date: 2013-01-09

Abstract

PURPOSE: A method for using composite biomarker information for diagnosing lung cancer is provided to enhance diagnosis efficiency. CONSTITUTION: A method for using composite biomarker information for diagnosing lung cancer comprises: a step of acquiring information in which expression level of a first biomarker group of IGF-1 or RANTES and a second biomarker group of A1AT, CYFRA21-1, proApoA1, AFP, EGFR, PAI-1, TTR, CEA, CA19-9, or ApoA1 are measured from blood, plasma, serum or other material collected from the body(S41); a step of processing the information and inputting into a preset lung cancer model(S42); and a step of generating lung cancer determining information from the model(S43). [Reference numerals] (S41) Acquiring expression level or rate information for biomarker which forms a biomarker group measured from collected materials separated from the blood, plasma, serum, or other materials of a subject body; (S42) Processing the acquired expression level or rate information with a lung cancer diagnosis predicting module containing a preset lung cancer diagnosis predicting model; (S43) Generating at least one lung cancer diagnosis predicting information from the lung cancer diagnosis predicting module

Description

Combined Biomarkers Information Processing Method for Lung Cancer Diagnosis

본 발명은 폐암 진단용 복합 바이오마커 정보 이용 방법에 관한 것으로서, 폐암 진단에 특이적인 2이상의 바이오마커를 복합적으로 사용함으로써 폐암 진단 능력을 높인 폐암 진단용 복합 바이오마커 정보 이용 방법에 관한 것이다.The present invention relates to a method for using complex biomarker information for diagnosing lung cancer, and more particularly, to a method for using complex biomarker information for diagnosing lung cancer by using two or more biomarkers specific for lung cancer diagnosis.

폐암은 폐에 발생하는 암으로, 흡연, 공해 등이 가장 큰 원인인 선진국형 암으로, 20세기에 들어서면서 구미 각국에서 급격히 증가하기 시작하여 전 세계적으로 매년 130만 명 이상이 폐암으로 사망하며, 암으로 인한 사망에서 가장 높은 비중을 차지하고 있다. 한국의 경우에도, 매년 10여 만 명의 암 환자가 새로 발생하고 5만 여명의 암 환자가 사망하고 있는 것으로 보고되었다. 더욱이 암의 발생 빈도는 최근 들어 더욱 증가하는 추세로 현재 암은 우리나라 성인 사망 원인의 2위를 차지하고 있다. 특히, 폐암은 한국 성인에서 발생하는 암중에서 약 12%를 차지하며 위암, 간암에 이어 제3위의 발생률을 보이며 매년 남녀 모두에서 발생율이 증가하고 있다. 폐암의 발생율은 여성보다 남성에서 현저히 높으며 상대적으로 45세 미만의 젊은 환자의 비율이 높은 것으로 보고 되었다. 더구나 폐암은 진단 당시 이미 다른 장기로 전이를 하였거나 전이가 없는 경우에도 국소적으로 진행되어 근치적절제술, 항암 화학 요법, 방사선 치료 등의 다양한 치료법에도 불구하고 치료 후 재발과 전이에 의해 5년 생존율이 5% 정도에 머무르는 완치율이 매우 낮은 종양으로 암에 의한 사망율 1위를 차지하고 있다.Lung cancer is a cancer of the lungs, which is the leading cause of cancer, pollution, smoking, etc. In the 20th century, it started to increase rapidly in Western countries and more than 1.3 million people die from lung cancer worldwide every year, It accounts for the highest percentage of deaths from cancer. In Korea, about 100,000 new cancer cases and 50,000 deaths are reported each year. In addition, the incidence of cancer has increased more recently, and cancer is currently the second largest cause of adult death in Korea. In particular, lung cancer accounts for about 12% of the cancers occurring in Korean adults, followed by stomach cancer and liver cancer. The incidence of lung cancer is significantly higher in men than in women, with a relatively high proportion of younger patients under 45 years of age. Moreover, lung cancer has progressed locally even if it has already metastasized to another organ at the time of diagnosis or no metastasis, and despite the various treatments such as curative resection, chemotherapy, and radiation therapy, the 5-year survival rate is reduced due to recurrence and metastasis after treatment. It is a tumor with a very low cure rate of 5%, which is the leading cancer death rate.

폐암은 소세포폐암 (small cell lung cancer)과 비소세포폐암 (non-small cell lung cancer)로 나누어진다. 그 중에서 비소세포폐암은 폐암의 약 80%에 해당하는 가장 대표적인 암으로, 선암(adenocarcinoma), 편평상피세포암(squamous cell carcinoma), 대세포 폐암 (large cell carcinoma)으로 나누어진다. 폐암 종류에 따라 조직학적 특성이 차이가 날뿐 아니라 예후와 치료 방법에서도 차이가 보이므로 정확한 진단이 중요하다. 비소세포폐암의 경우, 최근의 암 치료법의 발달에도 불구하고 10년 생존률이 10% 이하로 매우 낮다. 이는 대부분의 비소세포폐암이 진행된 단계(advanced stage) 까지 진단이 어려운데 원인이 있다. Lung cancer is divided into small cell lung cancer and non-small cell lung cancer. Among them, non-small cell lung cancer is the most representative cancer, which corresponds to about 80% of lung cancer, and is divided into adenocarcinoma, squamous cell carcinoma, and large cell carcinoma. Accurate diagnosis is important because not only the histological characteristics are different according to the type of lung cancer, but also the prognosis and treatment methods. In non-small cell lung cancer, the 10-year survival rate is very low, below 10%, despite recent advances in cancer treatment. This is because most NSCLCs are difficult to diagnose until the advanced stage.

현재로서는 조기 진단이 환자의 생존 가능성을 높이는 가장 좋은 방법이다. 이에, 바이오마커들을 이용하여 폐암을 진단하기 위한 다양한 시도들이 진행되었다.
For now, early diagnosis is the best way to increase patient survival. Accordingly, various attempts have been made to diagnose lung cancer using biomarkers.

WO 2009/006323 A2WO 2009/006323 A2 WO 2007/076439 A2WO 2007/076439 A2 WO 2006/044946 A2WO 2006/044946 A2

본 발명이 해결하려는 첫번째 기술적 과제는 폐암 진단을 위한 복합 바이오마커를 제시하는 것이다.The first technical problem to be solved by the present invention is to propose a complex biomarker for diagnosing lung cancer.

본 발명이 해결하려는 두번째 기술적 과제는 폐암 진단용 복합 바이오마커 정보 이용 방법을 제시하는 것이다.The second technical problem to be solved by the present invention is to propose a method for using complex biomarker information for lung cancer diagnosis.

본 발명이 해결하려는 세번째 기술적 과제는 복합 바이오마커를 사용하는 폐암 진단용 키트를 제시하는 것이다.The third technical problem to be solved by the present invention is to propose a lung cancer diagnostic kit using a composite biomarker.

본 발명이 이루고자 하는 기술적 과제를 달성하기 위하여, 폐암 진단용 복합 바이오마커군에 있어서, 개별 바이오마커 IGF-1 및 RANTES로 이루어진 제1 바이오마커군에서 선택되는 어느 하나 이상의 바이오마커 및 개별 바이오마커 A1AT, CYFRA21-1, proApoA1, AFP, EGFR, PAI-1, TTR, CEA, CA19-9, ApoA1, ApoA1/proApoA1로 이루어진 제2 바이오마커군에서 선택되는 어느 하나 이상의 바이오마커를 포함하는 것을 특징으로 하는 폐암 진단용 복합 바이오마커를 제시한다.In order to achieve the technical problem to be achieved by the present invention, in the complex biomarker group for lung cancer diagnosis, at least one biomarker and individual biomarker A1AT selected from the first biomarker group consisting of individual biomarkers IGF-1 and RANTES, Lung cancer comprising at least one biomarker selected from the second biomarker group consisting of CYFRA21-1, proApoA1, AFP, EGFR, PAI-1, TTR, CEA, CA19-9, ApoA1, ApoA1 / proApoA1 A diagnostic composite biomarker is presented.

상기 제1 바이오마커군에서 선택되는 바이오마커는 IGF-1 및 RANTES인 것이 바람직하다.The biomarkers selected from the first biomarker group are preferably IGF-1 and RANTES.

상기 선택되는 바이오마커는 A1AT 및 CYFRA21-1 중 어느 하나 이상을 포함하는 것인 것이 바람직하다.The selected biomarker is preferably one containing at least one of A1AT and CYFRA21-1.

상기 제2 바이오마커군에서 선택되는 바이오마커는 AFP, CA19-9, CYFRA21-1, A1AT, PAI-1 중 어느 하나 이상인 것이 바람직하다.The biomarker selected from the second biomarker group is preferably at least one of AFP, CA19-9, CYFRA21-1, A1AT, and PAI-1.

상기 제2 바이오마커군에서 선택되는 바이오마커는 A1AT, CYFRA21-1, proApoA1, AFP, EGFR, PAI-1, TTR, CEA, CA19-9, ApoA1/proApoA1, ApoA1 중 어느 2 이상인 것이 바람직하다. The biomarker selected from the second biomarker group is preferably at least two of A1AT, CYFRA21-1, proApoA1, AFP, EGFR, PAI-1, TTR, CEA, CA19-9, ApoA1 / proApoA1, ApoA1.

본 발명이 이루고자 하는 기술적 과제를 달성하기 위하여, 폐암 진단용 시스템의 폐암 진단용 복합 바이오마커 정보 이용 방법에 있어서, 상기 폐암 진단용 시스템이, (A) 폐암 진단 대상자의 혈액, 혈장, 혈청 또는 기타 대상자의 신체에서 분리한 채취 물질로부터 측정되는 개별 바이오마커 IGF-1 및 RANTES로 이루어진 제1 바이오마커군에서 선택되는 어느 하나 이상의 제1 바이오마커군의 바이오마커별 발현량 및 개별 바이오마커 A1AT, CYFRA21-1, proApoA1, AFP, EGFR, PAI-1, TTR, CEA, CA19-9 및 ApoA1로 이루어진 제2 바이오마커군의 바이오마커별 발현량 측정 정보를 입수하는 단계; (B) 상기 제1 바이오마커군의 바이오마커별 발현량 및 상기 제2 바이오마커군의 바이오마커별 발현량 정보를 처리하여, 기설정된 폐암 판정 모델에 투입하는 단계; 및 (C) 상기 폐암 판정 모델로부터 폐암 판정 정보를 생성하는 단계;를 포함하는 것을 특징으로 하는 폐암 진단용 복합 바이오마커 정보 이용 방법을 제시한다.In order to achieve the technical problem to be achieved by the present invention, in the method for using the lung cancer diagnostic complex biomarker information of the lung cancer diagnostic system, the lung cancer diagnostic system, (A) the body of the blood, plasma, serum or other subject of lung cancer diagnosis subject The expression level of each biomarker and the individual biomarkers A1AT, CYFRA21-1, at least one first biomarker group selected from the first biomarker group consisting of individual biomarkers IGF-1 and RANTES measured from the collected material separated from obtaining biomarker-specific expression measurement information of a second biomarker group consisting of proApoA1, AFP, EGFR, PAI-1, TTR, CEA, CA19-9 and ApoA1; (B) processing the expression amount information for each biomarker of the first biomarker group and the expression amount information for each biomarker of the second biomarker group and input the biomarker into a predetermined lung cancer determination model; And (C) generating lung cancer determination information from the lung cancer determination model; suggests a method of using the complex biomarker information for lung cancer diagnostics comprising a.

상기 (B) 단계에서 상기 바이오마커별 발현량 정보를 처리하는 것은, 상기 제2 바이오마커군에서 ApoA1의 발현량 및 proApoA1의 발현량 정보가 있는 경우, ApoA1 발현량과 proApoA1 발현량의 비율값을 생성하는 것이며, 상기 폐암 판정 모델에 ApoA1의 발현량, proApoA1의 발현량, 및 ApoA1 발현량과 proApoA1 발현량의 비율값 중 어느 하나 이상을 투입하는 것인 것이 바람직하다.Processing the expression level information for each biomarker in the step (B), when there is expression information of ApoA1 and expression level of proApoA1 in the second biomarker group, generates a ratio value of ApoA1 expression amount and proApoA1 expression amount It is preferable that any one or more of the expression amount of ApoA1, the expression amount of proApoA1, and the ratio of ApoA1 expression amount and proApoA1 expression amount are injected into the lung cancer determination model.

상기 바이오마커별 발현량 정보를 처리하는 것은 상기 바이오마커별 발현량 정보를 의사 결정 나무(decision tree)를 이용한 앙상블 방법의 partial dependency plot 내지는 partial dependency 함수 관계를 이용하여 변환된 바이오마커별 발현량 정보를 생성하는 것인 것이 바람직하다.
Processing the expression level information for each biomarker may include expressing the expression level information for each biomarker by using partial dependency plot or partial dependency function relationship of an ensemble method using a decision tree. It is preferred to produce.

*상기 폐암 판정 모델은 로지스틱 회귀 모델(logistic regression model)인 것이 바람직하다.* The lung cancer judgment model is preferably a logistic regression model.

상기 로지스틱 회귀 모델은 리지 벌점 함수(Ridge Penalty)를 사용한 것인 것이 바람직하다.Preferably, the logistic regression model uses a ridge penalty function.

본 발명이 이루고자 하는 기술적 과제를 달성하기 위하여, 폐암 진단용 키트에 있어서, 개별 바이오마커 IGF-1 및 RANTES로 이루어진 제1 바이오마커군에서 선택되는 어느 하나 이상의 단백질 및 개별 바이오마커 A1AT, CYFRA21-1, proApoA1, AFP, EGFR, PAI-1, TTR, CEA, CA19-9, ApoA1로 이루어진 제2 바이오마커군에서 선택되는 어느 하나 이상의 단백질에 특이적으로 결합하는 항체를 포함하는 것을 특징으로 하는 폐암 진단용 키트를 제시한다.In order to achieve the technical problem to be achieved by the present invention, in the lung cancer diagnostic kit, at least one protein selected from the first biomarker group consisting of individual biomarkers IGF-1 and RANTES and individual biomarkers A1AT, CYFRA21-1, A kit for diagnosing lung cancer comprising an antibody that specifically binds to at least one protein selected from the second biomarker group consisting of proApoA1, AFP, EGFR, PAI-1, TTR, CEA, CA19-9, ApoA1 To present.

상기 제1 바이오마커군에서 선택되는 단백질은 IGF-1 및 RANTES인 것이 바람직하다.The protein selected from the first biomarker group is preferably IGF-1 and RANTES.

상기 제2 바이오마커군에서 선택되는 단백질은 A1AT 및 CYFRA21-1 중 어느 하나 이상을 포함하는 것인 것이 바람직하다.Preferably, the protein selected from the second biomarker group includes any one or more of A1AT and CYFRA21-1.

상기 제2 바이오마커군에서 선택되는 단백질은 AFP, CA19-9, CYFRA21-1, A1AT, PAI-1 중 어느 하나 이상인 것이 바람직하다. 상기 제2 바이오마커군에서 선택되는 단백질은 A1AT, CYFRA21-1, proApoA1, AFP, EGFR, PAI-1, TTR, CEA, CA19-9, ApoA1 중 어느 2 이상인 것이 바람직하다.The protein selected from the second biomarker group is preferably one or more of AFP, CA19-9, CYFRA21-1, A1AT, PAI-1. The protein selected from the second biomarker group is preferably at least two of A1AT, CYFRA21-1, proApoA1, AFP, EGFR, PAI-1, TTR, CEA, CA19-9, and ApoA1.

상기 폐암 진단용 키트는 폐암 모니터링, 폐암 스크리닝의 목적으로도 사용되는 것인 것이 바람직하다.The lung cancer diagnostic kit is preferably used for the purpose of lung cancer monitoring, lung cancer screening.

본 발명을 활용하면, 단일 바이오마커에 비하여, 폐암 진단 능력이 높은 복합 바이오마커를 구성할 수 있어, 폐암 진단 키트 및 폐암 진단 키트를 이용하는 폐암 진단 방법의 효율성을 높일 수 있고, 효율적 진단을 통해 폐암 환자의 생존율을 향상시킬 수 있으며, 치료에 대한 환자의 반응을 모니터하여 그 결과에 따라 치료를 변경하는 것을 가능하게 한다. 또한, 마우스, 랫트 등의 동물 모델의 생체 내 또는 생체 외에서 하나 이상의 바이오마커의 발현을 조절하는 화합물을 동정하는데 사용될 수 있다.Utilizing the present invention, compared to a single biomarker, it is possible to construct a complex biomarker with high lung cancer diagnosis ability, can increase the efficiency of lung cancer diagnostic method using the lung cancer diagnostic kit and lung cancer diagnostic kit, lung cancer through efficient diagnosis The survival rate of the patient can be improved and it is possible to monitor the patient's response to the treatment and change the treatment accordingly. It can also be used to identify compounds that modulate the expression of one or more biomarkers in vivo or ex vivo of animal models such as mice, rats and the like.

도 1은 폐암 진단 바이오마커 후보군들에서 폐암 진단에 효과적인 복합 바이오마커들을 선별하는 방법에 관한 일 실시예적 흐름도이다.
도 2는 폐암 진단을 위해 복합 바이오마커 후보군들에 대한 랜덤 포리스트 모델 생성하는 일 실시예적 방법에 대한 흐름도이다.
도 3은 복수 개의 바이오마커들을 활용한 의사 결정 나무(decision tree)의 생성 방법에 대한 일 실시예적 개념도이다.
도 4는 평가 지표로서 ROC 커브를 생성하는 방법에 대한 일 실시예적 도면이다.
도 5는 RANTES의 partial dependency plot에 대한 일실시예적 도면이다.
도 6은 RANTES에 관한 암환자와 정상인 사람들의 boxplot에 대한 일실시예적 도면이다.
도 7는 Cyfra21.1의 partial dependency plot에 대한 일실시예적 도면이다.
도 8은 Cyfra21.1에 관한 암환자와 정상인 사람들의 boxplot에 대한 일실시예적 도면이다.
도 9는 A1AT의 partial dependency plot에 대한 일실시예적 도면이다.
도 10은 A1AT에 관한 암환자와 정상인 사람들의 boxplot에 대한 일실시예적 도면이다.
도 11은 본 발명의 CP(Coeffiecient Plot)의 일실시예적 구현예에 관한 도면이다.
도 12는 2 이상의 바이오마커로 구성되는 복합 바이오마커 조합을 선별하는 일실시예적 방법에 관한 도면이다.
도 13은 복합 바이오마커 조합을 선별하는 다른 일실시예적 방법에 관한 도면이다.
도 14는 본 발명의 폐암 진단 시스템의 구성 및 타 정보 제공자단과의 연결 관계에 관한 일실시예적 방법에 관한 도면이다.
도 15는 본 발명의 폐암 진단 시스템의 폐암 진단 정보 생성 방법에 관한 일실시예적 방법에 관한 도면이다.
도 16은 본 발명의 폐암 진단 시스템의 변환 모듈의 Partial Dependency Plot/함수 관계 생성부의 변환 변수값 생성 방법 및 그 생성된 변환 변수값을 본 발명의 폐암 진단 시스템이 사용하는 방법에 관한 일실시예적 방법에 관한 도면이다.
도 17은 본 발명의 폐암 진단 시스템의 CP 정보 생성부가 CP 정보를 생성하는 일실시예적 방법에 관한 도면이다.
도 18은 본 발명의 복합 바이오마커군을 구성하는 바이오마커별 정상 샘플과 암 샘플에 대한 boxplot이다.1 is an exemplary flow chart of a method for screening complex biomarkers effective for lung cancer diagnosis in lung cancer diagnostic biomarker candidate groups.
2 is a flowchart of an exemplary method of generating a random forest model for complex biomarker candidates for lung cancer diagnosis.
3 is a conceptual diagram of a method of generating a decision tree using a plurality of biomarkers.
4 is an exemplary diagram of a method of generating a ROC curve as an evaluation index.
5 is an exemplary diagram for a partial dependency plot of RANTES.
6 is an exemplary diagram of a boxplot of cancer patients and normal people with respect to RANTES.
7 is an exemplary diagram for a partial dependency plot of Cyfra21.1.
FIG. 8 is an exemplary diagram of a boxplot of cancer patients and normal people with Cyfra21.1. FIG.
9 is an exemplary diagram for a partial dependency plot of A1AT.
FIG. 10 is an exemplary diagram of a boxplot of cancer patients and normal people with A1AT. FIG.
FIG. 11 is a diagram illustrating an exemplary embodiment of a CP (Coeffiecient Plot) of the present invention. FIG.
12 is a diagram of one embodiment method for selecting a composite biomarker combination consisting of two or more biomarkers.
FIG. 13 is a diagram of another exemplary method of selecting a composite biomarker combination. FIG.
14 is a diagram illustrating an exemplary method for configuring a lung cancer diagnosis system of the present invention and a connection relationship with another information provider.
15 is a diagram illustrating an exemplary method of generating lung cancer diagnostic information in a lung cancer diagnostic system of the present invention.
16 is a view illustrating a method for generating a conversion variable value of a partial dependency plot / function relationship generation unit of a conversion module of a lung cancer diagnosis system of the present invention and a method of using the generated conversion variable value in the lung cancer diagnosis system of the present invention It is a figure concerning.
17 is a diagram illustrating an exemplary method of generating CP information by the CP information generator of the lung cancer diagnosis system of the present invention.
18 is a boxplot of normal samples and cancer samples for each biomarker constituting the complex biomarker group of the present invention.

이하, 도면을 참조하면서 상세히 설명한다.A detailed explanation follows below with reference to the drawings.

도 1은 폐암 진단 바이오마커 후보군들에서 폐암 진단에 효과적인 바이오마커들을 선별하는 방법에 관한 일 실시예적 흐름도이다. 폐암 진단에 효과적인 바이오마커들을 선별하는 방법은 우선적으로 폐암 바이오마커 후보군에 대한 샘플별 변수값 생성(S11)하고, 폐암 바이오마커 후보군 중에서 폐암 예측 모델에 투입할 바이오마커군 선별(S12)한 다음, 선별된 폐암 바이오마커군에 대한 복합 바이오마커 조합 생성(S13)하고, 생성된 복합 바이오마커 조합별 중에서 폐암 진단 능력이 뛰어난 복합 바이오마커 조합 선별(S14)하는 과정을 거친다. 이하, 상세하게 설명한다.1 is an exemplary flow chart of a method for screening biomarkers effective for lung cancer diagnosis in lung cancer diagnosis biomarker candidate groups. In order to select effective biomarkers for the diagnosis of lung cancer, a variable value for each sample for the lung cancer biomarker candidate group is first generated (S11), and the biomarker group for the lung cancer predictive model among the lung cancer biomarker candidate groups (S12) is selected. The complex biomarker combination generation for the selected lung cancer biomarker group is generated (S13), and the complex biomarker combination selection (S14) having excellent lung cancer diagnosis ability is generated among the generated complex biomarker combinations. This will be described in detail below.

복합 바이오마커의 후보를 찾기 위하여, 우선적으로 폐암 진단에 효과가 있는 바이오마커를 선택하는 것이 필요하다. 이를 위해, 우선적으로 정상적인 사람 및 폐암 질환 환자의 혈청 시료를 수득하여 정상인과 폐암 환자에서 단백질들을 발현량을 RBM 키트, Millipore키트 및 본 발명자들이 속한 집단에서 제조한 키트를 이용하여 각각의 프로토콜을 이용하여 측정하였으며, 측정 결과의 데이터를 구축하였다. 본 발명의 실험을 위해 정상인 128명(남자 78명, 여자 50명)과 폐암 환자 121명(남자 78명, 여자 43명)을 대상으로 하였다. 연령 분포를 보면 정상인의 경우 나이는 41세 ~ 65세(mean : 50.3, median : 48)였으며, 폐암 환자 나이는 35세 ~ 86세(mean : 64.7, median : 66)였다. 폐암 환자의 병기별 분포는 1기-83명, 2기-14명, 3기-21명, 4기-3명이었다. 그리고 실험 대상과 별개로 분류모델 검증을 위해, 블라인드 테스트에 정상인 37명(남자 16명, 여자 21명)과 폐암 환자 25명(남자 10명, 여자 15명)을 대상으로 하였다. 상기 정상인 또는 폐암 환자로부터 Vacutainer SST Ⅱ tube(Becton Dickinson)에 말초혈액 5 ㎖을 채취하여 상온에 한 시간 동안 둔 후, 3000 g에서 5분 동안 원심 분리한 후 상층액을 취해 혈청을 얻었으며 사용하기 전까지 -80℃에 보관하였다. In order to find candidates for complex biomarkers, it is necessary to first select a biomarker that is effective in diagnosing lung cancer. To this end, serum samples of normal human and lung cancer patients are first obtained, and the expression levels of proteins in normal and lung cancer patients are determined using respective protocols using RBM kits, Millipore kits, and kits prepared by the present inventors' population. The measurement result was constructed, and the data of the measurement result was constructed. For the experiment of the present invention, 128 normal (78 male, 50 female) and 121 lung cancer patients (78 male, 43 female) were included. Age distribution of normal subjects ranged from 41 to 65 years old (mean: 50.3, median: 48), and lung cancer patients ranged from 35 to 86 years old (mean: 64.7, median: 66). The distribution of lung cancer patients was stage 1 - 83, stage 2 - 14, stage 3 - stage 21, stage 4 - stage 3. In addition, 37 subjects (16 males and 21 females) and 25 lung cancer patients (10 males and 15 females) who were normal in the blind test were included to verify the classification model. 5 ml of peripheral blood was collected in a Vacutainer SST II tube (Becton Dickinson) from the normal or lung cancer patient and placed at room temperature for 1 hour, followed by centrifugation at 3000 g for 5 minutes, and then serum was obtained from the supernatant. Store at -80 ° C until now.

본 발명자들은 A1AT(alpha-1-antitrypsin), A2M(alpha-2 macroglobulin), DD(D-dimer), PAI-1 (total plasminogen activator inhibitor-1), VN(vitronectin), ApoA4 (apolipoprotein-A4), Hemo(hemoglobin), proApoA1(proapolipoprotein-A1), VDBP(vitamin D-binding protein), ApoA2(apolipoprotein-A2), ApoC2(apolipoprotein-C2), ApoC3 (apolipoprotein-C3), sICAM-1(soluble intercellular adhesion molecule-1), Svcam-1(soluble vascular cell adhesion molecule-1), IL-6(interleukin-6), RANTES(regulated upon activation normal T cell expressed and secreted), AFP(alpha-fetoprotein), CA125(cancer antigen 125), CA19-9(carbohydrate antigen 19-9), CEA(Carcinoembryonic antigen), f-PSA(prostate specific antigen, free), PSA(prostate specific antigen, total), CYFRA21-1(cytokeratin 19 fragment antigen 21-1), EGFR(epidermal growth factor receptor), IGF-1(insulin-like growth factor-1, free), ApoA1(apolipoprotein-A1), B2M(beta-2 microglobulin), CRP(C-reactive protein), Hp(haptoglobin), TTR(transthyretin) 등 30가지 단백질을 분석하기 위하여 여러 제조사로부터 키트 혹은 항체를 구입하거나 항체를 위탁 제조하였다. 항체, 키트, 표준물질 또는 시약의 구입처 등과 같은 정보는 하기 표1 내지 표3과 같다.The inventors of the present invention are A1AT (alpha-1-antitrypsin), A2M (alpha-2 macroglobulin), DD (D-dimer), PAI-1 (total plasminogen activator inhibitor-1), VN (vitronectin), ApoA4 (apolipoprotein-A4) , Hemo (hemoglobin), proApoA1 (proapolipoprotein-A1), VDBP (vitamin D-binding protein), ApoA2 (apolipoprotein-A2), ApoC2 (apolipoprotein-C2), ApoC3 (apolipoprotein-C3), sICAM-1 (soluble intercellular adhesion) molecule-1), soluble vascular cell adhesion molecule-1 (Svcam-1), interleukin-6 (IL-6), regulated upon activation normal T cell expressed and secreted (RANTES), alpha-fetoprotein (AFP), and CA125 (cancer) antigen 125), carbohydrate antigen 19-9 (CA19-9), carcinoembryonic antigen (CEA), prostate specific antigen, free (f-PSA), prostate specific antigen (total), cytokeratin 19 fragment antigen 21 (CYFRA21-1) -1), EGFR (epidermal growth factor receptor), IGF-1 (insulin-like growth factor-1, free), ApoA1 (apolipoprotein-A1), B2M (beta-2 microglobulin), CRP (C-reactive protein), 30 groups including Hp (haptoglobin) and TTR (transthyretin) Buy a kit from different manufacturers or antibodies or produce antibodies were commissioned to analyze the quality. Information such as where to buy an antibody, kit, standard or reagent is shown in Tables 1 to 3 below.

바이오마커Biomarker 표준물질 제조사Standard Manufacturer 대응 항체 제조사1Corresponding Antibody Manufacturer1 대응 항체 제조사2Corresponding antibody manufacturer 2 A1ATA1AT SigmaSigma AcrisAcris BiodesignBiodesign A2MA2M CalbiochemCalbiochem R&DR & D affinity bioreagentsaffinity bioreagents DDDD AbcamAbcam BiodesignBiodesign BiodesignBiodesign PAI-1PAI-1 CalbiochemCalbiochem Abcam Abcam USBiologicalUSBiological VNVN BiodesignBiodesign BiodesignBiodesign ChemiconChemicon ApoA4ApoA4 BIOINFRABIOINFRA Santa CruzSanta Cruz AB frontier(주문제작)AB frontier HemoHemo SigmaSigma BiodesignBiodesign BethylBethyl proApoA1proApoA1 BIOINFRABIOINFRA BiodesignBiodesign Biodesign 혹은 Genscript(주문제작)Biodesign or Genscript VDBPVDBP BiodesignBiodesign AbcamAbcam AbcamAbcam

바이오마커Biomarker 제품명product name 제조사manufacturer ApoA2ApoA2 MILLIPLEX Kit Human ApolipoproteinMILLIPLEX Kit Human Apolipoprotein MilliporeMillipore ApoC2ApoC2 MILLIPLEX Kit Human ApolipoproteinMILLIPLEX Kit Human Apolipoprotein MilliporeMillipore ApoC3ApoC3 MILLIPLEX Kit Human ApolipoproteinMILLIPLEX Kit Human Apolipoprotein MilliporeMillipore sICAM-1sICAM-1 MILLIPLEX Kit Human Cardiovascular Disease panel 1MILLIPLEX Kit Human Cardiovascular Disease panel 1 MilliporeMillipore Svcam-1Svcam-1 MILLIPLEX Kit Human Cardiovascular Disease panel 1MILLIPLEX Kit Human Cardiovascular Disease panel 1 MilliporeMillipore IL-6IL-6 MILLIPLEX Kit Human Cytokine/Chemokine 2MILLIPLEX Kit Human Cytokine / Chemokine 2 MilliporeMillipore RANTESRANTES MILLIPLEX Kit Human Cytokine/Chemokine 1MILLIPLEX Kit Human Cytokine / Chemokine 1 MilliporeMillipore AFPAFP RBM Cancer Antigen Panel 1RBM Cancer Antigen Panel 1 RBMRBM CA125CA125 RBM Cancer Antigen Panel 1RBM Cancer Antigen Panel 1 RBMRBM CA19-9CA19-9 RBM Cancer Antigen Panel 1RBM Cancer Antigen Panel 1 RBMRBM CEACEA RBM Cancer Antigen Panel 1RBM Cancer Antigen Panel 1 RBMRBM f-PSAf-PSA RBM Cancer Antigen Panel 1RBM Cancer Antigen Panel 1 RBMRBM PSAPSA RBM Cancer Antigen Panel 1RBM Cancer Antigen Panel 1 RBMRBM CYFRA21-1CYFRA21-1 TM-CYFRA21.1 ELISA kitTM-CYFRA21.1 ELISA kit DRG DiagnosticsDRG Diagnostics EGFREGFR DuoSet IC ELISADuoSet IC ELISA R&DR & D IGF-1IGF-1 DuoSet IC ELISADuoSet IC ELISA R&DR & D

바이오마커Biomarker 주시약Medicine 표준물질Standard material 제조사manufacturer ApoA1ApoA1 N Antiserum to human Apolipoprotein N Antiserum to human Apolipoprotein N Apolipoprotein standard SL N Apolipoprotein standard SL SiemensSiemens B2MB2M 　N Latex beta2-microglobulinN Latex beta2-microglobulin N Protein standard SLN Protein standard SL SiemensSiemens CRPCRP CardioPhase hsCRPCardioPhase hsCRP N Rheumatology standard SL N Rheumatology standard SL SiemensSiemens HpHp N Antiserum to human Haptoglobin (SMN 10446304)N Antiserum to human Haptoglobin (SMN 10446304) N Protein standard SLN Protein standard SL SiemensSiemens TTRTTR N Antiserum to human PreAlbumin N Antiserum to human PreAlbumin N Protein standard SL N Protein standard SL SiemensSiemens

표준 단백질의 경우, ApoA2, ApoC2, ApoC3, sICAM-1, Svcam-1, IL-6, RANTES 단백질은 Millipore사의 키트에 포함된 것, AFP, CA125, CA19-9, CEA, f-PSA, PSA 단백질은 RBM사의 키트에 포함된 것, CYFRA21-1 단백질은 DRG Diagnostics사의 키트에 포함된 것, EGFR, IGF-1 단백질은 R&D사의 키트에 포함된 것을 사용하였고, ApoA1, B2M, CRP, Hp, TTR 단백질은 Siemens사에서 구입하여 사용하였고, A1AT, Hemo 단백질은 Sigma사에서 구입하여 사용하였고 , A2M, PAI-1 단백질은 Calbiochem사에서 구입하여 사용하였고, DD 단백질은 Abcam에서 구입하여 사용하였고, VN, VDBP 단백질은 Biodesign사에서 구입하여 사용하였고 ApoA4, proApoA1 단백질은 바이오인프라(한국)에서 제조하여 사용하였다.
For standard proteins, ApoA2, ApoC2, ApoC3, sICAM-1, Svcam-1, IL-6, and RANTES proteins are included in the Millipore kit, AFP, CA125, CA19-9, CEA, f-PSA, PSA proteins Is included in RBM's kit, CYFRA21-1 protein is included in DRG Diagnostics' kit, EGFR, IGF-1 protein is used in R &D's kit, ApoA1, B2M, CRP, Hp, TTR protein Was purchased from Siemens, A1AT, Hemo protein was purchased from Sigma, A2M, PAI-1 protein was purchased from Calbiochem, DD protein was purchased from Abcam, VN, VDBP The protein was purchased from Biodesign, and ApoA4 and proApoA1 proteins were prepared and used in Bioinfrastructure (Korea).

필요한 경우 항체결합 미세구체를 다음과 같은 방법으로 제조하였다. 먼저 미세구체 저장액(Microsphere stock solution; Hitachi, Japan)을 볼텍스(vortex)한 후 음파 용기(sonification bath; Sonicor Instrument Corporation, USA)에서 20초 동안 현탁하였다. 2 × 10⁶개의 미세구체를 마이크로튜브(microtube)에 옮겨 원심분리로 상층액을 제거한 후, 3차 증류수 100 ㎕로 세척하고 다시 0.1M 인산나트륨 완충용액(Sodium phosphate buffer; pH 6.2) 80 ㎕에 재현탁하였다. 이후, 50 ㎎/㎖의 N-하이드록시-설포숙시니마이드(N-hydroxy-sulfosuccinimide, Sulfo-NHS) 및 1-에틸-3-(3-디메틸아미노프로필)-카르보디이미드 하이드로클로라이드(1-ethyl-3-(3-dimethylaminopropyl)-carbodiimide hydrochloride)(Pierce, USA)를 각각 10 ㎕씩 차례로 처리한 후 실온에서 20분 동안 섞어주었고, 원심분리로 상층액을 제거한 다음 50 mM MES, pH 5.0으로 두 번 세척하였다. If necessary, antibody-bound microspheres were prepared by the following method. First, the microsphere stock solution (Hitachi, Japan) was vortexed and then suspended in a sonication vessel (Sonicor Instrument Corporation, USA) for 20 seconds. 2 × 10 ⁶ microspheres were transferred to a microtube to remove the supernatant by centrifugation, washed with 100 μl of tertiary distilled water, and again in 80 μl of 0.1M sodium phosphate buffer (pH 6.2). Resuspend. Thereafter, 50 mg / ml of N-hydroxy-sulfosuccinimide (Sulfo-NHS) and 1-ethyl-3- (3-dimethylaminopropyl) -carbodiimide hydrochloride (1- 10 µl each of ethyl-3- (3-dimethylaminopropyl) -carbodiimide hydrochloride (Pierce, USA) was treated sequentially and mixed for 20 minutes at room temperature. The supernatant was removed by centrifugation, followed by 50 mM MES, pH 5.0. Wash twice.

이어, 상기 카복실기 활성화된 미세구체를 50 mM MES 400 ㎕로 재현탁(resuspension)한 후, 결합시킬 25 ㎍의 항체를 포함한 50 mM MES 100 ㎕를 첨가하여 섞어준 후 실온에서 두 시간 동안 섞어주었다. 상기 반응은 암실에서 실행하였다. 항체 결합 반응이 끝난 미세구체는 원심분리를 이용하여 500 ㎕ PBS-TBN[PBS, 1% BSA, 0.02% Tween, 20-0.05% 소듐 아자이드(sodium azide)]으로 두 번 세척하였고, 혈구 계산기(hemocytometer)로 개수를 측정하였다. 상기 항체 결합한 미세구체는 1 × 10⁶개/500 ㎕ PBS-TBN 농도로 4℃의 암실에서 보관하였다.Subsequently, the carboxyl-activated microspheres were resuspended with 400 μl of 50 mM MES, and then mixed with 100 μl of 50 mM MES including 25 μg of antibody to be bound, followed by mixing at room temperature for 2 hours. . The reaction was run in the dark. After completion of the antibody binding reaction, the microspheres were washed twice with 500 μl PBS-TBN [PBS, 1% BSA, 0.02% Tween, 20-0.05% sodium azide] using centrifugation. The number was measured by a hemocytometer. Microspheres bound the antibody was stored at 4 ℃ in a dark room with at ^{1 × 10 6/500 ㎕ PBS} -TBN levels.

이어, 상기에서 만들어진 항체 결합 미세구체의 항체 결합 효율을 측정하기 위해 상기 항체 결합 미세구체를 20초 동안 볼텍스 & 소니케이션 한 후, 필터형 바닥 96-웰 마이크로플레이트에 웰당 2,000개 미세구체를 넣고 미세구체에 결합된 항체의 종(species)에 맞는, PE(Phycoerythrin)가 결합된 2차 항체(anti-antibody antibodyPE conjugate, Jackson Immunoresearch, USA)를 2% BSA/PBS 용액에 1/10로 희석하여 50 ㎕/웰로 넣고 실온에서 30분 동안 섞어주었다. 상기 반응은 빛이 들어가지 않게 암실에서 실행하였다. 반응이 끝난 후 PBST로 2번 세척하였고 Luminex^TM200(Luminex, USA)으로 읽어 MFI 값이 10,000 이상임을 확인하였다.Subsequently, the antibody-binding microspheres were vortexed and sonicated for 20 seconds to measure the antibody binding efficiency of the antibody-binding microspheres prepared above, and then 2,000 microspheres per well were placed in a filter-bottom 96-well microplate. Depending on the species of antibody bound to the sphere, Phycoerythrin-bound secondary antibody (anti-antibody antibodyPE conjugate, Jackson Immunoresearch, USA) was diluted 1/10 in a 2% BSA / PBS solution. Put into μL / well and mix for 30 minutes at room temperature. The reaction was carried out in the dark so that no light entered. After the reaction was washed twice with PBST and read by Luminex ^TM 200 (Luminex, USA) to confirm that the MFI value is more than 10,000.

이어, 검출(detection) 항체는 바이오틴화(biotinylation)시킨 항체를 이용하였다. 구체적으로, EZ-Link Sulfo-NHS-Biotinylation 키트(Pierce, USA)를 이용하여 제조사의 방법의 따라 바이오틴화 반응을 수행하였고, 바이오틴(biotin) 결합의 정도는 키트에 포함된 HABA(4'-hydroxyazobenzene-2-carboxylic acid)를 이용하여 키트 제조사에서 지시한 방법에 따라 수행함으로써 확인하였다. 그 결과, 항체 하나당 결합된 바이오틴양은 8 ~ 12 개로 측정되었다.Subsequently, biotinylated antibodies were used as detection antibodies. Specifically, the biotinylation reaction was performed according to the manufacturer's method using the EZ-Link Sulfo-NHS-Biotinylation Kit (Pierce, USA), and the degree of biotin binding was determined by HABA (4'-hydroxyazobenzene) included in the kit. -2-carboxylic acid) was carried out according to the kit manufacturer's instructions. As a result, the amount of bound biotin per antibody was measured to 8-12.

이어, 개발된 분석방법은 검출 항체의 농도와 실험 반응시간을 더 최적화하였고, 민감도(sensitivity)는 연속 희석한 바이오마커의 분석 측정 수치로 확인하였다. 인트라-어세이 변이성(Intra-assay variability)은 9개의 다른 농도의 혈청 샘플을 12 웰(well)/1 플레이트(plate) 씩 2개의 플레이트로 3번의 다른 시간대에 실험하여 나온 측정치로 CV(coefficient of variation)를 계산하여 확인하였고, 5 ~ 15%로 평균 10%로 계산되었다. 개발된 키트는 교차반응(cross-reactivity)이 없음을 확인하였다.Subsequently, the developed assay further optimized the concentration of the detection antibody and the reaction time of the experiment, and the sensitivity was confirmed by analytical measurement values of serially diluted biomarkers. Intra-assay variability is a measure of the CV of nine different concentrations of serum samples in two wells of 12 wells / plate at three different time points. variation) was calculated and averaged 5% to 10%. The developed kit was confirmed to have no cross-reactivity.

RBM사의 프로토콜에 따라 AFP, CA125, CA19-9, CEA, f-PSA, PSA 의 면역분석을 96웰(well)의 V형 바닥 마이크로플레이트에서 수행하였다. 이때 제조사에서 제공한 표준(standard) 단백질은 혈청 기질 희석액(serum matrix diluent)으로 연속 희석하여 사용하였다. 구체적으로, 표준(duplication) 단백질, 대조군(duplication) 혈청 및 환자 혈청을 각각 20 ㎕씩 웰에 첨가하였고, 키트에 포함된 블로킹 완충용액(blocking buffer) 및 비드 혼합액(bead mixture)을 10 ㎕씩 웰에 첨가하여 섞어준 후 실온에서 한 시간 동안 반응시켰다. 검출 항체와 스트렙타비딘(streptavidin)-PE(Jackson Immunoresearch, USA)는 순차적으로 각각 한 시간, 30분씩 반응시켰고, 필터형 바닥 96-웰 마이크로플레이트(Millipore, USA)로 반응액을 옮긴 후 진공 다기관(vacuum manifold)을 이용하여 두번씩 씻어주었다. 키트에 포함된 분석 완충용액 100 ㎕ 처리한 반응액을 96 웰 마이크로플레이트에 옮겨 Luminex^TM200(Luminex, USA)으로 분석하였다. 결과는 업스테이트사(Upstate, USA)의 비드뷰 소프트웨어(beadview software)를 이용하여 5-파라메트릭 커브 피팅(5-parametric-curve fitting)으로 분석하였다.Immunoassay of AFP, CA125, CA19-9, CEA, f-PSA, PSA was performed on 96 well V type microplates according to RBM's protocol. At this time, the standard protein provided by the manufacturer was used by serial dilution with serum matrix diluent. Specifically, 20 μl of duplication protein, control serum and patient serum were added to the wells, and 10 μl of the blocking buffer and bead mixture included in the kit were added to the wells. After adding to the mixture and reacted at room temperature for one hour. The detection antibody and streptavidin-PE (Jackson Immunoresearch, USA) were allowed to react sequentially for one hour and 30 minutes, respectively, and the reaction solution was transferred to a filter-bottom 96-well microplate (Millipore, USA), followed by vacuum manifold. Washed twice with (vacuum manifold). The reaction solution treated with 100 μl of the assay buffer included in the kit was transferred to a 96 well microplate and analyzed by Luminex ^™ 200 (Luminex, USA). The results were analyzed by 5-parametric-curve fitting using the beadview software of Upstate, USA.

Millipore사의 프로토콜에 따라 ApoA2, ApoC2, ApoC3, sICAM-1, Svcam-1, IL-6, RANTES 의 면역분석을 필터형 바닥 96-웰 마이크로플 레이트(Millipore, USA)에서 수행하였다. 상기 필터형 바닥 96-웰 마이크로플레이트에 키트에서 제공된 분석 완충용액을 처리하여 10분 동안 블로킹 후 진공 다기관을 이용하여 완충용액을 제거하였다. 이때 제조사에서 제공한 표준(standard) 단백질은 혈청 기질 희석액으로 연속 희석하여 사용하였다. 구체적으로, 표준(duplication) 단백질, 대조군(duplication) 혈청 및 환자 혈청을 25 ㎕씩 웰에 처리하였고, 각 웰에 비드 혼합액 25 ㎕씩을 더한 후 실온에서 한 시간 동안 반응시켰다. 반응 플레이트를 진공 다기관을 이용하여 두 번 씻어준 후 검출 항체 및 스트렙타비딘-PE를 순차적으로 각각 한 시간, 30분씩 반응시켰다. 반응이 끝난 플레이트를 씻어준 다음 키트에서 제공된 분석 완충용액을 100 ㎕ 처리하여 Luminex^TM 200으로 분석하였다. 결과는 업스테이트의 비드뷰 소프트웨어를 이용하여 5-파라메트릭 커브 피팅으로 분석하였다.Immunoassay of ApoA2, ApoC2, ApoC3, sICAM-1, Svcam-1, IL-6, RANTES was performed in a filter-bottomed 96-well microplate (Millipore, USA) according to Millipore's protocol. The filter-bottom 96-well microplates were treated with the assay buffer provided in the kit, blocked for 10 minutes, and then the buffer was removed using a vacuum manifold. The standard protein provided by the manufacturer was used by serial dilution with serum substrate diluent. Specifically, duplicate proteins, duplication serum and patient serum were treated in wells of 25 쨉 l each, and 25 쨉 l of bead mixture was added to each well, followed by reaction at room temperature for one hour. After the reaction plate was washed twice using a vacuum manifold, the detection antibody and streptavidin-PE were sequentially reacted for 1 hour and 30 minutes, respectively. By the analysis buffer solution from the following kits washed the plate the reaction was finished, process 100 ㎕ Luminex ^TM 200 was analyzed. Results were analyzed with 5-parametric curve fitting using Upstate's BeadView software.

바이오인프라사의 프로토콜에 따라 A1AT, A2M, DD, PAI-1, VN, ApoA4, Hemo, proApoA1, VDBP의 면역분석을 필터형 바닥 96-웰 마이크로플레이트(Millipore, USA)에서 수행하였다. 상기 필터형 바닥 96-웰 마이크로플레이트에 분석 완충용액(PBS/2% BSA)을 처리하여 10분 동안 블로킹 후 진공 다기관을 이용하여 완충용액을 제거하였다. 이때 제조사에서 제공한 표준(standard) 단백질은 혈청 기질 희석액으로 연속 희석하여 사용하였다. 구체적으로, 표준(duplication) 단백질, 대조군(duplication) 혈청 및 환자 혈청을 25 ㎕씩 웰에 처리하였고, 각 웰에 비드 혼합액 25 ㎕씩을 더한 후 실온에서 한 시간 동안 반응시켰다. 반응 플레이트를 진공 다기관을 이용하여 두 번 씻어준 후 검출 항체 및 스트렙타비딘-PE를 순차적으로 각각 한 시간, 30분씩 반응시켰다. 반응이 끝난 플레이트를 씻어준 다음 키트에서 제공된 분석 완충용액을 100 ㎕ 처리하여 Luminex^TM 200으로 분석하였다. 결과는 업스테이트의 비드뷰 소프트웨어를 이용하여 5-파라메트릭 커브 피팅으로 분석하였다.Immunoassay of A1AT, A2M, DD, PAI-1, VN, ApoA4, Hemo, proApoA1, VDBP was performed in a filter-bottomed 96-well microplate (Millipore, USA) according to BioInfrasa's protocol. The filter-bottom 96-well microplates were treated with assay buffer (PBS / 2% BSA) for 10 minutes to block and then the buffer was removed using a vacuum manifold. The standard protein provided by the manufacturer was used by serial dilution with serum substrate diluent. Specifically, duplicate proteins, duplication serum and patient serum were treated in wells of 25 쨉 l each, and 25 쨉 l of bead mixture was added to each well, followed by reaction at room temperature for one hour. After the reaction plate was washed twice using a vacuum manifold, the detection antibody and streptavidin-PE were sequentially reacted for 1 hour and 30 minutes, respectively. By the analysis buffer solution from the following kits washed the plate the reaction was finished, process 100 ㎕ Luminex ^TM 200 was analyzed. Results were analyzed with 5-parametric curve fitting using Upstate's BeadView software.

ApoA1, B2M, CRP, Hp, TTR 은 제조사의 설명서에 따라 Behring Nephelometer II(BNII) System을 이용하여 자동화된 방법으로 분석하였다.ApoA1, B2M, CRP, Hp and TTR were analyzed by automated method using Behring Nephelometer II (BNII) System according to the manufacturer's instructions.

Cyfra21-1은 DRG Diagnostics 사 kit, EGFR,과 IGF-1은 R&D 사의 DuoSet IC ELISA kit에 포함된 사용설명서에 따라 분석하였다.
Cyfra21-1 was analyzed according to the instructions included in the DRG Diagnostics kit, EGFR, and IGF-1 in the R & D DuoSet IC ELISA kit.

표 4는 각 샘플에 대한 각 바이오마커별 측정 결과 데이터의 일례를 보여 주며, 이와 같이 폐암 바이오마커 후보군에 대한 샘플별 변수값 생성(S11)한다. 상기 변수값은 상기 바이오마커별 발현량 또는 상기 2 이상의 바이오마커별 발현량의 비율 정보가 될 수 있다.Table 4 shows an example of the measurement result data for each biomarker for each sample, thus generating a variable value for each sample for the lung cancer biomarker candidate group (S11). The variable value may be ratio information of the expression level for each biomarker or the expression level for each of the two or more biomarkers.

Sample.IDSample.ID classclass AgeAge SexSex Stage.SStage.S ApoA2ApoA2 Svcam.1Svcam.1 ........................ PAI.1.1PAI.1.1 LC01LC01 폐암Lung cancer 5353 MM 1One 5.359 5.359 2.738 2.738 ........................ 3.171 3.171 LC04LC04 폐암Lung cancer 6666 MM 1One 5.617 5.617 2.943 2.943 ........................ 2.950 2.950 LC05LC05 폐암Lung cancer 6060 MM 33 5.385 5.385 2.914 2.914 ........................ 2.770 2.770 LC07LC07 폐암Lung cancer 4343 FF 1One 5.463 5.463 2.752 2.752 ........................ 2.743 2.743 ........................ ........................ ........................ ........................ 　 ........................ ........................ ........................ ........................ ........................ ........................ ........................ ........................ 　 ........................ ........................ ........................ ........................ KNF140KNF140 정상normal 5151 FF 　 5.600 5.600 2.936 2.936 ........................ 3.116 3.116 KNM378KNM378 NorNor 5656 MM 　 5.443 5.443 2.923 2.923 ........................ 3.116 3.116 KNF088KNF088 NorNor 4848 FF 　 5.458 5.458 2.967 2.967 ........................ 3.036 3.036 KNM151KNM151 NorNor 5555 MM 　 5.542 5.542 3.077 3.077 ........................ 2.986 2.986

sample .ID : 실험 시 부여되는 sample 고유 ID이며, 사람을 구분하는 식별자이다. class : sample 분류로 Nor은 정상인 사람, Can은 폐암 환자를 말한다. Age는 나이이며, Sex는 성별, Stage.S는 폐암의 stage 정보(normal : blank, cancer : 1 ~ 4)이며, 그 이후의 칼럼은 바이오마커 list로 실험된 바이오마커들이며, 실험된 바이오마커들의 셀값은 바이오마커 후보의 실험값 나열이며, 표 2에 보이는 입력 데이터의 실험값은 실험값을 로그(log) 변환을 거친 값이다.sample .ID: The sample's unique ID given during the experiment. class: According to the sample classification, Nor is a normal person, Can is a lung cancer patient. Age is age, Sex is sex, Stage.S is stage information of lung cancer (normal: blank, cancer: 1-4), and subsequent columns are biomarkers tested on the biomarker list. The cell value is a list of experimental values of the biomarker candidates, and the experimental values of the input data shown in Table 2 are values obtained by log conversion of the experimental values.

구축한 측정 데이터는 바이오인포매틱스(bioinformatics) 및 통계적 분석방법인 R 패키지(R Development Core Team (2007). R: A language and environment for statistical computing. R Foundationfor Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0, URL http://www.R-project.org.)를 사용하여 분석하였다. 입력된 데이터에 대하여 랜덤 포리스트 알고리즘을 적용하여 변수 중요도(variable importance)를 결정하고, p-value 랭킹을 도출하고, 바이오마커 간의 상관성 분석(correlation analysis)을 수행하였다. 이와 같이 하여, 랜덤 포리스트 랭킹에 p-value 랭크를 참조하고 상위 랭크에 상관성이 높은 바이오마커가 포함된 경우 하위 랭크의 바이오마커를 배제하는 방법으로 폐암 바이오마커 후보군 중에서 폐암 예측 모델에 투입할 13개의 바이오마커군을 선별(S12) 하였다. The measurement data we built is based on bioinformatics and statistical analysis, the R Package (R Development Core Team (2007) .R: A language and environment for statistical computing.R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051 -07-0, URL http://www.R-project.org. A random forest algorithm was applied to the input data to determine variable importance, derive a p-value ranking, and perform correlation analysis between biomarkers. In this way, if a random forest ranking includes a biomarker with high correlation with reference to a p-value rank and excludes the lower rank biomarker, 13 lung cancer biomarker candidates will be included in the lung cancer prediction model. Biomarker group was selected (S12).

선별된 바이오마커는 A1AT, CYFRA21-1, IGF-1, AFP, proApoA1, EGFR, CEA, RANTES, PAI-1, TTR, CA19-9, ApoA1/ProApoA1, ApoA1이다. 하기 표 7는 선별된 13개의 바이오마커 및 각 개별 바이오마커에 대한 평가 지표값이다. ApoA1/proApoA1는 ApoA1의 발현량을 proApoA1의 발현량으로 나눈 값으로, 발현량 비율값의 일례가 되며, 발현량 비율값이 바이오마커가 될 수 있음을 보여 준다. ApoA1/proApoA1는 ApoA1의 발현량을 proApoA1의 발현량으로 나눈 값으로, 발현량 비율값의 일례가 되며, 발현량 비율값이 바이오마커가 될 수 있음을 보여 준다. Selected biomarkers are A1AT, CYFRA21-1, IGF-1, AFP, proApoA1, EGFR, CEA, RANTES, PAI-1, TTR, CA19-9, ApoA1 / ProApoA1, ApoA1. Table 7 below shows the evaluation index values for the 13 selected biomarkers and each individual biomarker. ApoA1 / proApoA1 is a value obtained by dividing the expression level of ApoA1 by the expression level of proApoA1, which is an example of the expression rate ratio, and shows that the expression rate ratio can be a biomarker. ApoA1 / proApoA1 is a value obtained by dividing the expression level of ApoA1 by the expression level of proApoA1, which is an example of the expression rate ratio, and shows that the expression rate ratio can be a biomarker.

바이오마커Biomarker 정확도
(accuracy)accuracy
(accuracy) 민감도
(ensitivity)responsiveness
(ensitivity) 특이도
(specificity)Specificity
(specificity) A1ATA1AT 0.83260.8326 0.77310.7731 0.89210.8921 CYFRA21-1CYFRA21-1 0.85250.8525 0.85110.8511 0.85380.8538 IGF-1IGF-1 0.80350.8035 0.85150.8515 0.75560.7556 RANTESRANTES 0.76440.7644 0.74790.7479 0.78090.7809 proApoA1proApoA1 0.75750.7575 0.68590.6859 0.82910.8291 AFPAFP 0.73470.7347 0.80820.8082 0.66120.6612 EGFREGFR 0.73620.7362 0.68950.6895 0.78290.7829 PAI-1PAI-1 0.73150.7315 0.69280.6928 0.77030.7703 TTRTTR 0.71560.7156 0.6980.698 0.73320.7332 CEACEA 0.68690.6869 0.72260.7226 0.65120.6512 CA19-9CA19-9 0.6860.686 0.77050.7705 0.60150.6015 ApoA1/proApoA1ApoA1 / proApoA1 0.65830.6583 0.48490.4849 0.83180.8318 ApoA1ApoA1 0.66790.6679 0.63640.6364 0.69940.6994

정확도(Accuracy)는 암, 정상을 맞춘 비율, 민감도(Sensitivity)는 암 환자를 암으로 판정한 비율, 특이도(Specificity)는 정상인 사람을 정상으로 판정한 비율을 말한다. 상기의 평가 지표인 민감도(Sensitivity), 특이도(Specificity) 및 정확도(Accuracy)를 어떻게 구하는지 예시를 통해 설명한다. cut-off=0.5로 하여 설명한다. 하기 표 6와 같은 데이터가 있다고 가정하자.Accuracy is cancer, normalized rate, sensitivity is cancer rate, and specificity is normal. An example of how to obtain the above-mentioned evaluation index Sensitivity, Specificity, and Accuracy will be described. It demonstrates that cut-off = 0.5. Assume there is data as shown in Table 6 below.

실제값(Y축) 예측값(X축) Actual value (Y axis) Predicted value (X axis) 　0(정상)0 (normal) 1(암)1 (female) 0(정상)0 (normal) 1717 33 1(암)1 (female) 00 2020

test set의 수가 40 개(normal 20, cancer 20) 일 때 실제값과 예측값의 교차표를 그리면 위와 같다. 실제값이 0(정상) 인데 0(정상)으로 예측한 개수는 17 이고, 실제값이 0(정상) 인데 1(암)로 예측한 개수는 3개이다. 실제값이 1인데 0으로 예측한 개수는 0 이고, 실제값이 1인데 1로 예측한 개수는 20 이다. Sensitivity는 실제 암환자를 암환자로 예측할 확률이다. 위의 테이블에서는 20명 중에 20명 모두 암으로 예측했으므로 sensitivity는 100% 이다. Specificity는 실제 정상인 사람을 정상으로 예측할 확률로 20명 중에 17명이므로 85%이다. Accuracy는 실제값과 예측값이 같은 비율, 즉, 전체 중에서 정상을 정상으로, 암환자를 암환자로 예측할 확률로 총 40 명 중에서 37명을 올바르게 예측했으므로 Accuracy 는 92.5% 이다. If the number of test sets is 40 (normal 20, cancer 20), draw the cross table of the actual value and the predicted value as above. The actual value is 0 (normal), but the number predicted to 0 (normal) is 17, and the actual value is 0 (normal), and the number predicted to 1 (dark) is three. The actual value is 1 and the number predicted by 0 is 0. The actual value is 1 and the number predicted by 1 is 20. Sensitivity is the probability of predicting actual cancer patients as cancer patients. In the table above, 20 out of 20 predicted cancer, so the sensitivity is 100%. Specificity is 85%, because 17 out of 20 people have a probability of predicting normal people. Accuracy is 92.5% because 37 people out of 40 were predicted correctly with the same ratio of actual value and predicted value, that is, the probability of predicting normal to normal and cancer patients as cancer patients.

한편, 본 발명에서는 평가 지표로 민감도, 특이도, 정확도를 사용했으나, 통계학 내지 사회 과학 영역에서 도입되는 다양한 평가 지표가 사용될 수 있음은 물론이다 할 것이며, 본 발명은 이러한 다양한 평가 지표의 도입을 당연히 포함할 수 있으며, 이러한 평가 지표를 통하여 바이오마커를 선별할 수 있음은 물론일 것이다. 한편, 선별된 평가 지표에 대하여 랭킹을 정하는 것은 평가 지표 중 어느 하나를 기준으로 할 수도 있지만, 적어도 하나 이상의 평가 지표를 입력값으로 하는 기설정된 함수 또는 평가 지표와는 독립적으로 계산되는 기설정된 중요도 함수도 가능함은 물론일 것이다. 적어도 하나 이상의 평가 지표를 입력값으로 하는 기설정된 함수 또는 평가 지표와는 독립적으로 계산되는 기설정된 중요도 함수를 평가 함수라 하고, 상기 평가 함수로 계산되는 값을 평가 함수값이라 한다.Meanwhile, in the present invention, sensitivity, specificity, and accuracy are used as evaluation indicators, but various evaluation indicators introduced in the field of statistics or social science may be used. Of course, the present invention naturally introduces such various evaluation indicators. Of course, the biomarkers can be selected through these evaluation indicators. Meanwhile, the ranking of the selected evaluation indicators may be based on any one of the evaluation indicators, but a predetermined function that takes at least one or more evaluation indicators as an input value or a predetermined importance function calculated independently of the evaluation indicators. Of course it is possible. A predetermined function that takes at least one evaluation indicator as an input value or a predetermined importance function calculated independently of the evaluation indicator is called an evaluation function, and a value calculated by the evaluation function is called an evaluation function value.

도 18은 상기 13개의 바이오마커에 대한 정상 샘플과 암 샘플에 대한 boxplot이다. FIG. 18 is boxplot for normal and cancer samples for the 13 biomarkers. FIG.

하기 표 7는 선별된 13개의 바이오마커, 각 바이오마커의 발현량 패턴 및 특성에 대해 간략히 요약하였다. 발현량 패턴은 각 바이오마커의 발현량 실험값이 높을수록 암일 가능성이 높은 경우와 반대로 낮을수록 암일 가능성이 높은 경우로 대별된다. 하기 표 5에서 Can(높음)은 전자, Can(낮음)은 후자에 각각 대응된다. Table 7 below summarizes the selected 13 biomarkers, expression patterns and characteristics of each biomarker. The expression pattern is roughly classified into a case where the higher the expression value of each biomarker, the higher the probability of cancer, and the lower, the higher the probability of cancer. In Table 5, Can (high) corresponds to the former, and Can (low) corresponds to the latter.

바이오마커Biomarker 패턴pattern 특성characteristic A1ATA1AT Can(높음)Can A1AT 은 당단백으로 혈청 trypsin 의 길항제로 알려져 있다. 체내에서는 염증 세포에서 분비하는 여러 분해 효소들 (특히 elastase)로부터 조직을 보호하는 역할을 수행하며 급성 염증기에 증가한다. 결핍 시에는 폐 조직의 파괴를 가져오는 선천성 질환과 관련이 있다. Hamrita 등은 침습성 유선암에서 A1AT 의 증가됨을 보고하였다 .A1AT is a glycoprotein and is known as an antagonist of serum trypsin. In the body, it protects tissues from various breakdown enzymes (especially elastase) secreted by inflammatory cells and increases in the acute inflammatory phase. Deficiency is associated with birth defects that result in destruction of lung tissue. Hamrita et al. Reported an increase in A1AT in invasive mammary cancer. AFPAFP Can(높음)Can 성인의 경우 germ cell tumor, 간암에서 가장 높은 빈도로 증가함
그러나 gastic, colon, biliary, pancreatic and lung cancer에서도 증가함(~20% 의 환자에서)
　In adults, the highest frequency of germ cell tumors and liver cancer
But also increases in gastic, colon, biliary, pancreatic and lung cancer (~ 20% of patients)
CA19-9CA19-9 Can(높음)Can pancreas, biliary tract, colon, stomach, breast carcinoma를 갖는 환자의 대부분에서 증가하므로 임상적으로 이용됨Clinically used because it increases in the majority of patients with pancreas, biliary tract, colon, stomach, and breast carcinoma CEACEA Can(높음)Can GI(gastrointestinal). lung, breast, ovary, uterus 의 암환자 혈청에서 증가함
　Gastrointestinal (GI). Increased in serum of cancer patients of lung, breast, ovary and uterus
CYFRA21.1CYFRA21.1 Can(높음)Can CYFRA 21-1 (a cytokeratin 19 fragment) 은 비소세포폐암와의 관련성이 알려져 있으며, Lai 등은 특히 편평상피암에서 높은 혈중 농도 뿐 아니라 병기 및 예후와도 관련 있다고 보고하였다.CYFRA 21-1 (a cytokeratin 19 fragment) is known to be associated with non-small cell lung cancer, and Lai et al. Reported that not only high blood levels but also stage and prognosis, especially in squamous cell carcinoma. EGFREGFR Can(낮음)Can EGF의 수용체로서 세포의 성장과 분화에 관여함Receptor of EGF, involved in cell growth and differentiation IGF-1IGF-1 Can(높음)Can IGF-1 은 다양한 기관에 생기는 선암에서 발현이 증가되어 있어 Ouban 등은 자궁내막암(100%), 유방암(87.5%), 난소암(100%), 위암(71.1%), 췌장암(57.1%), 폐암(90.0%), 폐암(84.6%) 등의 조직에서 발현이 잘 되지만, 두경부의 평편상피암 등에서는 발현이 적다고 보고하였다. 또한 Furstenberger 등은 혈중 IGF-1 의 농도와 유방암, 전립선암, 폐암, 폐암 등과 관련성을 보고하였다. 즉, IGF-1은 성장호르몬의 역할에서 중요한 매개체로 증가 시에 세포의 분화 및 성장에 영향을 미치고, 아포토시스 (apoptosis) 를 방해하는 작용을 한다는 것이다. The expression of IGF-1 is increased in adenocarcinoma of various organs, so Ouban et al. Have endometrial cancer (100%), breast cancer (87.5%), ovarian cancer (100%), gastric cancer (71.1%), and pancreatic cancer (57.1%). , Lung cancer (90.0%), lung cancer (84.6%), such as the tissues are well expressed, but the expression of squamous cell carcinoma of the head and neck is reported to be less. Furstenberger et al. Also reported the association of blood IGF-1 levels with breast cancer, prostate cancer, lung cancer and lung cancer. In other words, IGF-1 is an important mediator in the role of growth hormone, which affects the differentiation and growth of cells and increases their ability to interfere with apoptosis. PAI-1PAI-1 Can(낮음)Can tissue plasminogen activator(t-PA)의 inhibitor이며 fibrinolysis과정의 중요한 효소. PAI-1이 증가하면 t-PA의 활성이 줄고 fibfinolytic function에 장애를 초래함. deep vein thrombosis, myocardiac imfarction, normal pregnancy, sepsis에서 증가Inhibitor of tissue plasminogen activator (t-PA) and an important enzyme in fibrinolysis. Increasing PAI-1 decreases t-PA activity and impairs fibfinolytic function. increased in deep vein thrombosis, myocardiac imfarction, normal pregnancy, sepsis ApoA1ApoA1 Can(낮음)Can HDL(High density lipoprotein)의 구성요소이며 LCAT(lectin cholesterol acyltransferase)의 보조인자(cofactor)로 작용하여 조직으로부터 간으로 콜레스테롤을 수송하는 과정에 참여함It is a component of HDL (high density lipoprotein) and acts as a cofactor of LCAT (lectin cholesterol acyltransferase) and participates in the transport of cholesterol from tissue to liver. proApoA1proApoA1 Can(낮음)Can Apolipoprotein A1의 pro formPro form of Apolipoprotein A1 RANTESRANTES Can(낮음)Can T-cell, eosinophil, basophils에 대한 chemotactic factor
백혈구를 염증 site로 모이게 함
asthma, allergic rhinitis와 관련 있음Chemotactic factor for T-cell, eosinophil, basophils
Attracts white blood cells to inflammation sites
asthma, related to allergic rhinitis TTRTTR Can(낮음)Can Thyroid hormone-binding protein. Probably transports thyroxine from the bloodstream to the brain.
Defects in TTR are the cause of amyloidosis type 1 (AMYL1) . A hereditary generalized amyloidosis due to transthyretin amyloid deposition.Thyroid hormone-binding protein. Probably transports thyroxine from the bloodstream to the brain.
Defects in TTR are the cause of amyloidosis type 1 (AMYL1). A hereditary generalized amyloidosis due to transthyretin amyloid deposition.

이어, 도 2를 참조하면서, 폐암 진단을 위해 복합 바이오마커를 선별해 내기 위한 방법에 관해 설명한다. Next, referring to FIG. 2, a method for screening a composite biomarker for lung cancer diagnosis will be described.

먼저, feature selection으로 1차 선택된 13개 바이오마커들로 조합 가능한 복합 바이오마커 목록을 생성하는 방식으로 선별된 폐암 바이오마커군에 대한 복합 바이오마커 조합을 생성(S13)한다. 상기 복합 바이오마커 조합의 수는 13Cr개(단 14>r>1)로 총 8178개이다. 상기 모든 복합 바이오마커 조합에 대하여 각각의 암/정상 예측 통계 모델을 만들고, 각 모델에서 구해지는 평가 지표(Accuracy, Sensitivity, Specificity등)을 바탕으로 8178개의 통계 모델들을 비교한다.First, a composite biomarker combination for the selected lung cancer biomarker group is generated by generating a composite biomarker list that can be combined with 13 biomarkers selected primarily by feature selection (S13). The number of combination biomarker combinations is 13Cr (14> r> 1), which is a total of 8178. For each complex biomarker combination, each cancer / normal predictive statistical model is made, and 8178 statistical models are compared based on the evaluation indexes (Accuracy, Sensitivity, Specificity, etc.) obtained from each model.

통계 모델은 모델 생성에 사용된 데이터에 가장 적합한 모형을 제공하는데, 하나의 data set으로 모델을 만들면, 그 통계 모델이 일반적인 데이터에도 잘 작동하는지를 검증할 방법이 없게 된다. 이런 이유로 training set과 test set을 생성한다. 예를 들어 sample size가 200개(암 100개, 정상 100개)일때, random하게 100개(암 50개, 정상 50개)를 추출해서 training set으로 사용하고, 나머지 100개를 test set으로 사용할 수 있다. (샘플 사이즈가 주어질 때, 얼마만큼을 training set으로 사용하고, 얼마만큼을 test set으로 사용할 지는 경우에 따라서 달라 질 수 있는데, 통상적으로 training set의 size가 test set의 size보가 크거나 같다.) 먼저 training set을 사용하여 모델을 만들고, 이 때 만들어진 모델에 test set을 적용(test set의 암/정상을 예측)함으로써, 실제값과 예측값의 비교를 통해 주어진 모델이 얼마나 잘 작동하는지 검증한다. 이러한 "training set으로 모델 생성-test set으로 모델 검증"을 한 번만 하는 것보다 여러 번 반복하는 것이 보다 robust한 모델(특정 데이터에 덜 의존하는 좀 더 global 한 모델)을 만드는데 도움이 된다.The statistical model provides the best model for the data used to create the model. If you create a model with one data set, there is no way to verify that the statistical model works well with normal data. For this reason, create a training set and a test set. For example, if the sample size is 200 (100 arms, 100 normal), randomly 100 (50 arms, 50 normal) can be extracted and used as a training set, and the remaining 100 can be used as a test set. have. (When a sample size is given, how much to use as a training set and how much to use as a test set may vary. Typically, the size of a training set is greater than or equal to the size of a test set.) Create a model using the training set, and then apply the test set to the model (predict the cancer / normal of the test set) to verify how well the given model works by comparing the actual and predicted values. Repeating this multiple times rather than just once with "model creation with training set - model verification with test set" helps to make a more robust model (a more global model that relies less on specific data).

이어, 의사결정 나무(decision tree)에 대해서 설명한다. 의사결정나무는 데이터 마이닝(Data Mining)의 분석 기법 중 하나로 나무의 구조에 근거하여 의사결정 규칙을 찾아내는 방법이라 할 수 있다. 의사 결정나무는 의사 결정 규칙을 도표화하여 관심의 대상이 되는 집단을 몇 개의 소집단으로 분류하거나 예측하는 강력하고 널리 쓰이는 분석 기법이다. 의사 결정나무의 일반적인 알고리즘에는 정지규칙 그리고 가지치기 등에서 서로 다른 형성과정을 가지고 있다. 의사결정 나무에서 사용되는 규칙은 다음과 같다. Next, the decision tree will be described. Decision trees are one of the data mining analysis techniques, and they can be used to find decision rules based on the structure of trees. Decision trees are powerful and widely used analytical techniques that tabulate decision rules to classify or predict groups of interest into several subgroups. The general algorithm of decision trees has different formation processes in terms of stopping rules and pruning. The rules used in decision trees are:

1.분리기준: 어떤 예측 변수를 이용하여 어떻게 분리하는 것이 목표변수의 분포를 가장 잘 구별해 주는지를 파악하여 자식마디가 형성되는데, 목표 변수의 분포를 구별하는 정도를 순수도 또는 다른 분류 기준을 이용하여 측정하는 것이다.1. Separation Criteria: The child nodes are formed by identifying how predictive variables are used to best distinguish the distribution of target variables. It is measured using.

2.정지기준: 더 이상 분리가 일어나지 않고 현재의 마디가 끝마디 (terminal node)가 되도록 지정하는 규칙을 의미한다. 2. Stopping Criteria: A rule that specifies that no further separation takes place and that the current node becomes a terminal node.

3.가지치기: 지나치게 많은 마디를 가지는 의사결정나무는 새로운 자료에 적용될 때 예측 오차가 매우 클 가능성이 있다. 따라서 형성된 의사결정나무에서 적절하지 않은 마디를 제거하여 적당한 크기의 부(sub) 나무 구조를 가지는 의사결정 나무를 최종적인 모형으로 선택하는 것이 바람직하다. 3. Pruning: Decision trees with too many nodes can have very high prediction errors when applied to new data. Therefore, it is desirable to select a decision tree having a sub tree structure of a suitable size as a final model by removing inappropriate nodes from the formed decision tree.

목표변수가 이산형인 (discrete, 예를 들어 암/정상) 경우에는 목표변수의 각 범주에 속하는 빈도에 기초하여 분리가 일어나면, 분류나무를 구성하게 된다. If the target variable is discrete (for example, cancer / normal), if the separation occurs based on the frequency belonging to each category of the target variable, a classification tree is formed.

예를 들어 바이오마커 CYFRA21.1 값이 5보다 크면 암 일 확률이 매우 높다고 할 때, 100명중에 CYFRA21.1값 5가 넘는 사람 50명 중에 실제 암환자가 40명 정상 환자가 10명이었고, CYFRA21.1값이 5 미만인 50명 중에 암 환자가 10명 정상이 40명이었다고 할 경우, 이를 요약하면 하기 표 8와 같다.For example, if the biomarker CYFRA21.1 value is greater than 5, the probability of cancer is very high. Of the 100, 50 patients with a CYFRA21.1 value of 5 or more actually had 40 normal cancer patients and 10 normal patients, and CYFRA21. Among 50 patients with a value less than 5, 10 cancer patients and 40 normal patients are summarized in Table 8 below.

　 CancerCancer NormalNormal TotalTotal CYFRA21.1>5 CYFRA21.1> 5 4040 1010 5050 CYFRA21.1<5CYFRA21.1 <5 1010 4040 5050

상기 표 8은 CYFRA21.1값만 사용한 경우이다. 이렇게 나눠진 데이터에 추가 기준을 적용 (CEA<3, >=3 또는 CEA<4, >=4)함으로써 데이터가 더 나눠지게 되며, 이는 도 3에 잘 나타나 있다. 도 3을 참조하면서 설명한다. 예를 들어, 사람 A의 CYFRA21.1값이 5, CEA값이 4.5이면, 예시로 사용된 decision tree에 따르면, 해당 바이오마커 조합값은 Terminal Node 3에 해당한다. Majority vote원칙에 따르면 Terminal Node 3의 과반수 이상이 "암"이므로 사람 A는 "암"으로 판정된다. 반면에 사람 B의 바이오마커값이 CYFRA21.1=7.0, CEA=2.0이라면 사람 B는 Terminal Node 4에 들어가므로 "정상"으로 판정된다.Table 8 shows a case where only the CYFRA21.1 value is used. The data is further divided by applying additional criteria to the divided data (CEA <3,> = 3 or CEA <4,> = 4), which is well illustrated in FIG. It demonstrates, referring FIG. For example, if the CYFRA21.1 value of person A is 5 and the CEA value is 4.5, according to the decision tree used as an example, the biomarker combination value corresponds to terminal node 3. According to the principle of majority vote, more than half of Terminal Node 3 is "cancer", so person A is determined to be "cancer". On the other hand, if the biomarker value of person B is CYFRA21.1 = 7.0 and CEA = 2.0, person B enters Terminal Node 4 and is determined as "normal".

이어, RF 알고리즘에 대해서 설명한다. 랜덤 포레스트(Random forest, RF; Breiman L, Machine Learning 45(1):5-32, 2001)는 CART의 의사결정나무의 조합으로 이루어진 Bagging 알고리즘의 일종으로 Leo Breiman과 Adele Cutler에 의해 제안된 방법이다. 각 나무들의 마디들은 고차원을 갖는 자료를 하위 차원들의 작은 조각으로 나눠 빠르게 분류할 수 있도록 구성되어 있다. 이런 각 나무들은 조합(Ensemble)과 투표(Voting)에 의해 최종적인 분류를 완료하게 된다. 확률 분포가 같은 랜덤 벡터(Random Vector)에 의해 생성된 나무들은 각각 독립적으로 구성되고, 구성된 나무들의 개수를 무한으로 가져가면 오분류가 일반화되어 수렴하게 되는데, RF는 불규칙성(Randomness)과 Out-of-bag(Random Selection without Replacement) 기법을 이용하여 Adaboost 만큼의 정확도를 낼 수 있게 하고 경계면과 잡음(Noise)에 강한 성능을 보이며, Bagging과 Boosting 보다 빠르게 수렴하도록 도와주는 효과를 낸다.Next, the RF algorithm will be described. Random forest (Random forest, RF; Breiman L, Machine Learning 45 (1): 5-32, 2001) is a method of Bagging algorithm that is a combination of CART's decision trees and is proposed by Leo Breiman and Adele Cutler. . The nodes of each tree are organized so that the data with higher dimensions can be broken down into smaller pieces of lower dimensions. Each of these trees completes the final classification by ensemble and voting. Trees generated by random vectors with the same probability distribution are composed independently, and when the number of trees is infinite, the misclassification is generalized and converged. RF is random and out-of It uses the -bag (Random Selection without Replacement) technique to achieve the same accuracy as the Adaboost, shows strong performance on the interface and noise, and helps convergence faster than bagging and boosting.

RF algorithm은 자체적으로 주어진 데이터로부터 (training data set, test data set)를 복수 개(예를 들어 50개, 이 개수는 옵션으로 사용자가 조정가능함) 만들어서 각각으로부터 decision tree를 생성한다. 이렇게 되면 독립적인 decision tree 가 50개 생성되게 된다. 이렇게 50개의 decision tree를 생성한 뒤에, test set을 넣으면 하나의 test 샘플 당, 50개의 결정(암/정상)을 갖게 되며(각 decision tree로부터 나온 값), 50개의 결정값을 추려서 많은 쪽(majority vote)으로 최종 결과를 갖게 된다. 예를 들어 사람 A의 경우 45개의 decision tree는 암으로 판정하고 5개의 decision tree는 정상으로 판정했다면,vaverage score(전체 50개의 판정중에서 암으로 판정된 비율)=45/50=0.9로 계산되어진다. 이때 암/정상을 구분하는 기준이 되는 cut-off value를 0.5로 가정했을 때 A의 average score 0.9는 0.5보다 크므로 "암"으로 판정된다. The RF algorithm generates a decision tree from each of its own (training data set, test data set) (for example 50), the number of which can be optionally adjusted by the user) from the given data. This creates 50 independent decision trees. After creating 50 decision trees, if you put a test set, you get 50 decisions (cancer / normal) per test sample (values from each decision tree) vote) gives the final result. For example, in the case of person A, if 45 decision trees were determined to be cancer and 5 decision trees were judged to be normal, then the score score would be calculated as 45/50 = 0.9. . In this case, when the cut-off value, which is a standard for distinguishing cancer / normal, is assumed to be 0.5, the average score 0.9 of A is greater than 0.5, and thus it is determined as “cancer”.

이렇게 여러 개의 통계 모형(RF의 경우는 decision tree)으로부터 나온 결정을 종합하여 하나의 최종 결정으로 이르는 방법을 앙상블(ensemble) 기법이라고 하는데, 본 발명은 이러한 앙상블 기법을 사용하는 것을 특징으로 한다. 한편, RF 알고리즘 이외에도 Boosting 알고리즘도 있는데, 양자 모두 앙상블 기법을 사용하는 면에서는 동등하다. 본 발명 사상을 실행하는데 당업자에게 Boosting 알고리즘도 용이하게 채용할 수도 있음은 물론이다 할 것이며, 본 발명의 실시에 Boosting 알고리즘도 포함됨은 당연할 것이다.In this case, a method of synthesizing a decision made from a plurality of statistical models (RF in the case of a decision tree) to a final decision is called an ensemble technique. The present invention is characterized by using such an ensemble technique. In addition to the RF algorithm, there is also a boosting algorithm, which is equivalent in terms of using an ensemble technique. It will be appreciated that a boosting algorithm may be easily employed by those skilled in the art to implement the spirit of the present invention, and it will be obvious that the boosting algorithm is included in the implementation of the present invention.

Boosting의 기본적인 아이디어는 복수개의 weak learner가 합해져서 하나의 strong learner를 이룬다는 것이다. 이때의 weak learner는 random guessing 보다 나은 classifier로 accuracy가 0.5 이상인 것을 의미하며 decision tree, logistic regression 등 임의의 통계 classifier가 될 수 있다. Strong learner는 accuracy가 random guessing보다 월등히 좋은 classifier을 의미한다. 그 알고리즘은 다음과 같다.The basic idea of boosting is to combine multiple weak learners into a strong learner. At this time, weak learner is a better classifier than random guessing, meaning accuracy is 0.5 or more, and can be any statistical classifier such as decision tree and logistic regression. Strong learner means a classifier whose accuracy is much better than random guessing. The algorithm is as follows.

1. N개의 데이터가 있을 때, 모두 동일하게 Wi=1/N값으로 weight을 준다. 1. When there are N data, all are equally weighted with Wi = 1 / N.

2. 주어진 weight을 이용하여 weak classifer#1을 데이터에 적용한다. 2. Apply weak classifer # 1 to the data using the given weight.

3. Weak classifer#1으로 오분류(misclassified) 된 데이터의 weight은 증가시키고, 정분류(correctly-classified)된 데이터의 weight은 감소시킨다. 3. Increase the weight of misclassified data with Weak classifer # 1, and decrease the weight of correctly-classified data.

4. 3.에서 재계산된(re-calculated) weight을 이용하여 weak classifier #2를 데이터에 적용한다. Apply weak classifier # 2 to the data using the re-calculated weights in Section 3.

5. Weak classifier #2에 의해 오분류된 데이터의 weight은 증가시키고 정분류된 데이터의 weight은 감소시킨다. 5. Increase the weight of the data misclassified by Weak classifier # 2 and decrease the weight of the data classified.

이와 같이 like this

Step 1: 주어진 weight을 사용하여 weak classifier 생성, Step 1: create weak classifier using given weight

Step 2: 해당 weak classifier에 의한 오분류/정분류 여부에 따라 weight 재계산하기.Step 2: Recalculate the weight according to the misclassification / correct classification by the weak classifier.

(Step 1.+ Step 2) 작업을 적당한 정지기준이 만족될 때까지 반복한다. 예를 들어 weak classifier10개가 생성되었다고 하자. 그러면 최종 결과는 이 10개의 weak classifier를 종합하여 도출하게 된다.(Step 1. + Step 2) Repeat the work until the appropriate stop criterion is satisfied. For example, suppose you have 10 weak classifiers. The final result is then synthesized from these 10 weak classifiers.

상기와 같은 방법으로 가능한 모든 복합 바이오마커의 조합인 8178개의 전체 복합 바이오마커 조합에 대하여 각각 복수개의 암/정상 예측 통계 모델을 생성한 다음 최적의 암/정상 예측 통계 모델을 선택하게 된다. 복합도 n인 특정 복합 바이오마커 조합이 있을 경우(X1, X2, ...Xn), n개의 복합 바이오마커가 사용된 샘플 중 일정 샘플을 training set으로 나누고, training set에 속하는 샘플에 대하여, n개의 복합 바이오마커 중 어느 하나 이상이 참여하는 도 3과 같은 복수개의 decision tree를 생성하고, 상기 decision tree를 앙상블 기법을 활용하여 복수개의 암/정상 예측 통계 모델 후보군을 생성한다. 복수개의 암/정상 예측 통계 모델 후보군에 대하여 training set에 참여하지 않은 샘플로 test set을 구성하고, 상기 test set에 대하여, 예측 성능을 검증한다. 예측 성능은 평가 지표 등이 될 수 있다. 전체 샘플을 training set와 test set으로 나누는 방법/조합의 수는 굉장히 많으므로, 상기 암/정상 예측 통계 모델 후보는 다수가 되게 됨은 당연할 것이다.In this manner, a plurality of cancer / normal prediction statistical models are generated for each of 8178 total biomarker combinations, which are all possible combinations of complex biomarkers, and then an optimal cancer / normal prediction statistical model is selected. If there is a specific complex biomarker combination of complex degree n (X1, X2, ... Xn), the sample among n used biomarkers is divided into a training set, and for a sample belonging to the training set, n A plurality of decision trees as shown in FIG. 3, in which any one or more of the plurality of complex biomarkers participate, are generated, and a plurality of cancer / normal prediction statistical model candidate groups are generated by using the ensemble technique. A test set is constructed of samples that do not participate in the training set for a plurality of cancer / normal predictive statistical model candidate groups, and the prediction performance is verified for the test set. Predictive performance may be an evaluation index or the like. Since there are a great number of methods / combinations for dividing the entire sample into a training set and a test set, it will be obvious that the cancer / normal predictive statistical model candidates will be plural.

사람 A의 Avg.Score는 n개의 decision tree에서 나온 n개의 암/정상 판정 중에서 암의 비율을 나타낸다. 하나의 예측 모델은, random forest의 경우, 특정 마커 조합(예를 들면 RANTES+CYFRA21.1) 정보를 사용하여 만들어진 여러 개의 decision tree를 모은 것이 된다.Person A's Avg.Score represents the ratio of cancer among n cancer / normal decisions from n decision trees. In the case of a random forest, one prediction model is a collection of multiple decision trees created using specific marker combination information (eg, RANTES + CYFRA21.1).

상기 생성된 암/정상 예측 통계 모델 후보군의 형태는 하기 수식 8과 같은 형태가 될 수 있다. 하기 수식 8과 같은 예측 모델 또는 각 decision tree에 대하여 상기 (X1, X2, ...Xn), n개의 복합 바이오마커가 활용된 샘플들에 대하여 특정된 X값을 투입한다.(물론, 상기 X값(예를 들면 RANTES라는 바이오마커의 발현값 또는 ApoA1/proApoA1 등과 같은 발현값 비율 정보) 또는 X값에 특별한 처리(예를 들면 partial dependency plot/함수관계로 처리한 값) 상기 복수개의 암/정상 예측 통계 모델 후보군 또는 각 decision tree에 샘플의 X값들이 투입되면 각 모델 후보군별 또는 decision tree별로 0(정상) 내지 1(암) 사이 값과 같은 판정값을 가지게 되고, 이들을 평균하면, 하기 표7과 같은 평균값(Avg.Score)가 생성되게 된다. 물론, 샘플마다 암/정상에 대한 정답값을 알고 있으므로, 상기 복수개의 암/정상 예측 통계 모델 후보군 또는 각 decision tree 중에서 어느 모델이 가장 나은 평가 지표를 가지는 지도 판정할 수 있게 된다. The generated cancer / normal prediction statistical model candidate group may have a form as shown in Equation 8 below. (X1, X2, ... Xn), the X value specified for the samples utilizing n complex biomarkers is input to the prediction model or the decision tree as shown in Equation 8 below. A plurality of cancers / normals (e.g., expression values of biomarkers called RANTES or expression value ratio information such as ApoA1 / proApoA1) or special treatments (e.g., partial dependency plots / function relationships) When the X values of the samples are input to the predictive statistical model candidate group or each decision tree, each of the model candidate groups or the decision tree has a decision value equal to a value between 0 (normal) and 1 (dark). The average value (Avg.Score) is generated as follows: Of course, since the correct answer value for cancer / normal is known for each sample, any model among the plurality of cancer / normal prediction statistical model candidate groups or each decision tree is the best evaluation index. It has it is possible to determine the map.

상기 암/정상 예측 통계 모델 후보군 중에서는 앙상블 기법에서 활용되는 상기 복수개의 decision tree를 앙상블 접합한 모델이 다수 있음은 당연할 것이다.Of course, among the cancer / normal prediction statistical model candidate groups, there are many models that ensemble the plurality of decision trees used in the ensemble technique.

이어, 각각의 암/정상 예측 통계 모델을 통하여, 각 모델마다 하기 표 9와 같은 데이터를 얻는다.Then, through each cancer / normal prediction statistical model, the data shown in Table 9 for each model is obtained.

Avg.Score가 0.5를 초과하면 암으로 판정하고, 그렇지 않다면 정상으로 판정한다. 물론, 상기 0.5라는 cut-off는 특수한 예일뿐, 상황에 따라 0과 1사이의 임의의 수로 변경할 수도 있다. 이와 같이 복합 바이오마커에 대하여 Avg.Score를 계산하고, 복합 바이오마커별로 암과 정상 판정이 있게 되면, 표 9와 같은 데이터를 얻을 수 있게 되며, 이 데이터로부터 각 암/정상 예측 통계모델마다의 민감도, 특이도 및 정확도 등과 같은 진단 능력 내지는 예측 능력(performance) 평가 지표값을 생성할 수 있게 된다. RF에서 Average score를 가지고 암인지 정상인지를 판단을 할 때, Average score의 cut-off point 가 필요하게 된다. 즉, Average score 가 몇 이상일 때 암으로 볼 것 인가의 문제로, 위의 예에서는 Average score 가 0.5를 넘으면, 암으로, 그렇지 않으면 정상으로 판정했는데, 이 cut-off 값에 따라서 암/정상 판정이 달라지게 된다. cut-off 값이 커지면, 암으로 판정되는 비율이 줄고, cut-off값이 작아지면 암으로 판정되는 비율이 커진다. 이렇게 암/정상 판정이 영향을 받으면 연쇄적으로 Sensitivity, Specificity값 등 평가 지표값도 달라지게 된다. 따라서, 이 cut-off값을, 변동시키며, 이에 대응되는 평가 지표값(sensitivity, 1-specificity)를 ploting할 수 있다. 예를 들어, cut-off값을 0.01, 0.02,0.03, 0.04, ,...., 0.98,0.99,1이렇게 사용을 했을 때, 대응되는 (sensitivity, 1-specificity)값들을 구할 수 있고 이 값들을 각각 x, y 좌표로 사용하여 2차 평면에 표시할 수 있으며, 이에 대한 예시 도면은 도 4에 나와 있다. 도 4에서 파란 선(sensitivity(Sn), 1-specificity(Sp) 값이 지정되어 있는 선으로, 2차 평면에서 원호 모양으로 되어 있는 선)이 ROC 커브에 해당하는데, 완벽한 통계 모델일수록 이 곡선이 상자의 왼쪽 상위 꼭지점에(좌표로는 x=0.0, y=1.0에 해당)가까워 지게 된다. 이렇게 되면, 커브 아래의 면적(AUC: Area under curve)이 1에 가까워지게 된다. ROC curve는 보다 sensitivity와 specificity 측면에서 동시에 모델의 performance를 비교할 있는 방법으로, 커브 아래 면적이 1에 가까울수록 좋은 통계 모델이라고 말 할 수 있으며, ROC curve 아래 면적값(AUC)으로 performance 평가 지표값을 사용할 수 있고, 이 ROC curve를 사용하여 cut-off 값을 찾기도 한다. .If Avg.Score exceeds 0.5, it is determined to be cancer, otherwise it is determined to be normal. Of course, the cut-off of 0.5 is only a special example, and may be changed to any number between 0 and 1 according to circumstances. In this way, if Avg.Score is calculated for the composite biomarker and cancer and normal determination are obtained for each composite biomarker, the data shown in Table 9 can be obtained, and the sensitivity of each cancer / normal prediction statistical model is obtained from this data. In addition, diagnostic or predictive performance indicators such as specificity and accuracy can be generated. The cut-off point of the average score is needed to determine whether the cancer is normal with the average score in the RF. In other words, in the above example, if the average score is more than one, cancer is judged to be normal if the average score is over 0.5, otherwise it is determined to be normal. Will be different. The larger the cut-off value, the smaller the ratio determined by the cancer, and the smaller the cut-off value, the larger the ratio determined by the cancer. When the cancer / normal judgment is affected in this way, evaluation index values such as sensitivity and specificity values also vary. Therefore, the cut-off value is varied and the corresponding evaluation index value (sensitivity, 1-specificity) can be plotted. For example, if you use cut-off values of 0.01, 0.02, 0.03, 0.04, .., 0.98, 0.99, 1, you can get the corresponding (sensitivity, 1-specificity) values. These may be displayed in the secondary plane using x and y coordinates, respectively, and an exemplary drawing thereof is shown in FIG. 4. In FIG. 4, the blue line (the line where the sensitivity (Sn) and 1-specificity (Sp) values are specified, and the arc-shaped line in the secondary plane) corresponds to the ROC curve. It will be close to the upper left corner of the box (x = 0.0, y = 1.0 in coordinates). In this case, the area under curve (AUC) near the curve becomes close to 1. ROC curve is a way to compare the performance of the model at the same time in terms of sensitivity and specificity.The closer the area under the curve is to 1, the better the statistical model. You can also use this ROC curve to find cut-off values. .

이어, 생성된 복합 바이오마커 조합별 중에서 폐암 진단 능력이 뛰어난 복합 바이오마커 조합을 선별(S14)한다. 하기에서는 어떤 복합 바이오마커 조합이 더 타당한 조합인지를 판단하는 방법의 일례를 제시한다. 상기 13Cr개의 복합 바이오마커 조합(모든 개별 조합은 1개 이상의 암/정상 예측 통계 모델을 형성한다. 이들 1 개 이상의 암/정상 예측 통계 모델에 대하여 최적의 통계 모델을 선별할 수 있음은 물론이다 할 것이다.)에 대하여 각 통계 모델에서 각 바이오마커별 importance를 계산한다. Subsequently, the composite biomarker combination having excellent lung cancer diagnosis ability is selected among the generated complex biomarker combinations (S14). The following provides an example of how to determine which complex biomarker combination is a more reasonable combination. The combination of the 13Cr complex biomarkers (all individual combinations form one or more cancer / normal predictive statistical models. Of course, the optimal statistical model can be selected for these one or more cancer / normal predictive statistical models ), We calculate the importance of each biomarker in each statistical model.

Importance는 특정 통계 모델에서 특정 바이오마커가 암/정상 판정에 대한 연관성의 크기를 나타낸다. 도 3에서와 같이 CYFRA21.1과 CEA 두 개의 바이오마커 값을 사용하여 4개의 terminal node(tree의 가장 끝에 달려 있는 node들)가 생성이 되었고, 샘플이 입력되면 이 tree 를 따라가서 최종 도달하는 terminal node의 majority 값에 따라서 암인지 정상인지로 판정이 된다. 처음에 사용된 CYFRA21.1의 값에 의해 sample의 상당부분이 암/정상으로 나뉘어지게 되는데, 이는 CYFRA21.1바이오마커 값은 암/정상과 상당히 큰 연관성을 가지고 있다는 것을 말한다. 이 CYFRA21.1의 importance를 측정하기 위해서, CYFRA21.1값을 random하게 permute한다. 즉, CYFRA21.1값을 마구잡이로 섞은 뒤, 각 환자에게 할당하기 때문에, 암/정상과 CYFRA21.1의 상관 관계는 거의 없어지게 된다. 이렇게 randomly permuted된 데이터를 넣고 decision tree에 넣었을 때 각 terminal node에서의 correct decision ratio(2)와 CYFRA21.1원래 데이터를 사용했을 때 각 terminal node에서의 correct decision ratio(1)간의 차이를 측정하고, 이 값이 CYFRA21.1의 importance가 된다. 암/정상에 따른 확연한 바이오마커 값의 패턴이 존재한다면, 그 패턴을 사용했을 때의 correct decision ratio와 이 바이오마커값이 무시되었을 때 (randomly permuted)의 correct decision ratio가 커지게 된다. 반대로 암/정상과 별 상관 없는 바이오마커라면 원래 데이터를 사용했을 때나 randomly permute되었을 때나 correct decision ratio에 큰 차이가 없게 된다.Importance refers to the magnitude of the association of certain biomarkers to cancer / normal findings in certain statistical models. As shown in FIG. 3, four terminal nodes (nodes at the end of the tree) were generated using two biomarker values of CYFRA21.1 and CEA. It is determined whether it is cancer or normal according to the majority value of the node. A large portion of the sample is divided into cancer / normal by the value of CYFRA21.1 used initially, which means that the CYFRA21.1 biomarker value is significantly associated with cancer / normal. To measure the importance of this CYFRA21.1, we randomly permute the CYFRA21.1 value. That is, since the CYFRA21.1 values are randomly mixed and then assigned to each patient, the correlation between cancer / normal and CYFRA21.1 is almost eliminated. When the randomly permuted data is inserted into the decision tree, the difference between the correct decision ratio (2) at each terminal node and the correct decision ratio (1) at each terminal node is measured when the original data of CYFRA21.1 is used. This value is the importance of CYFRA21.1. If there is a clear pattern of biomarker values by cancer / normal, then the correct decision ratio when using that pattern and the random decision permuted when the biomarker value is ignored becomes large. Conversely, if the biomarker is not related to cancer / normal, there is no significant difference in the correct decision ratio when using the original data or randomly permute.

상기와 같이 각 통계 모델에서 각 통계 모델에 참여하는 모든 바이오마커의 importance를 계산할 수 있게 되면, 그 통계 모델에서 참여하는 바이오마커들의 importance 랭킹(서열)을 부여할 수 있게 된다. 예를 들면, IGF.1+CYFRA21.1+RANTES라는 바이오마커 조합을 포함하는 통계 모델이 있는 경우, 이 통계 모델에서 바이오마커 IGF.1, CYFRA21.1, 및 RANTES의 importance를 알 수 있게 된다. 예시적으로 importance 랭킹이 CYFRA21.1이 1위, IGF.1이 2위, RANTES가 3위 등이 될 수 있다. 이때, 모든 8178개의 통계 모델에서 각 모델마다에 참여하고 있는 바이오마커마다의 importance값 및 importance 랭킹을 알 수 있을 때, 상기 importance값 및 importance 랭킹을 활용하여 우월한 복합 바이오마커를 선별할 수 있게 된다. importance값 및 importance 랭킹값을 사용하여 우월성 있는 복합 바이오마커를 선별하는 방법은 다양할 수 있으나, 하기와 같은 일 실시예적 방법을 예시적으로 제시한다.As described above, when the importance of all the biomarkers participating in each statistical model in each statistical model can be calculated, the importance ranking (sequence) of the biomarkers participating in the statistical model can be given. For example, if there is a statistical model that includes a biomarker combination of IGF.1 + CYFRA21.1 + RANTES, the significance of the biomarkers IGF.1, CYFRA21.1, and RANTES can be seen from this statistical model. For example, the importance ranking could be CYFRA21.1 first, IGF.1 second, RANTES third, and so on. At this time, when all of the 8178 statistical models can know the importance value and importance ranking for each biomarker participating in each model, superior composite biomarkers can be selected using the importance value and importance ranking. Methods of selecting superior biomarkers using the importance value and importance ranking value may be various, but one exemplary method as follows is provided as an example.

모든 8178개의 통계 모델에서 importance 랭킹 1위 바이오마커 및 랭킹 2위인 바이오마커를 추출할 수 있고, 전체 8178개의 "랭킹 1위 바이오마커+랭킹 2위 바이오마커" 리스트가 만들어 지게 된다. 추출된 "랭킹 1위 바이오마커+랭킹 2위 바이오마커" 각각에 대하여, 각 "랭킹 1위 바이오마커+랭킹 2위 바이오마커"별 빈도를 계산할 수 있게 된다. 이때, "랭킹 1위 바이오마커+랭킹 2위 바이오마커"의 빈도 계산시, 랭킹 1위 바이오마커와 랭킹 2위 바이오마커의 서열이 달라도 동일하게 취급하는 방법(조합 방법)과 서로 다르게 취급하는 방법(순열 방법)이 있을 수 있다. 조합 방법의 경우, "IGF-1+CYFRA21.1"와 "CYFRA21.1 + IGF-1"은 같은 것이 된다. 즉, 모든 통계 모델에서 IGF-1나 CYFRA21.1 둘 중 어느 하나가 1위를 하고, 어느 하나가 2위를 하기만 하면 "IGF-1+CYFRA21.1"는 동일하게 빈도 1이 추가되게 된다. 한편, 순열 방법의 경우에는 IGF-1가 1위를 하고, CYFRA21.1가 2위를 한 모델과 CYFRA21.1 이 1위를 하고, IGF-1가 2위를 한 모델은 별개로 취급되게 된다. 즉, "IGF-1+CYFRA21.1"과 "CYFRA21.1+IGF-1"은 다르게 취급된다.From all 8178 statistical models, we can extract the biomarkers of importance ranking 1st and 2nd ranking, and a total of 8178 "ranking top biomarkers + ranking 2nd biomarkers" will be created. With respect to each of the extracted "ranking biomarker + ranking second biomarker", it is possible to calculate the frequency for each "ranking biomarker + ranking second biomarker". At this time, when calculating the frequency of "ranking biomarker + ranking biomarker 2 ranking", even if the sequence of the ranking biomarker and ranking 2nd biomarker different handling method (combination method) and different handling method (Permutation method). In the combination method, "IGF-1 + CYFRA21.1" and "CYFRA21.1 + IGF-1" are the same. In other words, if IGF-1 or CYFRA21.1 is placed first and any one is placed second in all statistical models, "IGF-1 + CYFRA21.1" is equally frequency 1 added. . On the other hand, in the permutation method, the model where IGF-1 ranked first, CYFRA21.1 placed second, and CYFRA21.1 placed first, and IGF-1 placed second, were treated separately. . In other words, "IGF-1 + CYFRA21.1" and "CYFRA21.1 + IGF-1" are handled differently.

한편, 상기에서는 조합 방식에서 랭킹 1위 및 2위만이 아닌 랭킹 3위 등과 같이 랭킹 n위까지 포함하여 빈도를 기준으로 중요한 바이오마커 조합을 찾을 수도 있을 것이다. 또는 순열 방식을 적용하여 랭킹 n(n>1)위 까지의 바이오마커 조합에 대하여 빈도를 계산하여 중요한 바이오마커 조합을 생성할 수 있을 것이다. Meanwhile, in the combination method, an important biomarker combination may be found based on frequency, including the ranking nth position, such as the third and third rankings, not only the first and second rankings. Alternatively, the permutation method may be used to generate an important biomarker combination by calculating a frequency for the biomarker combination up to the ranking n (n> 1).

또한, 랭킹 n위별로 가중치를 부여하여(예를 들어 가중치는 importance값 자체일 수 있으며, 랭킹 1위에는 가중치 1, 랭킹 2위에는 가중치 0.5를 부여하는 방식 등과 같이 가중치를 임의 또는 통계학적 근거에 따라 줄 수도 있을 것이다.) 상기 조합 방식이나 상기 순열 방식에서 빈도 및 가중치가 모두 반영된 중요 바이오마커 조합을 찾을 수도 있을 것이다.In addition, by assigning a weight to each ranking n position (for example, the weight may be the importance value itself, and the weight is given according to a random or statistical basis, such as a weighting method of 1 for a ranking 1 and a weight of 0.5 for a 2nd ranking). In the combination method or the permutation method, an important biomarker combination that reflects both frequency and weight may be found.

상기와 같은 처리를 통하여, 13Cr개의 복합 바이오마커마다에 대하여, 상대적 우월성 지표값을 계산할 수 있다. 상대적 우월성이란, 특정 복합 바이오마커가 있을 경우, 다른 복합 바이오마커에 비하여 얼마나 큰 우월성을 가지는가를 지표화한 것이다.Through the above process, relative superiority index value can be calculated for every 13Cr composite biomarkers. Relative superiority is an indication of how much superiority there is with a specific composite biomarker compared to other composite biomarkers.

한편, 상기 13Cr개의 복합 바이오마커 조합 각각에 대하여 민감도, 특이도, 정확도 등과 같은 performance 등을 계산할 수 있게 되며, 복합 바이오마커 조합에 대한 performance 값으로 최적의 복합 바이오마커를 선택할 수도 있을 것이다. 민감도, 특이도, 정확도 등은 각 복합 바이오마커(각 복합 바이오마커는 통계 모델과 1:1로 대등된다.)의 performance의 일례일 뿐 다른 performance 지표를 산출할 수 있음은 당연할 것이며, ROC 커브의 아래 면적은 그 예가 될 것이다.Meanwhile, performance such as sensitivity, specificity, accuracy, etc. may be calculated for each of the 13Cr complex biomarker combinations, and an optimal complex biomarker may be selected as a performance value for the complex biomarker combination. Sensitivity, specificity, accuracy, etc. are only examples of the performance of each complex biomarker (each complex biomarker equals 1: 1 with the statistical model), and it is obvious that other performance indicators can be calculated, and the ROC curve The area underneath is an example.

복합 바이오마커의 선택 시 복합 바이오마커에 참여하는 단일 바이오마커의 개수(복합도라 한다. 예를 들어 IGF-1+CYFRA21.1의 경우 복합도는 2가 되며, IGF-1+CYFRA21.1+RANTES는 복합도가 3이 된다.)가 증가할수록 performance(performance의 예로 민감도, 특이도, 정확도, ROC 커브의 아래 면적 등의 평가 지표가 사용될 수 있다)가 좋아지는 경향이 있으나, 반대로 실제 제품에 적용되었을 때, 1) 제조 비용의 증가, 2) 데이터 수집, 분석 등 정보 처리 비용/난이도의 증가, 3) 측정값들 간의 통계학적 상관성의 존재 가능성 증가되는 문제가 있을 수 있다. 또한, 특정 바이오마커들의 조합인 복합도 n의 복합 바이오마커만으로도 충분하고도 만족스러운 performance가 나오는 경우, 추가적인 바이오마커들을 복합시킬 경우 net performance(performance 증분)이 크지 않을 수 있다. 따라서, 복합도를 증가시킬 때, performance의 용인 하한값을 넘어서는 경우, net performance 및 복합도 증분에 대한 비용을 고려하여 복합도를 증가시키는 것이 타당할 것이다. 즉, 복합도 증가에 따른 Benefit 변동량/Cost 변동량의 값이 큰 것이 타당할 것이다. 한편, 복합도를 증가시킬 때, 어떤 바이오마커를 사용하는 가는 performance 값으로 판단할 수 있을 것이다. 예를 들면, 5개 정도의 바이오마커 조합(5복합 바이오마커)으로 충분한 performance가 나오는 경우, 추가적으로 1개 이상의 바이오마커를 더 복합하더라도 performance의 별 차이가 없을 경우, 5개 정도의 바이오마커 조합으로 폐암 진단 바이오마커 상품을 제조할 수 있을 것이다. The number of single biomarkers participating in the composite biomarker when selecting the composite biomarker is called the composite degree. For example, the composite degree is 2 for IGF-1 + CYFRA21.1 and IGF-1 + CYFRA21.1 + RANTES As the complexity increases, the performance tends to improve as performance indicators such as sensitivity, specificity, accuracy, and the area under the ROC curve can be used. There may be problems such as: 1) increase in manufacturing costs, 2) increase in information processing costs / difficulties, such as data collection and analysis, and 3) increased likelihood of statistical correlation between measurements. In addition, when a complex biomarker of complex degree n, which is a combination of specific biomarkers, provides sufficient and satisfactory performance, the net performance (performance increment) may not be large when additional biomarkers are combined. Therefore, when increasing the complexity, it would be reasonable to increase the complexity by taking into account the net performance and the cost of the complexity increment if it exceeds the lower acceptable limit of performance. In other words, it would be reasonable to have a large value of Benefit variance / Cost variance with increasing complexity. On the other hand, when increasing the complexity, which biomarker is used can be judged by the performance value. For example, if there are enough performances with about 5 biomarker combinations (5 complex biomarkers), if there is no difference in performance even if one or more additional biomarkers are combined, there are about 5 biomarker combinations. Lung cancer diagnostic biomarkers may be made.

하기 표 10은 IGF-1+Cyfra에 바이오마커를 1개씩 추가하면서 각 평가 지표의 변동(증가)를 보여 주는 일 실시예이다. 하기 표 10에서 알 수 있듯이, 각 평가 지표는 바이오마커 수가 증가됨에 따라 saturation 됨을 알 수 있다. 만약 accuracy를 기준으로 하여 93%로 충분할 경우(accuracy 93%가 cut off인 경우), "IGF-1+CYFRA21.1+A1AT+RANTES+CEA+CA19-9" 복합 바이오마커로 된 모델로도 충분할 수 있을 것이며, 이 모델에 TTR을 추가하는 모델은 불필요할 수도 있을 것이다.Table 10 below shows an example showing the variation (increase) of each evaluation index while adding one biomarker to IGF-1 + Cyfra. As can be seen in Table 10, it can be seen that each evaluation index is saturated as the number of biomarkers is increased. If 93% is sufficient based on accuracy (accuracy 93% is cut off), a model with "IGF-1 + CYFRA21.1 + A1AT + RANTES + CEA + CA19-9" composite biomarker will be sufficient. It may be possible that a model that adds a TTR to this model may be unnecessary.

M_01M_01 M_02M_02 M_03M_03 M_04M_04 M_05M_05 M_06M_06 AccuracyAccuracy SensitivitySensitivity IGF-1IGF-1 CYFRA21.1CYFRA21.1 　　　　 0.86290.8629 0.82130.8213 IGF-1IGF-1 CYFRA21.1CYFRA21.1 A1ATA1AT 　　　 0.88950.8895 0.87080.8708 IGF-1IGF-1 CYFRA21.1CYFRA21.1 A1ATA1AT RANTESRANTES 　　 0.92380.9238 0.92260.9226 IGF-1IGF-1 CYFRA21.1CYFRA21.1 A1ATA1AT RANTESRANTES CEACEA 　 0.92660.9266 0.9190.919 IGF-1IGF-1 CYFRA21.1CYFRA21.1 A1ATA1AT RANTESRANTES CEACEA CA19-9CA19-9 0.93000.9300 0.92070.9207 IGF-1IGF-1 CYFRA21.1CYFRA21.1 A1ATA1AT RANTESRANTES CEACEA CA19-9CA19-9 TTRTTR 0.93150.9315 0.92360.9236

상기 표 2에서의 실험된 바이오마커들의 셀값은 바이오마커 후보의 실험값 나열이며, 로그(log) 변환 을 거친 값이다. 이와 같이 실험값은 측정값이므로 오차가 발생가능하며, 이상치(outlier)가 존재하게 되어 로그 변환의 여부를 떠나 그대로 사용했을 경우 이러한 이상치들이 통계 모델의 평가 지표를 떨어뜨리는 주된 요인이 될 수 있다. 따라서, 이상치들을 효과적으로 제거, 영향 최소화 또는 보정하는 방법이 필요하게 되는데, 이때 취할 수 있는 효과적인 방법이 Decision Tree(분류나무)를 이용한 기법이다. 분류나무 모형은 주어진 데이터에 순위를 매기고, 데이터를 반복적으로 분할한다. 분할된 각 파티션은 하나의 반응변수 값을 전부 또는 대부분 갖도록 하는 것을 목표로 한다. 이러한tree를 이용한 앙상블 기법에는 Bagging , Boosting, Random Forest 등 다양한 분류기법이 있다. 앙상블 기법은 Decision Tree(분류나무) 노드를 이용하여 여러 개의 tree를 만들고, 이를 결합하여 더욱 안정되고 강력한 classifier를 만든다. Boosting은 weak classifier(일반적으로 랜덤한 선택보다 약간 더 좋은 성능)를 여러 개 만들어서 합쳐줌으로써 정확도가 높은 분류모형을 만드는 기법이다. Boosting은 변수의 interaction term도 고려할 수 있으며, 변수의 중요도도 관측된다. Random forest는 하나의 가장 좋은 분류나무모형을 구축하는 대신에 random하게 많은 분류나무모형을 구축하여 이들을 합하는 방법이다. Random forest의 장점은 분류 정확도가 우수하고, 이상치에 둔감하며, 계산이 빠르고 단순하다.The cell values of the tested biomarkers in Table 2 are a list of experimental values of biomarker candidates, and are values obtained by log conversion. As the experimental values are measured values, errors may occur, and when an outlier exists, the outliers may be a major factor that degrades the evaluation model of the statistical model when used as it is. Therefore, there is a need for a method for effectively removing, minimizing, or correcting outliers. An effective method that can be taken is a technique using a decision tree. The classification tree model ranks given data and partitions the data repeatedly. Each partition is aimed to have all or most of one response value. There are various classification techniques such as Bagging, Boosting, and Random Forest in the ensemble technique using the tree. The ensemble technique uses Decision Tree nodes to create multiple trees and combine them to create more stable and powerful classifiers. Boosting is a technique for creating highly accurate classification models by combining several weak classifiers (typically slightly better performance than random choices). Boosting can also take into account the interaction term of a variable, and the importance of the variable is also observed. Random forest is a method of constructing a large number of classification tree models and combining them randomly instead of building one best classification tree model. The advantages of random forest are good classification accuracy, insensitive to outliers, and quick and simple calculations.

　여기서는 앙상블 기법의 장점들을 이용하여, 암/정상 예측 모델 구축 시 이상치의 영향을 최소화 하도록 앙상블 기법(Boosting과 Random Forest)의 partial dependence plot 을 이용하여 자료를 변환하는 방법을 설명한다. Here, we will explain how to transform the data using partial dependence plots of ensemble techniques (Boosting and Random Forest) to minimize the effects of outliers when constructing cancer / normal prediction models.

각 바이오마커별 발현량 등과 같은 X(변수)의 실제 측정 등에서는 여러가지 이유로 이상치들이 존재하게 되며, 이러한 이상치들을 그대로 사용하는 경우, 샘플에 포함된 이상치들 때문에 예측 모델 생성에서도 모델의 왜곡이 심해지며, 예측 모델을 적용할 때도 환자 등의 측정값에 이상치가 있는 경우, 암/정상 판정에 상당한 왜곡이 발생할 가능성이 커진다. 이는 특히, 복합 바이오마커 조합를 사용하는 경우, 조합에 포함된 특정한 바이오마커에 이상치가 있는 경우, 그 이상치가 전체 판정 모델값에 큰 영향을 끼칠 수 있게 된다. 이러한 이상치의 직접적인 반영에 따른 영향을 줄일 필요가 있게 된다. decision tree는 본질적으로 분류(classification)에 기초를 두고 있으므로, 이상치가 있더라도, 그 이상치가 직접적으로 반영되지 않고, 그 이상치의 상대적 순서, 랭킹 또는 분류 기준에의 해당성만이 반영되므로, 이상치의 영향력이 크게 줄어들게 된다. In the actual measurement of X (variable), such as the amount of expression of each biomarker, there are various reasons for outliers. When these outliers are used as they are, the outliers included in the sample cause severe distortion of the model even when generating a predictive model. Even when the predictive model is applied, if there is an abnormality in the measured value of the patient or the like, the possibility of significant distortion in the cancer / normal determination increases. This is especially the case when using a complex biomarker combination, when there is an outlier in a particular biomarker included in the combination, the outlier can have a significant impact on the overall judgment model value. There is a need to reduce the impact of direct reflection of these outliers. The decision tree is inherently based on classification, so even if there are outliers, the outliers are not directly reflected, only the relative order, ranking, or correspondence of the outliers, so that the impact of the outliers Greatly reduced.

이상치를 제거할 수 있는 논리에 대하여 좀더 상세히 설명한다. Partial dependence plot은 반응변수(암/정상)에 관한 특정 변수값의 영향 (marginal effect)을 보기 위한 것이다. 일반적으로 Partial dependence plot 함수 관계는 다음과 같이 구한다. 2개의 바이오 마커 조합 X=( Xs, Xc)을 가지고 먼저 Random forest 를 적용한다. 예를 들어 해당 random forest에서 50개의 decision tree가 생성되었다고 하자. 50개의 decision tree결과를 종합하면, 각 환자의 바이오 마커값 X=(Xs, Xc )에 대하여 아래의 함수 f (Xs, Xc)값을 구할 수 있다. The logic to eliminate outliers is explained in more detail. Partial dependence plots are intended to show the marginal effect of a particular variable value on a response variable (cancer / normal). In general, the partial dependence plot function relationship is obtained as follows. Random forest is applied first with two biomarker combinations X = (Xs, Xc). For example, suppose that there are 50 decision trees in the random forest. Combining the 50 decision tree results, the following function f (Xs, Xc) can be obtained for each patient's biomarker value X = (Xs, Xc).

f(Xs, Xc)=f(X)=log(p(X)/(1-p(X)))f (Xs, Xc) = f (X) = log (p (X) / (1-p (X)))

이 때 p(X)는 마커 조합 X를 갖는 해당 환자가 50개의 decision tree에서 암으로 뽑힌 비율, 즉 Avg.Score에 해당한다. 이와 같은 방법으로 모든 환자에 대해서 함수값 f(Xs, Xc)를 계산할 수 있다. 만약 첫번째 바이오 마커(예를 들어 RANTES, 현재 사용하는 예시에서는 XS이라고 하자)의 partial dependence 값을 구하고 싶다면, 같은 Xs 값을 갖는 환자들끼리 모아서 그들의 f(Xs, Xc)값 평균 (g(XS)이라고 하자)을 구한다.P (X) corresponds to the rate at which the patient with the marker combination X was selected as cancer in the 50 decision trees, that is, Avg.Score. In this way, the function values f (Xs, Xc) can be calculated for all patients. If you want to determine the partial dependence of the first biomarker (for example, RANTES, the current example is XS), collect patients with the same Xs value and average their f (Xs, Xc) values (g (XS) Let's say).

예를 들면 마커 RANTES값 Xs =90인 환자들의 f(90, Xc)를 모아서 평균 (g(90))을 구하고. , For example, f (90, Xc) of patients with marker RANTES value Xs = 90 are collected and averaged (g (90)). ,

RANTES값 Xs =65를 갖는 환자들의 f(65, Xc)값들을 모아서 평균 (g(65))을 구한다. The f (65, Xc) values of patients with RANTES value Xs = 65 are collected and averaged (g (65)).

이런식으로 같은 Xs 값을 갖는 f값들의 평균(g(Xs))을 구하다 보면, In this way, if you find the average (g (Xs)) of f values with the same Xs value,

(Xs =90, g(90)), (Xs =65, g(65))와 같은 pair 값을 구할 수 있고, You can get pair values like (Xs = 90, g (90)), (Xs = 65, g (65)),

이 Xs 을 x-축으로 g(Xs)을 y축으로 사용하여 그림을 그리면, If we draw this Xs with the x-axis and g (Xs) with the y-axis,

f값에 대한 Xs 의 marginal effect를 구할 수 있고, 이 함수가 partial dependence plot이 된다. We can find the marginal effect of Xs on f, and this function is a partial dependence plot.

이때 original data로부터 f(Xs, Xc)를 추정할 때 사용된 decision tree는 데이터의 실제값보다는 순서를 사용하는 알고리즘이기 때문에, outlier들에 대해서 보다 둔감할 수 있다.At this time, since the decision tree used when estimating f (Xs, Xc) from the original data is an algorithm that uses the order rather than the actual value of the data, it can be more insensitive to outliers.

Partial dependence plot 또는 partial dependence 함수 관계는 하나의 변수에 대한 나머지 변수들의 영향을 제거한다. 예를 들어서, input 변수가 Xs, Xc 두 변수로 이루어진 결합 분포가 있을 때, Xs변수에 대한 영향을 알고 싶으면 결합분포를 Xc변수에 대해 평균을 취해주면 된다. 각 X마다 partial dependence 함수 관계를 생성할 수 있으며, 상기 partial dependence 함수 관계는 partial dependence plot에 대응되게 된다. 이러한 partial dependence 함수 관계나 partial dependence plot을 이용하여, X를 변환할 수 있게 된다. 즉, 2 이상의 샘플에 대하여 각 샘플별로 바이오마커별 오리지널 변수값을 입수(S51)하고, 바이오마커별 오리지널 입력 변수값으로 기설정된 처리를 수행하여 바이오마커별 partial dependence plot 또는 partial dependence 함수 관계를 구성(S52)하여, 바이오마커별 partial dependence plot 또는 partial dependence 함수 관계를 이용하여 바이오마커별 오리지널 변수값에 대한 바이오마커별 변환 변수값을 생성(S53)하고, 변환 변수값을 기설정된 암/정상 예측 통계 모델의 생성 또는 암/정상 예측 통계 모델의 실행에 사용(S54)할 수 있다. Partial dependence plot or partial dependence function relationships remove the effect of the rest of the variables on a single variable. For example, if an input variable has a joint distribution consisting of two variables, Xs and Xc, and you want to know the effect on the Xs variable, you can average the joint distribution over the Xc variable. A partial dependency function relationship can be created for each X, and the partial dependency function relationship corresponds to a partial dependency plot. Using these partial dependency function relationships or partial dependence plots, X can be transformed. That is, for each of the two or more samples, the original variable values for each biomarker are obtained for each sample (S51), and the partial processing of the partial dependence plot or partial dependence function for each biomarker is performed by performing a predetermined process with the original input variable values for each biomarker. (S52), the biomarker-specific conversion variable values are generated for the original biomarker-specific variable values using the partial dependence plot or partial dependence function-specific biomarker (S53), and the conversion variable values are preset cancer / normal prediction. It may be used (S54) for the generation of statistical models or the execution of cancer / normal prediction statistical models.

이어, Partial dependence plot 또는 partial dependence 함수 관계를 이용한 본 발명 사상을 더욱 더 상세하게 설명한다. A1AT, CYFRA21.1, RANTES 3개로 구성되는 복합도 3의 복합 바이오마커를 이용한 통계 모델을 예시적으로 설명한다. 우선적으로 기존의 데이터에서 암 50개(암 진단 50 샘플)와 정상 50개(정상 진단 50 샘플)로 100개의 데이터를 뽑고, 정상 샘플에 대해서는 y=0, 암 샘플에 대해서는 y=1을 대응시킨다. 이 경우, 하기 표 11와 같은 데이터가 준비될 수 있다.Next, the present invention using the partial dependence plot or partial dependence function relation will be described in more detail. A statistical model using a composite biomarker of Complexity 3 consisting of A1AT, CYFRA21.1, and RANTES is described as an example. First, 100 data are extracted from 50 existing samples (50 samples for cancer diagnosis) and 50 samples for normal diagnosis (50 samples for normal diagnosis), and y = 0 for the normal sample and y = 1 for the cancer sample . In this case, data as shown in Table 11 may be prepared.

샘플　indexSample index A1ATA1AT CYFRA21.1CYFRA21.1 RANTESRANTES yy 221221 3.453083.45308 -2-2 4.7087094.708709 00 223223 3.3411353.341135 -2-2 4.9585184.958518 00 222222 3.5688963.568896 -2-2 4.5773574.577357 00 246246 3.0685923.068592 -2-2 4.9007714.900771 00 207207 4.5383964.538396 -2-2 5.0142415.014241 00 182182 3.6745413.674541 -1.94122-1.94122 4.8645924.864592 00 146146 3.3508153.350815 -2-2 4.7603044.760304 00 197197 3.0031923.003192 -2-2 4.7419284.741928 00 167167 3.360723.36072 -0.5627-0.5627 4.8634314.863431 00 ...... ...... ...... ...... ...... 120120 3.6819633.681963 0.0729850.072985 4.549314.54931 1One 3737 3.7799613.779961 -2-2 4.5922874.592287 1One 66 3.4084833.408483 -0.11415-0.11415 4.6989184.698918 1One 106106 5.3412595.341259 0.0366210.036621 4.4184144.418414 1One 121121 4.3284824.328482 0.5502280.550228 5.009235.00923 1One 88 3.5139813.513981 0.1345590.134559 4.7328654.732865 1One 4343 4.1221044.122104 -2-2 4.1793324.179332 1One 118118 5.2200875.220087 0.4717320.471732 3.9720273.972027 1One 112112 5.1177925.117792 0.6631350.663135 4.7583354.758335 1One ...... ...... ...... ...... ......

하기 수식1은 3차원의 설명변수 바이오마커 xi=(A1AT, CYFRA21.1, RANTES )와 특정 질병 그룹(폐암)과 정상 그룹으로 구성된 범주형 반응변수 yi를 가진 100 개의 쌍으로 이루어진 표본을 말한다.Equation 1 below refers to a sample consisting of 100 pairs of three-dimensional explanatory variables biomarker xi = (A1AT, CYFRA21.1, RANTES) and a categorical response variable yi composed of a specific disease group (lung cancer) and a normal group.

[수식1] [Equation 1]

이어, A1AT,l CYFRA21.1 RANTES 이 세 개의 바이오마커로 구성된 sample을 가지고 tree를 이용한 앙상블 방법으로 통계 모델을 만든다.Subsequently, A1AT, l CYFRA21.1 RANTES has a sample consisting of three biomarkers and creates a statistical model using a tree ensemble method.

decision tree 방법을 수식으로 표현하면 하기 수식 2와 같다. The decision tree method is expressed by the following equation.

[수식 2][Equation 2]

여기서 Rj은 teminal node에서의 서로 배반인 설명변수 영역들을 나타낸다. 그리고　θ = {Rj,γj}은 추정해야할 모수이다.Where Rj represents explanatory variable regions that are mutually exclusive at the teminal node. And θ = {Rj, γj} is a parameter to be estimated.

이어, 각각의 partial dependence plot또는 partial dependence 함수 관계가 어떻게 구해지는지도 각 바이오마커에서 어떻게 구해지는지를 설명한다. 폐암 진단 모델 구축 시 이상치의 영향을 최소화 하도록 앙상블 기법(Boosting과 Random Forest)의 partial dependence plot 또는 partial dependence 함수 관계를 이용하여 자료를 변환한다. Partial dependence plot 또는 partial dependence 함수 관계는 하나의 변수에 대한 나머지 변수들의 영향을 제거한 것으로 오리지널 input 변수가 X_A1AT, X_Cyfra21 _.1, X_RANTES 3 변수로 이루어진 결합 분포를 생각해 보자. X_RANTES 변수에 대한 영향을 알고 싶으면 결합분포를 X_A1AT, X_Cyfra21 _. ₁변수에 대해 평균을 취해주면 된다. 이것이 partial dependence plot 또는 partial dependence 함수 관계의 기본 아이디어이다. 수식으로 partial dependence 함수 관계를 표현해 보면 하기 수식 3과 같다. Next, how each partial dependence plot or partial dependence relationship is obtained is explained for each biomarker. Data are transformed using partial dependence plots or partial dependence relations of ensemble techniques (Boosting and Random Forest) to minimize the effects of outliers in lung cancer diagnosis model construction. Partial plot dependence or partial dependence function of the original input variables to remove the influence of other variables for one variable, consider the combined distribution _A1AT consisting of X, X _Cyfra21 _.1, X 3 _RANTES variable. The joint distribution would like to know the effect on the X _RANTES _A1AT variable X, X _{_Cyfra21.} _We can take the mean of ₁ variable. This is the basic idea of partial dependence plot or partial dependence function relationships. The partial dependence function relationship can be expressed by the following expression.

[수식 3][Equation 3]

도 5 내지 도10을 참조하여 설명한다. 도 5는 RANTES의 partial dependency plot 이다. Partial dependency plot을 통해 나온 함수f값은 세로축에 표시되며 가로축은 설명 변수 값이 표시되어 있다. 도 6은 암환자와 정상인 사람들의 boxplot이다. Boxplot을 보면 정상인 그룹보다 암환자 그룹의 RANTES 값이 전반적으로 낮음을 알 수 있다. 즉, RANTES 값이 작을수록 암환자 그룹일 가능성이 높아진다. 이러한 정보를 반영한 것이 partial dependency plot이다. partial dependency plot의 y축 값은 RANTES변수에 대한 영향을 나타내며 가로축 값이 작아질수록 y축 값은 커진다. Y값이 클수록 질병으로 분류될 가능성이 높다고 해석할 수 있다. 각 설명 변수 X 마다 partial dependence plot을 그릴 수 있으며 Cyfra21.1에 대한 Partial dependency plot와 boxplot는 도 7 과 도8이며, A1AT에 대한 Partial dependency plot와 boxplot는 도 9와 도10이다.
This will be described with reference to FIGS. 5 to 10. 5 is a partial dependency plot of RANTES. The function f value from the partial dependency plot is displayed on the vertical axis, and the horizontal axis is the explanatory variable value. 6 is a boxplot of cancer patients and normal people. Boxplot shows that overall RANTES values are lower in cancer patients than in normal patients. That is, the smaller the RANTES value is, the higher the likelihood of being a cancer patient group. This information is reflected in the partial dependency plot. The y-axis value of the partial dependency plot indicates the effect on the RANTES variable. The smaller the horizontal axis value, the larger the y-axis value. The larger the Y value, the more likely it is to be classified as a disease. A partial dependency plot can be drawn for each explanatory variable X. Partial dependency plots and boxplots for Cyfra21.1 are shown in FIGS. 7 and 8, and partial dependency plots and boxplots for A1AT are shown in FIGS. 9 and 10.

이어, partial dependence plot 또는 partial dependence 함수 관계를 이용하여 변환된 설명 변수를 logistic regression, ridge regression 등 regression에 어떻게 적용하는지에 대해 설명한다.Next, we explain how to apply the explanatory variables transformed using partial dependence plot or partial dependence relationship to regression such as logistic regression and ridge regression.

Partial dependence plot/함수관계의 이러한 특성을 반영하여 우리는 원래의 값 X대신에 partial dependency plot/함수관계를 통해 변환된 값 Y를 새로운 input 변수로 정의하고 이 새로운 변수가 다음 단계인 logistic모형에서 input 변수가 된다. 도 9에서 A1AT 값이 3.0인 샘플은 partial dependency plot/함수관계를 통해 변환된 값인 -1.5를 갖는 것이고, A1AT값이 3.5인 샘플은 partial dependency plot/함수관계를 통해 0.5로 변환된다. Reflecting this characteristic of partial dependence plot / function relations, we define the value Y converted through partial dependency plot / function relations as a new input variable instead of the original value X, and this new variable is the next step in the logistic model. Become a variable. In FIG. 9, a sample having an A1AT value of 3.0 has a value of -1.5, which is converted through a partial dependency plot / function relationship, and a sample having an A1AT value of 3.5 is converted to 0.5 through a partial dependency plot / function relationship.

회귀분석모형은 일반적으로 설명 변수가 반응변수에 미치는 영향을 분석하는 방법으로서 그 결과를 질병진단 예측에도 활용할 수 있다. 회귀분석모형은 Lasso regression, Ridge regression, Logistic regression등 여러 가지가 있다. 분류 방법 중 하나인 logistic모형은 반응 변수값이 이분변수일 때 사용되는 모형으로 확률추정이 가능하며 해석하기가 쉽다. 각 회귀계수는 변수의 영향(중요도)를 나타낸다고 할 수 있다. 회귀계수가 0보다 큰 경우는 X값이 커지면서 Y가 1이 될 확률(질병으로 예측될 가능성)이 커지며 회귀계수가 0보다 작은 경우는 X값이 증가하면서 Y가 1이 될 확률이 작아진다. Logistic모형에서 회귀계수를 추정할 때 수렴하지 않는 문제가 발생 할 수 있기 때문에 regularization method인 ridge 함수를 이용하여 확률값을 추정한다. Ridge 함수를 이용한 회귀계수는 하기 수식4와 같이 추정된다. Ridge 추정량은 회귀계수 추정량이 제한된 상황에서 오차를 제일 작게 하는 추정량을 구하는 것이다. The regression model is a method of analyzing the influence of explanatory variables on response variables in general and the results can be used to predict disease diagnosis. There are several regression models such as Lasso regression, Ridge regression, and Logistic regression. The logistic model, one of the classification methods, is a model used when the response variable is a binary variable, and can be estimated easily and easily interpreted. Each regression coefficient can be said to represent the influence (importance) of the variable. If the regression coefficient is greater than 0, the probability of Y becoming 1 (probable disease) increases as the value of X increases. If the regression coefficient is smaller than 0, the probability of Y becoming 1 decreases as the value of X increases. In estimating the regression coefficients in the logistic model, a non-convergence problem can occur, so the probability value is estimated using the ridge function, a regularization method. The regression coefficient using the Ridge function is estimated as shown in Equation 4 below. The Ridge estimator is an estimator that obtains the smallest error when the regression coefficient estimator is limited.

[수식 4][Equation 4]

이렇게 추정된 회귀계수를 이용하여 질병의 예측 확률값을 구할 수 있다. The estimated regression coefficients can be used to obtain predicted probability values of diseases.

이어, 상기 추정된 회귀계수를 직접 대응시킨 Logistic regression 모델은 하기 수식 5와 같다Next, the logistic regression model to which the estimated regression coefficients are directly corresponded is shown in Equation 5 below.

[수식5][Equation 5]

실제 본 발명의 실제 실시예적 통계 모델에서 회귀 계수를 구한 값을 적용하면, 상기 수식 5는 하기 수식 6과 같이 되었다.When applying the value of the regression coefficient in the practical exemplary statistical model of the present invention, Equation 5 is as shown in Equation 6.

[수식 6][Equation 6]

질병으로 분류(Yi=1)될 확률을 예측하기 위해서 sample xi의 marker j 에 대한 회귀계수가 βj인 경우에 logistic 회귀모형은 하기 수식 7와 같다. In order to predict the probability of being classified as a disease (Yi = 1), the logistic regression model is represented by Equation 7 when the regression coefficient for the marker j of sample xi is βj.

[수식 7][Equation 7]

상기 수식 7에 실제 실시예적 통계 모델에서 추정된 회귀계수를 대입하면, 하기 수식 8과 같다.Substituting the regression coefficient estimated in the actual exemplary statistical model into Equation 7, it is as Equation 8.

[수식 8][Equation 8]

상기와 같은 방법으로 Partial Dependency Plot/함수관계를 사용하여 각 샘플별로 상기 바이오마커별 오리지널 변수값을 변환할 수 있으며, 변환된 바이오마커별 변수값으로 기설정된 암/정상 예측 통계 모델의 생성 또는 암/정상 예측 통계 모델의 실행에 사용할 수 있게 된다. 이와 같이 모든 복합 바이오마커 조합을 사용하는 모든 통계 모델마다 변환된 바이오마커별 변수값을 사용하여 상기 수식 8과 같은 폐암으로 진단할 확률 함수를 구할 수 있게 된다.
In the same manner as described above, the original variable value for each biomarker can be converted for each sample using a partial dependency plot / function relationship, and generation or cancer of a cancer / normal prediction statistical model preset with the converted biomarker variable value. Can be used to run normal predictive statistical models. In this way, a probability function for diagnosing lung cancer as shown in Equation 8 can be obtained by using the biomarker-specific variable values for all statistical models using all the combined biomarker combinations.

한편, 복합 바이오마커를 사용하는 경우, 여러 개의 바이오마커를 사용하므로, 어느 바이오마커가 얼마만한 영향을 미치는지를 용이하게 알아 보기 어려울 수가 있다. 이때, 폐암 진단 모형의 결과값으로 질병 확률 예측값과 함께 사용된 복합 바이오마커에서 각 바이오마커의 영향을 눈으로 쉽게 볼 수 있고 다른 바이오마커들과 비교해 볼 수 있는 기법의 개발이 필요하게 된다. 이러한 이유로 탐색적 자료 분석 기법인coefficient plot (CP)을 개발하였다. On the other hand, when using a composite biomarker, it is difficult to easily determine which biomarker affects how much biomarkers are used. In this case, it is necessary to develop a technique that can easily see the effects of each biomarker and compare it with other biomarkers in the complex biomarker used together with the predicted disease probability as a result of the lung cancer diagnosis model. For this reason, the exploratory data analysis technique, efficient plot (CP), was developed.

도 11은 CP의 일 실시예적 도면이다. x축에는 비교대상 바이오마커들을, y축에는 바이오마커의 질병에 미치는 영향 정도를 나타낸다. 도 11에서 Cyfra21.1 이 질병을 유발하는 중요한 변수라는 것은 한눈에 알 수 있다.11 is an exemplary diagram of a CP. The x-axis shows the biomarkers to be compared and the y-axis shows the degree of impact on the biomarker's disease. In Figure 11 it can be seen at a glance that Cyfra21.1 is an important variable causing the disease.

　CP에 사용된 바이오마커별 바이오마커의 질병에 미치는 영향 정도는 다음과 같이 계산된다. g(x)는 partial dependence plot 을 이용하여 변환된 새로운 input변수를 사용한다. 복합도 K인 복수 개의 바이오마커에 대하여 로지스틱 모형으로부터 구한 판별 함수는 하기 수식 9와 같이 표현할 수가 있다. 새로운 input변수를 표준화시킨 다음 베타 계수를 곱해서 나온 값을 plot으로 그려보면 각 바이오마커의 영향 정도를 가늠할 수 있다. The degree of impact on the biomarkers of each biomarker used in CP is calculated as follows. g (x) uses the new input variable converted using a partial dependence plot. The discrimination function obtained from the logistic model for a plurality of biomarkers having a complex degree K may be expressed as in Equation 9 below. Standardizing the new input variables and multiplying the beta coefficients with a plot gives a measure of the impact of each biomarker.

[수식 9][Equation 9]

CP를 생성하는 방법은 X 축에 복합 바이오마커를 구성하는 개별 바이오마커를 나열(S61)하고, Y 축에 개별 바이오마커별 영향력 정보를 표시(S62)하는 방법을 포함한다. The method of generating CP includes a method of listing individual biomarkers constituting the composite biomarker on the X axis (S61) and displaying the influence information for each biomarker on the Y axis (S62).

이하, 예시를 통하여 본 발명 사상을 더욱 더 상세하게 설명한다. 하기 표 12에는 A1AT, CYFRA21.1, 및 RANTES로 구성되는 복합 바이오마커 조합이 있을 때, 각 샘플별 각 바이오마커별 발현량의 측정값 및 partial dependency plot을 통하여 변환된 각 바이오마커의 발현량의 측정값을 보여 주고 있다.Hereinafter, the spirit of the present invention will be described in more detail with reference to examples. In Table 12, when there are complex biomarker combinations consisting of A1AT, CYFRA21.1, and RANTES, the measured value of each biomarker for each sample and the expression amount of each biomarker converted through the partial dependency plot The measured value is shown.

샘플 indexSample index A1ATA1AT CYFRA21.1CYFRA21.1 RANTESRANTES t(A1AT)t (A1AT) t(CYFRA21.1)t (CYFRA21.1) t(RANTES)t (RANTES) 163163 3.57 3.57 0.21 0.21 4.87 4.87 1.07 1.07 2.88 2.88 -0.63 -0.63 174174 2.88 2.88 -1.94 -1.94 4.33 4.33 -0.95 -0.95 -1.50 -1.50 2.48 2.48 205205 2.97 2.97 0.37 0.37 4.98 4.98 -0.95 -0.95 2.88 2.88 -0.88 -0.88 203203 3.33 3.33 -2.00 -2.00 4.95 4.95 -0.73 -0.73 -1.50 -1.50 -0.88 -0.88 152152 3.38 3.38 -0.91 -0.91 4.93 4.93 0.13 0.13 -1.33 -1.33 -0.88 -0.88 130130 3.36 3.36 -2.00 -2.00 4.71 4.71 -0.47 -0.47 -1.50 -1.50 -0.17 -0.17 229229 3.21 3.21 -2.00 -2.00 4.88 4.88 -0.90 -0.90 -1.50 -1.50 -0.63 -0.63 156156 3.07 3.07 -1.26 -1.26 4.78 4.78 -0.95 -0.95 -1.34 -1.34 -0.62 -0.62 168168 3.20 3.20 -1.83 -1.83 5.02 5.02 -0.90 -0.90 -1.27 -1.27 -0.86 -0.86 228228 3.31 3.31 -2.00 -2.00 5.03 5.03 -0.73 -0.73 -1.50 -1.50 -0.86 -0.86 ...... ...... ...... ...... ...... ...... ...... 2323 4.05 4.05 0.39 0.39 4.56 4.56 1.45 1.45 2.88 2.88 2.31 2.31 8181 4.10 4.10 0.93 0.93 4.35 4.35 1.45 1.45 2.88 2.88 2.48 2.48 1111 3.51 3.51 0.29 0.29 4.60 4.60 0.90 0.90 2.88 2.88 1.77 1.77 4949 3.90 3.90 -0.38 -0.38 4.41 4.41 1.41 1.41 1.60 1.60 2.48 2.48 88 3.51 3.51 0.13 0.13 4.73 4.73 0.90 0.90 2.88 2.88 -0.40 -0.40 104104 3.52 3.52 0.78 0.78 4.86 4.86 0.90 0.90 2.88 2.88 -0.63 -0.63 4545 4.25 4.25 -0.69 -0.69 4.45 4.45 1.45 1.45 -0.99 -0.99 2.47 2.47 120120 3.68 3.68 0.07 0.07 4.55 4.55 1.12 1.12 2.88 2.88 2.36 2.36 55 3.44 3.44 -0.80 -0.80 4.44 4.44 0.50 0.50 -1.30 -1.30 2.48 2.48 99 4.50 4.50 -1.83 -1.83 4.35 4.35 1.45 1.45 -1.27 -1.27 2.48 2.48 2121 3.65 3.65 -0.17 -0.17 4.82 4.82 1.12 1.12 1.66 1.66 -0.63 -0.63 ...... ...... ...... ...... ...... ...... ...... 7474 4.34 4.34 -0.27 -0.27 4.70 4.70 1.45 1.45 1.66 1.66 -0.01 -0.01

하기 표 13은 각 샘플별 실제 Y값(암 환자 또는 정상인), 암 진단 모델을 통하여 예측된 확률값 prob(Y=1)값, 확률값을 통한 예측지(암 또는 정상) 및 각 샘플(대상자)별 각 바이오마커별로의 Coefficient plot 값을 생성한 결과를 보여 주고 있다.Table 13 shows the actual Y value for each sample (cancer patient or normal person), probability value prob (Y = 1) value predicted through the cancer diagnosis model, predicted value (cancer or normal) through probability value, and for each sample (subject). It shows the result of generating the Coefficient plot value for each biomarker.

샘플 indexSample index true ytrue y 예상확률Expected probability 예상치Estimate coeff_A1ATcoeff_A1AT coeff_Cyfra21.1coeff_Cyfra21.1 coeff_RANTEScoeff_RANTES 163163 00 0.95 0.95 1 One 0.59 0.59 3.09 3.09 -1.29 -1.29 174174 00 0.38 0.38 0 0 -1.67 -1.67 -2.37 -2.37 1.90 1.90 205205 00 0.69 0.69 1 One -1.68 -1.68 2.28 2.28 -1.58 -1.58 203203 00 0.01 0.01 0 0 -1.07 -1.07 -2.35 -2.35 -1.59 -1.59 152152 00 0.03 0.03 0 0 -0.33 -0.33 -2.16 -2.16 -1.21 -1.21 130130 00 0.03 0.03 0 0 -1.08 -1.08 -1.78 -1.78 -0.70 -0.70 229229 00 0.01 0.01 0 0 -1.23 -1.23 -2.35 -2.35 -1.28 -1.28 156156 00 0.02 0.02 0 0 -1.67 -1.67 -2.18 -2.18 -0.96 -0.96 168168 00 0.01 0.01 0 0 -1.62 -1.62 -1.57 -1.57 -1.56 -1.56 228228 00 0.01 0.01 0 0 -1.07 -1.07 -2.35 -2.35 -1.57 -1.57 ...... ...... ...... ...... ...... ...... ...... 2323 1One 1.00 1.00 1 One 1.31 1.31 3.09 3.09 1.75 1.75 8181 1One 1.00 1.00 1 One 1.31 1.31 2.28 2.28 2.59 2.59 1111 1One 1.00 1.00 1 One 0.44 0.44 3.09 3.09 1.71 1.71 4949 1One 1.00 1.00 1 One 1.26 1.26 1.50 1.50 1.90 1.90 88 1One 0.96 0.96 1 One 0.62 0.62 2.28 2.28 -0.99 -0.99 104104 1One 0.94 0.94 1 One 0.44 0.44 3.09 3.09 -1.29 -1.29 4545 1One 0.92 0.92 1 One 1.31 1.31 -1.73 -1.73 1.90 1.90 120120 1One 1.00 1.00 1 One 0.90 0.90 2.28 2.28 2.44 2.44 55 1One 0.75 0.75 1 One 0.07 0.07 -2.11 -2.11 2.59 2.59 99 1One 0.89 0.89 1 One 1.31 1.31 -2.08 -2.08 1.90 1.90 2121 1One 0.82 0.82 1 One 0.90 0.90 1.15 1.15 -1.28 -1.28 ...... ...... ...... ...... ...... ...... ...... 7474 1One 0.93 0.93 1 One 0.95 0.95 1.58 1.58 -0.51 -0.51

상기 표 13에서 알 수 있듯이 실제 암이 아닌데 암이 진단된 것은 2개이며, 암인데 암이 아닌 것으로 진단한 것은 없는 것과 같이 예측의 정확도가 아주 높은 것을 알 수 있다. As can be seen from Table 13, the actual cancer is not two cancers are diagnosed, it can be seen that the accuracy of the prediction is very high as the cancer is not diagnosed as non-cancer.

이어, 폐암 진단 능력이 높은 복합 바이오마커 조합을 선별하는 방법을 설명한다.Next, a method of selecting a complex biomarker combination with high lung cancer diagnosis ability will be described.

본 발명에서는 13개의 바이오마커에 대하여 2개씩 쌍으로 복합도 2인 복합 마커 조합 8178개를 생성하였다. 생성된 각 바이오마커 조합에 대응하는 암 진단 모델을 생성하고, 생성된 암 진단 모델을 대상으로 하여, 정상인 128명(남자 78명, 여자 50명)과 폐암 환자 121명(남자 78명, 여자 43명)을 대상으로 하여 테스트를 하였고, 그 테스트 결과에 해당하는 각 암 진단 모델별 평가 지표(정확도, 민감도 및 특이도)는 하기 표 14 내지 하기 표 24에 있다. 테스트 된 각각의 암 진단 모델은 암 진단 모델의 관점에서는 개별적인 실시예에 해당하나, 8178개의 실시예를 모두 제시하고 나열해야 하나, 나열의 경우 너무 많은 공간을 차지하는 점과, 특허는 발명 사상의 표현인 점을 고려하여 대표적인 실시예만을 표의 형태로 제시한다. 표의 형태로 제시되는 각 실시예에는 암 진단 모델 번호가 있으며, 그 번호에 해당하는 암 진단 모델은 그 암 진단 모델에 참여하는 바이오마커 조합이 대응되어 있으며, 그 암 진단 모델에 대한 정상인 128명(남자 78명, 여자 50명)과 폐암 환자 121명(남자 78명, 여자 43명)을 대상으로 하여 테스트한 결과인 평가 지표가 병기되어 있다.In the present invention, 8178 complex marker combinations having a complexity of 2 in pairs of 13 biomarkers were generated. Cancer diagnosis models corresponding to the generated biomarker combinations are generated, and the normal cancer diagnosis models are generated for 128 normal men (78 males and 50 females) and 121 lung cancer patients (78 males and female 43). Persons) were tested, and the evaluation indicators (accuracy, sensitivity and specificity) for each cancer diagnosis model corresponding to the test results are shown in Tables 14 to 24 below. Each cancer diagnostic model tested corresponds to a separate embodiment from the perspective of the cancer diagnostic model, but all 8178 embodiments should be presented and listed, but in the case of listing, it takes up too much space, and the patent expresses the invention idea. In view of the above, only representative examples are given in the form of a table. Each embodiment presented in the form of a table has a cancer diagnosis model number, and the cancer diagnosis model corresponding to the number corresponds to a biomarker combination that participates in the cancer diagnosis model. The evaluation index, which is the result of the test of 78 males and 50 females and 121 lung cancer patients (78 males and 43 females), is written.

우선적으로 본 발명에서는 13개의 바이오마커에 대하여 2개씩 쌍으로 복합도 2인 복합 마커 조합을 생성하였다. 생성된 바이오마커 조합에 대응하는 78개의 암 진단 모델을 생성하고, 생성된 암 진단 모델별로 평가 지표를 생성하였다. 상기 각 암 진단 모델을 대상으로 하여, 정상인 128명(남자 78명, 여자 50명)과 폐암 환자 121명(남자 78명, 여자 43명)을 대상으로 하여 테스트를 하였고, 그 테스트 결과에 해당하는 각 암 진단 모델별 평가 지표(정확도, 민감도 및 특이도) 중 일부는 하기 표 14에 있다.First, in the present invention, a composite marker combination having a composite index of 2 is generated for each of 13 biomarkers. 78 cancer diagnostic models corresponding to the generated biomarker combinations were generated, and an evaluation index was generated for each generated cancer diagnostic model. For each cancer diagnosis model, a test was performed on 128 healthy patients (78 males and 50 females) and 121 lung cancer patients (78 males and 43 females). Some of the evaluation indicators (accuracy, sensitivity and specificity) for each cancer diagnosis model are shown in Table 14 below.

하기 표 14는 정확도 기준 상위 50%에 해당하는 복합도 2인 복합 바이오마커 조합에 대응되는 암 진단 모델의 평가 지표를 보여주고 있다.Table 14 below shows the evaluation index of the cancer diagnostic model corresponding to the combination biomarker combination of Complexity 2 corresponding to the top 50% of the accuracy criteria.

암 진단 모델Cancer diagnostic model 바이오마커Biomarker 바이오마커Biomarker 정확도accuracy 민감도responsiveness 특이도Specificity 2020 A1ATA1AT PAI-1PAI-1 0.8795 0.8795 0.8505 0.8505 0.9085 0.9085 1414 A1ATA1AT CYFRA21-1CYFRA21-1 0.8723 0.8723 0.8541 0.8541 0.8906 0.8906 1616 A1ATA1AT RANTESRANTES 0.8702 0.8702 0.8430 0.8430 0.8974 0.8974 3131 CYFRA21-1CYFRA21-1 PAI-1PAI-1 0.8684 0.8684 0.8469 0.8469 0.8900 0.8900 2222 A1ATA1AT CEACEA 0.8663 0.8663 0.8308 0.8308 0.9018 0.9018 2727 CYFRA21-1CYFRA21-1 RANTESRANTES 0.8648 0.8648 0.8708 0.8708 0.8588 0.8588 2626 CYFRA21-1CYFRA21-1 IGF-1IGF-1 0.8629 0.8629 0.8213 0.8213 0.9044 0.9044 3232 CYFRA21-1CYFRA21-1 TTRTTR 0.8626 0.8626 0.8469 0.8469 0.8782 0.8782 2121 A1ATA1AT TTRTTR 0.8620 0.8620 0.8197 0.8197 0.9044 0.9044 1818 A1ATA1AT AFPAFP 0.8618 0.8618 0.8472 0.8472 0.8765 0.8765 4242 IGF-1IGF-1 TTRTTR 0.8597 0.8597 0.8626 0.8626 0.8568 0.8568 2323 A1ATA1AT CA19-9CA19-9 0.8567 0.8567 0.8216 0.8216 0.8918 0.8918 2525 A1ATA1AT ApoA1ApoA1 0.8563 0.8563 0.8289 0.8289 0.8838 0.8838 1515 A1ATA1AT IGF-1IGF-1 0.8540 0.8540 0.8384 0.8384 0.8697 0.8697 2828 CYFRA21-1CYFRA21-1 proApoA1proApoA1 0.8539 0.8539 0.8331 0.8331 0.8747 0.8747 2424 A1ATA1AT ApoA1/proApoA1ApoA1 / proApoA1 0.8534 0.8534 0.7948 0.7948 0.9121 0.9121 3535 CYFRA21-1CYFRA21-1 ApoA1/proApoA1ApoA1 / proApoA1 0.8533 0.8533 0.8439 0.8439 0.8626 0.8626 3636 CYFRA21-1CYFRA21-1 ApoA1ApoA1 0.8508 0.8508 0.8413 0.8413 0.8603 0.8603 3737 IGF-1IGF-1 RANTESRANTES 0.8495 0.8495 0.8590 0.8590 0.8400 0.8400 3030 CYFRA21-1CYFRA21-1 EGFREGFR 0.8494 0.8494 0.8390 0.8390 0.8597 0.8597 1717 A1ATA1AT proApoA1proApoA1 0.8492 0.8492 0.7866 0.7866 0.9118 0.9118 4040 IGF-1IGF-1 EGFREGFR 0.8487 0.8487 0.8397 0.8397 0.8576 0.8576 1919 A1ATA1AT EGFREGFR 0.8484 0.8484 0.8141 0.8141 0.8826 0.8826 2929 CYFRA21-1CYFRA21-1 AFPAFP 0.8472 0.8472 0.8338 0.8338 0.8606 0.8606 3434 CYFRA21-1CYFRA21-1 CA19-9CA19-9 0.8418 0.8418 0.8256 0.8256 0.8579 0.8579 3333 CYFRA21-1CYFRA21-1 CEACEA 0.8380 0.8380 0.8289 0.8289 0.8471 0.8471 5959 proApoA1proApoA1 TTRTTR 0.8348 0.8348 0.8125 0.8125 0.8571 0.8571 4747 RANTESRANTES proApoA1proApoA1 0.8328 0.8328 0.8197 0.8197 0.8459 0.8459 3838 IGF-1IGF-1 proApoA1proApoA1 0.8312 0.8312 0.8151 0.8151 0.8474 0.8474 6666 AFPAFP TTRTTR 0.8289 0.8289 0.8272 0.8272 0.8306 0.8306 4141 IGF-1IGF-1 PAI-1PAI-1 0.8278 0.8278 0.8315 0.8315 0.8241 0.8241 4343 IGF-1IGF-1 CEACEA 0.8203 0.8203 0.8479 0.8479 0.7926 0.7926 4646 IGF-1IGF-1 ApoA1ApoA1 0.8178 0.8178 0.8344 0.8344 0.8012 0.8012 5151 RANTESRANTES TTRTTR 0.8168 0.8168 0.8180 0.8180 0.8156 0.8156 3939 IGF-1IGF-1 AFPAFP 0.8097 0.8097 0.8459 0.8459 0.7735 0.7735 5757 proApoA1proApoA1 EGFREGFR 0.8092 0.8092 0.7698 0.7698 0.8485 0.8485 6565 AFPAFP PAI-1PAI-1 0.8089 0.8089 0.8128 0.8128 0.8050 0.8050 5454 RANTESRANTES ApoA1/proApoA1ApoA1 / proApoA1 0.8083 0.8083 0.7780 0.7780 0.8385 0.8385 5252 RANTESRANTES CEACEA 0.8065 0.8065 0.8030 0.8030 0.8100 0.8100

상기 표 14에서 알 수 있듯이, 상기 13개의 바이오마커 중에서 IGF-1, RANTES, A1AT, Cyfra21-1이 다른 바이오마커들에 비하여 상당히 많음을 알 수 있다. 한편, 복합도 2인 모델에서는 평가 지표가 85%를 넘는 것들이 소수이며, 90%를 넘어가는 것은 없음을 알 수 있다. 이와 같이 복합도 2인 모델들은 평가 지표 85% 수준에서는 채택될 수 있는 암 진단 모델이 다수 있음을 볼 수 있다.
As can be seen in Table 14, IGF-1, RANTES, A1AT, Cyfra21-1 of the 13 biomarkers it can be seen that significantly more than other biomarkers. On the other hand, in the model of complexity 2, the number of evaluation indicators is more than 85%, and the number is not more than 90%. As described above, the model of complexity 2 has a number of cancer diagnosis models that can be adopted at the 85% level of the evaluation index.

이어, 본 발명에서는 13개의 바이오마커에 대하여 3개씩 쌍으로 복합도 3인 복합 마커 조합을 생성하였다. 생성된 바이오마커 조합에 대응하는 286개의 암 진단 모델을 생성하고, 생성된 암 진단 모델별로 평가 지표를 생성하였다.Subsequently, in the present invention, a complex marker combination of complex degree 3 was generated in pairs of 3 for 13 biomarkers. 286 cancer diagnostic models corresponding to the generated biomarker combinations were generated, and an evaluation index was generated for each generated cancer diagnostic model.

하기 표 15는 정확도 기준 상위 30에 해당하는 복합 바이오마커 조합에 대응되는 암 진단 모델의 평가 지표를 보여주고 있다.Table 15 below shows the evaluation index of the cancer diagnostic model corresponding to the complex biomarker combination corresponding to the top 30 accuracy criteria.

암 진단 모델Cancer diagnostic model 바이오마커Biomarker 바이오마커Biomarker 바이오마커Biomarker 정확도accuracy 민감도responsiveness 특이도Specificity 217217 IGF-1IGF-1 RANTESRANTES TTRTTR 0.9034 0.9034 0.9095 0.9095 0.8974 0.8974 232232 IGF-1IGF-1 AFPAFP TTRTTR 0.8959 0.8959 0.9007 0.9007 0.8912 0.8912 103103 A1ATA1AT IGF-1IGF-1 RANTESRANTES 0.8957 0.8957 0.8944 0.8944 0.8971 0.8971 117117 A1ATA1AT RANTESRANTES TTRTTR 0.8925 0.8925 0.8833 0.8833 0.9018 0.9018 131131 A1ATA1AT AFPAFP PAI-1PAI-1 0.8924 0.8924 0.8603 0.8603 0.9244 0.9244 145145 A1ATA1AT PAI-1PAI-1 CA19-9CA19-9 0.8899 0.8899 0.8662 0.8662 0.9135 0.9135 261261 RANTESRANTES proApoA1proApoA1 TTRTTR 0.8883 0.8883 0.8793 0.8793 0.8974 0.8974 132132 A1ATA1AT AFPAFP TTRTTR 0.8883 0.8883 0.8636 0.8636 0.9129 0.9129 9999 A1ATA1AT CYFRA21-1CYFRA21-1 CEACEA 0.8879 0.8879 0.8682 0.8682 0.9076 0.9076 114114 A1ATA1AT RANTESRANTES AFPAFP 0.8875 0.8875 0.8780 0.8780 0.8971 0.8971 173173 CYFRA21-1CYFRA21-1 RANTESRANTES CEACEA 0.8872 0.8872 0.8793 0.8793 0.8950 0.8950 107107 A1ATA1AT IGF-1IGF-1 PAI-1PAI-1 0.8862 0.8862 0.8695 0.8695 0.9029 0.9029 116116 A1ATA1AT RANTESRANTES PAI-1PAI-1 0.8857 0.8857 0.8679 0.8679 0.9035 0.9035 108108 A1ATA1AT IGF-1IGF-1 TTRTTR 0.8846 0.8846 0.8810 0.8810 0.8882 0.8882 137137 A1ATA1AT EGFREGFR PAI-1PAI-1 0.8819 0.8819 0.8597 0.8597 0.9041 0.9041 225225 IGF-1IGF-1 proApoA1proApoA1 TTRTTR 0.8811 0.8811 0.8679 0.8679 0.8944 0.8944 150150 A1ATA1AT TTRTTR ApoA1/proApoA1ApoA1 / proApoA1 0.8799 0.8799 0.8485 0.8485 0.9112 0.9112 130130 A1ATA1AT AFPAFP EGFREGFR 0.8796 0.8796 0.8544 0.8544 0.9047 0.9047 119119 A1ATA1AT RANTESRANTES CA19-9CA19-9 0.8794 0.8794 0.8689 0.8689 0.8900 0.8900 113113 A1ATA1AT RANTESRANTES proApoA1proApoA1 0.8793 0.8793 0.8639 0.8639 0.8947 0.8947 146146 A1ATA1AT PAI-1PAI-1 ApoA1/proApoA1ApoA1 / proApoA1 0.8789 0.8789 0.8423 0.8423 0.9156 0.9156 147147 A1ATA1AT PAI-1PAI-1 ApoA1ApoA1 0.8782 0.8782 0.8616 0.8616 0.8947 0.8947 120120 A1ATA1AT RANTESRANTES ApoA1/proApoA1ApoA1 / proApoA1 0.8777 0.8777 0.8561 0.8561 0.8994 0.8994 215215 IGF-1IGF-1 RANTESRANTES EGFREGFR 0.8772 0.8772 0.8711 0.8711 0.8832 0.8832 110110 A1ATA1AT IGF-1IGF-1 CA19-9CA19-9 0.8762 0.8762 0.8607 0.8607 0.8918 0.8918 143143 A1ATA1AT PAI-1PAI-1 TTRTTR 0.8756 0.8756 0.8518 0.8518 0.8994 0.8994 243243 IGF-1IGF-1 PAI-1PAI-1 TTRTTR 0.8747 0.8747 0.8764 0.8764 0.8729 0.8729 164164 CYFRA21-1CYFRA21-1 IGF-1IGF-1 CEACEA 0.8738 0.8738 0.8502 0.8502 0.8974 0.8974 115115 A1ATA1AT RANTESRANTES EGFREGFR 0.8733 0.8733 0.8521 0.8521 0.8944 0.8944 134134 A1ATA1AT AFPAFP CA19-9CA19-9 0.8732 0.8732 0.8574 0.8574 0.8891 0.8891

상기 표 15에서 알 수 있듯이, 정확도 기준 90%를 넘거나, 90%에 극히 근접하는(반올림하는 경우, 90%에 해당되는) 암 진단 모델에는 상기 13개의 바이오마커 중에서 IGF-1, RANTES가 포함되어 있음을 알 수 있다.As can be seen from Table 15, the cancer diagnostic model exceeding 90% of accuracy criteria or extremely close to 90% (when rounding, corresponding to 90%) includes IGF-1 and RANTES among the 13 biomarkers. It can be seen that.

한편, 상위 평가 지표를 보이는 암 진단 모델에서, A1AT, Cyfra21-1, TTR이 다른 바이오마커들에 비하여 상당히 많음을 알 수 있다. On the other hand, in the cancer diagnostic model showing the high evaluation index, it can be seen that A1AT, Cyfra21-1, TTR is significantly higher than other biomarkers.

이어, 본 발명에서는 13개의 바이오마커에 대하여 4개씩 쌍으로 복합도 4인 복합 마커 조합을 생성하였다. 생성된 바이오마커 조합에 대응하는 286개의 암 진단 모델을 생성하고, 생성된 암 진단 모델별로 평가 지표를 생성하였다.Subsequently, in the present invention, a complex marker combination of complex degree 4 was generated in pairs of 4 for 13 biomarkers. 286 cancer diagnostic models corresponding to the generated biomarker combinations were generated, and an evaluation index was generated for each generated cancer diagnostic model.

하기 표 16은 정확도 기준 상위 30에 해당하는 복합 바이오마커 조합에 대응되는 암 진단 모델의 평가 지표를 보여주고 있으며, 하기 표 17은 정확도 기준 상위 31위에서 60위에 해당하는 복합 바이오마커 조합에 대응되는 암 진단 모델의 평가 지표를 보여주고 있다.Table 16 below shows the evaluation indicators of the cancer diagnostic model corresponding to the complex biomarker combination corresponding to the top 30 accuracy criteria, and Table 17 below shows the cancer corresponding to the complex biomarker combination corresponding to the 60th position in the top 31 accuracy criteria. It shows the evaluation index of the diagnostic model.

상기 표 16에서 알 수 있듯이, 복합도 4를 기준으로 한 암 진단 모델에서 상위 30위 내에는 상기 13개의 바이오마커 중에서 IGF-1, RANTES가 각각 19회 및 20회가 포함되어 있음을 최빈값을 형성하고 있음을 알 수 있다. 한편, A1AT 및 TTR도 다수 포함되어 있음을 알 수 있다.As can be seen from Table 16, among the 13 biomarkers, IGF-1 and RANTES are included 19 times and 20 times in the top 30 positions in the cancer diagnosis model based on Complexity 4, respectively. It can be seen that. On the other hand, it can be seen that a large number of A1AT and TTR is included.

한편, 상기 표 17에서 알 수 있듯이, 복합도 4를 기준으로 한 암 진단 모델에서 상위 31위 내지 60위에는 상기 13개의 바이오마커 중에서 A1AT, IGF-1, RANTES가 각각 19회, 15회 및 15회가 포함되어 있음을 최빈값을 형성하고 있음을 알 수 있다.On the other hand, as shown in Table 17, A1AT, IGF-1, RANTES of the 13 biomarkers 19, 15 and 15 times in the top 31 to 60 in the cancer diagnostic model based on Complexity 4, respectively It can be seen that forms a mode that contains.

표 16 내지 표 17에서 알 수 있듯이, 복합도 4인 암 진단 모델에서는 IGF-1, RANTES가 암 진단 모델에서 가장 주요한 바이오마커일 가능성이 높으며, 아울러 A1AT, TTR도 암 진단 모델에서 주요한 바이오마커일 가능성이 높게 된다.As can be seen from Tables 16 to 17, IGF-1 and RANTES are likely to be the major biomarkers in the cancer diagnosis model, and A1AT and TTR are the major biomarkers in the cancer diagnosis model. The probability is high.

한편, 표 16 내지 표 17에서 알 수 있듯이, 대략 40위권 내에 들어오는 암 진단 모델은 정확도 반올림 기준으로 할 때 평가 지표값이 90%를 보이고 있음을 알 수 있다.On the other hand, as shown in Tables 16 to 17, it can be seen that the evaluation index value of the cancer diagnostic model that falls within the 40th rank shows 90% when the accuracy rounding standard is used.

이어, 본 발명에서는 13개의 바이오마커에 대하여 5개씩 쌍으로 복합도 5인 복합 마커 조합을 생성하였다. 생성된 바이오마커 조합에 대응하는 암 진단 모델을 생성하고, 생성된 암 진단 모델별로 평가 지표를 생성하였다. 하기 표 18 내지 표 21은 정확도 평가 기준 90%를 기준으로 암 진단 모델을 선정하였다.Subsequently, in the present invention, a complex marker combination of complex degree 5 was generated in pairs of 5 for 13 biomarkers. A cancer diagnostic model corresponding to the generated biomarker combination was generated, and an evaluation index was generated for each generated cancer diagnostic model. Tables 18 to 21 select cancer diagnosis models based on the accuracy evaluation criteria of 90%.

하기 표 18는 정확도 기준 상위 30에 해당하는 복합 바이오마커 조합에 대응되는 암 진단 모델의 평가 지표를 보여주고 있으며, 하기 표 19는 정확도 기준 상위 31위에서 60위에 해당하는 복합 바이오마커 조합에 대응되는 암 진단 모델의 평가 지표를 보여주고 있으며, 하기 표 20은 정확도 기준 상위 61위 내지 90위에 해당하는 복합 바이오마커 조합에 대응되는 암 진단 모델의 평가 지표를 보여주고 있으며, 하기 표 21은 정확도 기준 상위 91위에서 117위에 해당하는 복합 바이오마커 조합에 대응되는 암 진단 모델의 평가 지표를 보여주고 있다.
Table 18 shows the evaluation indicators of the cancer diagnostic model corresponding to the complex biomarker combination corresponding to the top 30 accuracy criteria, and Table 19 shows the cancer corresponding to the complex biomarker combination corresponding to the 60th position in the top 31 accuracy criteria. The evaluation index of the diagnostic model is shown, and Table 20 below shows the evaluation index of the cancer diagnostic model corresponding to the complex biomarker combination corresponding to the top 61 to the 90th accuracy criteria. Above, the evaluation index of the cancer diagnostic model corresponding to the combination biomarker of the 117th place is shown.

표 18에서 알 수 있듯이, 복합도 5를 기준으로 한 암 진단 모델에서 상위 30위 내에는 상기 13개의 바이오마커 중에서 IGF-1, RANTES가 각각 23회 및 27회가 포함되어 있음을 최빈값을 형성하고 있음을 알 수 있다. 한편, A1AT 및 TTR도 각각 15회 및 22회가 다수 포함되어 있음을 알 수 있다.As can be seen in Table 18, among the thirteen top biomarkers in the cancer diagnostic model based on Complexity 5, IGF-1 and RANTES contained 23 and 27 times, respectively. It can be seen that. On the other hand, it can be seen that A1AT and TTR are also included in a plurality of 15 times and 22 times, respectively.

한편 상기 표 19에서 알 수 있듯이, 복합도 5를 기준으로 한 암 진단 모델에서 상위 31위 내지 60위에는 상기 13개의 바이오마커 중에서 IGF-1, RANTES가 각각 17회, 27회 포함되어 있음을 최빈값을 형성하고 있음을 알 수 있고, A1AT 및 TTR도 각각 18회 및 16회 포함되어 있음을 알 수 있다.On the other hand, as shown in Table 19, in the cancer diagnostic model based on Complexity 5, the top 31 to 60, the highest value of the IGF-1, RANTES contained 17 times, 27 times among the 13 biomarkers, respectively It can be seen that the formation, and A1AT and TTR are also included 18 and 16 times, respectively.

한편, 상기 표 20에서 알 수 있듯이, 복합도 5를 기준으로 한 암 진단 모델에서 상위 61위 내지 90위에는 상기 13개의 바이오마커 중에서 IGF-1, RANTES가 각각 17회, 22회 포함되어 있음을 최빈값을 형성하고 있음을 알 수 있고, A1AT도 16회 포함되어 있음을 알 수 있다.On the other hand, as shown in Table 20, the highest 61 to 90 in the cancer diagnostic model based on the complex degree 5, IGF-1, RANTES among the 13 biomarkers, 17 times, 22 times, respectively, the most frequent value It can be seen that it forms, and it can be seen that A1AT is also included 16 times.

한편, 상기 표 21에서 알 수 있듯이, 복합도 5를 기준으로 한 암 진단 모델에서 상위 91위 내지 117위에는 상기 13개의 바이오마커 중에서 A1AT, IGF-1, RANTES, TTR 등이 다수 포함되어 있음을 알 수 있다.On the other hand, as can be seen in Table 21, in the cancer diagnosis model based on the complex degree 5, the top 91 to 117, it is found that a number of A1AT, IGF-1, RANTES, TTR, etc. among the 13 biomarkers are included Can be.

표 18 내지 표 21에서 알 수 있듯이, 복합도 5를 기준으로 한 암 진단 모델에서 상위 1위 내지 117위에는 상기 13개의 바이오마커 중에서 IGF-1, RANTES가 각각 73회, 88회 포함되어 있음을 최빈값을 형성하고 있음을 알 수 있고, A1AT 65회, TTR이 64회 포함되어 있음을 알 수 있다.As can be seen from Table 18 to Table 21, in the cancer diagnosis model based on Complexity 5, the highest 1 to 117 are among the 13 biomarkers containing 73 and 88 IGF-1 and RANTES, respectively. It can be seen that the form, and A1AT 65 times, TTR is included 64 times.

이어, 본 발명에서는 13개의 바이오마커에 대하여 6개씩 쌍으로 복합도 6인 복합 마커 조합을 생성하였다. 생성된 바이오마커 조합에 대응하는 암 진단 모델을 생성하고, 생성된 암 진단 모델별로 평가 지표를 생성하였다. 하기 표 22는 상위 30위 내에 포함되는 암 진단 모델의 평가 지표를 보여 주고 있다. Subsequently, in the present invention, a complex marker combination having a complex degree of 6 was formed in pairs of 6 for 13 biomarkers. A cancer diagnostic model corresponding to the generated biomarker combination was generated, and an evaluation index was generated for each generated cancer diagnostic model. Table 22 below shows the evaluation index of the cancer diagnostic model included in the top 30.

상기 표 22에서 알 수 있듯이, 복합도 6인 암 진단 모델에서는 RANTES는 모든 암 진단 모델에 포함되어 있으며, A1AT, IGF-1이 각각 24회 및 24회 포함되어 있고, Cyfra21-1과 TTR 등도 19회 포함되어 있음을 알 수 있다.As can be seen in Table 22, RANTES is included in all cancer diagnostic models in complex cancer model 6, and A1AT and IGF-1 are included 24 and 24 times, respectively, and Cyfra21-1 and TTR are also included. It can be seen that it is contained.

이어, 본 발명에서는 13개의 바이오마커에 대하여 7개씩 쌍으로 복합도 7인 복합 마커 조합을 생성하였다. 생성된 바이오마커 조합에 대응하는 암 진단 모델을 생성하고, 생성된 암 진단 모델별로 평가 지표를 생성하였다. 하기 표 23은 상위 30위 내에 포함되는 암 진단 모델의 평가 지표를 보여 주고 있다.
Subsequently, in the present invention, a compound marker combination of compound degree 7 was generated in pairs of 7 for 13 biomarkers. A cancer diagnostic model corresponding to the generated biomarker combination was generated, and an evaluation index was generated for each generated cancer diagnostic model. Table 23 below shows the evaluation index of the cancer diagnostic model included in the top 30.

상기 표 23에서 알 수 있듯이, 복합도 7인 암 진단 모델에서는 IGF-1과 RANTES는 각각 29회 및 30회로 거의 모든 암 진단 모델에 포함되어 있으며, Cyfra21-1과 TTR 등도 24회씩 포함되어 있음을 알 수 있다.As can be seen from Table 23, in the cancer diagnosis model of Complexity 7, IGF-1 and RANTES are included in almost all cancer diagnosis models 29 and 30 times, respectively, and Cyfra21-1 and TTR are included 24 times. Able to know.

한편, 표 23에서 알 수 있듯이, 복합도가 6 내지 7에 근접할수록 평가 지표가 포화되는 정도가 높아짐을 알 수 있다. On the other hand, as can be seen in Table 23, it can be seen that the degree of saturation of the evaluation indicators increases as the degree of complexity approaches 6 to 7.

본원 발명의 발명자들은 13개의 바이오마커에 대하여 8개씩 쌍으로 복합도 8인 복합 마커 조합, 9개씩 쌍으로 복합도 9인 복합 마커 조합, 10개씩 쌍으로 복합도 10인 복합 마커 조합, 11개씩 쌍으로 복합도 11인 복합 마커 조합, 12개씩 쌍으로 복합도 12인 복합 마커 조합 및 모든 13개의 바이오마커를 다 포함하는 암 진단 모델을 생성하고, 생성된 암 진단 모델별로 평가 지표를 생성하였다.The inventors of the present invention have a combination of 8 complex markers in a combination of 8 for 13 biomarkers, a complex marker combination in a complex of 9 in a pair of 9, a complex marker combination of 10 in a complex of 10, a pair of 11 As a result, a cancer diagnostic model including a complex marker combination of complex level 11, a complex marker combination of complex level 12 in pairs of 12, and all 13 biomarkers was generated, and an evaluation index was generated for each generated cancer diagnostic model.

복합도 8 내지 12에 대한 결과 중 복합도 12에 대한 결과의 일부를 하기 표 24에 게시한다. 복합도가 높아질 수록 평가 지표가 개선되는 경향이 있지만, 복합도가 높아질수록 평가 지표가 포화되거나, 좋아지는 경향은 반드시 성립되는 것은 아닐 수 있다. 표 24는 그러한 예시를 보여 준다.Some of the results for Complexity 12 of the results for Complexities 8-12 are published in Table 24 below. As the complexity increases, the evaluation index tends to improve. However, the higher the complexity, the more saturated the evaluation index or the better the tendency is not necessarily established. Table 24 shows such an example.

암 진단 모델Cancer diagnostic model 13개 바이오마커 중 제외된 바이오마커Biomarkers excluded of 13 biomarkers 정확도accuracy 민감도responsiveness 특이도Specificity 81788178 ApoA1ApoA1 0.90470.9047 0.90620.9062 0.90320.9032 81798179 ApoA1/proApoA1ApoA1 / proApoA1 0.90590.9059 0.90230.9023 0.90940.9094 81808180 CA19-9CA19-9 0.90260.9026 0.90230.9023 0.90290.9029 81818181 CEACEA 0.90.9 0.8970.897 0.90290.9029 81828182 TTRTTR 0.90370.9037 0.89970.8997 0.90760.9076 81838183 PAI-1PAI-1 0.89760.8976 0.89410.8941 0.90120.9012 81848184 EGFREGFR 0.90740.9074 0.90590.9059 0.90880.9088 81858185 AFPAFP 0.90260.9026 0.89870.8987 0.90650.9065 81868186 proApoA1proApoA1 0.90550.9055 0.9010.901 0.910.91 81878187 RANTESRANTES 0.88950.8895 0.88520.8852 0.89380.8938 81888188 IGF-1IGF-1 0.89170.8917 0.89280.8928 0.89060.8906 81898189 CYFRA21-1CYFRA21-1 0.89910.8991 0.89670.8967 0.90150.9015 81908190 A1ATA1AT 0.90020.9002 0.89310.8931 0.90740.9074

상기 표 15 내지 표 24에서 알 수 있듯이, 상기 바이오마커 조합 후보군을 구성하는 바이오마커 조합 후보에 대하여, 바이오마커 조합 후보를 구성하는 개별 바이오마커 또는 구성된 바이오마커 조합 후보들과 폐암 진단 능력을 비교(S22)할 수 있다. 상기 비교는 평가 지표로 비교할 수 있을 것이다. 바이오마커 조합 후보 중 폐암 진단 능력이 기설정된 기준 이상인 바이오마커 조합을 선별(S23)하는데, 상기 선별에서 어느 평가 지표를 사용하느냐에 따라 기설정된 기준은 다를 수 있다. 폐암 진단에 있어서는 특이도가 중요한 평가 지표일 수 있으며, ROC 커브의 면적도 효율적인 평가 지표가 될 수 있다.As can be seen from Tables 15 to 24, with respect to the biomarker combination candidate constituting the biomarker combination candidate group, lung cancer diagnosis capability is compared with the individual biomarker or configured biomarker combination candidates constituting the biomarker combination candidate (S22). )can do. The comparison may be compared with an evaluation index. Among biomarker combination candidates, a biomarker combination whose lung cancer diagnosis ability is greater than or equal to a predetermined criterion is selected (S23). The predetermined criterion may be different depending on which evaluation indicator is used in the selection. Specificity may be an important evaluation index in the diagnosis of lung cancer, and the area of the ROC curve may be an effective evaluation index.

한편, 상기 제1 바이오마커 군에서 어느 하나 이상의 바이오마커를 선택(S31)하고, 제2 바이오마커 군에서 어느 하나 이상의 바이오마커를 선택(S32)한 다음, 2 이상의 바이오마커를 포함하여 구성되는 적어도 하나 이상의 바이오마커 조합을 포함하는 바이오마커 조합 후보군을 구성(S33)하고, 바이오마커 조합 후보를 구성하는 개별 바이오마커 또는 구성된 바이오마커 조합 후보들과 폐암 진단 능력을 비교(S34)할 수도 있을 것이다.Meanwhile, at least one biomarker is selected from the first biomarker group (S31), at least one biomarker is selected from the second biomarker group (S32), and at least one biomarker is included. A biomarker combination candidate group including one or more biomarker combinations may be configured (S33), and lung cancer diagnosis capability may be compared with individual biomarkers or configured biomarker combination candidates constituting the biomarker combination candidate (S34).

본 발명은 상기 13개 바이오마커에 특이적으로 결합할 수 있는 항체를 2 이상 복합적으로 포함하는 폐암 진단 및 스크리닝용 키트를 제공한다.The present invention provides a kit for lung cancer diagnosis and screening comprising a combination of two or more antibodies that can specifically bind to the 13 biomarkers.

본 발명의 구체적인 실시예에서 폐암 환자의 혈청에서 발현량이 유의하게 변화하는 13개 단백질을 폐암 진단 및 스크리닝용 바이오마커로 선정하였고(표 5)참조), 상기 13개 바이오마커를 이용한 조합으로 이루어진 분류 모델에서 더 높은 정확도로 폐암 분류를 수행할 수 있음을 확인하였다. 이에, 본 발명의 키트는 폐암 환자와 정상인에서 발현에 차이가 있는 복합 바이오마커를 정량하는데 사용하기 위해, 상기 복합 바이오마커를 구성하는 각 바이오마커에 특이적으로 결합할 수 있는 항체를 포함할 수 있다.In a specific embodiment of the present invention, 13 proteins with significantly changed expression levels in serum of lung cancer patients were selected as biomarkers for lung cancer diagnosis and screening (see Table 5)), and the classification using a combination of the 13 biomarkers. We confirmed that the model can perform lung cancer classification with higher accuracy. Thus, the kit of the present invention may include an antibody capable of specifically binding to each biomarker constituting the complex biomarker for use in quantifying a complex biomarker having a difference in expression in a lung cancer patient and a normal person. have.

상기 키트는 환자가 폐암인지 아닌지를 구별하여 의사 등 진료 행위자가 폐암을 진단 및 스크리닝 하는 것을 가능하게 할 뿐 아니라, 치료에 대한 환자의 반응을 모니터하여 그 결과에 따라 치료를 변경하는 것을 가능하게 한다. 또한, 폐암 모델(예: 마우스, 랫트 등의 동물 모델)의 생체 내 또는 생체 외에서 하나 이상의 바이오마커의 발현을 조절하는 화합물을 동정하는데 사용될 수 있다. 이에, 본 발명의 바이오마커는 표준 물질로 상기 키트에 추가로 포함될 수 있다.The kit distinguishes whether or not the patient is lung cancer and enables medical practitioners, such as doctors, to diagnose and screen lung cancer, as well as monitor the patient's response to the treatment and modify the treatment accordingly. . It can also be used to identify compounds that modulate the expression of one or more biomarkers in vivo or ex vivo in a lung cancer model (eg, an animal model such as mouse, rat, etc.). Thus, the biomarker of the present invention may be further included in the kit as a standard material.

본 발명의 키트에 사용될 수 있는 항체는 다클론 항체, 단클론 항체 및 에피토프와 결합할 수 있는 단편 등을 포함한다.Antibodies that can be used in the kits of the present invention include polyclonal antibodies, monoclonal antibodies, fragments capable of binding epitopes, and the like.

다클론 항체는 상기 13개 단백질 중 어느 하나를 동물에 주사하고 해당 동물로부터 채혈하여 항체를 포함하는 혈청을 수득하는 종래의 방법에 의해 생산할 수 있다. 이러한 다클론 항체는 당업계에 알려진 어떠한 방법에 의해서든 정제될 수 있고, 염소, 토끼, 양, 원숭이, 말, 돼지, 소, 개 등의 임의의 동물 종 숙주로부터 만들어 질 수 있다.Polyclonal antibodies can be produced by conventional methods of injecting any one of the 13 proteins into an animal and collecting blood from the animal to obtain a serum comprising the antibody. Such polyclonal antibodies can be purified by any method known in the art and can be made from any animal species host, such as goats, rabbits, sheep, monkeys, horses, pigs, cattle, dogs and the like.

단클론 항체는 연속 세포주의 배양을 통한 항체 분자의 생성을 제공하는 어떠한 기술을 사용하여도 제조할 수 있다. 이러한 기술로는 이들로 한정되는 것은 아니지만 하이브리도마 기술, 사람 B-세포 하이브리도마 기술 및 EBV-하이브리도마 기술이 포함된다(Kohler G et al., Nature 256:495-497, 1975; Kozbor D et al., J Immunol Methods 81:31-42, 1985; Cote RJ et al., Proc Natl Acad Sci 80:2026-2030, 1983; 및 Cole SP et al., Mol Cell Biol 62:109-120, 1984).Monoclonal antibodies can be prepared using any technique that provides for the production of antibody molecules through the culture of continuous cell lines. Such techniques include, but are not limited to, hybridoma technology, human B-cell hybridoma technology, and EBV-hybridoma technology (Kohler G et al., Nature 256: 495-497, 1975; Kozbor D et al., J Immunol Methods 81: 31-42, 1985; Cote RJ et al., Proc Natl Acad Sci 80: 2026-2030, 1983; and Cole SP et al., Mol Cell Biol 62: 109-120, 1984).

또한 상기 13개 단백질 중 어느 하나에 대한 특정 결합 부위를 함유한 항체 단편이 제조될 수 있다. 예를 들면 이들로 한정되는 것은 아니지만 F(ab')2 단편은 항체 분자를 펩신으로 분해시켜 제조할 수 있으며, Fab 단편은 F(ab')2 단편의 디설파이드 브릿지를 환원시킴으로써 제조할 수 있다. 다른 방도로서, Fab 발현 라이브러리를 작제하여 원하는 특이성을 갖는 단클론 Fab 단편을 신속하고 간편하게 동정할 수 있다(Huse WD et al., Science 254: 1275-1281, 1989).In addition, antibody fragments containing specific binding sites for any of the 13 proteins can be prepared. For example, but not limited to, F (ab ') 2 fragments can be prepared by digesting antibody molecules with pepsin, and Fab fragments can be prepared by reducing the disulfide bridges of F (ab') 2 fragments. Alternatively, a Fab expression library can be constructed to quickly and simply identify monoclonal Fab fragments with the desired specificity (Huse WD et al., Science 254: 1275-1281, 1989).

상기 항체는 세척이나 복합체의 분리 등 그 이후의 단계를 용이하게 하기 위해 고형 기질(solid substrate)에 결합될 수 있다. 고형 기질은 예를 들어 합성수지, 니트로셀룰로오스, 유리기판, 금속기판, 유리섬유, paramagnetic bead, 미세구체 및 미세비드 등이 있다. 또한, 상기 합성수지에는 폴리에스터, 폴리염화비닐, 폴리스티렌, 폴리프로필렌, PVDF 및 나일론 등이 있다. 본 발명의 구체적인 실시예에서, 단백질에 특이적으로 결합하는 항체를 고형 기질에 결합시키기 위해, 미세구체를 현탁한 후 마이크로튜브(microtube)에 옮겨 원심분리로 상층액을 제거한 후 재현탁하고, N-하이드록시-설포숙시니마이드(N-hydroxy-sulfosuccinimide) 및 1-에틸-3-(3-디메틸아미노프로필)-카르보디이마이드 하이드로클로라이드(1-ethyl-3-(3-dimethylaminopropyl)-carbodiimide hydrochloride)를 차례로 처리한 후 원심분리로 상층액을 제거한 후 세척하여 보관하였다. 또한, 환자로부터 수득된 시료를 고형 기질에 결합된 본 발명의 13개 단백질 중 어느 하나의 단백질에 특이적으로 결합할 수 있는 항체와 접촉시키는 경우, 시료는 항체와 접촉 전에 알맞은 정도로 희석될 수 있다.The antibody can be bound to a solid substrate to facilitate subsequent steps such as washing or separation of the complex. Solid substrates include, for example, synthetic resins, nitrocellulose, glass substrates, metal substrates, glass fibers, paramagnetic beads, microspheres and microbeads. In addition, the synthetic resins include polyester, polyvinyl chloride, polystyrene, polypropylene, PVDF and nylon. In a specific embodiment of the present invention, in order to bind an antibody specifically binding to a protein to a solid substrate, the microspheres are suspended and then transferred to a microtube to remove the supernatant by centrifugation, and then resuspended. N-hydroxy-sulfosuccinimide and 1-ethyl-3- (3-dimethylaminopropyl) -carbodiimide hydrochloride-ethyl-3- (3-dimethylaminopropyl) -carbodiimide hydrochloride ) Was treated sequentially and then the supernatant was removed by centrifugation, washed and stored. In addition, when a sample obtained from a patient is contacted with an antibody capable of specifically binding to any of the 13 proteins of the present invention bound to a solid substrate, the sample may be diluted to a suitable degree prior to contact with the antibody. .

본 발명의 키트는 추가로 상기 바이오마커에 특이적으로 결합하는 검출용 항체를 포함할 수 있다. 상기 검출용 항체는 발색효소, 형광물질, 방사성 동위원소 또는 콜로이드 등의 검출체로 표지한 접합체(conjugate)일 수 있고, 바람직하게는 상기 바이오마커에 특이적으로 결합할 수 있는 1차 항체일 것이다. 예를 들어, 상기 발색효소는 퍼록시다제(peroxidase), 알칼라인 포스파타제(alkaline phosphatase) 또는 산성 포스파타제(acid phosphatase)(예:양고추냉이 퍼록시다제(horseradish peroxidase))일 수 있고; 형광물질인 경우, 플루오레신카복실산(FCA), 플루오레신 이소티오시아네이트(FITC), 플루오레신 티오우레아(FTH), 7-아세톡시쿠마린-3-일, 플루오레신-5-일, 플루오레신-6-일, 2',7'-디클로로플루오레신-5-일, 2',7'-디클로로플루오레신-6-일, 디하이드로테트라메틸로사민-4-일, 테트라메틸로다민-5-일, 테트라메틸로다민-6-일, 4,4-디플루오로-5,7-디메틸-4-보라-3a,4a-디아자-s-인다센-3-에틸 또는 4,4-디플루오로-5,7-디페닐-4-보라-3a,4a-디아자-s-인다센-3-에틸, Cy3, Cy5,폴리 L-라이신-플루오레세인 이소티오시아네이트(poly L-lysine-fluorescein isothiocyanate, FITC), 로다민-B-이소티오시아네이트(rhodamine-B-isothiocyanate, RITC), 로다민(rhodamine), PE(Phycoerythrin)등을 사용하는 것이 가능하다.The kit of the present invention may further comprise a detection antibody that specifically binds to the biomarker. The detection antibody may be a conjugate labeled with a detector such as a chromophore, a fluorescent substance, a radioisotope or a colloid, and preferably a primary antibody capable of specifically binding to the biomarker. For example, the chromase may be peroxidase, alkaline phosphatase or acid phosphatase (eg horseradish peroxidase); In the case of fluorescent materials, fluorescein carboxylic acid (FCA), fluorescein isothiocyanate (FITC), fluorescein thiourea (FTH), 7-acetoxycoumarin-3-yl, fluorescein-5-yl , Fluorescein-6-yl, 2 ', 7'-dichlorofluorescein-5-yl, 2', 7'-dichlorofluorescin-6-yl, dihydrotetramethyllosamine-4-yl, Tetramethyllodamine-5-yl, tetramethyllodamine-6-yl, 4,4-difluoro-5,7-dimethyl-4-bora-3a, 4a-diaza-s-indacene-3- Ethyl or 4,4-difluoro-5,7-diphenyl-4-bora-3a, 4a-diaza-s-indacene-3-ethyl, Cy3, Cy5, poly L-lysine-fluorescein iso Thiocyanate (poly L-lysine-fluorescein isothiocyanate (FITC), rhodamine-B-isothiocyanate (RITC), rhodamine (rhodamine), PE (Phycoerythrin), etc. can be used. Do.

또한, 본 발명의 키트는 추가로 (1) 상기 바이오마커에 특이적으로 결합하는 검출용 항체 및 (2) 상기 검출용 항체에 결합할 특이적으로 결합할 수 있는 리간드를 포함할 수 있다. 상기 리간드에는 단백질 A 또는 검출용 항체에 특이적으로 결합하는 2차 항체 등이 있다. 또한 상기 리간드는 발색효소, 형광물질, 방사성 동위원소 또는 콜로이드 등의 검출체로 표지한 접합체(conjugate)일 수 있다. 상기 검출용 항체는 상기 리간드를 위해, 바이오틴화(biotinylation) 또는 다이곡시제닌(digoxigenin) 처리한 1차 항체를 이용하는 것이 바람직하나, 상기 검출용 항체의 처리방법은 이에 한정되지 않는다. 또한 상기 리간드로는 상기 검출용 항체에 결합하기 위해, 스트렙타비딘, 아비딘 등이 사용되는 것이 바람직하나, 이에 한정되지 않는다. 본 발명의 구체적인 실시예에서 상기 검출체로 형광물질을 부착한 스트렙타비딘(streptavidin)을 리간드로 사용하였으며, 상기 리간드를 위해 바이오틴화(biotinylation)시킨 검출용 항체를 이용하였다.In addition, the kit of the present invention may further comprise (1) a detection antibody that specifically binds to the biomarker and (2) a ligand that can specifically bind to the detection antibody. The ligand includes a secondary antibody that specifically binds to protein A or an antibody for detection. In addition, the ligand may be a conjugate labeled with a detector such as a chromophore, a fluorescent substance, a radioisotope or a colloid. The detection antibody is preferably a biotinylated or digoxigenin-treated primary antibody for the ligand, but the method of treating the detection antibody is not limited thereto. In addition, as the ligand, streptavidin, avidin, or the like is preferably used to bind the detection antibody, but is not limited thereto. In a specific embodiment of the present invention, streptavidin (streptavidin) having a fluorescent substance attached thereto was used as a ligand, and a detection antibody biotinylated for the ligand was used.

본 발명의 진단 및 스크리닝용 키트는 상기 항체 및 바이오마커 복합체에 검출용 항체를 처리한 후 검출용 항체의 양을 탐색함으로써 폐암을 진단 및 스크리닝할 수 있다. 또는 상기 항체 및 바이오마커 복합체에 검출용 항체 및 리간드를 순차적으로 처리한 후, 검출체용 항체의 양을 탐색함으로써 폐암을 진단 및 스크리닝할 수 있다. 본 발명의 바람직한 실시예에서, 검출용 항체를 세척된 항체-바이오마커 복합체와 정온배치한 후 세척하여 검출용 항체를 측정함으로써 상기 바이오마커의 양을 측정할 수 있다. 검출용 항체의 양 측정이나 존재 검출은 형광, 발광, 화학발광(chemiluminescence), 흡광도, 반사 또는 투과를 통해 이루어질 수 있다.The diagnostic and screening kit of the present invention can diagnose and screen lung cancer by treating the antibody and biomarker complex with a detection antibody and then searching for the amount of the detection antibody. Alternatively, the antibody and the biomarker complex may be sequentially treated with a detection antibody and a ligand, and then lung cancer may be diagnosed and screened by searching for the amount of the antibody for a detector. In a preferred embodiment of the present invention, the amount of the biomarker can be determined by measuring the antibody for detection by aligning the antibody for detection with the washed antibody-biomarker complex and then washing the antibody. Determination of the amount or detection of the antibody for detection can be made through fluorescence, luminescence, chemiluminescence, absorbance, reflection or transmission.

또한, 상기 검출용 항체 또는 리간드의 양을 탐색하는 방법으로는 초고속 스크리닝(high throughput screening, HTS) 시스템을 이용하는 것이 바람직하고, 여기에는 검출체로 형광물질이 부착되어 형광을 검출함으로써 수행되는 형광법 또는 검출체로 방사선 동위원소가 부착되어 방사선을 검출함으로써 수행되는 방사선법; 검출체의 표지 없이 표면의 플라즈몬 공명 변화를 실시간으로 측정하는 SPR(surface plasmon resonance) 방법 또는 SPR 시스템을 영상화하여 확인하는 SPRI(surface plasmon resonance imaging) 방법을 이용하는 것이 바람직하나 이에 한정되지 않는다.In addition, as a method of detecting the amount of the antibody or ligand for detection, it is preferable to use a high throughput screening (HTS) system, wherein a fluorescence method or detection performed by detecting a fluorescence by attaching a fluorescent material to the detector Radiation method performed by detecting radiation by attaching a radioisotope into a sieve; It is preferable to use a surface plasmon resonance (SPR) method for measuring the plasmon resonance change of the surface in real time without a label of the detector, or a SPRI (surface plasmon resonance imaging) method for imaging the SPR system.

예를 들어 상기 형광법은 형광 스캐너 프로그램을 이용하여 상기 검출용 항체를 형광물질로 라벨링한 후 스포팅 하여 신호를 확인하는 방법으로, 이 방법을 적용하여 결합 정도를 확인할 수 있다. 상기 형광물질은 Cy3, Cy5,폴리 L-라이신-플루오레세인 이소티오시아네이트(poly L-lysine-fluorescein isothiocyanate, FITC), 로다민-B-이소티오시아네이트(rhodamine-B-isothiocyanate, RITC), 로다민(rhodamine), PE(Phycoerythrin)으로 이루어진 군으로부터 선택된 어느 하나인 것 바람직하나 이에 한정되지 않는다. 상기 SPR 시스템은 형광법과는 달리 시료를 형광물질로 표지할 필요가 없이 항체의 결합 정도를 실시간으로 분석하는 것이 가능하나 동시다발적인 시료 분석이 불가능하다는 단점이 있다. SPRI의 경우에는 미세정렬 방법을 이용하여 동시다발적인 시료 분석이 가능하지만 탐지 강도가 낮은 단점이 있다.For example, the fluorescence method uses a fluorescence scanner program to label the detection antibody with a fluorescent material and spot the signal by spotting. This method can be applied to confirm the degree of binding. The fluorescent material is Cy3, Cy5, poly L-lysine-fluorescein isothiocyanate (FITC), rhodamine-B-isothiocyanate (RITC) , Rhodamine, PE (Phycoerythrin) is preferably any one selected from the group consisting of, but is not limited thereto. Unlike the fluorescence method, the SPR system does not require that the sample be labeled with a fluorescent material, but it is possible to analyze the binding degree of the antibody in real time, but it has a disadvantage that it is impossible to simultaneously analyze samples. In the case of SPRI, it is possible to perform simultaneous multiple sample analysis by using the fine alignment method, but the detection strength is low.

또한, 본 발명의 진단 및 스크리닝용 키트는 효소와 발색 반응할 기질 및 결합되지 않은 단백질 등은 제거하고 결합된 바이오마커만을 보유할 수 있는 세척액 또는 용리액을 추가로 포함할 수 있다. 분석을 위해 사용되는 시료는 혈청, 뇨, 눈물 타액 등 정상적인 상태와 구별될 수 있는 질환 특이적 폴리펩타이드를 확인할 수 있는 생체 시료를 포함한다. 바람직하게는 생물학적 액체 시료, 예를 들어 혈액, 혈청, 혈장, 더욱 바람직하게는 혈청으로부터 측정될 수 있다. 시료는 바이오마커의 탐지감도를 증가시키도록 준비될 수 있는데 예를 들어 환자로부터 수득한 혈청 시료는 음이온 교환 크로마토그래피, 친화도 크로마토그래피, 크기별 배제 크로마토그래피(size exclusion chromatography), 액체 크로마토그래피, 연속추출(sequential extraction) 또는 젤 전기영동 등의 방법을 이용하여 전처리될 수 있으나, 이에 한정되지 않는다.In addition, the kit for diagnosis and screening of the present invention may further include a washing solution or an eluent which can remove the substrate and unbound protein and the like to develop a color reaction with the enzyme and retain only the bound biomarker. Samples used for analysis include biological samples capable of identifying disease specific polypeptides that can be distinguished from normal conditions such as serum, urine, and tear saliva. Preferably from a biological liquid sample, for example blood, serum, plasma, more preferably serum. The sample may be prepared to increase the detection sensitivity of the biomarker. For example, a serum sample obtained from a patient may be subjected to anion exchange chromatography, affinity chromatography, size exclusion chromatography, liquid chromatography, Such as, but not limited to, sequential extraction or gel electrophoresis.

아울러, 본 발명은 상기 13개 단백질 중 어느 하나의 단백질에 특이적으로 결합할 수 있는 생물 분자가 고형 기질에 집적된 폐암 진단 및 스크리닝용 바이오칩을 제공한다.In addition, the present invention provides a biochip for lung cancer diagnosis and screening in which a biomolecule capable of specifically binding to any one of the 13 proteins is integrated on a solid substrate.

본 발명의 구체적인 실시예에서 폐암 환자의 혈청에서 발현량이 유의하게 변화하는 13개 단백질을 선정하였고(표 5참조), 상기 13개 단백질을 적어도 2 이상 복합적으로 이용한 조합으로 이루어진 분류모델에서 더 높은 정확도로 폐암 분류를 수행할 수 있음을 확인하였다. 이에, 본 발명의 바이오칩은 폐암 환자와 정상인에서 발현에 차이가 있는 상기와 같은 13개 단백질 중 하나 이상의 단백질을 측정하는데 사용하기 위해, 상기 13개 단백질 중 어느 하나의 단백질에 특이적으로 결합할 수 있는 항체를 포함할 수 있고, 또는 두 종류 이상의 상기 특이적인 항체의 조합을 포함할 수 있다.In a specific embodiment of the present invention, 13 proteins with significantly varying expression levels in the serum of lung cancer patients were selected (see Table 5), and higher accuracy in the classification model consisting of a combination using at least two of the 13 proteins. It was confirmed that lung cancer classification can be performed. Thus, the biochip of the present invention can specifically bind to any one of the 13 proteins for use in measuring one or more of the 13 proteins, such as the difference in expression in lung cancer patients and normal people. Antibodies, or combinations of two or more of the above specific antibodies.

상기 생물 분자는 저분자 화합물, 리간드, 앱타머, 펩티드, 폴리펩티드, 특이적 결합 단백질, 고분자 물질 및 항체 등으로 이루어진 군으로부터 선택되며 상기 단백질에 특이적으로 결합할 수 있는 물질이면 무엇이든 사용가능하며, 항체 또는 앱타머를 사용하는 것이 바람직하나, 이에 한정되는 것은 아니다.The biomolecule is selected from the group consisting of low molecular weight compounds, ligands, aptamers, peptides, polypeptides, specific binding proteins, high molecular materials and antibodies, and any material that can specifically bind to the protein, It is preferable to use an antibody or aptamer, but is not limited thereto.

상기 항체는 폴리클로날(polyclonal) 항체 또는 모노클로날(monoclonal) 항체를 사용하는 것이 바람직하며, 모노클로날 항체를 사용하는 것이 더욱 바람직하다. 상기 단백질에 특이적으로 결합하는 항체는 당업자에게 알려진 공지의 방법으로 제작하여도 무방하며, 상업적으로 알려진 항체를 구입하여 사용할 수 있다. 상기 항체는 당업자에게 알려진 종래 방법에 따라 면역원인 단백질을 외부 숙주에 주사함으로써 제조될 수 있다. 외부 숙주는 마우스, 래트, 양, 토끼와 같은 포유동물을 포함한다. 면역원은 근내, 복강내 또는 피하 주사방법으로 주사되며, 일반적으로 항원성을 증가시키기 위한 보조제(adjuvant)와 함께 투여할 수 있다. 외부 숙주로부터 정기적으로 혈액을 채취하여 형상된 역가 및 항원에 대한 특이성을 보이는 혈청을 수거하여 항체를 분리할 수 있다.The antibody is preferably a polyclonal antibody or a monoclonal antibody, more preferably a monoclonal antibody. Antibodies that specifically bind to the proteins may be prepared by known methods known to those skilled in the art, and commercially known antibodies may be purchased and used. The antibody can be prepared by injecting a protein that is an immunogen into an external host according to conventional methods known to those skilled in the art. External hosts include mammals such as mice, rats, sheep, rabbits. Immunogens are injected by intramuscular, intraperitoneal or subcutaneous injection and can generally be administered with an adjuvant to increase antigenicity. Antibodies can be isolated by collecting blood periodically from an external host and collecting serum showing shaped titers and specificity for the antigen.

또한, 본 발명의 바이오칩의 고형 기질은 플라스틱, 유리, 금속 및 실리콘으로 구성된 군으로부터 선택될 수 있으며, 바람직하게는 그 표면에 상기 항체를 부착시키기 위해 화학 처리되거나 링커 분자가 결합하여 있을 수 있으나 이에 한정되는 것은 아니다. 본 발명의 바이오칩은 시료에서 전체 단백질을 채취하여 바이오칩과 반응시켜 손쉽고 정확하게 폐암을 진단 및 스크리닝을 수행할 수 있다.In addition, the solid substrate of the biochip of the present invention may be selected from the group consisting of plastics, glass, metals and silicon, and preferably may be chemically treated or a linker molecule is bound to attach the antibody to the surface thereof. It is not limited. The biochip of the present invention can easily and accurately diagnose lung cancer by screening the whole protein from the sample and reacting with the biochip.

상기 바이오칩의 기판에 코팅된 활성기는 상기 물질을 결합하는 역할을 하며, 아민기(amine group), 알데하이드기(aldehyde group), 카르복실기(carboxyl group) 및 티올기(thiol group)로 이루어진 군으로부터 선택될 수 있으며, 당업자에게 단백질 분자를 기판에 결합할 수 있는 활성기로 알려진 모든 활성기가 사용 가능하며, 이것에 한정되는 것은 아니다.The active group coated on the substrate of the biochip serves to bind the material, and may be selected from the group consisting of an amine group, an aldehyde group, a carboxyl group and a thiol group. Any active group known as an activator capable of binding a protein molecule to a substrate may be used by one skilled in the art, but is not limited thereto.

도 14는 폐암 진단 시스템의 일 실시예적 구성에 관한 것이다.14 relates to an exemplary configuration of a lung cancer diagnosis system.

상기 폐암 진단 시스템은 진단 키트를 직접 또는 상기 진단 키트에서 기인하거나 독출한 정보를 이용하여 폐암 진단을 수행한다. 상기 폐암 진단 시스템은 대상자의 혈액, 혈장, 혈청 또는 기타 대상자의 신체에서 분리한 채취 물질로부터 측정된 상기 바이오마커 조합을 구성하는 바이오마커별로 발현량 정보 또는 발현량 비율 정보를 입수하는 정보 입수 모듈, 상기 입수된 상기 발현량 정보 또는 발현량 비율 정보를 기설정된 폐암 진단 모델로 처리하는 폐암 진단 모듈 및 상기 폐암 진단 모듈로부터 적어도 하나 이상의 폐암 진단 정보를 생성하는 폐암 진단 정보 생성 모듈을 포함할 수 있다. 상기 폐암 진단 모듈은 상기 발현량 정보 또는 발현량 비율 정보에 대하여 기설정된 적어도 하나 이상의 변환 모듈;을 더 포함하며, 상기 변환 모듈은 상기 발현량 정보에 대한 발현량 변환 정보 또는 상기 발현량 비율 정보에 대한 발현량 비율 변환 정보를 우선 생성한다. The lung cancer diagnosis system performs lung cancer diagnosis using information directly or derived from or read from the diagnostic kit. The lung cancer diagnosis system includes an information acquisition module for obtaining expression amount information or expression rate ratio information for each biomarker constituting the biomarker combination measured from blood, plasma, serum or other collected material separated from the subject's body, And a lung cancer diagnostic module configured to process the obtained expression level information or expression level ratio information into a preset lung cancer diagnostic model, and lung cancer diagnostic information generation module configured to generate at least one lung cancer diagnostic information from the lung cancer diagnostic module. The lung cancer diagnostic module may further include at least one or more conversion modules preset to the expression level information or expression level ratio information, and the conversion module may further include expression level conversion information or expression level ratio information for the expression level information. First, the expression rate ratio conversion information is generated.

한편, 상기 폐암 진단 모델은 상기 생성된 발현량 변환 정보 또는 상기 발현량 비율 변환 정보를 입력값으로 입력 받으며, 상기 변환 모듈은 tree를 이용한 앙상블 기법의 partial dependence plot 또는 partial dependency 함수 관계를 이용하여 발현량 변환 정보 또는 발현량 비율 변환 정보를 생성한다. 이에 대해서는 전술한 바와 같다. 상기 폐암 진단 모델은 로지스틱 모형인 것이며, 상기 로지스틱 모형은 상기 발현량 변환 정보 또는 상기 발현량 비율 변환 정보를 입력 받아 폐암으로 분류되는 확률값을 추정한다.Meanwhile, the lung cancer diagnostic model receives the generated expression level conversion information or the expression level ratio conversion information as input values, and the conversion module expresses the expression using a partial dependence plot or partial dependency function relationship of an ensemble technique using a tree. Quantity conversion information or expression rate ratio conversion information is generated. This is described above. The lung cancer diagnostic model is a logistic model, and the logistic model estimates a probability value classified as lung cancer by receiving the expression level conversion information or the expression level ratio conversion information.

폐암 진단 정보 생성 모듈(1300)의 CP 정보 생성부(1310)는 상기 폐암 진단 정보 생성 모듈은 바이오마커별 질병 진단 기여도에 대한 정보를 추가적으로 생성하며, 상기 바이오마커별 질병 진단 기여도는 상기 바이오마커 조합에 포함된 바이오마커에 대하여 로지스틱 모형으로 구한 기설정된 판별함수를 사용하여 폐암에 미치는 영향의 정도를 coefficient plot의 형태로 생성한다.The CP information generation unit 1310 of the lung cancer diagnosis information generation module 1300 may further generate information on the disease diagnosis contribution rate for each biomarker, and the disease diagnosis contribution rate for each biomarker may be determined by the biomarker combination. The degree of effect on lung cancer is generated in the form of a coefficient plot using a predetermined discriminant function obtained from a logistic model for the biomarkers included in.

상기 정보 입수 모듈이 상기 바이오마커별로 발현량 정보 또는 발현량 비율 정보를 입수하는 방법은, 상기 폐암 진단 시스템이 상기 진단 키트로부터 직접 입수하는 방법, 상기 폐암 진단 시스템과 유무선 네트워크를 통하여 연결된 상기 진단 키트의 상기 바이오마커별 발현량 정보를 독출할 수 있는 제3의 시스템으로부터 전송 받는 방식으로 입수하는 방법 및 상기 폐암 진단 시스템과 유무선 네트워크로 연결된 상기 바이오마커별 발현량 정보를 입수하는 자의 컴퓨터로부터 전송되는 방식으로 입수하는 방법 등이 사용될 수 있다. 상기 폐암 진단 시스템이 진단 키트의 바이오마커의 발현량 정보를 직접 독출할 수 있는 경우에는 직접 상기 진단 키트로부터 발현량 정보를 입수할 수 있게 된다. 하지만, 직접 독출할 수 없는 경우에는 그 발현량 정보를 독출하는 기계, 장치, 기구 등의 제3의 시스템으로부터 전송받는 방식으로도 입수할 수 있게 된다. 한편, 상기 제3의 시스템과 상기 폐암 진단 시스템이 유무선 네트워크로 연결되어 있지 않거나, 직접 정보를 주고 받지 못하는 경우에는 상기 발현량 정보를 독출한 자의 컴퓨터로부터 직접 또는 간접적으로 독출한 발현량 정보를 유무선 네트워크를 통하여 상기 폐암 진단 시스템으로 전송할 수 있게 된다.The information obtaining module obtains expression amount information or expression amount ratio information for each biomarker by the lung cancer diagnosis system directly obtained from the diagnosis kit, the lung cancer diagnosis system and the diagnostic kit connected through a wired or wireless network. The method of obtaining the expression information of each biomarker is transmitted from a third system capable of reading the method and the biomarker is transmitted from the computer of the person receiving the expression information of the biomarker connected to the wire and wireless network with the lung cancer diagnosis system A method of obtaining in a manner may be used. When the lung cancer diagnosis system can directly read expression level information of a biomarker of a diagnostic kit, expression level information can be directly obtained from the diagnostic kit. However, when it cannot read directly, it can also obtain it by the method of receiving from the 3rd system, such as a machine, apparatus, and apparatus which reads the expression quantity information. On the other hand, when the third system and the lung cancer diagnosis system are not connected to the wired / wireless network or cannot directly transmit or receive information, the expression level information read directly or indirectly from the computer of the person who read the expression level information is wired or wireless. It is possible to transmit to the lung cancer diagnosis system through a network.

상기 폐암 진단 시스템은 대상자의 혈액, 혈장, 혈청 또는 기타 대상자의 신체에서 분리한 채취 물질로부터 측정된 바이오마커 조합을 구성하는 바이오마커별로 발현량 정보 또는 발현량 비율 정보를 입수(S41)하고, 입수된 발현량 정보 또는 발현량 비율 정보를 기설정된 폐암 진단 모델을 포함하는 폐암 진단 모듈로 처리(S42)하여, 폐암 진단 모듈로부터 적어도 하나 이상의 폐암 진단 정보를 생성(S43)한다. The lung cancer diagnosis system obtains expression amount information or expression amount ratio information for each biomarker constituting a biomarker combination measured from blood, plasma, serum or other collected material separated from the subject's body (S41). The expression level information or expression level ratio information is processed into a lung cancer diagnostic module including a preset lung cancer diagnostic model (S42) to generate at least one lung cancer diagnostic information from the lung cancer diagnostic module (S43).

한편, 상기 폐암 진단 시스템은 다수의 폐암 진단 모델을 폐암 진단 모델부에 저장해 놓고, 다수의 다른 폐암 진단용 바이오마커 조합을 사용하는 자들을 위하여 폐암 진단 서비스를 수행할 수 있다. 예를 들면, A 병원은 a+b+c+d 복합 바이오마커와 관련된 폐암 진단 키트를 사용하여 폐암 진단을 수행하고, B 병원은 a+c+e+f 복합 바이오마커와 관련된 폐암 진단 키트를 사용하여 폐암 진단을 수행하는 경우, 각 진단 키트마다 관련된 바이오마커 조합이 다르므로, 다른 폐암 진단 모델을 사용해야 할 것이다. 이 경우, 상기 폐암 진단 시스템이 입수 받는 정보에는 샘플 ID, 바이오마커별 발현량 정보가 필수적으로 포함되어 있어야 한다. 따라서, 상기 폐암 진단 시스템의 폐암 진단 모델 선택부는 입수하는 바이오마커별 발현량 정보에서 발현량 정보가 대응되는 복수 개의 바이오마커를 통하여, 상기 진단 키트에 사용된 바이오마커 조합을 추출하고, 추출된 바이오마커 조합 정보를 통하여 어느 폐암 진단 모델을 선택할 것인지를 결정한다. 즉, A 병원과 관련해서는 a+b+c+d 복합 바이오마커와 관련된 폐암 진단 모델을 사용하여 폐암 진단을 수행하고, B 병원에 대해서는 a+c+e+f 복합 바이오마커와 관련된 폐암 진단 모델을 사용하여 폐암 진단을 수행한다.
Meanwhile, the lung cancer diagnosis system may store a plurality of lung cancer diagnosis models in a lung cancer diagnosis model unit and perform lung cancer diagnosis service for those who use a plurality of different lung cancer diagnosis biomarker combinations. For example, hospital A performs lung cancer diagnosis using a lung cancer diagnostic kit associated with a + b + c + d complex biomarker, and hospital B uses a lung cancer diagnostic kit associated with a + c + e + f complex biomarker. When lung cancer diagnosis is performed using different biomarker combinations for each diagnostic kit, different lung cancer diagnostic models should be used. In this case, the information obtained by the lung cancer diagnosis system should include the sample ID and expression information for each biomarker. Therefore, the lung cancer diagnosis model selection unit of the lung cancer diagnosis system extracts the biomarker combination used in the diagnostic kit through the plurality of biomarkers corresponding to the expression amount information from the biomarker expression amount information obtained, The marker combination information is used to determine which lung cancer diagnostic model to select. In other words, lung cancer diagnosis is performed using a lung cancer diagnostic model associated with a + b + c + d complex biomarker for hospital A, and lung cancer diagnostic model associated with a + c + e + f complex biomarker for hospital B. To perform lung cancer diagnosis.

본 발명은 의료 산업, 의료 정보 처리 산업, 암 진단 및 예방과 관련된 산업에 활용될 수 있다.
The present invention can be utilized in the medical industry, the medical information processing industry, and industries related to cancer diagnosis and prevention.

1000 : 폐암 진단 시스템
1100 : 정보 입수 모듈
1200 : 폐암 진단 모듈
1210 : 변환 모듈
1211 : Partial Dependency Plot/함수 관계 생성부
1220 : 폐암 진단 모델 생성부
1221 : 폐암 진단 모델부
1300 : 폐암 진단 정보 생성 모듈
1310 : CP 정보 생성부
1320 : 폐암 진단 모델 선택부
2000 : 바이오마커 발현량 정보 제공자단
2100 : 진단 키트
2200 : 진단 키트의 바이오마커별 발현량 정보를 독출할 수 있는 제3의 시스템
2300 : 바이오마커별 발현량 정보를 입수하는 자의 컴퓨터
3000 : 유무선 네트워크1000: Lung Cancer Diagnosis System
1100: information acquisition module
1200: Lung Cancer Diagnostic Module
1210: Conversion Module
1211: Partial Dependency Plot / function relationship generator
1220: lung cancer diagnostic model generator
1221: lung cancer diagnostic model
1300: lung cancer diagnostic information generation module
1310: CP information generation unit
1320: lung cancer diagnostic model selection unit
2000: Biomarker expression level provider
2100: Diagnostic Kits
2200: Third system capable of reading expression level information for each biomarker of a diagnostic kit
2300: Computer of a person who obtains expression level information for each biomarker
3000: wired and wireless network

Claims

In the method for using the combined biomarker information for lung cancer diagnosis of the lung cancer diagnosis system, the lung cancer diagnosis system,
(A) at least one first biomarker selected from the group of first biomarkers consisting of individual biomarkers IGF-1 and RANTES measured from blood, plasma, serum or other material taken from the body of a subject diagnosed with lung cancer Determination of the amount of expression of each biomarker of the group and the amount of expression of the second biomarker of the individual biomarkers A1AT, CYFRA21-1, proApoA1, AFP, EGFR, PAI-1, TTR, CEA, CA19-9 and ApoA1 Obtaining information;
(B) processing the expression amount information for each biomarker of the first biomarker group and the expression amount information for each biomarker of the second biomarker group and input the biomarker into a predetermined lung cancer determination model; And
(C) generating lung cancer determination information from the lung cancer determination model; lung cancer diagnostic composite biomarker information using method comprising a.

The method of claim 1,
Processing the expression level information for each biomarker in the step (B), when there is expression information of ApoA1 and expression level of proApoA1 in the second biomarker group, generates a ratio value of ApoA1 expression amount and proApoA1 expression amount The method for using the complex biomarker information for lung cancer diagnosis, characterized in that the lung cancer determination model, the expression amount of ApoA1, the expression amount of proApoA1, and the ratio of the ratio of ApoA1 expression amount and proApoA1 expression amount are injected.

The method of claim 1,
Processing the expression level information for each biomarker is to generate the expression level information for each biomarker converted using partial dependency plot or partial dependency function relationship of the ensemble method using the decision tree. Method for using complex biomarker information for lung cancer diagnosis, characterized in that.

The method of claim 1,
The lung cancer determination model is a method for using complex biomarker information for lung cancer diagnosis, characterized in that the logistic regression model.

5. The method of claim 4,
The logistic regression model is a method for using complex biomarker information for diagnosing lung cancer, characterized by using a ridge penalty function.