KR20200057664A

KR20200057664A - Gene expression marker screening method using neural network based on gene selection algorithm

Info

Publication number: KR20200057664A
Application number: KR1020190147762A
Authority: KR
Inventors: 강근수; 박성수; 신봉근
Original assignee: 단국대학교 천안캠퍼스 산학협력단; 디어젠 주식회사
Priority date: 2018-11-16
Filing date: 2019-11-18
Publication date: 2020-05-26
Also published as: KR102376212B1

Abstract

The present invention relates to a gene expression marker screening method using a neural network based gene selection algorithm. According to the present invention, the method comprises: a step of collecting each biopsy tissue from a plurality of patients and collecting a plurality of gene expression information experimentally measured from each collected biopsy tissue; a step of calculating a discriminative index (DI) for each gene by applying the collected plural gene expression information to a pre-trained neural network based gene selection algorithm; a step of listing the genes according to the calculated discriminative index (DI) and selecting a plurality of specific genes having a large discriminative index (DI) from the listed genes; and a step of predicting whether or not cancer occurs by using expression values of the selected specific genes. According to the present invention, genes are ranked on the basis of the discriminative index (DI), which indicates the classification ability to distinguish the normal group from the cancer patient group, and the optimal gene set among the highest ranking genes through the list of genes ranked by the discriminative index (DI) can be selected.

Description

Gene expression marker screening method using neural network based on gene selection algorithm}

본 발명은 신경망 기반의 유전자 선택 알고리즘을 이용한 유전자 발현 마커 선별 방법에 관한 것으로서, 더욱 상세하게는 신경망 기반의 유전자 선택 알고리즘을 이용하여 유전자 발현 정보로부터 복수의 암과 연관된 유전자 발현 마커를 선별하는 유전자 발현 마커 선별 방법에 관한 것이다. The present invention relates to a method for selecting a gene expression marker using a neural network based gene selection algorithm, and more specifically, a gene expression for selecting a gene expression marker associated with a plurality of cancers from gene expression information using a neural network based gene selection algorithm It relates to a marker selection method.

차세대 염기서열 분석(next-generation sequencing, NGS) 혹은 초병렬 염기서열 분석(massively parallel sequencing)은 염기서열 데이터 생산량을 증가시키기 위해 염기서열 분석법을 대규모로 병렬화한 방법이다.Next-generation sequencing (NGS) or massively parallel sequencing is a method of massively parallelizing sequencing to increase sequencing data production.

NGS는 분자의 정보를 수치로 변환할 수 있기 때문에 많은 연구 분야에서 적용되고 있다. 그러나. NGS를 이용한 접근법은 주어진 연구의 다음 단계를 지시하기 위해 적절한 유전자 (또는 유전자좌)를 선택해야 했다. 예를 들어, 인간 게놈의 경우, 약 50,000 개 이상의 유전자 (또는 190,000개까지의 전사체) 이상의 발현 수준 목록에서 합리적인 유전자 (기능)를 선택하는 것은 병목을 발생시키는 주요 요인이 되었다. NGS has been applied in many research fields because it can convert molecular information into numerical values. But. The approach using NGS had to select the appropriate gene (or locus) to direct the next step in a given study. For example, in the case of the human genome, selecting a rational gene (function) from a list of expression levels above about 50,000 genes (or up to 190,000 transcripts) has become a major bottleneck.

많은 연구자들은 여러 테스트에서 조정된 p 값이 0.05 (또는 이하) 인 DEG(differentially expression gene) 식별 알고리즘을 이용하여 차별적으로 발현되는 유전자 목록(DEG)에서 유전자를 선택하였다. 그러나, 샘플 수가 증가함에 따라 DEG의 수는 수천 개까지 증가하는 문제점이 있었다. 따라서, 바이오 마커 후보에 이상적인 유전자 세트를 자동으로 추천하는 방법에 대한 요구가 발생하였다. Many researchers selected genes from the differentially expressed gene list (DEG) using a differentially expressed gene (DEG) identification algorithm with a p value adjusted to 0.05 (or less) in various tests. However, as the number of samples increased, the number of DEGs increased to thousands. Accordingly, there has been a need for a method of automatically recommending an ideal gene set for a biomarker candidate.

본 발명의 실시예에서는 신경망 기반의 유전자 선택 알고리즘을 이용하여 최적의 바이오 마커 선별하고자 한다. In an embodiment of the present invention, an optimal biomarker is selected using a neural network-based gene selection algorithm.

본 발명의 배경이 되는 기술은 대한민국 등록특허공보 제10-1489536호(2015.02.04공고)에 개시되어 있다.The background technology of the present invention is disclosed in Republic of Korea Patent Publication No. 10-1489536 (2015.02.04 announcement).

본 발명이 이루고자 하는 기술적 과제는, 신경망 기반의 유전자 선택 알고리즘을 이용하여 유전자 발현 정보로부터 12가지의 암과 연관된 유전자 발현 마커를 선별하는 유전자 발현 마커 선별 방법을 제공하는데 목적이 있다. An object of the present invention is to provide a gene expression marker selection method for selecting gene expression markers associated with 12 cancers from gene expression information using a neural network-based gene selection algorithm.

이러한 기술적 과제를 이루기 위한 본 발명의 실시예에 따르면, 신경망 기반의 유전자 선택 알고리즘을 이용한 유전자 발현 마커 선별 방법 에 있어서, 복수의 환자로부터 각각의 생검 조직을 수집하고, 수집된 각각의 생검 조직으로부터 실험적으로 측정된 복수개의 유전자 발현 정보를 수집하는 단계, 상기 수집된 복수개의 유전자 발현 정보를 기 학습된 신경망 기반의 유전자 선택 알고리즘에 적용하여 각각의 유전자에 대한 차별지수(DI)를 산출하는 단계, 상기 산출된 차별지수(DI)에 따라 유전자를 나열하고, 나열된 유전자 중에서 차별지수(DI)가 큰 복수의 특정 유전자를 선별하는 단계, 그리고, 상기 선별된 복수의 특정 유전자의 발현값을 이용하여 암 발생 여부를 예측하는 단계를 포함한다. According to an embodiment of the present invention for achieving such a technical problem, in the method for selecting a gene expression marker using a neural network-based gene selection algorithm, each biopsy tissue is collected from a plurality of patients and experimentally obtained from each collected biopsy tissue. Collecting a plurality of gene expression information measured by, applying the collected plurality of gene expression information to a pre-trained neural network-based gene selection algorithm to calculate a differential index (DI) for each gene, Genes are listed according to the calculated differential index (DI), selecting a plurality of specific genes having a large differential index (DI) from the listed genes, and using the expression values of the selected plurality of specific genes to develop cancer And predicting whether or not to include.

상기 신경망 기반의 유전자 선택 알고리즘을 구축하여 학습시키는 단계를 더 포함하며, 상기 신경망 기반의 유전자 선택 알고리즘을 구축하여 학습시키는 단계는, 암 게놈 아틀라스(The Cancer Genome Atlas: TCGA) 프로그램으로부터 복수의 암종류에 대한 유전자 발현 정보를 수신하는 단계, 상기 수신된 유전자 발현 정보를 암 환자 그룹과 정상인 그룹으로 그룹핑하고, 각 그룹으로부터 획득한 유전자 정보를 무작위로 추출하여 데이터 셋을 형성하는 단계, 상기 형성된 데이터 셋을 이용하여 정상인 그룹과 암환자 그룹으로 분류하는 복수의 특정 유전자를 추출하는 유전자 선택 알고리즘을 구축하는 단계를 포함할 수 있다. Further comprising the step of constructing and learning the neural network based gene selection algorithm, the step of constructing and learning the neural network based gene selection algorithm is a plurality of cancer types from the Cancer Genome Atlas (TCGA) program. Receiving gene expression information for, grouping the received gene expression information into a cancer patient group and a normal group, and randomly extracting gene information obtained from each group to form a data set, the formed data set And constructing a gene selection algorithm for extracting a plurality of specific genes that are classified into a normal group and a cancer patient group.

상기 복수의 암종류는, 방광 요로 암종(BLCA), 유방 침습성 암종(BRCA), 선암(COAD), 머리와 목 편평 상피 세포암(HNSC), 신장 발색단(KICH), 신장 투명 세포 암종(KIRC), 신장 유두 세포 암종(KIRP), 간암(LIHC), 폐선암종(LUAD), 폐 편평 상피 세포 암(LUSC), 전립선암(PRAD) 및 갑상선 암종(THCA)를 포함할 수 있다. The plurality of cancer types include bladder urinary tract carcinoma (BLCA), breast invasive carcinoma (BRCA), adenocarcinoma (COAD), squamous cell carcinoma of the head and neck (HNSC), renal chromophore (KICH), and kidney transparent cell carcinoma (KIRC). , Renal papillary cell carcinoma (KIRP), liver cancer (LIHC), lung adenocarcinoma (LUAD), lung squamous cell carcinoma (LUSC), prostate cancer (PRAD) and thyroid carcinoma (THCA).

상기 유전자 선택 알고리즘은, 상시 수신된 복수의 암종류에 포함된 모든 발현 유전자에 대한 각각의 차별 지수(DI)값을 산출하고, 상기 산출된 차별 지수(DI)값의 순위를 이용하여 상위 복수개의 특정 유전자를 추출할 수 있다. The gene selection algorithm calculates each discrimination index (DI) value for all the expression genes included in a plurality of cancer types that are always received, and uses the calculated ranking of the discrimination index (DI) values to select a plurality of upper multiples. Specific genes can be extracted.

상기 차별 지수(DI)값은, 하기의 수학식을 통해 연산될 수 있다. The discrimination index (DI) value may be calculated through the following equation.

여기서,

는 j번째 유전자에 대응하는 암조직의 유전자 발현값들의 총합을 나타내고,

는 j번째 유전자에 대응하는 정상 조직의 유전자 발현값들의 총합을 나타내며, W는 가중치를 나타낸다.here,

Denotes the sum of gene expression values of cancer tissues corresponding to the j-th gene,

Denotes the sum of gene expression values of normal tissues corresponding to the jth gene, and W denotes a weight.

상기 데이터 셋을 생성하는 단계는, 상기 복수의 암 종류마다 각각 상이한 암 샘플과 정상 샘플의 비율에 상관없이 무작위로 발현 유전자 정보를 추출하여 데이터 셋을 생성할 수 있다. In the generating of the data set, the expression gene information may be randomly extracted regardless of the ratio of different cancer samples and normal samples for each of the plurality of cancer types to generate a data set.

상기 데이터 셋을 생성하는 단계는, 전체의 암 유전자 발현 데이터를 이용하여 기 설정된 비율로 학습 데이터 셋, 검증 데이터 셋 및 평가 데이터 셋을 생성하며, 각각 생성된 학습 데이터 셋, 검증 데이터 셋 및 평가 데이터 셋은 암 샘플과 정상 샘플의 비율을 동일하게 형성할 수 있다. In the generating of the data set, a learning data set, a verification data set and an evaluation data set are generated at a predetermined rate using the entire cancer gene expression data, and the generated learning data set, the verification data set, and the evaluation data, respectively. The three can form the same ratio of the cancer sample and the normal sample.

상기 복수개의 특정 유전자는, FN1, ALB, EEF1A1, SFTPC, GAPDH, P4HB, DCN, A2M, MGP, UMOD, GPX3, FTL, ACPP 및 CTSD를 포함할 수 있다. The plurality of specific genes may include FN1, ALB, EEF1A1, SFTPC, GAPDH, P4HB, DCN, A2M, MGP, UMOD, GPX3, FTL, ACPP and CTSD.

이와 같이 본 발명에 따르면, 정상인 그룹과 암 환자 그룹을 구별하기 위한 분류 능력을 나타내는 차별 지수(DI)에 기초하여 유전자의 순위를 정하고, 차별 지수(DI)로 순위가 매겨진 유전자 목록을 통해 최고 순위의 유전자 중에서 최적의 유전자 세트를 선택할 수 있도록 한다. As described above, according to the present invention, genes are ranked based on a discrimination index (DI) indicating a classification ability for distinguishing between a normal group and a cancer patient group, and the highest ranking is obtained through a list of genes ranked by the discrimination index (DI) It is possible to select the optimal gene set among the genes.

도 1은 본 발명의 실시예에 따른 유전자 발현 마커 선별 장치를 개략적으로 도시한 도면이다.
도 2는 본 발명의 실시예에 따른 신경망 기반의 유전자 선택 알고리즘을 이용하여 특정 유전자를 선별하는 방법을 개략적으로 도시한 순서도이다.
도 3은 도 2에 도시된 S210단계를 설명하기 위한 도면이다.
도 4는 도3에 도시된 S213단계에서 주어진 유전자 수에 따른 분류 정확도를 나타내는 그래프이다.
도 5는 S230 단계에서 유전자마다 차별지수를 산출하는 것을 나타내는 도면이다. 1 is a view schematically showing a gene expression marker selection device according to an embodiment of the present invention.
2 is a flowchart schematically showing a method of selecting a specific gene using a neural network-based gene selection algorithm according to an embodiment of the present invention.
3 is a view for explaining the step S210 shown in FIG.
4 is a graph showing classification accuracy according to the number of genes given in step S213 shown in FIG. 3.
5 is a view showing that the differential index is calculated for each gene in step S230.

이하 첨부된 도면을 참조하여 본 발명에 따른 바람직한 실시예를 상세히 설명하기로 한다. 이 과정에서 도면에 도시된 선들의 두께나 구성요소의 크기 등은 설명의 명료성과 편의상 과장되게 도시되어 있을 수 있다. Hereinafter, preferred embodiments of the present invention will be described in detail with reference to the accompanying drawings. In this process, the thickness of the lines or the size of components shown in the drawings may be exaggerated for clarity and convenience.

또한 후술되는 용어들은 본 발명에서의 기능을 고려하여 정의된 용어들로서, 이는 사용자, 운용자의 의도 또는 관례에 따라 달라질 수 있다. 그러므로 이러한 용어들에 대한 정의는 본 명세서 전반에 걸친 내용을 토대로 내려져야 할 것이다.In addition, terms to be described later are terms defined in consideration of functions in the present invention, which may vary according to a user's or operator's intention or practice. Therefore, the definition of these terms should be made based on the contents throughout the specification.

이하에서는 도1을 이용하여 본 발명의 실시예에 따른 재발 예측 장치를 더욱 상세하게 설명한다. Hereinafter, a recurrence prediction apparatus according to an embodiment of the present invention will be described in more detail with reference to FIG. 1.

도 1은 본 발명의 실시예에 따른 유전자 발현 마커 선별 장치를 개략적으로 도시한 도면이다. 1 is a view schematically showing a gene expression marker selection device according to an embodiment of the present invention.

도 1에 도시된 바와 같이, 본 발명의 실시예에 따른 유전자 발현 마커 선별 장치(100)는 수집부(110), 알고리즘생성부(120), 차별지수산출부(130), 특정 유전자 선별부(140) 및 예측부(150)을 포함한다. 1, the gene expression marker selection device 100 according to an embodiment of the present invention includes a collection unit 110, an algorithm generation unit 120, a differential index calculation unit 130, and a specific gene selection unit ( 140) and the prediction unit 150.

먼저, 수집부(110)는 복수의 피검자의 조직으로부터 추출된 유전자 정보를 수집한다. 부연하자면, 수집부(110)는 복수의 피검자로부터 채취된 조직의 RNA를 추출한다. 그리고, 수집부(110)는 추출된 RNA를 nCounter®Analysis System에 적용하여 유전자 발현 데이터를 획득한다. 여기서, 획득한 유전자 발현 데이터는 대략 20,000개의 유전자 정보를 포함한다. First, the collection unit 110 collects genetic information extracted from the tissues of a plurality of subjects. In other words, the collection unit 110 extracts RNA of tissue collected from a plurality of subjects. In addition, the collection unit 110 obtains gene expression data by applying the extracted RNA to the nCounter®Analysis System. Here, the obtained gene expression data includes approximately 20,000 gene information.

그리고, 수집부(110)는 획득한 유전자 발현 데이터를 차별지수산출부(130)에 전달한다. Then, the collection unit 110 transmits the obtained gene expression data to the differential index calculation unit 130.

알고리즘생성부(120)는 수신된 유전자 발현 데이터를 이용하여 유전자 선택 알고리즘을 생성한다. 여기서, 유전자 선택 알고리즘은 발현 유전자 정보 중에서 암환자와 정상인으로 분류하는데 영항을 미치는 특정 유전자를 선별하는 모델이다. The algorithm generator 120 generates a gene selection algorithm using the received gene expression data. Here, the gene selection algorithm is a model for selecting a specific gene that affects the classification of cancer patients and normal persons from expression gene information.

부연하자면, 알고리즘생성부(120)은 암 게놈 아틀라스 (TCGA)에 공개된 12 가지 다른 암 유형에 대한 유전자 발현 데이터를 획득한다. 획득한 유전자 발현데이터는 총 6,226개(5,609 개의 암 샘플 및 617개의 정상샘플)의 샘플로 구성된다. In other words, the algorithm generating unit 120 acquires gene expression data for 12 different cancer types disclosed in the cancer genome atlas (TCGA). The obtained gene expression data consists of a total of 6,226 samples (5,609 cancer samples and 617 normal samples).

그리고 알고리즘생성부(120)는 획득한 12가지의 다른 암 유형의 유전자 발현 데이터 중에서 무작위로 n개 선택하여 조합한 복수의 데이터 셋을 생성한다. 알고리즘생성부(120)는 생성된 복수의 데이터 셋을 7:2:1의 비율로 나뉘어, 7에 해당되는 데이터셋은 학습용으로 사용하고, 2에 해당하는 데이터 셋은 평가용으로 사용한다. 또한, 1 에 해당하는 데이터 셋은 최종 평가용으로 사용한다. 즉, 알고리즘생성부(120)는 데이터 셋을 이용하여 학습 및 평가를 거침으로써 유전자에 대한 차별지수(DI)를 산출하는 유전자 선택 알고리즘을 생성한다. In addition, the algorithm generating unit 120 randomly selects n of the 12 gene expression data of 12 different cancer types and generates a plurality of combined data sets. The algorithm generating unit 120 divides the generated plurality of data sets in a ratio of 7: 2: 1, and the data set corresponding to 7 is used for learning, and the data set corresponding to 2 is used for evaluation. In addition, the data set corresponding to 1 is used for final evaluation. That is, the algorithm generator 120 generates a gene selection algorithm that calculates a differential index (DI) for a gene by performing learning and evaluation using a data set.

차별지수산출부(130)는 피검자로부터 획득한 발현 유전자 정보를 기 학습된 유전자 선택 알고리즘에 적용한다. 그리고 차별지수산출부(130)는 입력된 발현 유전자 각각의 차별지수(discriminative index: DI) 점수를 획득한다. 여기서 차별지수(DI)는 특정 유전자가 주어진 그룹을 얼마나 잘 구별하는지 분류 능력을 평가하기 위해서 산출되는 값을 나타낸다. The discrimination index calculation unit 130 applies the expression gene information obtained from the subject to the pre-trained gene selection algorithm. In addition, the discrimination index calculation unit 130 acquires a discrimination index (DI) score for each of the input expression genes. Here, the discrimination index (DI) represents a value calculated in order to evaluate the classification ability of how well a specific gene distinguishes a given group.

특정 유전자 선별부(140)는 산출된 차별지수(DI)에 따라 모든 유전자를 나열한다. 그리고 특정 유전자 선별부(140)는 나열된 모든 유전자 중에서 상위 14RO에 해당되는 유전자를 선별한다. 그리고, 특정 유전자 선별부(140)는 선별된 상위 14개에 해당되는 유전자를 특정 유전자로 선별한다. The specific gene selection unit 140 lists all genes according to the calculated differential index (DI). In addition, the specific gene selection unit 140 selects genes corresponding to the top 14RO among all the genes listed. In addition, the specific gene selection unit 140 selects genes corresponding to the top 14 selected genes as specific genes.

마지막으로, 예측부(150)는 선별된 상위 14개에 해당되는 특정 유전자 정보를 이용하여 암 발생 여부를 판단한다. Finally, the prediction unit 150 determines whether cancer has occurred using specific genetic information corresponding to the top 14 selected.

이하에서는 도 2 내지 도 5를 이용하여 유전자 발현 마커 선별 장치를 이용하여 특정 유전자를 선별하는 방법에 대해 더욱 상세하게 설명한다. Hereinafter, a method of selecting a specific gene using the gene expression marker selection device will be described in more detail with reference to FIGS. 2 to 5.

도 2는 본 발명의 실시예에 따른 신경망 기반의 유전자 선택 알고리즘을 이용하여 특정 유전자를 선별하는 방법을 개략적으로 도시한 순서도이다. 2 is a flowchart schematically showing a method of selecting a specific gene using a neural network-based gene selection algorithm according to an embodiment of the present invention.

도 2에 도시된 바와 같이, 먼저, 알고리즘생성부(120)는 암 게놈 아틀라스(The Cancer Genome Atlas: TCGA) 프로그램으로부터 복수의 암에 대한 유전자 발현 정보를 수신한다. 그리고 알고리즘생성부(120)는 수신된 12 가지 유형의 암에 대한 유전자 발현 정보를 이용하여 유전자 선택 알고리즘을 구축한다(S210).As illustrated in FIG. 2, first, the algorithm generating unit 120 receives gene expression information for a plurality of cancers from the Cancer Genome Atlas (TCGA) program. In addition, the algorithm generator 120 constructs a gene selection algorithm using the gene expression information for the 12 types of cancer received (S210).

이하에서는 도 3 및 도 4를 이용하여 S210단계에 대해 더욱 상세하게 설명한다. Hereinafter, step S210 will be described in more detail with reference to FIGS. 3 and 4.

도 3은 도 2에 도시된 S210단계를 설명하기 위한 도면이고, 도 4는 도3에 도시된 S213단계에서 주어진 유전자 수에 따른 분류 정확도를 나타내는 그래프이다. FIG. 3 is a diagram for explaining step S210 shown in FIG. 2, and FIG. 4 is a graph showing classification accuracy according to the number of genes given in step S213 shown in FIG. 3.

도 3에 도시된 바와 같이, 알고리즘생성부(120)는 암 게놈 아틀라스(The Cancer Genome Atlas: TCGA) 프로그램에 공개된 12가지 유형의 암에 대한 유전자 발현정보를 수신한다(S211)As shown in FIG. 3, the algorithm generator 120 receives gene expression information for 12 types of cancers disclosed in the Cancer Genome Atlas (TCGA) program (S211).

상기 표1은 암 게놈 아틀라스(The Cancer Genome Atlas: TCGA) 프로그램을 통해 수신된 12가지의 유형에 암을 나타내고, 각각의 암마다 획득한 암 조직 샘플 및 정상 조직 샘플을 나타낸다. Table 1 above shows cancers in 12 types received through the Cancer Genome Atlas (TCGA) program, and shows cancer tissue samples and normal tissue samples obtained for each cancer.

여기서 12가지 암의 명칭은 방광 요로 암종(BLCA), 유방 침습성 암종(BRCA), 선암(COAD), 머리와 목 편평 상피 세포암(HNSC), 신장 발색단(KICH), 신장 투명 세포 암종(KIRC), 신장 유두 세포 암종(KIRP), 간암(LIHC), 폐선암종(LUAD), 폐 편평 상피 세포 암(LUSC), 전립선암(PRAD) 및 갑상선 암종(THCA)으로 나타낸다. Here, the names of the 12 cancers are bladder urinary tract carcinoma (BLCA), breast invasive carcinoma (BRCA), adenocarcinoma (COAD), squamous cell carcinoma of the head and neck (HNSC), renal chromophore (KICH), and kidney transparent cell carcinoma (KIRC). , Kidney papillary cell carcinoma (KIRP), liver cancer (LIHC), lung adenocarcinoma (LUAD), lung squamous cell carcinoma (LUSC), prostate cancer (PRAD) and thyroid carcinoma (THCA).

그 다음, 알고리즘생성부(120)는 획득한 12종류의 암으로부터 암 조직과 정상 조식으로 그룹핑한 다음, 각 그룹으로부터 획득한 유전자 발현 정보를 무작위로 추출하여 데이터 셋을 형성한다(S212). Next, the algorithm generating unit 120 groups cancer tissues and normal breakfasts from 12 types of cancers obtained, and then randomly extracts gene expression information obtained from each group to form a data set (S212).

부연하자면, 상기 표1에 나타난 바와 같이, 수신된 12 종류의 암은 총 6210의 샘플을 포함한다. 여기서 5,609개는 암 조직에 대한 샘플이고, 617개는 정상 조직에 대한 샘플이다. More specifically, as shown in Table 1 above, the 12 types of cancer received include a total of 6210 samples. Here, 5,609 are samples for cancer tissue, and 617 are samples for normal tissue.

따라서, 알고리즘생성부(120)는 총 6210을 7:2:1의 비율로 분배하여, 학습 데이터 셋, 평가 데이터 셋 및 최종평가 데이터 셋을 생성한다. Therefore, the algorithm generator 120 distributes a total of 6210 at a ratio of 7: 2: 1 to generate a training data set, an evaluation data set, and a final evaluation data set.

이때, 각각의 암은 상이한 암 샘플과 정상 샘플의 비율로 구성된다. 그러므로 알고리즘생성부(120)는 비율에 상관없이 무작위로 발현 유전자 정보를 추출하여 데이터 셋을 생성한다. 다만, 학습 데이터 셋, 평가 데이터 셋 및 최종평가 데이터 셋은 암 샘플과 정상 샘플의 비율을 유지하면서 생성한다. At this time, each cancer is composed of a ratio of different cancer samples and normal samples. Therefore, the algorithm generator 120 randomly extracts the expression gene information regardless of the ratio to generate a data set. However, the training data set, the evaluation data set, and the final evaluation data set are generated while maintaining the ratio between the cancer sample and the normal sample.

그 다음, 알고리즘생성부(120)는 생성된 훈련 데이터 셋을 이용하여 유전자 선택 알고리즘을 학습시킨다(S213). Next, the algorithm generator 120 learns the gene selection algorithm using the generated training data set (S213).

한편, 유전자 선택 알고리즘은 훈련 데이터 셋을 이용하여 네트워크의 가중치를 훈련시키는 신경망 방법을 기반으로 한다. 이때, 훈련된 가중치는 초기값에 할당된 임의의 값에 크게 의존하므로 결과가 다소 달라질 수 있다. 따라서, 결과에 대한 불규칙성을 줄이기 위해, 유전자 선택 알고리즘을 10,000 번 반복하여 산출된 각 유전자별 차별지수(DI)의 평군값으로 유전자의 순위를 매긴다. 그리고, 가장 높은 차별지수(DI)를 가진 유전자 세트를 특정 유전자로 선별하였다. On the other hand, the genetic selection algorithm is based on a neural network method for training the weights of the network using a training data set. At this time, since the trained weight is highly dependent on an arbitrary value assigned to the initial value, the result may be slightly different. Therefore, in order to reduce irregularities in the results, genes are ranked by the average value of the differential index (DI) for each gene calculated by repeating the gene selection algorithm 10,000 times. Then, the gene set with the highest differential index (DI) was selected as a specific gene.

한편, 차별지수(DI)의 점수 순위 상 얼마나 많은 유전자를 특정 유전자로 정해야 분류 성능의 저하가 없는지를 계산하기 위해서, 먼저 DI 점수로 정렬한 유전자 목록에서 최적의 유전자 개수를 계산한다. 이를 위해, 1개의 유전자부터 개수를 증가시키면서 1,000개의 유전자까지 각 세트를 하나의 특정 유전자 세트로 설정하여 훈련 데이터 셋의 암 및 정상 샘플 분류 평균 정확도를 계산한다. On the other hand, in order to calculate how many genes should be designated as specific genes in the ranking of the score of the discrimination index (DI) so that there is no degradation in classification performance, the optimal number of genes is first calculated from the gene list sorted by the DI score. To this end, the average accuracy of cancer and normal sample classification of the training data set is calculated by setting each set to one specific gene set from 1,000 genes to 1,000 genes while increasing the number.

그 결과, 도 4에 도시된 바와 같이, 대략 상위 100 개의 유전자를 하나의 특정 유전자 세트로 구성 시 가장 높은 평균 정확도를 보였으며, 더 많은 유전자가 추가된다고 하더라도 평균 정확도가 증가되지는 않았다. As a result, as shown in FIG. 4, when the top 100 genes were composed of one specific gene set, the highest average accuracy was shown, and even if more genes were added, the average accuracy was not increased.

S213단계가 완료되면, 알고리즘생성부(120)는 중간 평가 데이터 셋을 유전자 선택 알고리즘에 입력하여 중간 평가를 수행한다(S214).When step S213 is completed, the algorithm generating unit 120 performs an intermediate evaluation by inputting the intermediate evaluation data set to the gene selection algorithm (S214).

이때, 알고리즘생성부(120)는 가중치를 달리하면서 각 유전자별 차별지수(DI)를 산출한다. At this time, the algorithm generating unit 120 calculates the differential index (DI) for each gene with different weights.

그 다음, 알고리즘생성부(120)는 최종 평가 데이터 셋을 유전자 선택 알고리즘에 입력하여 최종 평가를 수행한다(S215).Then, the algorithm generator 120 performs a final evaluation by inputting the final evaluation data set into the gene selection algorithm (S215).

상기 S213단계 내지 S215단계를 수행한 결과, 알고리즘생성부(120)는 14개의 특정 유전자를 선별하였다. As a result of performing steps S213 to S215, the algorithm generator 120 selected 14 specific genes.

여기서, 14개의 특정 유전자는 FN1, ALB, EEF1A1, SFTPC, GAPDH, P4HB, DCN, A2M, MGP, UMOD, GPX3, FTL, ACPP 및 CTSD를 포함한다. Here, 14 specific genes include FN1, ALB, EEF1A1, SFTPC, GAPDH, P4HB, DCN, A2M, MGP, UMOD, GPX3, FTL, ACPP and CTSD.

S210단계를 통해 유전자 선택 알고리즘이 구축된 상태에서, 수집부(110)는 복수의 피검자로부터 조직을 수집하고, 수집된 조직으로부터 유전자 정보를 획득한다(S220).In a state in which the gene selection algorithm is constructed through step S210, the collection unit 110 collects tissues from a plurality of subjects and obtains genetic information from the collected tissues (S220).

여기서 복수의 피검자는 암 환자 그룹와 정상인 그룹을 포함하며, 각각 50명으로 구성된다. 그리고, 수집부(110)는 100명의 피검자로부터 분리된 조직을 획득하고, 획득한 조직으로부터 RNA를 추출한다. Here, the plurality of subjects includes a cancer patient group and a normal person group, and is composed of 50 people each. Then, the collection unit 110 acquires tissues separated from 100 subjects, and extracts RNA from the tissues.

그 다음, 수집부(100)는 추출된 RNA를 nCounter®Analysis System을 통해 분석한다. nCounter®Analysis System은 디지털 분석기가 RNA에 포함된 각 분자의 색을 포착 및 카운팅하여 유전자 정보를 획득한다. 한편, 수집부(100)는 복수의 피검자를 대상으로 대략 20,000개의 유전자 발현 데이터를 획득한다. Then, the collection unit 100 analyzes the extracted RNA through the nCounter®Analysis System. In the nCounter®Analysis System, a digital analyzer acquires genetic information by capturing and counting the color of each molecule contained in RNA. Meanwhile, the collection unit 100 acquires approximately 20,000 gene expression data for a plurality of subjects.

그리고, 수집부(100)는 획득한 20,000개의 유전자 발현 데이터를 차별지수산출부(130)에 전달한다. Then, the collection unit 100 transmits the obtained 20,000 gene expression data to the differential index calculation unit 130.

그 다음, 차별지수산출부(130)는 수신된 20,000개의 유전자 발현 데이터를 기 구축된 유전자 선택 알고리즘에 입력하여 각각의 유전자에 대한 차별지수를 산출한다(S230). Next, the discrimination index calculation unit 130 calculates the discrimination index for each gene by inputting the received 20,000 gene expression data into a pre-built gene selection algorithm (S230).

도 5는 S230 단계에서 유전자마다 차별지수를 산출하는 것을 나타내는 도면이다. 5 is a view showing that the differential index is calculated for each gene in step S230.

도 5에 도시된 바와 같이, 차별지수산출부(130)는 수신된 20,000개의 유전자 발현 데이터마다 암조직의 유전자 발현값의 총합(

)과 정상조직의 유전자 발현값의 총합(

)을 산출한다. As shown in Figure 5, the differential index calculation unit 130 is the sum of gene expression values of cancer tissues for every 20,000 gene expression data received (

) And the sum of gene expression values in normal tissue (

).

그 다음, 차별지수산출부(130)는 하기의 수학식을 이용하여 차별지수(DI)를 산출한다.Next, the differential index calculation unit 130 calculates the differential index (DI) using the following equation.

여기서,

는 j번째 유전자의 차별지수이고,

는 j번째 유전자에 대응하는 암조직의 유전자 발현값의 총합을 나타내고,

는 j번째 유전자에 대응하는 정상 조직의 유전자 발현값의 총합을 나타내며, W는 가중치를 나타낸다.here,

Is the differential index of the jth gene,

Denotes the sum of gene expression values of normal tissues corresponding to the j-th gene, and W denotes a weight.

즉, 특정 유전자의 영향력 즉 차별지수(DI)는 입력데이터 중에서 서로 다른 페어의

값을 더하여 산출된다. That is, the influence of a specific gene, that is, the discrimination index (DI), is a different pair of input data.

It is calculated by adding the values.

여기서 서로 다른 페어는 종양에 대한 유전자 발현 샘플들의 총합이고, 다른 하나는 정상 유전자 발현 샘플들의 총합을 나타낸다. Here, the different pairs represent the sum of gene expression samples for the tumor, and the other represent the sum of normal gene expression samples.

S230 단계가 완료되면, 특정 유전자 선별부(140)는 산출된 차별지수(DI)가 큰 순서대로 20,000개의 유전자를 나열한다. 그리고, 특정 유전자 선별부(140)는 나열된 유전자 중에서 상위 14개에 해당하는 유전자를 선별한다(S240). When the step S230 is completed, the specific gene selection unit 140 lists 20,000 genes in the order in which the calculated differential index (DI) is large. Then, the specific gene selection unit 140 selects genes corresponding to the top 14 from the listed genes (S240).

마지막으로 예측부(130)는 기 구축된 유전자 선택 알고리즘을 통해 획득하였던 특정 유전자와 S230단계에서 선별된 유전자를 상호 비교하여 암 발생 여부를 예측한다(S250). Finally, the prediction unit 130 compares a specific gene obtained through a pre-built gene selection algorithm with a gene selected in step S230 to predict whether cancer has occurred (S250).

여기서, 특정 유전자는 FN1, ALB, EEF1A1, SFTPC, GAPDH, P4HB, DCN, A2M, MGP, UMOD, GPX3, FTL, ACPP 및 CTSD를 나타낸다. Here, specific genes represent FN1, ALB, EEF1A1, SFTPC, GAPDH, P4HB, DCN, A2M, MGP, UMOD, GPX3, FTL, ACPP and CTSD.

이하에서는 본 발명의 실시예에 따른 유전자 발현 마커 선별 장치를 통해 추출된 특정 유전자의 분류 정확도에 대해 더욱 상세하게 설명한다. Hereinafter, the classification accuracy of a specific gene extracted through the gene expression marker selection device according to an embodiment of the present invention will be described in more detail.

본 발명의 실시예에서는 차별지수가 높은 상위 14개의 특정 유전자를 유전자 발현 마커로 선택하였다. 표 2에 기재된 바와 같이, 이전 연구(Peng et al. 또는 Martinez-Ledesma et al.)에서의 유전자 발현 마커는 7개 혹은 14개로 구성된다. 다 만, 유전자 선택 알고리즘을 통해 획득한 유전자 발현 마커와 이전 연구에서의 유전자 발현 마커는 상호 겹치지 않는 것을 알 수 있다. 따라서, 유전자 선택 알고리즘을 통해 선택된 유전자 발현 마커가 암을 분류하는데 어느 정도의 정확도를 가지고 있는지 평가를 한 결과, 하기의 표3에 기재된 바와 같이, 7가지의 암 유형 중 5가지에 대해서 높은 분류 정확도를 나타내었다. In the embodiment of the present invention, the top 14 specific genes with a high differential index were selected as gene expression markers. As shown in Table 2, the gene expression markers in previous studies (Peng et al. Or Martinez-Ledesma et al.) Consist of 7 or 14 genes. However, it can be seen that the gene expression marker obtained through the gene selection algorithm and the gene expression marker in the previous study do not overlap. Therefore, as a result of evaluating how accurate the gene expression marker selected through the gene selection algorithm has to classify cancer, as shown in Table 3 below, high classification accuracy for 5 out of 7 cancer types It was shown.

이와 같이 본 발명에 따른 유전자 발현 마커 선별 방법은 정상인 그룹과 암 환자 그룹을 구별하기 위한 분류 능력을 나타내는 차별 지수(DI)에 기초하여 유전자의 순위를 정하고, 차별 지수(DI)로 순위가 매겨진 유전자 목록을 통해 최고 순위의 유전자 중에서 최적의 유전자 세트를 선택할 수 있다. As described above, the gene expression marker selection method according to the present invention ranks genes on the basis of the discrimination index (DI) indicating the classification ability to distinguish the normal group from the cancer patient group, and the genes ranked by the discrimination index (DI) The list allows you to choose the best set of genes from the top ranking genes.

본 발명은 도면에 도시된 실시예를 참고로 하여 설명되었으나 이는 예시적인 것에 불과하며, 당해 기술이 속하는 분야에서 통상의 지식을 가진 자라면 이로부터 다양한 변형 및 균등한 타 실시예가 가능하다는 점을 이해할 것이다. 따라서 본 발명의 진정한 기술적 보호범위는 아래의 특허청구범위의 기술적 사상에 의하여 정해져야 할 것이다.The present invention has been described with reference to the embodiment shown in the drawings, but this is only exemplary, and those skilled in the art to which the art belongs understand that various modifications and other equivalent embodiments are possible therefrom. will be. Therefore, the true technical protection scope of the present invention should be determined by the technical spirit of the claims below.

100 : 유전자 발현 마커 선별 장치
110 : 수집부
120 : 알고리즘생성부
130 : 차별지수산출부
140 : 특정 유전자 선별부
150 : 예측부100: gene expression marker selection device
110: collection unit
120: algorithm generator
130: discrimination index calculation department
140: specific gene selection unit
150: prediction unit

Claims

In a method for selecting a gene expression marker based on a neural network using a gene expression marker selection device,
Collecting each biopsy tissue from a plurality of patients and collecting a plurality of gene expression information experimentally measured from each collected biopsy tissue,
Calculating the differential index (DI) for each gene by applying the collected plurality of gene expression information to a pre-trained neural network-based gene selection algorithm,
Gene sorting according to the calculated differential index (DI), and selecting a plurality of specific genes having a large differential index (DI) among the listed genes, and
Gene expression marker selection method comprising the step of predicting whether or not the occurrence of cancer by using the expression values of the selected plurality of specific genes.

According to claim 1,
Further comprising constructing and learning the neural network based gene selection algorithm,
The step of constructing and learning a genetic selection algorithm based on the neural network,
Receiving gene expression information for a plurality of cancer types from the Cancer Genome Atlas (TCGA) program,
Grouping the received gene expression information into a cancer patient group and a normal group, and randomly extracting the gene information obtained from each group to form a data set, and
Gene expression marker selection method comprising the step of constructing a gene selection algorithm for extracting a plurality of specific genes to be classified into a normal patient group and a cancer patient group using the formed data set.

According to claim 2,
The plurality of cancer types,
Bladder urinary tract carcinoma (BLCA), breast invasive carcinoma (BRCA), adenocarcinoma (COAD), squamous cell carcinoma of the head and neck (HNSC), renal chromophore (KICH), renal transparent cell carcinoma (KIRC), renal papillary cell carcinoma (KIRP) ), Liver cancer (LIHC), lung adenocarcinoma (LUAD), lung squamous cell carcinoma (LUSC), prostate cancer (PRAD) and thyroid carcinoma (THCA).

According to claim 2,
The gene selection algorithm,
Each differential index (DI) value for all the expression genes included in a plurality of cancer types that are always received is calculated,
Gene expression marker selection method for extracting a plurality of specific genes by using the ranking of the calculated discrimination index (DI) value.

The method of claim 4,
The discrimination index (DI) value,
Gene expression marker selection method calculated through the following equation:

here,

According to claim 2,
Generating the data set,
Gene expression marker selection method for generating a data set by extracting expression gene information at random regardless of the ratio of cancer samples and normal samples that are different for each of the plurality of cancer types.

The method of claim 6,
Generating the data set,
Using the entire cancer gene expression data, a learning data set, a verification data set, and an evaluation data set are generated at a predetermined rate.
The generated training data set, verification data set, and evaluation data set, respectively, are gene expression marker selection methods that form the same ratio of cancer samples and normal samples.

The method according to any one of claims 1 to 7,
The plurality of specific genes,
Methods for selecting gene expression markers including FN1, ALB, EEF1A1, SFTPC, GAPDH, P4HB, DCN, A2M, MGP, UMOD, GPX3, FTL, ACPP and CTSD.