WO2022203093A1 - Method for diagnosing or predicting cancer occurrence - Google Patents

Method for diagnosing or predicting cancer occurrence Download PDF

Info

Publication number
WO2022203093A1
WO2022203093A1 PCT/KR2021/003531 KR2021003531W WO2022203093A1 WO 2022203093 A1 WO2022203093 A1 WO 2022203093A1 KR 2021003531 W KR2021003531 W KR 2021003531W WO 2022203093 A1 WO2022203093 A1 WO 2022203093A1
Authority
WO
WIPO (PCT)
Prior art keywords
cancer
diagnosing
predicting
data
analysis data
Prior art date
Application number
PCT/KR2021/003531
Other languages
French (fr)
Korean (ko)
Inventor
권혁중
이성훈
Original Assignee
이원다이애그노믹스(주)
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 이원다이애그노믹스(주) filed Critical 이원다이애그노믹스(주)
Priority to PCT/KR2021/003531 priority Critical patent/WO2022203093A1/en
Publication of WO2022203093A1 publication Critical patent/WO2022203093A1/en

Links

Images

Classifications

    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6869Methods for sequencing
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6876Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes
    • C12Q1/6883Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes for diseases caused by alterations of genetic material
    • C12Q1/6886Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes for diseases caused by alterations of genetic material for cancer
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/10Sequence alignment; Homology search
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/20ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for computer-aided diagnosis, e.g. based on medical expert systems
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/50ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for simulation or modelling of medical disorders
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/70ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for mining of medical data, e.g. analysing previous cases of other patients

Definitions

  • the present invention relates to a method for diagnosing or predicting cancer, and more particularly, first analysis data for cfDNA (cell-free DNA) concentration, second analysis data for copy number variation (CNV), and tumor obtaining third analysis data on the expression level of the marker; and diagnosing or predicting whether cancer occurs by analyzing the analysis data using a machine learning model trained to calculate a cancer occurrence probability by a prediction unit calculating a cancer occurrence probability through an operation on the analysis data; is about
  • Cancer is the second leading cause of death in the world and is a fatal disease for humans.
  • the standard method for diagnosing cancer is a tissue biopsy, which is a method of diagnosing cancer by collecting tumor tissue using an endoscope or an injection needle.
  • tissue biopsy since the tissue biopsy has to use an invasive method, it is burdensome for doctors and patients, it takes a lot of time, and even if it is a tumor tissue, the biological characteristics may appear differently depending on the location from which it is collected, so the accuracy of information may be a problem. have.
  • the liquid biopsy designed to overcome the problems of the tissue biopsy is to non-invasively diagnose cancer or various diseases using blood, urine, spinal fluid, and the like.
  • Australian physician Thomas Ashworth reported that circulating tumor cells (CTCs, cancer tumor cells) in the blood were associated with cancer metastasis. It is attracting attention as a new alternative that can replace invasive diagnosis with blood sampling.
  • liquid biopsies the most active research is a blood biopsy, which uses blood to diagnose mutant genes in cancer.
  • a blood biopsy for EGFR gene mutation is available, and it is currently the most active in lung cancer.
  • multi-omics analysis is performed at various molecular levels such as genome, transcriptome, proteome, metabolome, epigenome, and lipodome. It means a holistic and integrated analysis of a large number of data generated in
  • Patent Document 0001 Korean Patent Publication No. 10-2012-0077568
  • Non-Patent Document 0001 Heidi Schwarzenbach et al., Cell-free nucleic acids as biomarkers in cancer patients
  • the present inventors have recognized the problems of the prior art and have tried to develop a method for diagnosing or predicting cancer with high accuracy. As a result, it has been demonstrated that high sensitivity and specificity can be achieved in cancer diagnosis or prediction when using a multiomics analysis method that performs integrated analysis of specific data with a machine learning algorithm, and thus the present invention has been completed.
  • An object of the present invention is to provide a method for diagnosing or predicting cancer with high sensitivity and specificity.
  • the present invention provides first analysis data for cfDNA (cell-free DNA) concentration, second analysis data for copy number variation (CNV), and third analysis data for tumor marker expression level obtaining a; and diagnosing or predicting whether or not cancer occurs by analyzing the analysis data using a machine learning model trained to calculate the cancer occurrence probability by the prediction unit calculating the cancer occurrence probability through operation on the analysis data.
  • the first analysis data may include: extracting blood from a subject; and measuring the amount of cfDNA detected per 1 ml of the extracted blood.
  • the second analysis data may include: a) inputting sequence information of test nucleotide sequence fragments of cfDNA (cell-free DNA) from blood isolated from a subject; b) arranging homology positions by comparing the sequence information of the test nucleotide sequence fragments with a human reference genome database; c) segmenting a target chromosome among the test nucleotide sequence fragments to a predetermined size and increasing the segment size to generate a normalized two-dimensional matrix in which rows and columns represent segment sizes and positions, respectively; d) forming a two-dimensional matrix of Z-score values by calculating a Z-score value of the generated two-dimensional matrix; e) selecting the lowest Z-score value from among the Z-score values lower than the Z-score value calculated from the target chromosome of the reference genome sequence fragment; and f) gene copy number variation (CNV) by checking the position of the segment in which the Z-score value gradually increases from the row and column to which the lowest Z-score value belongs
  • Step a) may include: a-1) separating plasma by centrifuging blood collected from a subject; a-2) extracting cfDNA (cell-free DNA) from the separated plasma; a-3) preparing a library using the extracted cfDNA; and a-4) pooling the library and then decoding the nucleotide sequence by next generation sequencing (NGS).
  • NGS next generation sequencing
  • Step c) may be to repeatedly perform steps c)-n to generate an n-th row by segmenting the chromosome to a size of 0.5 Mb * n;
  • Z-score values for all segment sizes of the two-dimensional matrix in step d) may be calculated using mean and standard deviation (SD) values calculated by normalizing samples of a control group having normal chromosomes. .
  • SD standard deviation
  • step e) the portion showing the lowest Z-score value is selected among the values having the Z-score value lower than the reference value, and in step f), the segment size is the highest from the row and column to which the lowest Z-score value belongs. It may be to check the copy number variation by checking the position of the segment where the Z-score value gradually increases up to small rows.
  • the Z-score value is calculated according to Equation 1 below using the mean and standard deviation (SD) of the position (xy) of the two-dimensional matrix calculated from the target chromosome of the reference genome sequence fragment. it may be
  • the third analysis data may include: obtaining a biological sample from the subject; and measuring the concentration of the tumor marker in the biological sample.
  • the tumor marker may be Cyfra 21-1, CA 15-3, AFP, CEA and CA 19-9.
  • the biological sample may be isolated from blood, plasma or serum.
  • the machine learning model performs a plurality of calculations to which a weight between a plurality of layers of the machine learning model is applied to the analysis data to calculate an output value indicating the probability of cancer cancer in the test target patient step; and estimating, by the prediction unit, whether cancer occurs according to the output value.
  • the learning unit Before the acquiring step, the learning unit preparing the learning data, which is the analysis data of the patient for which cancer is known; setting, by the learning unit, a label according to whether or not cancer occurs in the learning data; calculating, by the machine learning model before the learning is completed, an output value indicating the probability of cancer occurrence by performing a plurality of calculations to which a weight between a plurality of layers of the machine learning model is applied to the learning data; calculating, by the learning unit, a loss that is a difference between an output value and a label; and performing, by the learning unit, an optimization of correcting the weight w of the machine learning model (MLM) using a backpropagation algorithm so that the loss is minimized.
  • MLM machine learning model
  • Calculating the loss is a loss function by the learning unit Calculate the loss according to , where L means loss, i is an index corresponding to the training data, x i is the training data that is an input to the machine learning model, and y i is the expected value for the training data. , y i is a label corresponding to the i-th training data, and f (x i ) is an output value calculated by the machine learning model with respect to the i-th training data.
  • the machine learning model may be formed through a multilayer perceptron (MLP).
  • MLP multilayer perceptron
  • the machine learning model may include an input layer (IL), a plurality of hidden layers (HL1 to HLk), and an output layer (OL).
  • IL input layer
  • HL1 to HLk hidden layers
  • OL output layer
  • Each of the plurality of layers may include at least one node.
  • the plurality of nodes of the plurality of layers may all have an operation, and the operation may be performed through an activation function (F).
  • the cancer may be one or more selected from the group consisting of lung cancer, head and neck cancer, breast cancer, ovarian cancer, liver cancer, testicular cancer, colorectal cancer, thyroid cancer, pancreatic cancer, cervical cancer, bladder cancer, digestive cancer and gallbladder cancer.
  • the method for diagnosing or predicting cancer of the present invention can reduce the possibility of false positives and false negatives, and has excellent sensitivity and specificity, so that it can be usefully used for diagnosis or prediction of cancer.
  • FIG. 1 is a diagram for explaining the configuration of a prediction apparatus for predicting cancer according to an embodiment of the present invention.
  • FIG. 2 is a diagram for explaining a schematic configuration of a machine learning model (MLM) executed in a predicting apparatus for predicting cancer according to an embodiment of the present invention.
  • MLM machine learning model
  • MLM machine learning model
  • MLP multilayer perceptron
  • MLM machine learning model
  • FIG. 5 is a diagram for explaining a learning method of a machine learning model (MLM) according to an embodiment of the present invention.
  • MLM machine learning model
  • FIG. 6 is a flowchart illustrating a method for predicting cancer using a machine learning model (MLM) according to an embodiment of the present invention.
  • MLM machine learning model
  • FIG. 7 is a view showing the difference in sensitivity of cancer diagnosis between the method according to an embodiment of the present invention, when using CNV single data, and when using Grail's liquid biopsy.
  • the present invention provides a method for diagnosing or predicting cancer comprising the following steps.
  • cfDNA cell-free DNA
  • CNV copy number variation
  • Diagnosing or predicting whether or not cancer occurs by analyzing the analysis data using a machine learning model trained to calculate a cancer occurrence probability by a prediction unit calculating a cancer occurrence probability through an operation on the analysis data.
  • the first analysis data is data obtained by measuring a cell-free DNA (cfDNA) concentration.
  • cfDNA cell-free DNA
  • cfDNA circulating tumor DNA
  • the first analysis data is
  • It may be obtained by a method including, but is not limited to, a known method capable of measuring the concentration of cfDNA, or a commercially available kit, or may be obtained by requesting a measurable related institution.
  • the first analysis data result may indicate a high or low cfDNA concentration based on a cfDNA concentration of 7.4 ng/ml, but is not limited thereto.
  • the second analysis data is data on (or confirming) copy number variation (CNV).
  • the second analysis data may be obtained by a non-invasive method including the following steps.
  • CNV Gene copy number variation
  • CNV Cosmetic Number Variation
  • subject includes any human or non-human animal.
  • non-human animal can be a vertebrate, such as non-human primates, sheep, dogs, and rodents, such as mice, rats, and guinea pigs.
  • the subject may preferably be a human, and specifically may be a human possessing or expected to have tumor cells.
  • the term “subject” may be used interchangeably with “individual” or “patient”.
  • step a) is,
  • NGS next generation sequencing
  • Step a-4) divides genomic DNA into countless pieces by massive parallel sequencing, reads each piece at the same time to obtain nucleotide sequence data, and uses various bioinformatics techniques to obtain the fragments. It includes all analysis methods that decipher the genome by combining the nucleotide sequence data of
  • the base sequence may be, for example, Roche 454 (ie, Roche 454 GS FLX), Applied Biosystems' SOLiD system (ie, SOLiDv4), Illumina's GAIIx, HiSeq 2500 and MiSeq sequences Analyzer, Life Technologies' Ion Torrent semiconductor sequencing platform, Pacific Biosciences' PacBio RS, and Sanger's 3730xl. It may be obtained by Life Technologies' IonTorrent platform or Illumina's MiSeq, and more preferably, Life Technologies' IonTorrent Personal Genome Machine (IonTorrent PGM). have.
  • the chromosome of step c) is an autosomal, and may be one or more chromosomes selected from autosomes consisting of chromosomes 1 to 22.
  • step c) may be to repeatedly perform steps c)-n to generate an n-th row by segmenting the chromosome to a size of 0.5 Mb * n; where n is 1 to 24.
  • the first row is generated by segmenting the chromosome to a size of 0.5 Mb
  • the second row is generated by segmenting the chromosome to a size of 1 Mb, which is 0.5 Mb * 2, and in this way, the segment size is increased by 0.5 Mb.
  • rows and columns represent segment sizes and positions. For example, by increasing the length of one chromosome from 0.5 Mb to 12 Mb at 0.5 Mb intervals to generate a two-dimensional matrix including segments according to each length, one matrix contains segments composed of only one length.
  • a total of 24 matrices composed of segments each having a length can be generated (the number of copies by checking the Z-score step by step in a two-dimensional matrix indicating the size and position of the generated segment) (also referred to herein as a stair-matrix to identify a mutation).
  • the Z-score values for all segment sizes of the two-dimensional matrix in step d) are calculated using the mean and standard deviation (SD) values calculated by normalizing the samples of the control group having normal chromosomes. characterized in that
  • the Z-score value is calculated according to Equation 1 below using the mean and standard deviation (SD) of the position (xy) of the two-dimensional matrix calculated from the target chromosome of the reference genome sequence fragment. do.
  • step e) selects a portion showing the lowest Z-score value among values lower than the reference value
  • step f) is segmented from the row and column to which the lowest Z-score value belongs. Copy number variation is confirmed by checking the location of the segment where the Z-score value gradually increases up to the smallest row.
  • the Z-score value calculated from the target chromosome of the reference genome sequence fragment may also be defined as a “reference value” and refers to a Z-score value obtained from a sample determined to be normal.
  • the lowest Z-score value in step e) may be determined as the segment having the darkest shade has the lowest Z-score value by displaying the chromosome as a heat map.
  • the third analysis data is data on the expression level of a tumor marker.
  • the third analysis data in the present invention is,
  • tumor marker refers to a proteinaceous material produced in cancer cells in response to cancer growth or in normal tissues around it in response to cancer tissue and is a biomarker detected in blood, urine, or tissue samples. .
  • cancer is diagnosed through some established tumor markers in clinical practice, most tumor markers can be elevated even in non-tumor conditions, so specific tumor markers alone are not suitable for diagnosis of cancer.
  • the tumor markers are Cyfra 21-1, CA 15-3 (cancer antigen 15-3), AFP (alpha-fetoprotein), CEA (onco-embryonic antigen), CA 19-9 (cancer antigen 19-9) ), B2M (beta-2 microglobulin), CA-125 (cancer antigen 125), calcitonin, Chromogranin A (CgA), hCG (chorionic gonadotropin), monoclonal immunoglobulin, PSA (prostate specific antigen), thyroid gland It may be at least one selected from the group consisting of globulin, and preferably, the tumor marker may be at least one selected from the group consisting of Cyfra 21-1, CA 15-3, AFP, CEA and CA 19-9. According to an embodiment of the present invention, the tumor marker may be a tumor marker group consisting of Cyfra 21-1, CA 15-3, AFP, CEA and CA 19-9, but is not limited thereto.
  • the result of the third analysis data may indicate a low risk or a high risk level according to the following criteria, but is not limited thereto.
  • the biological sample may be a body odor material isolated from a subject's body, and may preferably be blood, plasma or serum due to the nature of the present invention using a non-invasive method.
  • the prediction device 10 may be a device for performing a function of processing data according to a predetermined clock while temporarily storing data, and the prediction device 10 is a central processing unit (CPU). unit), a graphic processing unit (GPU), a neural processing unit (NPU), or the like.
  • the prediction apparatus 10 includes a preprocessor 100 , a learning unit 200 , and a prediction unit 300 .
  • the preprocessor 100 performs preprocessing on data input to the machine learning model (MLM), that is, analysis data or learning data. These data include cfDNA concentration or input amount, copy number variation (CNV), and tumor marker (Cyfra 21-1, CA 15-3, AFP, CEA and CA 19-9) expression levels, etc. It can be data. That is, the preprocessor 100 may remove and normalize outliers of data such as the amount of cfDNA, CNV, and tumor markers.
  • MLM machine learning model
  • the learning unit 200 is for learning a machine learning model (MLM). The operation of the learning unit 200 will be described in more detail below.
  • MLM machine learning model
  • the prediction unit 300 calculates a cancer occurrence probability (Positive Predict Probability: 0.0 to 1.0) using the machine learning model (MLM) generated by the learning unit 200 , and according to this probability according to the calculated probability, cancer occurrence can be predicted.
  • MLM machine learning model
  • FIG. 2 is a diagram for explaining a schematic configuration of a machine learning model (MLM) executed in a prediction apparatus for predicting cancer according to an embodiment of the present invention
  • FIG. 3 is a multilayer perceptron
  • MLP Multilayer Perceptron
  • FIG. 4 shows the configuration of any one node of the machine learning model (MLM) according to an embodiment of the present invention.
  • the machine learning model (MLM) may be a multilayer perceptron (MLP).
  • a machine learning model (MLM) includes multiple layers (IL, HL, OL).
  • the plurality of layers includes an input layer (IL), a plurality of hidden layers (HL1 to HLk), and an output layer (OL).
  • each of the plurality of layers IL, HL, and OL includes at least one node.
  • the input layer IL may include n input nodes i1 to in
  • the output layer OL may include one output node o.
  • the number of hidden layers HL may be k (HL1, HL2, ..., HLk).
  • the first hidden layer HL1 includes a number of hidden nodes h11 to h1a
  • the second hidden layer HL2 includes b number of hidden nodes h21 to h2b
  • the kth hidden layer HLk may include z hidden nodes (hk1 to hkz).
  • All of the plurality of nodes of the plurality of layers have operations. This operation is performed through the activation function (F).
  • F activation function
  • a plurality of nodes of different layers are connected by a channel (indicated by a dotted line) having a weight (w: weight).
  • w weight
  • any one node (N) of any one layer of the machine learning model (MLM) is a weight (weight: w1) to the input (x1, x2, ... xn) from the node of the previous layer , w2, .
  • These results (OUT) are transferred to the input of the next layer.
  • an unexplained parameter b indicates a threshold, which receives a value obtained by applying a weight (weight: w1, w2, ...wn) to the inputs (x1, x2, ...xn), and the sum of these values is the threshold If there is no abnormality, the operation result of the corresponding node is inactivated.
  • the machine learning model calculates an output value (0.0 to 1.0) by performing a plurality of operations to which a weight between a plurality of layers (IL, HL, OL) is applied to the input data. That is, a plurality of calculations in which weights are applied from the input layer IL to the plurality of hidden layers HL1 , HL2 , ... HLk and the output layer OL are performed on data to calculate an output value.
  • the value of the output node o of the output layer OL becomes the output value, which indicates the probability of cancer occurrence (0.0 to 1.0). For example, if the output value of each output node o is 0.45, it indicates that the probability of cancer occurrence is 45%.
  • FIG. 5 is a diagram for explaining a learning method of a machine learning model (MLM) according to an embodiment of the present invention.
  • the learning unit 200 prepares learning data in step S110.
  • the learning data means data for which cancer is known.
  • the learning data include data on the cfDNA concentration (or amount) of patients with known cancer, data confirming copy number variation (CNV), and tumor markers (Cyfra 21-1, CA 15-3, AFP) , CEA and CA 19-9) means data on expression levels. That is, the learning data includes an experimental group, which is data of a patient with cancer, and a control group, which is data of a patient who does not have cancer.
  • the learning unit 200 sets a label on the training data in step S120.
  • the learning unit 200 may set the label to 1 for the experimental group among the training data and set the label to 0 for the control group in the one-hot encoding method.
  • the learning unit 200 inputs the learning data to the machine learning model (MLM)
  • the machine learning model (MLM) performs a plurality of operations in which weights between a plurality of layers are applied to the learning data input in step S130.
  • an output value representing the probability of cancer is calculated.
  • step S140 the learning unit 200 calculates a loss through a loss function as in Equation 2 below.
  • L means loss (L2 Loss).
  • i is an index corresponding to the training data.
  • f (x i ) is the output value calculated by the machine learning model (MLM) according to the input (x i ), and y i is a label indicating the expected value. That is, y i is a label corresponding to the i-th training data (x i ).
  • f (x i ) is an output value calculated by the machine learning model (MLM) for the i-th training data (x i ).
  • the learning unit 200 performs optimization of correcting the weight w of the machine learning model (MLM) so that the loss that is the difference between the output value and the label of the machine learning model (MLM) is minimized in step S150.
  • MLM machine learning model
  • a back-propagation algorithm can be used.
  • Steps S110 to S150 described above are repeatedly performed using a plurality of different learning data. This iteration may be performed until the accuracy is calculated through the evaluation index, and a desired accuracy is reached.
  • FIG. 6 is a flowchart illustrating a method for predicting cancer using a machine learning model (MLM) according to an embodiment of the present invention.
  • the preprocessor 100 analyzes the analysis data of the test target patient whose cancer is unknown in step S210 , that is, analysis data on the cfDNA concentration or amount, and the copy number variation (CNV) analysis.
  • Data, and analysis data on the expression level of tumor markers (Cyfra 21-1, CA 15-3, AFP, CEA and CA 19-9) are input, and pretreatment is performed thereon in step S220, and then pretreatment in step S230
  • the analyzed data is input into a machine learning model (MLM).
  • MLM machine learning model
  • the machine learning model calculates an output value indicating the probability of whether the patient to be examined has cancer by performing a plurality of calculations to which a weight between a plurality of layers is applied on the analysis data input in step S240. For example, if the output of the output node o of the output layer OL is 0.84, it means that the cancer occurrence probability is 84%.
  • the prediction unit 300 may estimate the cancer according to the output value in step S250 .
  • the cancer to be diagnosed is lung cancer, head and neck cancer, breast cancer, ovarian cancer, liver cancer, testicular cancer, colorectal cancer, thyroid cancer, pancreatic cancer, cervical cancer, bladder cancer, digestive cancer and gallbladder cancer. It may be one or more selected from.
  • the cancer diagnosis or prediction method of the present invention has improved sensitivity and accuracy even in a trace amount of cfDNA fraction, and according to an experimental example of the present invention, Grail liquid biopsy and As a result of confirming the occurrence of cancer with respect to the artificial sequence using the method of the present invention, it was confirmed that the method of the present invention can diagnose or predict the occurrence of cancer with a higher level of sensitivity and accuracy than Grail's liquid biopsy.
  • 6 ctDNA fractions such as 2%, 3%, 4%, 6%, 8%, and 10% were applied based on 30 cfDNA samples of the normal group for a total of 180 samples. samples were made. The total chromosomes 13 and 18 and shorter 30M, 20M, and 10M samples were also made, respectively.
  • the second analysis data was obtained using a stair-matix
  • the first analysis data was obtained by measuring and obtained by Leewon Diagnostics Co., Ltd. in Korea
  • the third analysis data was obtained by requesting the Leewon Medical Foundation of Korea) to determine the sensitivity in diagnosing or predicting cancer occurrence. was confirmed, and the results are shown in FIG. 7 .

Abstract

The present invention relates to a method for diagnosing or predicting cancer and, specifically, to a method for diagnosing or predicting cancer, comprising steps in which: first analysis data on a cell-free DNA (cfDNA) concentration, second analysis data on copy number variation (CNV), and third analysis data on tumor marker expression amount are acquired; and a prediction unit uses a machine learning model trained to calculate a cancer occurrence probability through computing of the analysis data, so as to analyze the analysis data, and thus diagnose or predict cancer occurrence

Description

암 발생여부를 진단 또는 예측하는 방법Methods for diagnosing or predicting the occurrence of cancer
본 발명은 암의 진단 또는 예측 방법에 관한 것으로, 상세하게는 cfDNA(cell-free DNA) 농도에 대한 제1 분석 데이터, 복제수변이(copy number variation, CNV)에 대한 제2 분석 데이터, 및 종양표지자 발현량에 대한 제3 분석 데이터를 획득하는 단계; 및 예측부가 상기 분석 데이터에 대한 연산을 통해 암 발생 확률을 산출하도록 학습된 기계학습모델을 이용하여 상기 분석 데이터를 분석하여 암 발생 여부를 진단 또는 예측하는 단계;를 포함하는 암의 진단 또는 예측 방법에 관한 것이다.The present invention relates to a method for diagnosing or predicting cancer, and more particularly, first analysis data for cfDNA (cell-free DNA) concentration, second analysis data for copy number variation (CNV), and tumor obtaining third analysis data on the expression level of the marker; and diagnosing or predicting whether cancer occurs by analyzing the analysis data using a machine learning model trained to calculate a cancer occurrence probability by a prediction unit calculating a cancer occurrence probability through an operation on the analysis data; is about
암은 전 세계에서 2위의 사망 원인에 해당할 정도로 인간에게 치명적인 질병으로, 암의 피해를 줄이기 위해서는 항암제의 개발뿐 아니라 암의 초기 발견과 재발에 대한 모니터링이 중요하다.Cancer is the second leading cause of death in the world and is a fatal disease for humans.
현재 암 진단의 표준방법은 조직생검으로 종양 조직을 내시경, 주사바늘 등으로 조직을 채취하여 암을 진단하는 방법이다. 그러나, 상기 조직생검은 침습적인 방법을 이용해야 하므로 의사와 환자에게 부담이 있으며, 시간이 많이 소요되고 종양조직이라 하더라도 채취한 위치에 따라 생물학적 특성이 다르게 나타날 수 있어 정보의 정확성이 문제가 될 수 있다.Currently, the standard method for diagnosing cancer is a tissue biopsy, which is a method of diagnosing cancer by collecting tumor tissue using an endoscope or an injection needle. However, since the tissue biopsy has to use an invasive method, it is burdensome for doctors and patients, it takes a lot of time, and even if it is a tumor tissue, the biological characteristics may appear differently depending on the location from which it is collected, so the accuracy of information may be a problem. have.
상기 조직생검이 가진 문제점을 보완하기 위해 고안된 액체생검은 혈액, 소변, 척수액 등을 이용하여 암 또는 다양한 질병을 비침습적으로 진단하는 것이다. 1869년 호주 의사인 Thomas Ashworth가 혈액 속 순환종양세포(CTC, 암종양세포)가 암 전이와 관련이 있다고 보고하면서 알려졌는데, 혈액내 암 세포 분리기술이 2000년도 초반에 완성되면서 액체생검은 종래의 침습적인 진단을 채혈 등으로 대체할 수 있는 새로운 대안으로 주목받고 있다.The liquid biopsy designed to overcome the problems of the tissue biopsy is to non-invasively diagnose cancer or various diseases using blood, urine, spinal fluid, and the like. In 1869, Australian physician Thomas Ashworth reported that circulating tumor cells (CTCs, cancer tumor cells) in the blood were associated with cancer metastasis. It is attracting attention as a new alternative that can replace invasive diagnosis with blood sampling.
액체생검 중 가장 활발히 연구가 이루어지고 있는 것은 혈액을 이용하여 암의 돌연변이 유전자 등을 진단하는 혈액생검이다. 우리나라에서는 EGFR 유전자 돌연변이를 위한 혈액생검이 가능하며 현재 폐암에서 가장 활성화 되어있다.Among liquid biopsies, the most active research is a blood biopsy, which uses blood to diagnose mutant genes in cancer. In Korea, a blood biopsy for EGFR gene mutation is available, and it is currently the most active in lung cancer.
다만, 액체생검의 경우, 암종에 따라 혈액 내 세포의 검출 정도에 차이가 있고, 돌연변이의 발현율이 다르고, 특정 바이오마커는 특정 암에 대해 전혀 발현되지 않는 경우가 있어 진단의 정확도나 암 바이오마커가 불확실하다는 점이 문제점으로 대두되고 있다.However, in the case of liquid biopsy, there is a difference in the degree of detection of cells in the blood depending on the carcinoma, the expression rate of mutations is different, and certain biomarkers are not expressed at all for a specific cancer, so the accuracy of diagnosis or cancer biomarkers is not high. Uncertainty has emerged as a problem.
한편, 멀티오믹스 분석(multi-omics analysis)은 유전체(Genome), 전사체(tranome), 단백체(proteome), 대사체(metabolome), 후성유전체(epigenome), 지질체(lipodome) 등 다양한 분자 수준에서 생성된 다수의 데이터의 총체적이고 통합적인 분석을 의미한다.On the other hand, multi-omics analysis is performed at various molecular levels such as genome, transcriptome, proteome, metabolome, epigenome, and lipodome. It means a holistic and integrated analysis of a large number of data generated in
이러한 멀티오믹스 분석은 초고속 분자생물학적 분석 기술의 발전 및 컴퓨터 산업의 발전에 따른 정보 처리 능력의 발달과 함께 가능하게 되었으며, 다양한 데이터들의 총합적 분석으로 인해 제공할 수 있는 정보가 단일 분석과 대비하여 많으므로 암과 같이 복잡한 원인으로 발병하는 질병의 진단이나 치료법 개발에 있어 큰 기여를 할 수 있다.Such multiomics analysis has become possible with the development of information processing capabilities according to the development of high-speed molecular biological analysis technology and the development of the computer industry, and the information that can be provided due to the comprehensive analysis of various data is compared with a single analysis. Because there are many, it can make a great contribution to the diagnosis and treatment of diseases caused by complex causes such as cancer.
[선행기술문헌][Prior art literature]
[특허문헌][Patent Literature]
(특허문헌 0001) 한국공개특허 제10-2012-0077568호(Patent Document 0001) Korean Patent Publication No. 10-2012-0077568
[비특허문헌] [Non-patent literature]
(비특허문헌 0001) Heidi Schwarzenbach et al., Cell-free nucleic acids as biomarkers in cancer patients(Non-Patent Document 0001) Heidi Schwarzenbach et al., Cell-free nucleic acids as biomarkers in cancer patients
본 발명자들은 상기 종래 기술의 문제점을 인지하고, 정확도가 높은 암의 진단 또는 예측 방법을 개발하고자 노력하였다. 그 결과, 특정 데이터들에 대해 머신러닝 알고리즘으로 통합분석하는 멀티오믹스 분석법을 사용하는 경우 암 진단 또는 예측에 있어서 높은 민감도 및 특이도를 달성할 수 있음을 입증하고 본 발명을 완성하기에 이르렀다.The present inventors have recognized the problems of the prior art and have tried to develop a method for diagnosing or predicting cancer with high accuracy. As a result, it has been demonstrated that high sensitivity and specificity can be achieved in cancer diagnosis or prediction when using a multiomics analysis method that performs integrated analysis of specific data with a machine learning algorithm, and thus the present invention has been completed.
본 발명은 민감도 및 특이도가 높은 암의 진단 또는 예측 방법을 제공함을 목적으로 한다.An object of the present invention is to provide a method for diagnosing or predicting cancer with high sensitivity and specificity.
상기 목적을 위해 본 발명은 cfDNA(cell-free DNA) 농도에 대한 제1 분석 데이터, 복제수변이(copy number variation, CNV)에 대한 제2 분석 데이터, 및 종양표지자 발현량에 대한 제3 분석 데이터를 획득하는 단계; 및 예측부가 상기 분석 데이터에 대한 연산을 통해 암 발생 확률을 산출하도록 학습된 기계학습모델을 이용하여 상기 분석 데이터를 분석하여 암 발생여부를 진단 또는 예측하는 단계;를 포함하는 암의 진단 또는 예측 방법을 제공한다.For the above purpose, the present invention provides first analysis data for cfDNA (cell-free DNA) concentration, second analysis data for copy number variation (CNV), and third analysis data for tumor marker expression level obtaining a; and diagnosing or predicting whether or not cancer occurs by analyzing the analysis data using a machine learning model trained to calculate the cancer occurrence probability by the prediction unit calculating the cancer occurrence probability through operation on the analysis data. provides
상기 제1 분석 데이터는, 대상체로부터 혈액을 추출하는 단계; 및 상기 추출된 혈액 1 ml 당 검출된 cfDNA 양을 측정하는 단계;를 포함하는 방법에 의해 획득되는 것일 수 있다.The first analysis data may include: extracting blood from a subject; and measuring the amount of cfDNA detected per 1 ml of the extracted blood.
상기 제2 분석 데이터는, a) 대상체로부터 분리된 혈액으로부터 cfDNA(cell-free DNA)의 시험 염기서열 단편들의 서열정보를 입력하는 단계; b) 상기 시험 염기서열 단편들의 서열정보를 인간 참조 유전체 서열(reference genome database)과 비교하여 상동성 위치를 배열하는 단계; c) 상기 시험 염기서열 단편들 중 목적 염색체를 일정 크기로 분절하고, 분절 크기를 증가시켜 행과 열이 각각 분절 크기와 위치를 나타내는 정규화된 2차원 행렬을 생성하는 단계; d) 상기 생성된 2차원 행렬의 Z-score 값을 산출하여 Z-score 값의 2차원 행렬을 형성하는 단계; e) 상기 Z-score 값 중, 참조 유전체 염기서열 단편의 목적 염색체로부터 산출된 Z-score 값보다 낮은 Z-score 값 중에서 가장 낮은 Z-score 값을 선별하는 단계; 및 f) 상기 가장 낮은 Z-score 값이 속하는 행과 열로부터 분절 크기가 가장 작은 행까지 단계적으로 Z-score 값이 점차 증가하는 분절의 위치를 체크하여 유전자 복제수변이(copy number variation, CNV)가 일어난 위치 및 크기를 확인하는 단계;를 포함하는 방법에 의해 획득되는 것일 수 있다.The second analysis data may include: a) inputting sequence information of test nucleotide sequence fragments of cfDNA (cell-free DNA) from blood isolated from a subject; b) arranging homology positions by comparing the sequence information of the test nucleotide sequence fragments with a human reference genome database; c) segmenting a target chromosome among the test nucleotide sequence fragments to a predetermined size and increasing the segment size to generate a normalized two-dimensional matrix in which rows and columns represent segment sizes and positions, respectively; d) forming a two-dimensional matrix of Z-score values by calculating a Z-score value of the generated two-dimensional matrix; e) selecting the lowest Z-score value from among the Z-score values lower than the Z-score value calculated from the target chromosome of the reference genome sequence fragment; and f) gene copy number variation (CNV) by checking the position of the segment in which the Z-score value gradually increases from the row and column to which the lowest Z-score value belongs to the row with the smallest segment size. It may be obtained by a method comprising a; confirming the location and size of the occurrence.
상기 a) 단계는 a-1) 대상체로부터 채취된 혈액을 원심분리하여 혈장을 분리하는 단계; a-2) 상기 분리된 혈장에서 cfDNA(cell-free DNA)를 추출하는 단계; a-3) 추출된 cfDNA를 이용하여 라이브러리를 제작하는 단계; 및 a-4) 상기 라이브러리를 pooling 한 다음, 차세대 염기서열 분석법(Next Generation Sequencing, NGS)으로 염기서열을 해독하는 단계;를 포함하는 것일 수 있다.Step a) may include: a-1) separating plasma by centrifuging blood collected from a subject; a-2) extracting cfDNA (cell-free DNA) from the separated plasma; a-3) preparing a library using the extracted cfDNA; and a-4) pooling the library and then decoding the nucleotide sequence by next generation sequencing (NGS).
상기 c) 단계는 상기 염색체를 0.5 Mb * n 크기로 분절하여 제n 행을 생성하는 제c)-n 단계;를 n 이 1 내지 24까지 반복적으로 수행하는 것일 수 있다.Step c) may be to repeatedly perform steps c)-n to generate an n-th row by segmenting the chromosome to a size of 0.5 Mb * n;
상기 d) 단계에서 2차원 행렬의 모든 분절 크기에 대한 Z-score 값은 정상 염색체를 갖는 대조군의 표본들을 정규화하여 산출한 평균(mean) 및 표준편차(SD) 값을 이용하여 산출하는 것일 수 있다.Z-score values for all segment sizes of the two-dimensional matrix in step d) may be calculated using mean and standard deviation (SD) values calculated by normalizing samples of a control group having normal chromosomes. .
상기 e) 단계는 Z-score 값이 기준 값보다 낮은 값들 중에 가장 낮은 Z-score 값을 나타내는 부분을 선별하고, 상기 f) 단계는 가장 낮은 Z-score 값이 속하는 행과 열로부터 분절 크기가 가장 작은 행까지 단계적으로 Z-score 값이 점차 증가하는 분절의 위치를 체크하여 복제수변이를 확인하는 것일 수 있다.In step e), the portion showing the lowest Z-score value is selected among the values having the Z-score value lower than the reference value, and in step f), the segment size is the highest from the row and column to which the lowest Z-score value belongs. It may be to check the copy number variation by checking the position of the segment where the Z-score value gradually increases up to small rows.
상기 Z-score 값은 참조 유전체 염기서열 단편(reference)의 목적 염색체로부터 산출된 2차원 행렬의 위치(xy)에 대한 평균(mean) 및 표준편차(SD)를 이용하여 하기 수학식 1에 따라 산출되는 것일 수 있다.The Z-score value is calculated according to Equation 1 below using the mean and standard deviation (SD) of the position (xy) of the two-dimensional matrix calculated from the target chromosome of the reference genome sequence fragment. it may be
[수학식 1][Equation 1]
Figure PCTKR2021003531-appb-img-000001
Figure PCTKR2021003531-appb-img-000001
(여기서, Z=Z-score, M=matrix, cor.gc=correctable gc percent, xy=location of matrix)(where Z=Z-score, M=matrix, cor.gc=correctable gc percent, xy=location of matrix)
상기 제3 분석 데이터는 대상체로부터 생물학적 시료를 얻는 단계; 및 상기 생물학적 시료에서 종양표지자의 농도를 측정하는 단계;를 포함하는 방법에 의해 획득된 것일 수 있다.The third analysis data may include: obtaining a biological sample from the subject; and measuring the concentration of the tumor marker in the biological sample.
상기 종양표지자는 Cyfra 21-1, CA 15-3, AFP, CEA 및 CA 19-9일 수 있다.The tumor marker may be Cyfra 21-1, CA 15-3, AFP, CEA and CA 19-9.
상기 생물학적 시료는 혈액, 혈장 또는 혈청에서 분리한 것일 수 있다.The biological sample may be isolated from blood, plasma or serum.
상기 진단 또는 예측하는 단계는 상기 기계학습모델이 상기 분석 데이터에 대해 상기 기계학습모델의 복수의 계층 간의 가중치가 적용되는 복수의 연산을 수행하여 검사 대상 환자의 발암 여부의 확률을 나타내는 출력값을 산출하는 단계; 및 상기 예측부가 상기 출력값에 따라 암 발생 여부를 추정하는 단계;를 포함하는 것일 수 있다.In the diagnosing or predicting step, the machine learning model performs a plurality of calculations to which a weight between a plurality of layers of the machine learning model is applied to the analysis data to calculate an output value indicating the probability of cancer cancer in the test target patient step; and estimating, by the prediction unit, whether cancer occurs according to the output value.
상기 획득하는 단계 전, 학습부가 암 발생 여부가 알려진 환자의 분석 데이터인 학습 데이터를 마련하는 단계; 상기 학습부가 상기 학습 데이터에 암 발생 여부에 따른 레이블을 설정하는 단계; 학습이 완료되기 전의 기계학습모델이 상기 학습 데이터에 대해 기계학습모델의 복수의 계층 간의 가중치가 적용되는 복수의 연산을 수행하여 암 발생 여부에 대한 확률을 나타내는 출력값을 산출하는 단계; 상기 학습부가 출력값과 레이블의 차이인 손실을 산출하는 단계; 및 상기 학습부가 상기 손실이 최소가 되도록 역전파 알고리즘을 이용하여 기계학습모델(MLM)의 가중치(w)를 수정하는 최적화를 수행하는 단계;를 더 포함하는 것일 수 있다.Before the acquiring step, the learning unit preparing the learning data, which is the analysis data of the patient for which cancer is known; setting, by the learning unit, a label according to whether or not cancer occurs in the learning data; calculating, by the machine learning model before the learning is completed, an output value indicating the probability of cancer occurrence by performing a plurality of calculations to which a weight between a plurality of layers of the machine learning model is applied to the learning data; calculating, by the learning unit, a loss that is a difference between an output value and a label; and performing, by the learning unit, an optimization of correcting the weight w of the machine learning model (MLM) using a backpropagation algorithm so that the loss is minimized.
상기 손실을 산출하는 단계는 상기 학습부가 손실 함수
Figure PCTKR2021003531-appb-img-000002
에 따라 손실을 산출하며, 상기 L은 손실을 의미하고, 상기 i는 학습 데이터에 대응하는 인덱스이고, 상기 x i는 기계학습모델에 대한 입력인 학습 데이터이고, 상기 y i는 학습 데이터에 대한 기댓값을 나타내는 레이블이고, y i는 i번째 학습 데이터에 대응하는 레이블이고, 상기 f(x i)은 i번째 학습 데이터에 대해 기계학습모델이 산출한 출력값인 것을 특징으로 하는 것일 수 있다.
Calculating the loss is a loss function by the learning unit
Figure PCTKR2021003531-appb-img-000002
Calculate the loss according to , where L means loss, i is an index corresponding to the training data, x i is the training data that is an input to the machine learning model, and y i is the expected value for the training data. , y i is a label corresponding to the i-th training data, and f (x i ) is an output value calculated by the machine learning model with respect to the i-th training data.
상기 진단 또는 예측하는 단계에서 상기 기계학습모델은 다층퍼셉트론(MLP: Multilayer Perceptron)을 통해 형성된 것일 수 있다.In the diagnosis or prediction step, the machine learning model may be formed through a multilayer perceptron (MLP).
상기 기계학습모델은 입력층(IL: Input Layer), 복수의 은닉 계층(HL: Hidden Layer, HL1 내지 HLk) 및 출력층(OL: Output Layer)을 포함하는 것일 수 있다.The machine learning model may include an input layer (IL), a plurality of hidden layers (HL1 to HLk), and an output layer (OL).
상기 복수의 계층은 각각 적어도 하나의 노드를 포함하는 것일 수 있다.Each of the plurality of layers may include at least one node.
상기 복수의 계층의 복수의 노드는 모두 연산을 가지며, 상기 연산은 활성화함수(F)를 통해 이루어지는 것일 수 있다.The plurality of nodes of the plurality of layers may all have an operation, and the operation may be performed through an activation function (F).
상기 암은 폐암, 두경부암, 유방암, 난소암, 간암, 고환암, 대장직장암, 갑상선암, 췌장암, 자궁경부암, 방광암, 소화기암 및 담낭암으로 구성된 군으로부터 선택되는 하나 이상일 수 있다.The cancer may be one or more selected from the group consisting of lung cancer, head and neck cancer, breast cancer, ovarian cancer, liver cancer, testicular cancer, colorectal cancer, thyroid cancer, pancreatic cancer, cervical cancer, bladder cancer, digestive cancer and gallbladder cancer.
본 발명의 암의 진단 또는 예측 방법은 위양성 및 위음성 가능성을 감소시킬 수 있고, 민감도 및 특이도가 우수하여 암의 진단 또는 예측에 유용하게 사용될 수 있다.The method for diagnosing or predicting cancer of the present invention can reduce the possibility of false positives and false negatives, and has excellent sensitivity and specificity, so that it can be usefully used for diagnosis or prediction of cancer.
도 1은 본 발명의 실시예에 따른 암을 예측하는 예측장치의 구성을 설명하기 위한 도면이다.1 is a diagram for explaining the configuration of a prediction apparatus for predicting cancer according to an embodiment of the present invention.
도 2는 본 발명의 실시예에 따른 암을 예측하는 예측장치에서 실행되는 기계학습모델(MLM)의 개략적인 구성을 설명하기 위한 도면이다.2 is a diagram for explaining a schematic configuration of a machine learning model (MLM) executed in a predicting apparatus for predicting cancer according to an embodiment of the present invention.
도 3은 본 발명의 일 실시예에 따른 다층퍼셉트론(MLP: Multilayer Perceptron)을 통해 형성된 기계학습모델(MLM)의 세부적인 구성을 설명하기 위한 도면이다.3 is a view for explaining a detailed configuration of a machine learning model (MLM) formed through a multilayer perceptron (MLP) according to an embodiment of the present invention.
도 4는 본 발명의 일 실시예에 따른 기계학습모델(MLM)의 어느 하나의 노드의 구성을 설명하기 위한 도면이다.4 is a diagram for explaining the configuration of any one node of a machine learning model (MLM) according to an embodiment of the present invention.
도 5는 본 발명의 실시예에 따른 기계학습모델(MLM)의 학습 방법을 설명하기 위한 도면이다.5 is a diagram for explaining a learning method of a machine learning model (MLM) according to an embodiment of the present invention.
도 6은 본 발명의 실시예에 따른 기계학습모델(MLM)을 이용한 암을 예측하기 위한 방법을 설명하기 위한 흐름도이다.6 is a flowchart illustrating a method for predicting cancer using a machine learning model (MLM) according to an embodiment of the present invention.
도 7은 본 발명의 실시예에 따른 방법, CNV 단일 데이터를 이용한 경우 및 Grail의 액체생검을 이용한 경우의 암 진단의 민감도 차이를 나타낸 도면이다.7 is a view showing the difference in sensitivity of cancer diagnosis between the method according to an embodiment of the present invention, when using CNV single data, and when using Grail's liquid biopsy.
본 발명의 상세한 설명에 앞서, 이하에서 설명되는 본 명세서 및 청구범위에 사용된 용어나 단어는 통상적이거나 사전적인 의미로 한정해서 해석되어서는 아니 되며, 발명자는 그 자신의 발명을 가장 최선의 방법으로 설명하기 위해 용어의 개념으로 적절하게 정의할 수 있다는 원칙에 입각하여 본 발명의 기술적 사상에 부합하는 의미와 개념으로 해석되어야만 한다. 따라서 본 명세서에 기재된 실시예와 도면에 도시된 구성은 본 발명의 가장 바람직한 실시예에 불과할 뿐, 본 발명의 기술적 사상을 모두 대변하는 것은 아니므로, 본 출원시점에 있어서 이들을 대체할 수 있는 다양한 균등물과 변형 예들이 있을 수 있음을 이해하여야 한다. Prior to the detailed description of the present invention, the terms or words used in the present specification and claims described below should not be construed as being limited to their ordinary or dictionary meanings, and the inventors should develop their own inventions in the best way. It should be interpreted as meaning and concept consistent with the technical idea of the present invention based on the principle that it can be appropriately defined as a concept of a term for explanation. Accordingly, the embodiments described in this specification and the configurations shown in the drawings are only the most preferred embodiments of the present invention, and do not represent all the technical ideas of the present invention, so various equivalents that can replace them at the time of the present application It should be understood that there may be water and variations.
이하, 첨부된 도면을 참조하여 본 발명의 바람직한 실시예들을 상세히 설명한다. 이때, 첨부된 도면에서 동일한 구성 요소는 가능한 동일한 부호로 나타내고 있음을 유의해야 한다. 또한, 본 발명의 요지를 흐리게 할 수 있는 공지 기능 및 구성에 대한 상세한 설명은 생략할 것이다. 마찬가지의 이유로 첨부 도면에 있어서 일부 구성요소는 과장되거나 생략되거나 또는 개략적으로 도시되었으며, 각 구성요소의 크기는 실제 크기를 전적으로 반영하는 것이 아니다.Hereinafter, preferred embodiments of the present invention will be described in detail with reference to the accompanying drawings. In this case, it should be noted that the same components in the accompanying drawings are denoted by the same reference numerals as much as possible. In addition, detailed descriptions of well-known functions and configurations that may obscure the gist of the present invention will be omitted. For the same reason, some components are exaggerated, omitted, or schematically illustrated in the accompanying drawings, and the size of each component does not fully reflect the actual size.
본 발명은 하기의 단계를 포함하는 암의 진단 또는 예측 방법을 제공한다.The present invention provides a method for diagnosing or predicting cancer comprising the following steps.
cfDNA(cell-free DNA) 농도에 대한 제1 분석 데이터, 복제수변이(copy number variation, CNV)에 대한 제2 분석 데이터, 및 종양표지자 발현량에 대한 제3 분석 데이터를 획득하는 단계; 및acquiring first analysis data for cell-free DNA (cfDNA) concentration, second analysis data for copy number variation (CNV), and third analysis data for tumor marker expression level; and
예측부가 상기 분석 데이터에 대한 연산을 통해 암 발생 확률을 산출하도록 학습된 기계학습모델을 이용하여 상기 분석 데이터를 분석하여 암 발생 여부를 진단 또는 예측하는 단계.Diagnosing or predicting whether or not cancer occurs by analyzing the analysis data using a machine learning model trained to calculate a cancer occurrence probability by a prediction unit calculating a cancer occurrence probability through an operation on the analysis data.
본 발명의 암의 진단 또는 예측 방법에서 제1 분석 데이터는 cfDNA(cell-free DNA) 농도를 측정한 데이터이다.In the method for diagnosing or predicting cancer of the present invention, the first analysis data is data obtained by measuring a cell-free DNA (cfDNA) concentration.
본 발명에서 “cfDNA(cell-free DNA)”는 세포의 사멸로 인해 세포 밖, 혈액을 포함한 다양한 체액 내에 떠돌아다니는 작은 DNA 조각을 의미하며, 건강한 사람의 경우 혈장 4 ml에 약 10~30 ng의 cfDNA가 있다. 여기서 cfDNA 중 암세포 유래의 cfDNA를 ctDNA(circulating tumor DNA)라 지칭하며 암환자에서 암 조직이 커질수록 ctDNA 양도 증가하며, cfDNA의 양도 증가하는 경향이 있다.In the present invention, “cfDNA (cell-free DNA)” refers to a small DNA fragment floating outside the cell and in various body fluids including blood due to cell death. cfDNA is present. Here, cfDNA derived from cancer cells among cfDNA is referred to as circulating tumor DNA (ctDNA), and as the cancer tissue in a cancer patient grows, the amount of ctDNA increases, and the amount of cfDNA tends to increase.
본 발명에서 제1 분석 데이터는 In the present invention, the first analysis data is
대상체로부터 혈액을 추출하는 단계; 및 extracting blood from the subject; and
상기 추출된 혈액 1 ml 당 검출된 cfDNA 양을 측정하는 단계;measuring the amount of cfDNA detected per ml of the extracted blood;
를 포함하는 방법에 의해 획득된 것일 수 있으나, 이에 제한되지 않으며, cfDNA의 농도를 측정할 수 있는 공지된 방법 또는 상업용 키트를 통해 측정하거나 측정 가능한 관련 기관에 의뢰하여 획득된 것일 수 있다.It may be obtained by a method including, but is not limited to, a known method capable of measuring the concentration of cfDNA, or a commercially available kit, or may be obtained by requesting a measurable related institution.
본 발명에서 제1 분석 데이터 결과는 cfDNA 농도 7.4 ng/ml를 기준으로 cfDNA 농도의 고저를 나타내는 것일 수 있으나, 이에 한정되지 않는다.In the present invention, the first analysis data result may indicate a high or low cfDNA concentration based on a cfDNA concentration of 7.4 ng/ml, but is not limited thereto.
본 발명의 암의 진단 또는 예측 방법에서 제2 분석 데이터는 복제수변이(copy number variation, CNV)에 대한(또는 확인한) 데이터이다.In the method for diagnosing or predicting cancer of the present invention, the second analysis data is data on (or confirming) copy number variation (CNV).
본 발명에서 제2 분석 데이터는 하기 단계를 포함하는 비침습적인 방법에 의해 획득된 것일 수 있다.In the present invention, the second analysis data may be obtained by a non-invasive method including the following steps.
a) 대상체로부터 분리된 혈액으로부터 cfDNA(cell-free DNA)의 시험 염기서열 단편들의 서열정보를 입력하는 단계;a) inputting sequence information of test nucleotide sequence fragments of cfDNA (cell-free DNA) from blood isolated from a subject;
b) 상기 시험 염기서열 단편들의 서열정보를 인간 참조 유전체 서열(reference genome database)과 비교하여 상동성 위치를 배열하는 단계;b) arranging homology positions by comparing the sequence information of the test nucleotide sequence fragments with a human reference genome database;
c) 상기 시험 염기서열 단편들 중 목적 염색체를 일정 크기로 분절하고, 분절 크기를 증가시켜 행과 열이 각각 분절 크기와 위치를 나타내는 정규화된 2차원 행렬을 생성하는 단계;c) segmenting a target chromosome among the test nucleotide sequence fragments to a predetermined size and increasing the segment size to generate a normalized two-dimensional matrix in which rows and columns represent segment sizes and positions, respectively;
d) 상기 생성된 2차원 행렬의 Z-score 값을 산출하여 Z-score 값의 2차원 행렬을 형성하는 단계;d) forming a two-dimensional matrix of Z-score values by calculating a Z-score value of the generated two-dimensional matrix;
e) 상기 Z-score 값 중, 참조 유전체 염기서열 단편의 목적 염색체로부터 산출된 Z-score 값보다 낮은 Z-score 값 중에서 가장 낮은 Z-score 값을 선별하는 단계; 및e) selecting the lowest Z-score value from among the Z-score values lower than the Z-score value calculated from the target chromosome of the reference genome sequence fragment; and
f) 상기 가장 낮은 Z-score 값이 속하는 행과 열로부터 분절 크기가 가장 작은 행까지 단계적으로 Z-score 값이 점차 증가하는 분절의 위치를 체크하여 유전자 복제수변이(copy number variation, CNV)가 일어난 위치 및 크기를 확인하는 단계.f) Gene copy number variation (CNV) is determined by checking the position of the segment where the Z-score value gradually increases from the row and column to which the lowest Z-score value belongs to the row with the smallest segment size. Steps to ascertain where and where it happened.
본 발명에서 “복제수변이(Copy Number Variation, CNV)”는 인간 유전체의 구조적 변이에 해당하는 유전적 변화로, 2n의 형태로 존재하는 일반적인 서열들과는 달리 결실(0n, 1n 상태), 증폭(3n 이상의 상태)되는 등 인간의 표준 참조 게놈 (reference genome)와 비교해 반복되는 서열의 숫자의 차이를 보이는 DNA 조각을 의미한다.In the present invention, “Copy Number Variation (CNV)” is a genetic change corresponding to a structural variation of the human genome, and unlike general sequences that exist in the form of 2n, deletion (0n, 1n state), amplification (3n It refers to a DNA fragment that shows a difference in the number of repeating sequences compared to the human standard reference genome (reference genome).
본 발명에서 “대상체”는 임의의 인간, 비인간 동물을 포함한다. 용어 "비인간 동물"은 척추동물, 예컨대 비인간 영장류, 양, 개, 및 설치류, 예컨대 마우스, 래트 및 기니피그일 수 있다. 상기 개체는 바람직하게 인간일 수 있으며, 구체적으로 종양세포를 보유하고 있거나, 보유하고 있을 것으로 예상되는 인간일 수 있다. 본 발명에서 용어 "대상체"는 "개체" 또는 "환자"와 상호교환적으로 사용될 수 있다.In the present invention, "subject" includes any human or non-human animal. The term “non-human animal” can be a vertebrate, such as non-human primates, sheep, dogs, and rodents, such as mice, rats, and guinea pigs. The subject may preferably be a human, and specifically may be a human possessing or expected to have tumor cells. In the present invention, the term “subject” may be used interchangeably with “individual” or “patient”.
본 발명에서, 상기 a) 단계는,In the present invention, the step a) is,
a-1) 대상체로부터 채취된 혈액을 원심분리하여 혈장을 분리하는 단계;a-1) centrifuging the blood collected from the subject to separate plasma;
a-2) 상기 분리된 혈장에서 cfDNA(cell-free DNA)를 추출하는 단계;a-2) extracting cfDNA (cell-free DNA) from the separated plasma;
a-3) 추출된 cfDNA를 이용하여 라이브러리를 제작하는 단계; 및a-3) preparing a library using the extracted cfDNA; and
a-4) 상기 라이브러리를 pooling 한 다음, 차세대 염기서열 분석법(Next Generation Sequencing, NGS)으로 염기서열을 해독하는 단계;를 포함할 수 있다.a-4) pooling the library and then decoding the nucleotide sequence by next generation sequencing (NGS).
상기 a-4) 단계는 대규모 병렬형 서열분석법(massive parallel sequencing)으로 유전체 DNA를 무수히 많은 조각으로 나눈 후, 각 조각을 동시에 읽어내어 염기서열 데이터를 얻고, 다양한 생물 정보학적 기법을 이용하여 상기 조각의 염기서열 데이터를 조합하여 유전체를 해독하는 분석법을 모두 포함한다.Step a-4) divides genomic DNA into countless pieces by massive parallel sequencing, reads each piece at the same time to obtain nucleotide sequence data, and uses various bioinformatics techniques to obtain the fragments. It includes all analysis methods that decipher the genome by combining the nucleotide sequence data of
상기 염기서열은 예를 들어, 로슈(Roche) 454 (즉, 로슈 454 GS FLX), 어플라이드바이오시스템즈(Applied Biosystems)의 SOLiD 시스템 (즉, SOLiDv4), 일루미나(Illumina)의 GAIIx, HiSeq 2500 및 MiSeq 서열분석기, 라이프 테크놀로지스(Life Technologies)의 아이온토렌트(Ion Torrent) 반도체 서열분석 플랫폼, 퍼시픽바이오 사이언시스(Pacific Biosciences)의 PacBio RS 및 생어(Sanger)의 3730xl로부터 선택되는 차세대 서열분석 플랫폼으로 분석되는 얻어지는 것일 수 있으며, 바람직하게는 라이프 테크놀로지스의 아이온토렌트 플랫폼 또는 일루미나의 MiSeq에 의해 얻어지는 것일 수 있으며, 더욱 바람직하게는 라이프 테크놀로지스의 아이온토렌트 퍼스널 게놈 머신(Personal Genome Machine) (아이온토렌트 PGM)에 의해 얻어지는 것일 수 있다.The base sequence may be, for example, Roche 454 (ie, Roche 454 GS FLX), Applied Biosystems' SOLiD system (ie, SOLiDv4), Illumina's GAIIx, HiSeq 2500 and MiSeq sequences Analyzer, Life Technologies' Ion Torrent semiconductor sequencing platform, Pacific Biosciences' PacBio RS, and Sanger's 3730xl. It may be obtained by Life Technologies' IonTorrent platform or Illumina's MiSeq, and more preferably, Life Technologies' IonTorrent Personal Genome Machine (IonTorrent PGM). have.
본 발명에서 상기 c) 단계의 염색체는 상염색체이며, 1번 내지 22번 염색체로 구성된 상염색체로부터 선택되는 하나 이상의 염색체일 수 있다.In the present invention, the chromosome of step c) is an autosomal, and may be one or more chromosomes selected from autosomes consisting of chromosomes 1 to 22.
본 발명에서 상기 c) 단계는 상기 염색체를 0.5 Mb * n 크기로 분절하여 제n 행을 생성하는 제c)-n 단계;를 n이 1 내지 24까지 반복적으로 수행하는 것일 수 있다.In the present invention, step c) may be to repeatedly perform steps c)-n to generate an n-th row by segmenting the chromosome to a size of 0.5 Mb * n; where n is 1 to 24.
구체적으로, 염색체를 0.5 Mb 크기로 분절하여 제1행을 생성하고, 상기 염색체를 0.5 Mb * 2인 1Mb 크기로 분절하여 제2행을 생성하고, 이와 같은 방법으로 분절 크기를 0.5 Mb씩 증가시키면서 행을 추가하여 행과 열이 분절 크기와 위치를 나타내는 2차원 행렬(matrix)을 생성하는 것일 수 있다. 예를 들면, 하나의 염색체를 0.5 Mb부터 12 Mb까지 0.5 Mb 간격으로 길이를 증가시켜 각각의 길이에 따른 분절을 포함하는 2차원 행렬을 생성하는 것으로, 1개의 행렬에는 하나의 길이로만 구성된 분절이 존재하고, 각각의 길이를 갖는 분절로 구성된 총 24개의 행렬이 생성될 수 있다(생성된 분절 크기와 위치를 나타내는 2차원 행렬(matrix)에서 단계적(stair)으로 Z-score를 확인함에 따라 복제수 변이를 확인하는 것으로 본 명세서에서 stair-matrix로도 명명된다).Specifically, the first row is generated by segmenting the chromosome to a size of 0.5 Mb, and the second row is generated by segmenting the chromosome to a size of 1 Mb, which is 0.5 Mb * 2, and in this way, the segment size is increased by 0.5 Mb. It could be adding rows to create a two-dimensional matrix where rows and columns represent segment sizes and positions. For example, by increasing the length of one chromosome from 0.5 Mb to 12 Mb at 0.5 Mb intervals to generate a two-dimensional matrix including segments according to each length, one matrix contains segments composed of only one length. exists, and a total of 24 matrices composed of segments each having a length can be generated (the number of copies by checking the Z-score step by step in a two-dimensional matrix indicating the size and position of the generated segment) (also referred to herein as a stair-matrix to identify a mutation).
본 발명에서 상기 d) 단계에서 2차원 행렬의 모든 분절 크기에 대한 Z-score 값은 정상 염색체를 갖는 대조군의 표본들을 정규화하여 산출한 평균(mean) 및 표준편차(SD) 값을 이용하여 산출하는 것을 특징으로 한다.In the present invention, the Z-score values for all segment sizes of the two-dimensional matrix in step d) are calculated using the mean and standard deviation (SD) values calculated by normalizing the samples of the control group having normal chromosomes. characterized in that
상기 Z-score 값은 참조 유전체 염기서열 단편(reference)의 목적 염색체로부터 산출된 2차원 행렬의 위치(xy)에 대한 평균(mean) 및 표준편차(SD)를 이용하여 하기 수학식 1에 따라 산출된다.The Z-score value is calculated according to Equation 1 below using the mean and standard deviation (SD) of the position (xy) of the two-dimensional matrix calculated from the target chromosome of the reference genome sequence fragment. do.
[수학식 1][Equation 1]
Figure PCTKR2021003531-appb-img-000003
Figure PCTKR2021003531-appb-img-000003
(여기서, Z=Z-score, M=matrix, cor.gc=correctable gc percent, xy=location of matrix)(where Z=Z-score, M=matrix, cor.gc=correctable gc percent, xy=location of matrix)
본 발명에서 상기 e) 단계는 Z-score 값이 기준 값보다 낮은 값들 중에 가장 낮은 Z-score 값을 나타내는 부분을 선별하고, 상기 f) 단계는 가장 낮은 Z-score 값이 속하는 행과 열로부터 분절 크기가 가장 작은 행까지 단계적으로 Z-score 값이 점차 증가하는 분절의 위치를 체크하여 복제수변이를 확인하는 것이다.In the present invention, step e) selects a portion showing the lowest Z-score value among values lower than the reference value, and step f) is segmented from the row and column to which the lowest Z-score value belongs. Copy number variation is confirmed by checking the location of the segment where the Z-score value gradually increases up to the smallest row.
본 발명에서 참조 유전체 염기서열 단편의 목적 염색체로부터 산출된 Z-score 값은 “기준값”으로도 정의될 수 있으며, 정상(normal)으로 판별되는 샘플로부터 구한 Z-score 값을 의미한다.In the present invention, the Z-score value calculated from the target chromosome of the reference genome sequence fragment may also be defined as a “reference value” and refers to a Z-score value obtained from a sample determined to be normal.
본 발명에서 상기 e) 단계의 가장 낮은 Z-score 값은 해당 염색체를 Heat map으로 표시하여 음영이 가장 진한 분절이 가장 낮은 Z-score 값을 갖는 것으로 판단할 수 있다.In the present invention, the lowest Z-score value in step e) may be determined as the segment having the darkest shade has the lowest Z-score value by displaying the chromosome as a heat map.
본 발명의 암의 진단 또는 예측 방법에서 제3 분석 데이터는 종양표지자 발현량에 대한 데이터이다.In the method for diagnosing or predicting cancer of the present invention, the third analysis data is data on the expression level of a tumor marker.
본 발명에서 제3 분석 데이터는,The third analysis data in the present invention is,
생물학적 시료를 얻는 단계; 및obtaining a biological sample; and
상기 생물학적 시료에서 종양표지자의 농도를 측정하는 단계;measuring a concentration of a tumor marker in the biological sample;
를 포함하는 방법에 의해 획득된 것일 수 있으나, 이에 제한되지 않으며, 종양표지자 발현량을 측정할 수 있는 공지된 방법 또는 상업용 키트를 통해 측정되어 획득될 수 있으며, 측정 가능한 관련 기관에 의뢰하여 획득된 것일 수 있다.It may be obtained by a method comprising a it could be
본 발명에서 “종양표지자”는 암의 성장에 반응해서 암 세포에서 생성되거나 암 조직에 반응하여 주위의 정상 조직에서 생성된 단백질성 물질로 혈액, 소변, 또는 조직 검체에서 검출되는 바이오마커를 의미한다. 임상에서 일부 확립된 종양표지자들을 통해 암을 진단하고 있기는 하나, 대부분의 종양 표지자들은 비종양적인 상태에서도 상승할 수 있어 특정 종양표지자만으로는 암의 진단에 적합하지 않다.In the present invention, the term “tumor marker” refers to a proteinaceous material produced in cancer cells in response to cancer growth or in normal tissues around it in response to cancer tissue and is a biomarker detected in blood, urine, or tissue samples. . Although cancer is diagnosed through some established tumor markers in clinical practice, most tumor markers can be elevated even in non-tumor conditions, so specific tumor markers alone are not suitable for diagnosis of cancer.
본 발명에서 상기 종양표지자는 Cyfra 21-1, CA 15-3(암항원 15-3), AFP(알파-태아단백), CEA(발암배아성항원), CA 19-9(암항원 19-9), B2M(베타-2 마이크로글로불린), CA-125(암항원 125), 칼시토닌, Chromogranin A(CgA), hCG(융모성생식선자극호르몬), 단클론성 면역글로불린, PSA(전립성 특이성 항원), 갑상선글로불린으로 이루어진 군에서 선택되는 하나 이상일 수 있으며, 바람직하게는 상기 종양표지자는 Cyfra 21-1, CA 15-3, AFP, CEA 및 CA 19-9로 구성된 군에서 선택되는 하나 이상일 수 있다. 본 발명의 실시예에 따르면, 상기 종양표지자는 Cyfra 21-1, CA 15-3, AFP, CEA 및 CA 19-9로 구성된 종양표지자 군일 수 있으나, 이에 한정되지 않는다.In the present invention, the tumor markers are Cyfra 21-1, CA 15-3 (cancer antigen 15-3), AFP (alpha-fetoprotein), CEA (onco-embryonic antigen), CA 19-9 (cancer antigen 19-9) ), B2M (beta-2 microglobulin), CA-125 (cancer antigen 125), calcitonin, Chromogranin A (CgA), hCG (chorionic gonadotropin), monoclonal immunoglobulin, PSA (prostate specific antigen), thyroid gland It may be at least one selected from the group consisting of globulin, and preferably, the tumor marker may be at least one selected from the group consisting of Cyfra 21-1, CA 15-3, AFP, CEA and CA 19-9. According to an embodiment of the present invention, the tumor marker may be a tumor marker group consisting of Cyfra 21-1, CA 15-3, AFP, CEA and CA 19-9, but is not limited thereto.
본 발명에서 제3 분석 데이터의 결과는 하기와 같은 기준에 의해 저위험 또는 고위험 정도를 나타내는 것일 수 있으나, 이에 제한되지 않는다.In the present invention, the result of the third analysis data may indicate a low risk or a high risk level according to the following criteria, but is not limited thereto.
- Cyfra 21-1:≤3.30 ng/ml- Cyfra 21-1:≤3.30 ng/ml
- CA 15-3:≤26.40 U/ml- CA 15-3:≤26.40 U/ml
- AFP:≤7.0 ng/ml- AFP:≤7.0 ng/ml
- CEA: 비흡연 ≤3.8, 흡연 ≤5.5 ng/ml- CEA: non-smoking ≤3.8, smoking ≤5.5 ng/ml
- CA 19-9:≤34.00 U/ml- CA 19-9:≤34.00 U/ml
본 발명에서 상기 생물학적 시료는 대상자의 신체에서 분리한 체취 물질일 수 있으며, 비침습적 방법을 이용하는 본 발명의 특성상 바람직하게는 혈액, 혈장 또는 혈청일 수 있다.In the present invention, the biological sample may be a body odor material isolated from a subject's body, and may preferably be blood, plasma or serum due to the nature of the present invention using a non-invasive method.
다음은 예측부가 상기 분석 데이터에 대한 연산을 통해 암 발생 확률을 산출하도록 학습된 기계학습모델을 이용하여 상기 분석 데이터를 분석하여 암 발생 여부를 진단 또는 예측하는 단계에 대해서 설명하기로 한다.Next, a step of diagnosing or predicting whether or not cancer occurs by analyzing the analysis data using a machine learning model trained to calculate the cancer occurrence probability by the prediction unit operating on the analysis data will be described below.
도 1은 본 발명의 실시예에 따른 암을 예측하는 예측장치의 구성을 설명하기 위한 도면이다. 도 1을 참조하면, 예측장치(10)는 데이터를 일시로 저장하면서, 소정의 클럭에 따라 데이터를 처리하는 기능을 수행하기 위한 장치가 될 수 있으며, 이러한 예측장치(10)는 CPU(central processing unit), GPU(Graphic Processing Unit), NPU(Neural processing unit) 등이 될 수 있다. 이러한 예측장치(10)는 전처리부(100), 학습부(200), 및 예측부(300)를 포함한다.1 is a diagram for explaining the configuration of a prediction apparatus for predicting cancer according to an embodiment of the present invention. Referring to FIG. 1 , the prediction device 10 may be a device for performing a function of processing data according to a predetermined clock while temporarily storing data, and the prediction device 10 is a central processing unit (CPU). unit), a graphic processing unit (GPU), a neural processing unit (NPU), or the like. The prediction apparatus 10 includes a preprocessor 100 , a learning unit 200 , and a prediction unit 300 .
전처리부(100)는 기계학습모델(MLM)에 입력되는 데이터, 즉, 분석 데이터 또는 학습 데이터에 대한 전처리를 수행한다. 이러한 데이터는 cfDNA 농도 또는 양(input amount), 복제수변이(copy number variation, CNV), 및 종양표지자(Cyfra 21-1, CA 15-3, AFP, CEA 및 CA 19-9) 발현량 등에 대한 데이터일 수 있다. 즉, 전처리부(100)는 cfDNA 양, CNV, 종양표지자 등의 데이터의 Outlier를 제거하고, 정규화 할 수 있다.The preprocessor 100 performs preprocessing on data input to the machine learning model (MLM), that is, analysis data or learning data. These data include cfDNA concentration or input amount, copy number variation (CNV), and tumor marker (Cyfra 21-1, CA 15-3, AFP, CEA and CA 19-9) expression levels, etc. It can be data. That is, the preprocessor 100 may remove and normalize outliers of data such as the amount of cfDNA, CNV, and tumor markers.
학습부(200)는 기계학습모델(Machine Learning Model: MLM)을 학습(Machine Learning)시키기 위한 것이다. 이러한 학습부(200)의 동작에 대해서는 아래에서 더 상세하게 설명될 것이다.The learning unit 200 is for learning a machine learning model (MLM). The operation of the learning unit 200 will be described in more detail below.
예측부(300)는 학습부(200)가 생성한 기계학습모델(MLM)을 이용하여 암 발생 확률(Positive Predict Probability: 0.0 ~ 1.0)을 산출하고, 산출된 확률에 따라 이러한 확률에 따라 암 발생을 예측할 수 있다. 이러한 예측부(300)의 동작에 대해서는 아래에서 더 상세하게 설명될 것이다.The prediction unit 300 calculates a cancer occurrence probability (Positive Predict Probability: 0.0 to 1.0) using the machine learning model (MLM) generated by the learning unit 200 , and according to this probability according to the calculated probability, cancer occurrence can be predicted. The operation of the prediction unit 300 will be described in more detail below.
도 2는 본 발명의 실시예에 따른 암을 예측하는 예측장치에서 실행되는 기계학습모델(MLM)을 개략적인 구성을 설명하기 위한 도면이며, 도 3은 본 발명의 일 실시예에 따른 다층퍼셉트론(MLP: Multilayer Perceptron)을 통해 형성된 기계학습모델(MLM)의 세부적인 구성을 설명하기 위한 도면이며, 도 4는 본 발명의 일 실시예에 따른 기계학습모델(MLM)의 어느 하나의 노드의 구성을 설명하기 위한 도면이다. 도 2 내지 도 4를 참조하면, 기계학습모델(MLM)은 다층퍼셉트론(MLP: Multilayer Perceptron)이 될 수 있다. 기계학습모델(MLM)은 복수의 계층(IL, HL, OL)을 포함한다. 이러한 복수의 계층은 입력층(IL: Input Layer), 복수의 은닉 계층(HL: Hidden Layer, HL1 내지 HLk) 및 출력층(OL: Output Layer)을 포함한다.2 is a diagram for explaining a schematic configuration of a machine learning model (MLM) executed in a prediction apparatus for predicting cancer according to an embodiment of the present invention, and FIG. 3 is a multilayer perceptron ( It is a view for explaining the detailed configuration of a machine learning model (MLM) formed through MLP: Multilayer Perceptron, and FIG. 4 shows the configuration of any one node of the machine learning model (MLM) according to an embodiment of the present invention. It is a drawing for explanation. 2 to 4 , the machine learning model (MLM) may be a multilayer perceptron (MLP). A machine learning model (MLM) includes multiple layers (IL, HL, OL). The plurality of layers includes an input layer (IL), a plurality of hidden layers (HL1 to HLk), and an output layer (OL).
또한, 복수의 계층(IL, HL, OL) 각각은 적어도 하나의 노드를 포함한다. 예컨대, 도시된 바와 같이, 입력층(IL)은 n개의 입력노드(i1 ~ in)를 포함하며, 출력층(OL)은 1개의 출력노드(o)를 포함할 수 있다. 또한, 은닉층(HL)은 k개(HL1, HL2, ..., HLk)가 될 수 있다. 은닉층(HL) 중 제1 은닉층(HL1)은 a개의 은닉노드(h11 내지 h1a)를 포함하고, 제2 은닉층(HL2)은 b개의 은닉노드(h21 내지 h2b)를 포함하고, 제k 은닉층(HLk)은 z개의 은닉노드(hk1 내지 hkz)를 포함할 수 있다.In addition, each of the plurality of layers IL, HL, and OL includes at least one node. For example, as illustrated, the input layer IL may include n input nodes i1 to in, and the output layer OL may include one output node o. Also, the number of hidden layers HL may be k (HL1, HL2, ..., HLk). Among the hidden layers HL, the first hidden layer HL1 includes a number of hidden nodes h11 to h1a, the second hidden layer HL2 includes b number of hidden nodes h21 to h2b, and the kth hidden layer HLk ) may include z hidden nodes (hk1 to hkz).
복수의 계층의 복수의 노드 모두는 연산을 가진다. 이러한 연산은 활성화함수(F)를 통해 이루어진다. 특히, 서로 다른 계층의 복수의 노드는 가중치(w: weight)를 가지는 채널(점선으로 표시)로 연결된다. 다른 말로, 어느 하나의 노드의 연산 결과는 가중치가 적용되어 다음 계층 노드의 입력이 된다.All of the plurality of nodes of the plurality of layers have operations. This operation is performed through the activation function (F). In particular, a plurality of nodes of different layers are connected by a channel (indicated by a dotted line) having a weight (w: weight). In other words, the calculation result of one node is weighted and becomes the input of the next layer node.
즉, 도 4에 도시된 바와 같이, 기계학습모델(MLM)의 어느 한 계층의 어느 하나의 노드(N)는 이전 계층의 노드로부터의 입력(x1, x2, …xn)에 가중치(weight: w1, w2, …wn)를 적용한 값을 입력 받고, 이를 합산하여 활성화 함수(F)를 취하여 그 연산 결과(OUT)를 산출한다. 이러한 결과(OUT)를 다음 계층의 입력으로 전달한다. 한편, 설명되지 않은 파라미터 b는 임계치를 나타내며, 이는 입력(x1, x2, …xn)에 가중치(weight: w1, w2, …wn)를 적용한 값을 입력 받고, 이를 합산한 값이 임계치(b) 이상이 되지 않으면, 해당 노드의 연산 결과를 비활성화시키기 위한 것이다.That is, as shown in Figure 4, any one node (N) of any one layer of the machine learning model (MLM) is a weight (weight: w1) to the input (x1, x2, ... xn) from the node of the previous layer , w2, . These results (OUT) are transferred to the input of the next layer. On the other hand, an unexplained parameter b indicates a threshold, which receives a value obtained by applying a weight (weight: w1, w2, …wn) to the inputs (x1, x2, …xn), and the sum of these values is the threshold If there is no abnormality, the operation result of the corresponding node is inactivated.
이에 따라, 기계학습모델(MLM)은 입력된 데이터에 대해 복수의 계층(IL, HL, OL) 간의 가중치가 적용되는 복수의 연산을 수행하여 출력값(0.0~1.0)을 산출한다. 즉, 데이터에 대해 입력층(IL)으로부터 복수의 은닉층(HL1, HL2, …HLk) 및 출력층(OL)까지 가중치가 적용되는 복수의 연산을 수행하여 출력값을 산출한다. 출력층(OL)의 출력노드(o)의 값이 출력값이 되며, 이는 암이 발생할 확률(0.0~1.0)을 나타낸다. 예컨대, 출력노드(o) 각각의 출력값이 0.45이면, 암의 발생 확률이 45%임을 나타낸다.Accordingly, the machine learning model (MLM) calculates an output value (0.0 to 1.0) by performing a plurality of operations to which a weight between a plurality of layers (IL, HL, OL) is applied to the input data. That is, a plurality of calculations in which weights are applied from the input layer IL to the plurality of hidden layers HL1 , HL2 , ... HLk and the output layer OL are performed on data to calculate an output value. The value of the output node o of the output layer OL becomes the output value, which indicates the probability of cancer occurrence (0.0 to 1.0). For example, if the output value of each output node o is 0.45, it indicates that the probability of cancer occurrence is 45%.
다음으로, 전술한 기계학습모델(MLM)의 학습 방법에 대해서 설명하기로 한다. 도 5는 본 발명의 실시예에 따른 기계학습모델(MLM)의 학습 방법을 설명하기 위한 도면이다.Next, a learning method of the aforementioned machine learning model (MLM) will be described. 5 is a diagram for explaining a learning method of a machine learning model (MLM) according to an embodiment of the present invention.
도 5을 참조하면, 학습부(200)는 S110 단계에서 학습 데이터를 마련한다. 학습 데이터는 암이 알려진 데이터를 의미한다. 구체적으로, 학습 데이터는 암이 알려진 환자의 cfDNA 농도 (또는 양)에 대한 데이터, 복제수변이(copy number variation, CNV)를 확인한 데이터, 및 종양표지자(Cyfra 21-1, CA 15-3, AFP, CEA 및 CA 19-9) 발현량에 대한 데이터를 의미한다. 즉, 학습 데이터는 암이 발생한 환자의 데이터인 실험군과, 암이 발생하지 않은 환자의 데이터인 대조군을 포함한다.Referring to FIG. 5 , the learning unit 200 prepares learning data in step S110. The learning data means data for which cancer is known. Specifically, the learning data include data on the cfDNA concentration (or amount) of patients with known cancer, data confirming copy number variation (CNV), and tumor markers (Cyfra 21-1, CA 15-3, AFP) , CEA and CA 19-9) means data on expression levels. That is, the learning data includes an experimental group, which is data of a patient with cancer, and a control group, which is data of a patient who does not have cancer.
다음으로, 학습부(200)는 S120 단계에서 학습 데이터에 레이블을 설정한다. 학습부(200)는 일 실시예에 따르면, 원핫인코딩 방식으로, 학습데이터 중 실험군에 대해 레이블을 1로 설정하고, 대조군에 대해 레이블을 0으로 설정할 수 있다.Next, the learning unit 200 sets a label on the training data in step S120. According to an embodiment, the learning unit 200 may set the label to 1 for the experimental group among the training data and set the label to 0 for the control group in the one-hot encoding method.
이어서, 학습부(200)가 학습 데이터를 기계학습모델(MLM)에 입력하면, 기계학습모델(MLM)은 S130 단계에서 입력된 학습 데이터에 대해 복수의 계층 간의 가중치가 적용되는 복수의 연산을 수행하여 암에 대한 확률을 나타내는 출력값을 산출한다.Subsequently, when the learning unit 200 inputs the learning data to the machine learning model (MLM), the machine learning model (MLM) performs a plurality of operations in which weights between a plurality of layers are applied to the learning data input in step S130. Thus, an output value representing the probability of cancer is calculated.
그러면, 학습부(200)는 S140 단계에서 다음의 수학식 2와 같은 손실 함수를 통해 손실을 산출한다.Then, in step S140 , the learning unit 200 calculates a loss through a loss function as in Equation 2 below.
[수학식 2][Equation 2]
Figure PCTKR2021003531-appb-img-000004
Figure PCTKR2021003531-appb-img-000004
수학식 2에서, L은 손실(L2 Loss)을 의미한다. i는 학습 데이터에 대응하는 인덱스이다. f(x i)은 입력(x i)에 따라 기계학습모델(MLM)이 산출한 출력값이고, y i는 기댓값을 나타내는 레이블이다. 즉, y i는 i번째 학습 데이터(x i)에 대응하는 레이블이다. 또한, f(x i)은 i번째 학습 데이터(x i)에 대해 기계학습모델(MLM)이 산출한 출력값이다.In Equation 2, L means loss (L2 Loss). i is an index corresponding to the training data. f (x i ) is the output value calculated by the machine learning model (MLM) according to the input (x i ), and y i is a label indicating the expected value. That is, y i is a label corresponding to the i-th training data (x i ). In addition, f (x i ) is an output value calculated by the machine learning model (MLM) for the i-th training data (x i ).
이어서, 학습부(200)는 S150 단계에서 기계학습모델(MLM)의 출력값과 레이블의 차이인 손실이 최소가 되도록 기계학습모델(MLM)의 가중치(w)를 수정하는 최적화를 수행한다. 이러한 최적화를 위해 역전파(Back-propagation) 알고리즘을 이용할 수 있다.Next, the learning unit 200 performs optimization of correcting the weight w of the machine learning model (MLM) so that the loss that is the difference between the output value and the label of the machine learning model (MLM) is minimized in step S150. For this optimization, a back-propagation algorithm can be used.
전술한 S110 단계 내지 S150 단계는 복수의 서로 다른 복수의 학습 데이터를 이용하여 반복하여 수행된다. 이러한 반복은 평가 지표를 통해 정확도를 산출하고, 원하는 정확도에 도달할 때까지 이루어질 수 있다.Steps S110 to S150 described above are repeatedly performed using a plurality of different learning data. This iteration may be performed until the accuracy is calculated through the evaluation index, and a desired accuracy is reached.
다음으로, 전술한 바와 같이, 기계학습모델(MLM)에 대한 학습이 완료된 후, 기계학습모델(MLM)을 이용하여 암을 예측할 수 있다. 이러한 방법에 대해서 설명하기로 한다. 도 6은 본 발명의 실시예에 따른 기계학습모델(MLM)을 이용한 암을 예측하기 위한 방법을 설명하기 위한 흐름도이다.Next, as described above, after learning of the machine learning model (MLM) is completed, cancer may be predicted using the machine learning model (MLM). These methods will be described. 6 is a flowchart illustrating a method for predicting cancer using a machine learning model (MLM) according to an embodiment of the present invention.
도 6을 참조하면, 전처리부(100)는 S210 단계에서 암이 알려지지 않은 검사 대상 환자의 분석 데이터, 즉, cfDNA 농도 또는 양(input amount)에 대한 분석 데이터, 복제수변이(CNV)에 대한 분석 데이터, 및 종양표지자(Cyfra 21-1, CA 15-3, AFP, CEA 및 CA 19-9) 발현량에 대한 분석 데이터를 입력 받고, S220 단계에서 이에 대한 전처리를 수행한 후, S230 단계에서 전처리된 분석 데이터를 기계학습모델(MLM)에 입력한다.Referring to FIG. 6 , the preprocessor 100 analyzes the analysis data of the test target patient whose cancer is unknown in step S210 , that is, analysis data on the cfDNA concentration or amount, and the copy number variation (CNV) analysis. Data, and analysis data on the expression level of tumor markers (Cyfra 21-1, CA 15-3, AFP, CEA and CA 19-9) are input, and pretreatment is performed thereon in step S220, and then pretreatment in step S230 The analyzed data is input into a machine learning model (MLM).
그러면, 기계학습모델(MLM)은 S240 단계에서 입력된 분석 데이터에 대해 복수의 계층 간의 가중치가 적용되는 복수의 연산을 수행하여 검사 대상 환자의 발암 여부의 확률을 나타내는 출력값을 산출한다. 예컨대, 출력층(OL)의 출력노드(o)의 출력이 0.84이면, 암 발생 확률이 84%임을 의미한다.Then, the machine learning model (MLM) calculates an output value indicating the probability of whether the patient to be examined has cancer by performing a plurality of calculations to which a weight between a plurality of layers is applied on the analysis data input in step S240. For example, if the output of the output node o of the output layer OL is 0.84, it means that the cancer occurrence probability is 84%.
이에 따라, 예측부(300)는 S250 단계에서 출력값에 따라 암을 추정할 수 있다.Accordingly, the prediction unit 300 may estimate the cancer according to the output value in step S250 .
본 발명의 암의 진단 또는 예측 방법에 있어서, 진단하고자 하는 암은 폐암, 두경부암, 유방암, 난소암, 간암, 고환암, 대장직장암, 갑상선암, 췌장암, 자궁경부암, 방광암, 소화기암 및 담낭암으로 구성된 군으로부터 선택되는 하나 이상일 수 있다.In the method for diagnosing or predicting cancer of the present invention, the cancer to be diagnosed is lung cancer, head and neck cancer, breast cancer, ovarian cancer, liver cancer, testicular cancer, colorectal cancer, thyroid cancer, pancreatic cancer, cervical cancer, bladder cancer, digestive cancer and gallbladder cancer. It may be one or more selected from.
본 발명의 암의 진단 또는 예측 방법은 미량의 cfDNA 분획(cfDNA fraction)에서도 민감도 및 정확도가 향상된 것이며, 본 발명의 일 실험예에 따르면, 종래 암의 진단 또는 예측하기 위한 방법인 Grail의 액체생검과 본원발명의 방법을 이용하여 인공서열에 대해 암 발생여부를 확인해본 결과, 본 발명의 방법은 Grail의 액체생검 대비 높은 수준의 민감도 및 정확도로 암 발생여부를 진단 또는 예측할 수 있는 것을 확인하였다.The cancer diagnosis or prediction method of the present invention has improved sensitivity and accuracy even in a trace amount of cfDNA fraction, and according to an experimental example of the present invention, Grail liquid biopsy and As a result of confirming the occurrence of cancer with respect to the artificial sequence using the method of the present invention, it was confirmed that the method of the present invention can diagnose or predict the occurrence of cancer with a higher level of sensitivity and accuracy than Grail's liquid biopsy.
실험예. 본 발명의 암의 진단 또는 예측 방법의 민감도 확인experimental example. Confirmation of sensitivity of the method for diagnosing or predicting cancer of the present invention
본 발명의 방법의 민감도를 확인하기 위해 실제 정상군의 cfDNA 샘플 30개를 기반으로 ctDNA fraction을 2%, 3%, 4%, 6%, 8%, 10%와 같이 6개씩 적용하여 총 180개 샘플을 만들었다. 이를 13번 및 18번 전체 염색체와 이보다 길이가 작은 30M, 20M, 10M 샘플들도 각각 만들었다.In order to confirm the sensitivity of the method of the present invention, 6 ctDNA fractions such as 2%, 3%, 4%, 6%, 8%, and 10% were applied based on 30 cfDNA samples of the normal group for a total of 180 samples. samples were made. The total chromosomes 13 and 18 and shorter 30M, 20M, and 10M samples were also made, respectively.
상기 인공 샘플에 대해 Grail의 액체생검, CNV 단일 분석법(stair-matrix 사용)만을 사용한 경우 및 본 발명의 암의 진단 또는 예측 방법을 사용(여기서, 제2 분석 데이터는 stair-matix를 사용하여 획득하였으며, 제1 분석 데이터는 대한민국의 이원다이애그노믹스(주) 社에서 측정하여 획득하였고, 제3 분석 데이터는 대한민국의 이원의료재단에 의뢰하여 획득)하여 암 발생여부를 진단 또는 예측에 있어서의 민감도를 확인하였으며, 결과는 도 7에 나타내었다.For the artificial sample, only Grail's liquid biopsy, CNV single assay (using a stair-matrix), and the cancer diagnosis or prediction method of the present invention were used (here, the second analysis data was obtained using a stair-matix) , the first analysis data was obtained by measuring and obtained by Leewon Diagnostics Co., Ltd. in Korea, and the third analysis data was obtained by requesting the Leewon Medical Foundation of Korea) to determine the sensitivity in diagnosing or predicting cancer occurrence. was confirmed, and the results are shown in FIG. 7 .
도 7에서 확인할 수 있는 바와 같이, STAGE I의 초기 암에서 본 발명의 암의 진단 또는 예측 방법을 사용하는 경우, Grail의 액체생검, CNV 단일 분석법만을 사용한 경우 대비 민감도가 높음을 확인하였다.As can be seen in FIG. 7 , when the diagnosis or prediction method of the present invention was used for STAGE I early cancer, it was confirmed that the sensitivity was higher than when only Grail's liquid biopsy and CNV single assay were used.

Claims (15)

  1. cfDNA(cell-free DNA) 농도에 대한 제1 분석 데이터, 복제수변이(copy number variation, CNV)에 대한 제2 분석 데이터, 및 종양표지자 발현량에 대한 제3 분석 데이터를 획득하는 단계; 및acquiring first analysis data for cell-free DNA (cfDNA) concentration, second analysis data for copy number variation (CNV), and third analysis data for tumor marker expression level; and
    예측부가 상기 분석 데이터에 대한 연산을 통해 암 발생 확률을 산출하도록 학습된 기계학습모델을 이용하여 상기 분석 데이터를 분석하여 암 발생여부를 진단 또는 예측하는 단계;를 포함하는 암의 진단 또는 예측 방법.A method of diagnosing or predicting cancer, comprising: a prediction unit diagnosing or predicting whether or not cancer occurs by analyzing the analysis data using a machine learning model trained to calculate a cancer occurrence probability through operation on the analysis data.
  2. 제1항에 있어서,According to claim 1,
    상기 제1 분석 데이터는,The first analysis data is,
    대상체로부터 혈액을 추출하는 단계; 및extracting blood from the subject; and
    상기 추출된 혈액 1 ml 당 검출된 cfDNA 양을 측정하는 단계;를 포함하는 방법에 의해 획득되는 것인 암의 진단 또는 예측 방법.A method for diagnosing or predicting cancer, which is obtained by a method comprising; measuring the amount of cfDNA detected per ml of the extracted blood.
  3. 제1항에 있어서,According to claim 1,
    상기 제2 분석 데이터는,The second analysis data is,
    a) 대상체로부터 분리된 혈액으로부터 cfDNA(cell-free DNA)의 시험 염기서열 단편들의 서열정보를 입력하는 단계;a) inputting sequence information of test nucleotide sequence fragments of cfDNA (cell-free DNA) from blood isolated from a subject;
    b) 상기 시험 염기서열 단편들의 서열정보를 인간 참조 유전체 서열(reference genome database)과 비교하여 상동성 위치를 배열하는 단계;b) arranging homology positions by comparing the sequence information of the test nucleotide sequence fragments with a human reference genome database;
    c) 상기 시험 염기서열 단편들 중 목적 염색체를 일정 크기로 분절하고, 분절 크기를 증가시켜 행과 열이 각각 분절 크기와 위치를 나타내는 정규화된 2차원 행렬을 생성하는 단계;c) segmenting a target chromosome among the test nucleotide sequence fragments to a predetermined size and increasing the segment size to generate a normalized two-dimensional matrix in which rows and columns represent segment sizes and positions, respectively;
    d) 상기 생성된 2차원 행렬의 Z-score 값을 산출하여 Z-score 값의 2차원 행렬을 형성하는 단계;d) forming a two-dimensional matrix of Z-score values by calculating a Z-score value of the generated two-dimensional matrix;
    e) 상기 Z-score 값 중, 참조 유전체 염기서열 단편의 목적 염색체로부터 산출된 Z-score 값보다 낮은 Z-score 값 중에서 가장 낮은 Z-score 값을 선별하는 단계; 및e) selecting the lowest Z-score value from among the Z-score values lower than the Z-score value calculated from the target chromosome of the reference genome sequence fragment; and
    f) 상기 가장 낮은 Z-score 값이 속하는 행과 열로부터 분절 크기가 가장 작은 행까지 단계적으로 Z-score 값이 점차 증가하는 분절의 위치를 체크하여 유전자 복제수변이(copy number variation, CNV)가 일어난 위치 및 크기를 확인하는 단계;를 포함하는 방법에 의해 획득된 것인 암의 진단 또는 예측 방법.f) Gene copy number variation (CNV) is determined by checking the position of the segment where the Z-score value gradually increases from the row and column to which the lowest Z-score value belongs to the row with the smallest segment size. A method of diagnosing or predicting cancer that is obtained by a method comprising; confirming the location and size of the occurrence.
  4. 제1항에 있어서,According to claim 1,
    상기 a) 단계는Step a) is
    a-1) 대상체로부터 채취된 혈액을 원심분리하여 혈장을 분리하는 단계;a-1) centrifuging the blood collected from the subject to separate plasma;
    a-2) 상기 분리된 혈장에서 cfDNA(cell-free DNA)를 추출하는 단계;a-2) extracting cfDNA (cell-free DNA) from the separated plasma;
    a-3) 추출된 cfDNA를 이용하여 라이브러리를 제작하는 단계; 및a-3) preparing a library using the extracted cfDNA; and
    a-4) 상기 라이브러리를 pooling 한 다음, 차세대 염기서열 분석법(Next Generation Sequencing, NGS)으로 염기서열을 해독하는 단계;를 포함하는 것인 암의 진단 또는 예측 방법.a-4) pooling the library and then decoding the nucleotide sequence by next generation sequencing (NGS);
  5. 제1항에 있어서,According to claim 1,
    상기 c) 단계는 상기 염색체를 0.5 Mb * n 크기로 분절하여 제n 행을 생성하는 제c)-n 단계;를 n 이 1 내지 24까지 반복적으로 수행하는 것인 암의 진단 또는 예측 방법.Step c) is a method of diagnosing or predicting cancer, wherein n is 1 to 24 repeatedly performing; step c)-n of generating an n-th row by segmenting the chromosome to a size of 0.5 Mb * n.
  6. 제1항에 있어서,According to claim 1,
    상기 d) 단계에서 2차원 행렬의 모든 분절 크기에 대한 Z-score 값은 정상 염색체를 갖는 대조군의 표본들을 정규화하여 산출한 평균(mean) 및 표준편차(SD) 값을 이용하여 산출하는 것인 암의 진단 또는 예측 방법.In step d), the Z-score values for all segment sizes of the two-dimensional matrix are calculated using the mean and standard deviation (SD) values calculated by normalizing samples of the control group having normal chromosomes. Cancer of diagnostic or predictive methods.
  7. 제1항에 있어서,According to claim 1,
    상기 e) 단계는 Z-score 값이 기준 값보다 낮은 값들 중에 가장 낮은 Z-score 값을 나타내는 부분을 선별하고, 상기 f) 단계는 가장 낮은 Z-score 값이 속하는 행과 열로부터 분절 크기가 가장 작은 행까지 단계적으로 Z-score 값이 점차 증가하는 분절의 위치를 체크하여 복제수변이를 확인하는 것인 암의 진단 또는 예측 방법.In step e), the portion showing the lowest Z-score value is selected among the values having the Z-score value lower than the reference value, and in step f), the segment size is the highest from the row and column to which the lowest Z-score value belongs. A method for diagnosing or predicting cancer, wherein the copy number variation is confirmed by checking the location of the segment where the Z-score value gradually increases up to a small row.
  8. 제1항에 있어서,According to claim 1,
    상기 Z-score 값은 참조 유전체 염기서열 단편(reference)의 목적 염색체로부터 산출된 2차원 행렬의 위치(xy)에 대한 평균(mean) 및 표준편차(SD)를 이용하여 하기 수학식 1에 따라 산출되는 것인 암의 진단 또는 예측 방법.The Z-score value is calculated according to Equation 1 below using the mean and standard deviation (SD) of the position (xy) of the two-dimensional matrix calculated from the target chromosome of the reference genome sequence fragment. A method for diagnosing or predicting cancer.
    [수학식 1][Equation 1]
    Figure PCTKR2021003531-appb-img-000005
    Figure PCTKR2021003531-appb-img-000005
    (여기서, Z=Z-score, M=matrix, cor.gc=correctable gc percent, xy=location of matrix)(where Z=Z-score, M=matrix, cor.gc=correctable gc percent, xy=location of matrix)
  9. 제1항에 있어서,According to claim 1,
    상기 제3 분석 데이터는The third analysis data is
    대상체로부터 생물학적 시료를 얻는 단계; 및obtaining a biological sample from the subject; and
    상기 생물학적 시료에서 종양표지자의 농도를 측정하는 단계;를 포함하는 방법에 의해 획득된 것인 암의 진단 또는 예측 방법.A method for diagnosing or predicting cancer obtained by a method comprising; measuring the concentration of a tumor marker in the biological sample.
  10. 제9항에 있어서,10. The method of claim 9,
    상기 종양표지자는 Cyfra 21-1, CA 15-3, AFP, CEA 및 CA 19-9인 암의 진단 또는 예측 방법.The tumor markers are Cyfra 21-1, CA 15-3, AFP, CEA and CA 19-9. A method for diagnosing or predicting cancer.
  11. 제9항에 있어서,10. The method of claim 9,
    상기 생물학적 시료는 혈액, 혈장 또는 혈청에서 분리한 것인 암의 진단 또는 예측 방법.The method for diagnosing or predicting cancer, wherein the biological sample is isolated from blood, plasma or serum.
  12. 제1항에 있어서, According to claim 1,
    상기 진단 또는 예측하는 단계는 The step of diagnosing or predicting
    상기 기계학습모델이 상기 분석 데이터에 대해 상기 기계학습모델의 복수의 계층 간의 가중치가 적용되는 복수의 연산을 수행하여 검사 대상 환자의 발암 여부의 확률을 나타내는 출력값을 산출하는 단계; 및 calculating, by the machine learning model, a plurality of calculations to which weights between a plurality of layers of the machine learning model are applied on the analyzed data to calculate an output value indicating the probability of cancer in a patient to be examined; and
    상기 예측부가 상기 출력값에 따라 암 발생 여부를 추정하는 단계;estimating, by the predicting unit, whether cancer occurs according to the output value;
    를 포함하는 암의 진단 또는 예측 방법.A method for diagnosing or predicting cancer comprising a.
  13. 제1항에 있어서, According to claim 1,
    상기 획득하는 단계 전, Before the obtaining step,
    학습부가 암 발생 여부가 알려진 환자의 분석 데이터인 학습 데이터를 마련하는 단계; preparing, by the learning unit, learning data, which is analysis data of a patient whose cancer occurrence is known;
    상기 학습부가 상기 학습 데이터에 암 발생 여부에 따른 레이블을 설정하는 단계; setting, by the learning unit, a label according to whether or not cancer occurs in the learning data;
    학습이 완료되기 전의 기계학습모델이 상기 학습 데이터에 대해 기계학습모델의 복수의 계층 간의 가중치가 적용되는 복수의 연산을 수행하여 암 발생 여부에 대한 확률을 나타내는 출력값을 산출하는 단계; calculating, by the machine learning model before the learning is completed, an output value indicating the probability of cancer occurrence by performing a plurality of calculations to which a weight between a plurality of layers of the machine learning model is applied to the learning data;
    상기 학습부가 출력값과 레이블의 차이인 손실을 산출하는 단계; 및 calculating, by the learning unit, a loss that is a difference between an output value and a label; and
    상기 학습부가 상기 손실이 최소가 되도록 역전파 알고리즘을 이용하여 기계학습모델(MLM)의 가중치(w)를 수정하는 최적화를 수행하는 단계;performing, by the learning unit, optimization of modifying the weight w of the machine learning model (MLM) using a backpropagation algorithm so that the loss is minimized;
    를 더 포함하는 암의 진단 또는 예측 방법.A method of diagnosing or predicting cancer further comprising a.
  14. 제13항에 있어서, 14. The method of claim 13,
    상기 손실을 산출하는 단계는 The step of calculating the loss is
    상기 학습부가 the learning department
    손실 함수loss function
    Figure PCTKR2021003531-appb-img-000006
    Figure PCTKR2021003531-appb-img-000006
    에 따라 손실을 산출하며, The loss is calculated according to
    상기 L은 손실을 의미하고, L means loss,
    상기 i는 학습 데이터에 대응하는 인덱스이고, The i is an index corresponding to the training data,
    상기 x i는 기계학습모델에 대한 입력인 학습 데이터이고, Where x i is training data that is an input to the machine learning model,
    상기 y i는 학습 데이터에 대한 기댓값을 나타내는 레이블이고, Wherein y i is a label indicating an expected value for the training data,
    y i는 i번째 학습 데이터에 대응하는 레이블이고, y i is the label corresponding to the i-th training data,
    상기 f(x i)은 i번째 학습 데이터에 대해 기계학습모델이 산출한 출력값인 것을 특징으로 하는 암의 진단 또는 예측 방법.Wherein f (x i ) is an output value calculated by the machine learning model with respect to the i-th learning data. A method for diagnosing or predicting cancer.
  15. 제1항에 있어서,According to claim 1,
    상기 암은 폐암, 두경부암, 유방암, 난소암, 간암, 고환암, 대장직장암, 갑상선암, 췌장암, 자궁경부암, 방광암, 소화기암 및 담낭암으로 구성된 군으로부터 선택되는 하나 이상인 암의 진단 또는 예측 방법.The cancer is lung cancer, head and neck cancer, breast cancer, ovarian cancer, liver cancer, testicular cancer, colorectal cancer, thyroid cancer, pancreatic cancer, cervical cancer, bladder cancer, digestive cancer and gallbladder cancer at least one selected from the group consisting of a diagnosis or prediction method of cancer.
PCT/KR2021/003531 2021-03-22 2021-03-22 Method for diagnosing or predicting cancer occurrence WO2022203093A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/KR2021/003531 WO2022203093A1 (en) 2021-03-22 2021-03-22 Method for diagnosing or predicting cancer occurrence

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/KR2021/003531 WO2022203093A1 (en) 2021-03-22 2021-03-22 Method for diagnosing or predicting cancer occurrence

Publications (1)

Publication Number Publication Date
WO2022203093A1 true WO2022203093A1 (en) 2022-09-29

Family

ID=83395811

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/KR2021/003531 WO2022203093A1 (en) 2021-03-22 2021-03-22 Method for diagnosing or predicting cancer occurrence

Country Status (1)

Country Link
WO (1) WO2022203093A1 (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20190014121A (en) * 2015-01-13 2019-02-11 더 차이니즈 유니버시티 오브 홍콩 Using size and number aberrations in plasma dna for detecting cancer
KR20190112219A (en) * 2018-03-13 2019-10-04 전남대학교산학협력단 Method for detection of alzheimer's disease, system for detection of alzheimer's disease, and computer-readable medium storing program for method thereof
KR20190114351A (en) * 2018-03-29 2019-10-10 이원다이애그노믹스(주) Methods for Identifying Microdeletion or Microamplification of Fetal Chromosomes Using Non-invasive Prenatal testing
KR20200117917A (en) * 2019-04-05 2020-10-14 주식회사 제놉시 Method for diagnosing cancer using cfdna
KR20200143462A (en) * 2018-04-13 2020-12-23 프리놈 홀딩스, 인크. Implementing machine learning for testing multiple analytes in biological samples

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20190014121A (en) * 2015-01-13 2019-02-11 더 차이니즈 유니버시티 오브 홍콩 Using size and number aberrations in plasma dna for detecting cancer
KR20190112219A (en) * 2018-03-13 2019-10-04 전남대학교산학협력단 Method for detection of alzheimer's disease, system for detection of alzheimer's disease, and computer-readable medium storing program for method thereof
KR20190114351A (en) * 2018-03-29 2019-10-10 이원다이애그노믹스(주) Methods for Identifying Microdeletion or Microamplification of Fetal Chromosomes Using Non-invasive Prenatal testing
KR20200143462A (en) * 2018-04-13 2020-12-23 프리놈 홀딩스, 인크. Implementing machine learning for testing multiple analytes in biological samples
KR20200117917A (en) * 2019-04-05 2020-10-14 주식회사 제놉시 Method for diagnosing cancer using cfdna

Similar Documents

Publication Publication Date Title
US10354747B1 (en) Deep learning analysis pipeline for next generation sequencing
Hofreiter Drafting human ancestry: What does the Neanderthal genome tell us about hominid evolution? Commentary on Green et al.(2010)
WO2019139363A1 (en) Method for detecting circulating tumor dna in sample including acellular dna and use thereof
Sun et al. A comprehensive comparison of supervised and unsupervised methods for cell type identification in single-cell RNA-seq
WO2022188490A1 (en) Survival time prediction method and system based on imaging genomics
CN111180013B (en) Device for detecting blood disease fusion gene
CN115896242A (en) Intelligent cancer screening model and method based on peripheral blood immune characteristics
CN111814893A (en) Lung full-scan image EGFR mutation prediction method and system based on deep learning
CN114446389A (en) Tumor neoantigen characteristic analysis and immunogenicity prediction tool and application thereof
WO2022203093A1 (en) Method for diagnosing or predicting cancer occurrence
JP7141038B2 (en) Onset prediction device and onset prediction system
WO2016085262A2 (en) Virtual drug screening method, intensive screening library constructing method, and system therefor
Yang et al. A CpGCluster-teaching–learning-based optimization for prediction of CpG islands in the human genome
CN115831351A (en) Clinical auxiliary diagnosis model for hepatocellular carcinoma microvascular invasion based on dictionary learning
KR102534968B1 (en) Method for diagnosing or predicting cancer occurrence
CN114863149A (en) Method, system, device and storage medium for predicting relative survival risk of breast cancer
WO2020235721A1 (en) Method for discovering marker for predicting risk of depression or suicide using multi-omics analysis, marker for predicting risk of depression or suicide, and method for predicting risk of depression or suicide using multi-omics analysis
Paya et al. Deep learning system for classification of ploidy status using time-lapse videos
NZ766350A (en) Sequencing data-based itd mutation ratio detecting apparatus and method
WO2023022444A1 (en) Method and apparatus for providing examination-related guide on basis of tumor content predicted from pathology slide images
US20220246232A1 (en) Method for diagnosing disease risk based on complex biomarker network
CN117079723B (en) Biomarker and diagnostic model related to amyotrophic lateral sclerosis and application of biomarker and diagnostic model
WO2024071796A1 (en) Method and system for machine learning for predicting fracture risk based on spine radiographic image, and method and system for predicting fracture risk using same
US20240021267A1 (en) Dynamically selecting sequencing subregions for cancer classification
TWI719380B (en) Method and system for selecting biomarker via disease trajectories

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21933321

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21933321

Country of ref document: EP

Kind code of ref document: A1