WO2023075402A1 - 메틸화된 무세포 핵산을 이용한 암 진단 및 암 종 예측방법 - Google Patents
메틸화된 무세포 핵산을 이용한 암 진단 및 암 종 예측방법 Download PDFInfo
- Publication number
- WO2023075402A1 WO2023075402A1 PCT/KR2022/016448 KR2022016448W WO2023075402A1 WO 2023075402 A1 WO2023075402 A1 WO 2023075402A1 KR 2022016448 W KR2022016448 W KR 2022016448W WO 2023075402 A1 WO2023075402 A1 WO 2023075402A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- cancer
- nucleic acid
- value
- acid fragments
- vectorized data
- Prior art date
Links
- 150000007523 nucleic acids Chemical class 0.000 title claims abstract description 176
- 206010028980 Neoplasm Diseases 0.000 title claims abstract description 122
- 201000011510 cancer Diseases 0.000 title claims abstract description 120
- 238000000034 method Methods 0.000 title claims abstract description 92
- 108020004707 nucleic acids Proteins 0.000 title claims abstract description 59
- 102000039446 nucleic acids Human genes 0.000 title claims abstract description 59
- 238000003745 diagnosis Methods 0.000 title claims abstract description 19
- 210000000349 chromosome Anatomy 0.000 claims abstract description 42
- 238000013473 artificial intelligence Methods 0.000 claims abstract description 35
- 239000012472 biological sample Substances 0.000 claims abstract description 18
- 238000012163 sequencing technique Methods 0.000 claims description 44
- 230000011987 methylation Effects 0.000 claims description 22
- 238000007069 methylation reaction Methods 0.000 claims description 22
- 238000013528 artificial neural network Methods 0.000 claims description 20
- 239000000523 sample Substances 0.000 claims description 19
- 238000013527 convolutional neural network Methods 0.000 claims description 16
- 238000003860 storage Methods 0.000 claims description 13
- 238000009826 distribution Methods 0.000 claims description 8
- 239000012634 fragment Substances 0.000 claims description 7
- 108090000623 proteins and genes Proteins 0.000 claims description 7
- 238000006243 chemical reaction Methods 0.000 claims description 6
- 238000001114 immunoprecipitation Methods 0.000 claims description 6
- 230000000306 recurrent effect Effects 0.000 claims description 6
- 230000002068 genetic effect Effects 0.000 claims description 5
- LSNNMFCWUKXFEE-UHFFFAOYSA-M Bisulfite Chemical compound OS([O-])=O LSNNMFCWUKXFEE-UHFFFAOYSA-M 0.000 claims description 4
- 230000002759 chromosomal effect Effects 0.000 claims description 3
- 102000004169 proteins and genes Human genes 0.000 claims description 3
- 239000011324 bead Substances 0.000 claims description 2
- 238000004587 chromatography analysis Methods 0.000 claims description 2
- 238000004440 column chromatography Methods 0.000 claims description 2
- 230000006862 enzymatic digestion Effects 0.000 claims description 2
- 230000002255 enzymatic effect Effects 0.000 claims description 2
- 239000003925 fat Substances 0.000 claims description 2
- 238000010298 pulverizing process Methods 0.000 claims description 2
- 238000005185 salting out Methods 0.000 claims description 2
- 238000001514 detection method Methods 0.000 abstract description 4
- 230000000694 effects Effects 0.000 abstract description 2
- 238000007796 conventional method Methods 0.000 abstract 1
- 230000006870 function Effects 0.000 description 20
- 206010029260 Neuroblastoma Diseases 0.000 description 19
- 238000012360 testing method Methods 0.000 description 18
- 238000010200 validation analysis Methods 0.000 description 16
- 238000007481 next generation sequencing Methods 0.000 description 14
- 238000013136 deep learning model Methods 0.000 description 11
- 238000004364 calculation method Methods 0.000 description 8
- 210000004027 cell Anatomy 0.000 description 8
- 239000000047 product Substances 0.000 description 8
- 238000012549 training Methods 0.000 description 8
- 210000004369 blood Anatomy 0.000 description 7
- 239000008280 blood Substances 0.000 description 7
- 238000011156 evaluation Methods 0.000 description 6
- 238000004458 analytical method Methods 0.000 description 5
- 239000012530 fluid Substances 0.000 description 5
- 238000012216 screening Methods 0.000 description 5
- 238000002864 sequence alignment Methods 0.000 description 5
- 239000013598 vector Substances 0.000 description 5
- 238000005516 engineering process Methods 0.000 description 4
- 238000011528 liquid biopsy Methods 0.000 description 4
- 239000002773 nucleotide Substances 0.000 description 4
- 125000003729 nucleotide group Chemical group 0.000 description 4
- 210000001519 tissue Anatomy 0.000 description 4
- 238000001574 biopsy Methods 0.000 description 3
- 238000001369 bisulfite sequencing Methods 0.000 description 3
- 238000010276 construction Methods 0.000 description 3
- 238000002474 experimental method Methods 0.000 description 3
- 239000000284 extract Substances 0.000 description 3
- 238000013507 mapping Methods 0.000 description 3
- 210000002569 neuron Anatomy 0.000 description 3
- 238000012795 verification Methods 0.000 description 3
- 208000024893 Acute lymphoblastic leukemia Diseases 0.000 description 2
- 208000014697 Acute lymphocytic leukaemia Diseases 0.000 description 2
- 206010006187 Breast cancer Diseases 0.000 description 2
- 208000026310 Breast neoplasm Diseases 0.000 description 2
- 206010008805 Chromosomal abnormalities Diseases 0.000 description 2
- 208000031404 Chromosome Aberrations Diseases 0.000 description 2
- 208000015914 Non-Hodgkin lymphomas Diseases 0.000 description 2
- 208000006664 Precursor Cell Lymphoblastic Leukemia-Lymphoma Diseases 0.000 description 2
- 238000013459 approach Methods 0.000 description 2
- 210000001175 cerebrospinal fluid Anatomy 0.000 description 2
- 238000012790 confirmation Methods 0.000 description 2
- 238000013467 fragmentation Methods 0.000 description 2
- 238000006062 fragmentation reaction Methods 0.000 description 2
- 230000014509 gene expression Effects 0.000 description 2
- 210000000265 leukocyte Anatomy 0.000 description 2
- 201000007270 liver cancer Diseases 0.000 description 2
- 208000014018 liver neoplasm Diseases 0.000 description 2
- 238000007855 methylation-specific PCR Methods 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 238000005457 optimization Methods 0.000 description 2
- 238000011176 pooling Methods 0.000 description 2
- 239000002244 precipitate Substances 0.000 description 2
- 238000012545 processing Methods 0.000 description 2
- 108091008146 restriction endonucleases Proteins 0.000 description 2
- 230000035945 sensitivity Effects 0.000 description 2
- 239000006228 supernatant Substances 0.000 description 2
- 208000031261 Acute myeloid leukaemia Diseases 0.000 description 1
- 206010004593 Bile duct cancer Diseases 0.000 description 1
- 206010005003 Bladder cancer Diseases 0.000 description 1
- 201000009030 Carcinoma Diseases 0.000 description 1
- 206010008342 Cervix carcinoma Diseases 0.000 description 1
- 206010009944 Colon cancer Diseases 0.000 description 1
- 230000007067 DNA methylation Effects 0.000 description 1
- KCXVZYZYPLLWCC-UHFFFAOYSA-N EDTA Chemical compound OC(=O)CN(CC(O)=O)CCN(CC(O)=O)CC(O)=O KCXVZYZYPLLWCC-UHFFFAOYSA-N 0.000 description 1
- 208000000461 Esophageal Neoplasms Diseases 0.000 description 1
- 208000022072 Gallbladder Neoplasms Diseases 0.000 description 1
- 208000008839 Kidney Neoplasms Diseases 0.000 description 1
- 206010058467 Lung neoplasm malignant Diseases 0.000 description 1
- 206010027406 Mesothelioma Diseases 0.000 description 1
- 208000034578 Multiple myelomas Diseases 0.000 description 1
- 208000033776 Myeloid Acute Leukemia Diseases 0.000 description 1
- 206010030155 Oesophageal carcinoma Diseases 0.000 description 1
- 108091034117 Oligonucleotide Proteins 0.000 description 1
- 206010033128 Ovarian cancer Diseases 0.000 description 1
- 206010061535 Ovarian neoplasm Diseases 0.000 description 1
- 206010061902 Pancreatic neoplasm Diseases 0.000 description 1
- 206010035226 Plasma cell myeloma Diseases 0.000 description 1
- 206010036790 Productive cough Diseases 0.000 description 1
- 206010060862 Prostate cancer Diseases 0.000 description 1
- 208000000236 Prostatic Neoplasms Diseases 0.000 description 1
- 208000015634 Rectal Neoplasms Diseases 0.000 description 1
- 206010038389 Renal cancer Diseases 0.000 description 1
- 208000005718 Stomach Neoplasms Diseases 0.000 description 1
- 208000024770 Thyroid neoplasm Diseases 0.000 description 1
- 241000283907 Tragelaphus oryx Species 0.000 description 1
- 208000007097 Urinary Bladder Neoplasms Diseases 0.000 description 1
- 208000006105 Uterine Cervical Neoplasms Diseases 0.000 description 1
- 235000009499 Vanilla fragrans Nutrition 0.000 description 1
- 244000263375 Vanilla tahitensis Species 0.000 description 1
- 235000012036 Vanilla tahitensis Nutrition 0.000 description 1
- 230000004913 activation Effects 0.000 description 1
- 210000004381 amniotic fluid Anatomy 0.000 description 1
- 230000003321 amplification Effects 0.000 description 1
- 208000026900 bile duct neoplasm Diseases 0.000 description 1
- 239000013060 biological fluid Substances 0.000 description 1
- 210000001754 blood buffy coat Anatomy 0.000 description 1
- 210000001124 body fluid Anatomy 0.000 description 1
- 239000010839 body fluid Substances 0.000 description 1
- 210000000621 bronchi Anatomy 0.000 description 1
- 238000005119 centrifugation Methods 0.000 description 1
- 201000010881 cervical cancer Diseases 0.000 description 1
- 208000006990 cholangiocarcinoma Diseases 0.000 description 1
- 230000003920 cognitive function Effects 0.000 description 1
- 210000001072 colon Anatomy 0.000 description 1
- 208000029742 colonic neoplasm Diseases 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 238000000205 computational method Methods 0.000 description 1
- 210000002726 cyst fluid Anatomy 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000012631 diagnostic technique Methods 0.000 description 1
- 201000010099 disease Diseases 0.000 description 1
- 208000037265 diseases, disorders, signs and symptoms Diseases 0.000 description 1
- 239000000839 emulsion Substances 0.000 description 1
- 201000004101 esophageal cancer Diseases 0.000 description 1
- 201000010175 gallbladder cancer Diseases 0.000 description 1
- 206010017758 gastric cancer Diseases 0.000 description 1
- 238000013412 genome amplification Methods 0.000 description 1
- 230000000762 glandular Effects 0.000 description 1
- 208000005017 glioblastoma Diseases 0.000 description 1
- 201000010536 head and neck cancer Diseases 0.000 description 1
- 208000014829 head and neck neoplasm Diseases 0.000 description 1
- 230000002489 hematologic effect Effects 0.000 description 1
- 238000009396 hybridization Methods 0.000 description 1
- 238000007031 hydroxymethylation reaction Methods 0.000 description 1
- 230000003834 intracellular effect Effects 0.000 description 1
- 150000002500 ions Chemical class 0.000 description 1
- 201000010982 kidney cancer Diseases 0.000 description 1
- -1 leukocytes Substances 0.000 description 1
- 201000005202 lung cancer Diseases 0.000 description 1
- 208000020816 lung neoplasm Diseases 0.000 description 1
- 210000004880 lymph fluid Anatomy 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 208000015486 malignant pancreatic neoplasm Diseases 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 239000000463 material Substances 0.000 description 1
- 238000005259 measurement Methods 0.000 description 1
- 201000001441 melanoma Diseases 0.000 description 1
- 239000000203 mixture Substances 0.000 description 1
- 210000003097 mucus Anatomy 0.000 description 1
- 210000002445 nipple Anatomy 0.000 description 1
- 238000010606 normalization Methods 0.000 description 1
- 238000003199 nucleic acid amplification method Methods 0.000 description 1
- 210000000056 organ Anatomy 0.000 description 1
- 230000001590 oxidative effect Effects 0.000 description 1
- 201000002528 pancreatic cancer Diseases 0.000 description 1
- 208000008443 pancreatic carcinoma Diseases 0.000 description 1
- 210000005259 peripheral blood Anatomy 0.000 description 1
- 239000011886 peripheral blood Substances 0.000 description 1
- 210000003819 peripheral blood mononuclear cell Anatomy 0.000 description 1
- 210000002826 placenta Anatomy 0.000 description 1
- 210000004910 pleural fluid Anatomy 0.000 description 1
- 238000002360 preparation method Methods 0.000 description 1
- 230000005180 public health Effects 0.000 description 1
- 238000000746 purification Methods 0.000 description 1
- 206010038038 rectal cancer Diseases 0.000 description 1
- 201000001275 rectum cancer Diseases 0.000 description 1
- 238000012552 review Methods 0.000 description 1
- 210000003296 saliva Anatomy 0.000 description 1
- 230000028327 secretion Effects 0.000 description 1
- 238000010187 selection method Methods 0.000 description 1
- 210000000582 semen Anatomy 0.000 description 1
- 210000002966 serum Anatomy 0.000 description 1
- 210000003765 sex chromosome Anatomy 0.000 description 1
- 230000008054 signal transmission Effects 0.000 description 1
- 201000002314 small intestine cancer Diseases 0.000 description 1
- 239000000344 soap Substances 0.000 description 1
- 239000007787 solid Substances 0.000 description 1
- 238000000527 sonication Methods 0.000 description 1
- 241000894007 species Species 0.000 description 1
- 210000003802 sputum Anatomy 0.000 description 1
- 208000024794 sputum Diseases 0.000 description 1
- 201000011549 stomach cancer Diseases 0.000 description 1
- 239000000126 substance Substances 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 239000013589 supplement Substances 0.000 description 1
- 210000001179 synovial fluid Anatomy 0.000 description 1
- 210000001138 tear Anatomy 0.000 description 1
- 201000002510 thyroid cancer Diseases 0.000 description 1
- 239000000439 tumor marker Substances 0.000 description 1
- 201000005112 urinary bladder cancer Diseases 0.000 description 1
- 210000002700 urine Anatomy 0.000 description 1
- 238000005406 washing Methods 0.000 description 1
- 238000012070 whole genome sequencing analysis Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B5/00—ICT specially adapted for modelling or simulations in systems biology, e.g. gene-regulatory networks, protein interaction networks or metabolic networks
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16H—HEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
- G16H50/00—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
- G16H50/20—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for computer-aided diagnosis, e.g. based on medical expert systems
-
- C—CHEMISTRY; METALLURGY
- C12—BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
- C12Q—MEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
- C12Q1/00—Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
- C12Q1/68—Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
- C12Q1/6876—Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes
- C12Q1/6883—Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes for diseases caused by alterations of genetic material
- C12Q1/6886—Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes for diseases caused by alterations of genetic material for cancer
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/044—Recurrent networks, e.g. Hopfield networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/0464—Convolutional networks [CNN, ConvNet]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B30/00—ICT specially adapted for sequence analysis involving nucleotides or amino acids
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B30/00—ICT specially adapted for sequence analysis involving nucleotides or amino acids
- G16B30/10—Sequence alignment; Homology search
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B35/00—ICT specially adapted for in silico combinatorial libraries of nucleic acids, proteins or peptides
- G16B35/10—Design of libraries
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B40/00—ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B40/00—ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
- G16B40/20—Supervised data analysis
Definitions
- the present invention relates to a method for diagnosing cancer and predicting cancer types using methylated cell-free nucleic acids, and more specifically, by extracting nucleic acids from biological samples, obtaining sequence information including methylation information, and obtaining nucleic acids based on aligned reads. It relates to a method for diagnosing cancer and predicting cancer types using a method of generating fragment vectorized data and then inputting it into a learned artificial intelligence model to analyze the calculated value.
- Cancer diagnosis in clinical practice is usually confirmed by performing a tissue biopsy after a medical history, physical examination, and clinical evaluation. Cancer diagnosis by clinical tests is possible only when the number of cancer cells is 1 billion or more and the diameter of the cancer is 1 cm or more. In this case, the cancer cells already have the ability to metastasize, and at least half of them have already metastasized.
- tissue biopsy is invasive, it causes considerable inconvenience to patients, and there are problems in that tissue biopsy can often not be performed while treating cancer patients.
- cancer screening tumor markers are used to monitor substances produced directly or indirectly from cancer, but even when cancer is present, more than half of the tumor marker screening results are normal, and often positive even when there is no cancer. Because it appears, there is a limit to its accuracy.
- liquid biopsy using body fluids of patients as a recent cancer diagnosis and follow-up test (liquid biopsy) is widely used.
- Liquid biopsy is a non-invasive diagnostic technique that is attracting attention as an alternative to conventional invasive diagnostic and examination methods.
- an artificial neural network refers to a calculation model implemented in software or hardware that imitates the computational capability of a biological system by using a large number of artificial neurons connected by connection lines.
- Artificial neural networks use artificial neurons that simplify the functions of biological neurons.
- the human cognitive function or learning process is performed by interconnecting them through a connection line having a connection strength.
- the connection strength is a specific value that a connection line has, and is also called a connection weight.
- Learning of artificial neural networks can be divided into supervised learning and unsupervised learning.
- Supervised learning is a method of putting input data and corresponding output data together into a neural network and updating the connection strength of connection lines so that output data corresponding to the input data is output.
- Representative learning algorithms include Delta Rule and Back Propagation Learning.
- Unsupervised learning is a method in which an artificial neural network learns connection strength by itself using only input data without a target value.
- Unsupervised learning is a method of updating connection weights by correlation between input patterns.
- the present inventors have made diligent efforts to solve the above problems and develop a highly sensitive and accurate AI-based cancer diagnosis method, and as a result, vectorized data is generated based on the distance or amount of methylated cell-free nucleic acid fragments, When this was analyzed with a learned artificial intelligence model, it was confirmed that cancer diagnosis and cancer type discrimination could be performed with high sensitivity and accuracy, and the present invention was completed.
- An object of the present invention is to provide a method for diagnosing cancer and predicting cancer types using methylated cell-free nucleic acids.
- Another object of the present invention is to provide an apparatus for diagnosing cancer and predicting cancer types using methylated cell-free nucleic acids.
- Another object of the present invention is to provide a computer readable storage medium containing instructions configured to be executed by a processor for diagnosing cancer and predicting cancer types by the above method.
- the present invention provides (a) obtaining sequence information including methylation information by extracting nucleic acids from a biological sample; (b) aligning the obtained sequence information (reads) with a standard chromosome sequence database (reference genome database); (c) generating vectorized data using nucleic acid fragments based on the aligned sequence reads; (d) determining the presence or absence of cancer by comparing an output result value analyzed by inputting the generated vectorized data into a learned artificial intelligence model and a cut-off value; and (e) estimating the type of cancer through comparison of the output result value.
- the present invention also provides a decoding unit for extracting nucleic acids from a biological sample and decoding sequence information including methylation information; an alignment unit that aligns the translated sequence with a standard chromosomal sequence database; a data generator for generating vectorized data using the aligned sequence-based nucleic acid fragments; a cancer diagnosis unit that analyzes the generated vectorized data by inputting it to the learned artificial intelligence model and compares it with a reference value to determine whether or not there is cancer; and a cancer type prediction unit that analyzes the output result value and predicts the type of cancer.
- the present invention also provides a computer-readable storage medium comprising instructions configured to be executed by a processor for diagnosing cancer and predicting cancer types, including (a) extracting nucleic acid from a biological sample to obtain sequence information including methylation information; obtaining; (b) aligning the obtained sequence information (reads) with a standard chromosome sequence database (reference genome database); (c) generating vectorized data using the nucleic acid fragments based on the aligned sequence information (reads); (d) inputting and analyzing the generated vectorized data into a learned artificial intelligence model, and comparing a cut-off value to determine the presence or absence of cancer; and (e) a computer-readable storage medium comprising instructions configured to be executed by a processor for predicting the presence of cancer and the type of cancer through the step of predicting the type of cancer through the comparison of the output values.
- the present invention also includes (a) obtaining sequence information including methylation information by extracting nucleic acids from a biological sample; (b) aligning the obtained sequence information (reads) with a standard chromosome sequence database (reference genome database); (c) generating vectorized data using nucleic acid fragments based on the aligned sequence reads; (d) determining the presence or absence of cancer by comparing an output result value analyzed by inputting the generated vectorized data into a learned artificial intelligence model and a cut-off value; and (e) predicting the type of cancer through the comparison of the output result value.
- FIG. 1 is an overall flowchart for determining artificial intelligence-based chromosomal abnormalities of the present invention.
- FIG. 2 is an example of a GC plot generated based on methylated cfDNA according to an embodiment of the present invention, wherein the X-axis represents chromosomes for each segment, and the Y-axis represents the number of nucleic acid fragments corresponding to each segment.
- 3 is a result of confirming the accuracy of neuroblastoma determination for a deep learning model that has learned GC plot image data generated based on the number of nucleic acid fragments using methylated cfDNA according to an embodiment of the present invention
- FIG. 4 is a probability distribution for each data set of neuroblastoma determination for a deep learning model that has learned GC plot image data generated based on the number of nucleic acid fragments using methylated cfDNA according to an embodiment of the present invention.
- A means a training set
- B a validation set
- C a test set.
- FIG. 5 is an example of a GC plot generated based on cfDNA according to an embodiment of the present invention, in which the X-axis indicates chromosomes for each section and the Y-axis indicates the number of nucleic acid fragments corresponding to each section.
- FIG. 6 is a result of confirming the accuracy of neuroblastoma determination for a deep learning model that has learned GC plot image data generated based on the number of nucleic acid fragments using cfDNA according to an embodiment of the present invention
- FIG. 7 is a result showing the probability distribution for each data set of neuroblastoma determination for a deep learning model that has learned GC plot image data generated based on the number of nucleic acid fragments using cfDNA according to an embodiment of the present invention.
- (A) means the training set
- first, second, A, B, etc. may be used to describe various elements, but the elements are not limited by the above terms, and are merely used to distinguish one element from another. used only as For example, without departing from the scope of the technology described below, a first element may be referred to as a second element, and similarly, the second element may be referred to as a first element.
- the terms and/or include any combination of a plurality of related recited items or any of a plurality of related recited items.
- each component to be described below may be combined into one component, or one component may be divided into two or more for each more subdivided function.
- each component to be described below may additionally perform some or all of the functions of other components in addition to its main function, and some of the main functions of each component may be performed by other components. Of course, it may be dedicated and performed by .
- each process constituting the method may occur in a different order from the specified order unless a specific order is clearly described in context. That is, each process may occur in the same order as specified, may be performed substantially simultaneously, or may be performed in the reverse order.
- sequencing data obtained from methylated cell-free nucleic acids extracted from samples is aligned with a reference genome, vectorized data is generated based on the aligned nucleic acid fragments, and then DPI values are calculated from the learned artificial intelligence model. It was intended to confirm that cancer can be detected with high sensitivity and accuracy when cancer is detected by comparison with the reference value.
- the deep learning model was trained to calculate the DPI value. If the DPI value is greater than the reference value, it is determined that there is cancer , developed a method for determining the cancer type showing the highest value among a plurality of DPI values as the actual cancer type (FIG. 1)
- It relates to a method for providing information for diagnosing cancer and predicting cancer types, including (e) predicting cancer types through comparison of output result values.
- the nucleic acid fragment may be used without limitation as long as it is a fragment of nucleic acid extracted from a biological sample, but preferably may be a fragment of cell-free nucleic acid or intracellular nucleic acid, but is not limited thereto.
- the nucleic acid fragment can be obtained by any method known to those skilled in the art, and is preferably directly sequenced, sequenced through next-generation sequencing, or non-specific whole genome amplification. ), or obtained through sequencing or probe-based sequencing, but is not limited thereto.
- the nucleic acid fragment may mean a lead when next-generation sequencing is used.
- the cancer may be solid cancer or hematological cancer, preferably non-Hodgkin lymphoma, non-Hodgkin lymphoma, acute myeloid leukemia, or acute lymphocytic leukemia.
- acute-lymphoid leukemia multiple myeloma, head and neck cancer, lung cancer, glioblastoma, colon/rectal cancer, pancreatic cancer, breast cancer, ovarian cancer, melanoma, prostate cancer
- It may be selected from the group consisting of liver cancer, thyroid cancer, gastric cancer, gallbladder cancer, bile duct cancer, bladder cancer, small intestine cancer, cervical cancer, cancer of unknown primary site, kidney cancer, esophageal cancer, neuroblastoma, and mesothelioma, more preferably neuroblastoma. It may be a neuroblastoma, but is not limited thereto.
- the step (a) is
- the step of obtaining sequence information of step (a) may be characterized in that the isolated cell-free DNA is obtained through whole genome sequencing at a depth of 1 million to 100 million reads, but is not limited thereto. .
- the biological sample refers to any material, biological fluid, tissue or cell obtained from or derived from an individual, for example, whole blood, leukocytes, peripheral blood mononuclear peripheral blood mononuclear cells, leukocyte buffy coat, blood (including plasma and serum), sputum, tears, mucus, nasal washes, nasal aspirates, breath, urine, semen, saliva, peritoneal washings, pelvic fluids, cyst fluids ( cystic fluid, meningeal fluid, amniotic fluid, glandular fluid, pancreatic fluid, lymph fluid, pleural fluid, nipple aspirate, bronchi Bronchial aspirate, synovial fluid, joint aspirate, organ secretions, cells, cell extract, hair, oral cells, placenta cells, cerebrospinal fluid ( cerebrospinal fluid) and mixtures thereof, but is not limited thereto.
- cyst fluids cystic fluid, meningeal fluid, amniotic fluid, glandular fluid, pancreatic fluid, lymph fluid,
- the term "reference group” is a reference group that can be compared like a standard sequencing database, and refers to a group of people who do not currently have a specific disease or condition.
- the standard nucleotide sequence in the standard chromosome sequence database of the reference group may be a reference chromosome registered with a public health institution such as NCBI.
- the nucleic acid in step (a) may be cell-free DNA, more preferably circulating tumor DNA (ctDNA), but is not limited thereto.
- ctDNA circulating tumor DNA
- the nucleic acid containing the methylation information can be obtained by various known methods, preferably by bisulfite conversion, enzymatic conversion or methylated DNA immunoprecipitation ( Methylated DNA Immunoprecipitation (MeDIP) may be characterized as obtained, but is not limited thereto.
- Methylated DNA Immunoprecipitation Methylated DNA Immunoprecipitation (MeDIP) may be characterized as obtained, but is not limited thereto.
- a method for detecting DNA methylation there is an additional restriction enzyme-based detection method, which uses a methylation restriction enzyme (MRE) to cleave unmethylated nucleic acid, or a specific sequence (recognition Site) is cut and analyzed by combining with hybridization method or PCR.
- MRE methylation restriction enzyme
- methods based on bisulfite substitution include Whole-Genome Bisulfite Sequencing (WGBS), Reduced-Representation Bisulfite Sequencing (RRBS), Methylated CpG Tandems Amplification and Sequencing (MCTA-seq), Targeted Bisulfite Sequencing, Methylation Array and Methylation- specific PCR (MSP), etc.
- WGBS Whole-Genome Bisulfite Sequencing
- RRBS Reduced-Representation Bisulfite Sequencing
- MCTA-seq Methylated CpG Tandems Amplification and Sequencing
- MSP Methylation-specific PCR
- methods for enriching and analyzing methylated DNA include Methylated DNA Immunoprecipitation Sequencing (MeDIP-seq), Methyl-CpG Binding Domain Protein Capture Sequencing (MBD-seq), and the like.
- Another method that can analyze methylated DNA in the present invention is 5-hydroxymethylation profiling, examples of which include 5hmC-Seal (hMe-Seal), hmC-CATCH, Hydroxymethylated DNA Immunoprecipitation Sequencing (hMeDIP-seq), Oxidative Bisulfite Conversion, etc.
- next-generation sequencer can be used with any sequencing method known in the art. Sequencing of nucleic acids isolated by selection methods is typically performed using next-generation sequencing (NGS).
- Next-generation sequencing includes any sequencing method that determines the nucleotide sequence of an individual nucleic acid molecule or one of clonally expanded proxies for individual nucleic acid molecules in a highly similar manner (e.g., 105 or more molecules are sequenced simultaneously). do).
- the relative abundance of a nucleic acid species in a library can be estimated by counting the relative number of occurrences of its cognate sequence in data generated by sequencing experiments. Next-generation sequencing methods are known in the art and are described, for example, in Metzker, M. (2010) Nature Biotechnology Reviews 11:31-46, incorporated herein by reference.
- next-generation sequencing is performed to determine the nucleotide sequence of individual nucleic acid molecules (e.g., the HeliScope Gene Sequencing system from Helicos BioSciences and the Pacific Biosciences' HeliScope Gene Sequencing system). PacBio RS system).
- sequencing e.g., massively parallel short-read sequencing that yields more bases of sequence per sequencing unit than other sequencing methods that yield fewer but longer reads (e.g., San Diego, Calif.)
- the Illumina Inc. Solexa sequencer method determines the nucleotide sequence of clonally expanded proxies for individual nucleic acid molecules (e.g., Illumina Inc., San Diego, CA).
- Solexa sequencer 454 Life Sciences (Branford, Connecticut) and Ion Torrent).
- Other methods or machines for next-generation sequencing include, but are not limited to, 454 Life Sciences (Branford, CT), Applied Biosystems (Foster City, CA; SOLiD sequencers), Helicos Provided by Bioscience Corporation (Cambridge, MA) and emulsion and microfluidic sequencing technology nanodroplets (eg, GnuBio droplets).
- Genome Sequencer FLX system from Roche/454
- Illumina/Solexa Genome Analyzer GA
- Life/APG's Support Oligonucleotide Ligation Detection SOLiD
- Polonator's G.007 Hcos BioSciences' HeliScope Gene Sequencing system , Oxford Nanopore Technologies' PromethION, GriION, MinION system and Pacific Biosciences' PacBio RS system.
- the sequence alignment in step (b) is a computer algorithm that evaluates the similarity between most of the lead sequences in the genome (eg, short-lead sequences from next-generation sequencing) and the reference sequence. It includes computational methods or approaches used for identification from cases likely to be derived by A variety of algorithms can be applied to sequence alignment problems. Some algorithms are relatively slow, but allow relatively high specificity. These include, for example, dynamic programming-based algorithms. Dynamic programming is a way to solve complex problems by breaking them down into simpler steps. Other approaches are relatively more efficient, but are typically less thorough. This includes, for example, heuristic algorithms and probabilistic methods designed for bulk database searches.
- candidate screening reduces the search space for sequence alignments from the whole genome to a shorter enumeration of possible alignment positions.
- Sequence alignment involves aligning sequences with sequences provided in the candidate screening step. This can be done using a global alignment (eg Needleman-Wunsch alignment) or a local alignment (eg Smith-Waterman alignment).
- attribute sorting algorithms can be characterized as one of three types based indexing methods: hash tables (e.g. BLAST, ELAND, SOAP), suffix trees (e.g. Bowtie, BWA), and merge sort. (e.g. Slider) based algorithm. Short lead sequences are typically used for alignment.
- hash tables e.g. BLAST, ELAND, SOAP
- suffix trees e.g. Bowtie, BWA
- merge sort. e.g. Slider
- Short lead sequences are typically used for alignment.
- the alignment step of step (b) is not limited thereto, but may be performed using the BWA algorithm and the Hg19 sequence.
- the BWA algorithm may include BWA-ALN, BWA-SW or Bowtie2, but is not limited thereto.
- the length of the sequence information (reads) in step b) is 5 to 5000 bp, and the number of sequence information used may be 5,000 to 5 million, but is not limited thereto.
- the vectorized data in step (c) can be used without limitation as long as it can be generated based on the aligned nucleic acid fragments, preferably a Grand Canyon plot (GC plot) It may be characterized as, but is not limited thereto.
- GC plot Grand Canyon plot
- the vectorized data in the present invention may be characterized as preferably imaged, but not limited thereto.
- An image is basically composed of pixels.
- a 1-dimensional 2D vector black and white
- 3-dimensional 2D vector color (RGB)
- CMLK 4-dimensional 2D vector
- the vectorized data of the present invention is not limited to images, and can be used, for example, as input data for an artificial intelligence model by stacking several n black-and-white images and using n-dimensional 2D vectors (Multi-dimensional Vector).
- the GC plot is a plot in which a specific section (either bins of a certain size or bins of different sizes) is set as the X-axis and values that can be expressed as nucleic acid fragments, such as the distance or number between nucleic acid fragments, are created as the Y-axis.
- step (c) prior to performing step (c), it may be characterized by further comprising the step of separately sorting nucleic acid fragments satisfying a mapping quality score of the aligned nucleic acid fragments.
- the mapping quality score may vary depending on a desired criterion, but may be preferably 15 to 70 points, more preferably 50 to 70 points, and most preferably 60 points.
- the GC plot in step (c) is characterized in that the distribution of the aligned nucleic acid fragments for each chromosome section is generated as vectorized data by calculating the number of nucleic acid fragments for each section or the distance between the nucleic acid fragments. there is.
- any method of vectorizing the calculated number of nucleic acid fragments or the distance between nucleic acid fragments can be used without limitation as long as it is a known technique for vectorizing calculated values.
- calculating the distribution of the aligned sequence information for each chromosome segment by the number of nucleic acid fragments may be characterized in that it is performed by including the following steps:
- step iv) generating a GC plot with the order of each section as the X-axis value and the normalized value calculated in step iii) as the Y-axis value.
- calculating the distribution of the aligned sequence information by chromosome section as the distance between nucleic acid fragments may be characterized in that it is performed by including the following steps:
- step iv) normalizing by dividing the representative value calculated in step iii) by the representative value of all nucleic acid fragment distance values;
- the GC plot can be used by aligning GC plots from chromosomes 1 to 22 on the Y axis to create one image, or by combining images created from chromosomes 1 to 22 on the z axis.
- the representative value is the sum, difference, product, mean, median, quantile, minimum, maximum, variance, standard deviation, median absolute deviation, coefficient of variation, and reciprocal of the distances between nucleic acid fragments. And it may be characterized in that at least one selected from the group consisting of combinations thereof, but is not limited thereto.
- the bin may be characterized in that it is 1Kb to 3Gb, but is not limited thereto.
- a step of grouping nucleic acid fragments may be additionally used, and in this case, the grouping criterion may be performed based on adapter sequences of the aligned nucleic acid fragments.
- the distance between the nucleic acid fragments can be calculated for the selected sequence information by separately dividing it into forward-aligned nucleic acid fragments and reverse-aligned nucleic acid fragments.
- the FD value is defined as the distance between the ith nucleic acid fragment and the reference value of one or more nucleic acid fragments selected from the i+1 to n nucleic acid fragments with respect to the obtained n nucleic acid fragments.
- the FD value is calculated by calculating the distance from the reference value of one or more nucleic acid fragments selected from the group consisting of the first nucleic acid fragment and the second to n nucleic acid fragments with respect to the obtained n nucleic acid fragments.
- a reciprocal value of , calculation results including weights, and statistics that are not limited thereto may be used as FD values, but are not limited thereto.
- the "reference value of nucleic acid fragments" may be characterized in that a value obtained by adding or subtracting an arbitrary value from the median value of nucleic acid fragments.
- the FD value can be defined as follows for the obtained n nucleic acid fragments.
- the Dist function is the sum, difference, product, mean, log of the product, log of the sum, median, quantile, minimum value, maximum value, variance of the differences in alignment position values of all the nucleic acid fragments included between the two selected Ri and Rj nucleic acid fragments. Calculate one or more values selected from the group consisting of . , standard deviation, median absolute deviation, and coefficient of variation and/or one or more reciprocal values thereof, and calculation results including weights and statistical values that are not limited thereto.
- FD value Frametic Distance Value
- N the number of selection cases of nucleic acid fragments for distance calculation. That is, when i is 1, i+1 becomes 2, and the distance to one or more nucleic acid fragments selected from the 2nd to nth nucleic acid fragments can be defined.
- the FD value may be characterized by calculating a distance between a specific position inside the i-th nucleic acid fragment and a specific position inside any one or more of the i+1 to n-th nucleic acid fragments.
- nucleic acid fragment is 50 bp in length and is aligned at position 4,183 of chromosome 1
- the genetic position values that can be used to calculate the distance of this nucleic acid fragment are 4,183 to 4,232 of chromosome 1.
- the genetic position value that can be used for calculating the distance of this nucleic acid fragment is 4,232 to 4,281 of chromosome 1
- the FD between the two nucleic acid fragments Values can be from 1 to 99.
- the genetic position values that can be used to calculate the distance of this nucleic acid fragment are 4,123 to 4,172 of chromosome 1
- the FD value between the two nucleic acid fragments is 61 to 159
- the FD value with the first example nucleic acid fragment is 12 to 110, the sum, difference, product, average, log of the product, log of the sum, median, quantile, minimum,
- One or more values selected from the group consisting of the maximum value, variance, standard deviation, median absolute deviation, and coefficient of variation, and/or one or more reciprocal values thereof, and calculation results including weights, and statistics, including but not limited to, can be used as FD values. It may be, preferably, characterized in that it is a reciprocal value of one of the two FD value ranges, but is not limited thereto
- the FD value may be a value obtained by adding or subtracting an arbitrary value from the median value of nucleic acid fragments.
- the median value of FD means a value located at the most center when the calculated FD values are arranged in order of size. For example, when there are three values, such as 1, 2, and 100, 2 is the median because 2 is the most central. If there are an even number of FD values, the median is determined as the average of the two values in the middle. For example, if there are FD values of 1, 10, 90, and 200, the median is 50, which is the average of 10 and 90.
- the arbitrary value can be used without limitation as long as it can indicate the position of the nucleic acid fragment, but is preferably 0 to 5 kbp or 0 to 300% of the length of the nucleic acid fragment, 0 to 3 kbp or the length of the nucleic acid fragment. 0 to 200%, 0 to 1 kbp or 0 to 100% of the length of the nucleic acid fragment, more preferably 0 to 500 bp or 0 to 50% of the length of the nucleic acid fragment, but is not limited thereto.
- the FD value may be derived based on positional values of forward and reverse sequence information (reads).
- nucleic acid fragment distances are 4183 to 4349.
- the position value of this nucleic acid fragment is 4349 to 4515.
- the distance between the two nucleic acid fragments may be 0 to 333, and most preferably 166, which is the distance of the median value of each nucleic acid fragment.
- sequence information when sequence information is obtained by the paired-end sequencing, in the case of a nucleic acid fragment whose alignment score of sequence information (reads) is less than the reference value, it may be characterized in that it further comprises the step of excluding from the calculation process. there is.
- the FD value may be derived based on one type of positional value of forward or reverse sequence information (read).
- the arbitrary value in the case of the single-end sequencing, when a position value is derived based on sequence information aligned in the forward direction, an arbitrary value is added, and when a position value is derived based on sequence information aligned in the reverse direction It can be characterized in that an arbitrary value is subtracted, and the arbitrary value can be used without limitation as long as the FD value clearly indicates the location of the nucleic acid fragment, but is preferably 0 to 5 kbp or the length of the nucleic acid fragment 0 to 300%, 0 to 3 kbp or 0 to 200% of the nucleic acid fragment length, 0 to 1 kbp or 0 to 100% of the nucleic acid fragment length, more preferably 0 to 500 bp or 0 to 50% of the nucleic acid fragment length. , but is not limited thereto.
- Nucleic acids to be analyzed in the present invention may be sequenced and expressed in units called reads.
- This read can be divided into single end sequencing read (SE) and paired end sequencing read (PE) according to the sequencing method.
- SE-type read means that one of the 5' and 3' parts of the nucleic acid molecule is sequenced for a certain length in a random direction
- PE-type read means that both 5' and 3' are sequenced for a certain length. Because of this difference, it is well known to those skilled in the art that one lead is generated from one nucleic acid fragment when sequencing in SE mode, and two reads are generated in pairs from one nucleic acid fragment in PE mode.
- the most ideal way to calculate the exact distance between nucleic acid fragments is to sequence the nucleic acid molecules from start to finish, align the grids, and use the median (center) of the aligned values.
- the above method has limitations due to limitations of sequencing technology and cost aspects. Therefore, sequencing is performed in the same way as SE and PE.
- SE sequencing technology
- the exact position (median value) of the nucleic acid fragment can be identified through the combination of these values, but the SE
- the method since information on only one end of a nucleic acid fragment can be used, there is a limit to calculating the exact position (median value).
- the 5' end of the forward read has a position value smaller than the central position of the nucleic acid molecule, and the 3' end of the reverse read has a large value.
- a value close to the central position of a nucleic acid molecule can be estimated by adding an arbitrary value (Extended bp) to a forward read and subtracting a reverse read.
- the arbitrary value may vary depending on the sample used, and in the case of cell-free nucleic acids, since the average length of the nucleic acids is known to be about 166 bp, it can be set to about 80 bp. If the experiment is conducted through fragmentation equipment (eg, sonication), about half of the target length set in the fragmentation process can be set as extended bp.
- fragmentation equipment eg, sonication
- the representative value is one or more selected from the group consisting of the sum, difference, product, mean, median, quantile, minimum, maximum, variance, standard deviation, median absolute deviation, and coefficient of variation of FD values value and/or one or more reciprocal values thereof, preferably a median value, an average value, or an inverse value thereof of FD values, but is not limited thereto.
- the artificial intelligence model in step (d) can be used without limitation as long as it is a model capable of learning to distinguish a normal image from a cancerous image, and is preferably a deep learning model.
- the artificial intelligence model can be used without limitation as long as it is an artificial neural network algorithm capable of analyzing vectorized data based on an artificial neural network, but preferably a convolutional neural network (CNN) or a deep neural network (Deep Neural Network). It may be characterized in that it is selected from the group consisting of Neural Network (DNN), Recurrent Neural Network (RNN), and Autoencoder, but is not limited thereto.
- CNN convolutional neural network
- RNN Recurrent Neural Network
- Autoencoder but is not limited thereto.
- the recurrent neural network is a group consisting of a long-short term memory (LSTM) neural network, a gated recurrent unit (GRU) neural network, a vanilla recurrent neural network, and an attentive recurrent neural network. It can be characterized as being selected.
- the loss function for performing binary classification may be characterized in that it is represented by Equation 1 below, and the loss function for performing multi-class classification is represented by Equation 2 below. can be characterized as being
- the binary classification means that the artificial intelligence model learns to determine whether or not there is cancer
- the multi-class classification means that the artificial intelligence model learns to determine the type of cancer
- learning when the artificial intelligence model is a CNN, learning may be performed including the following steps:
- the training data is used when learning the CNN model
- the validation data is used for hyper-parameter tuning verification
- the test data is used for performance evaluation after producing the optimal model.
- the hyper-parameter tuning process is a process of optimizing the values of various parameters (the number of convolution layers, the number of dense layers, the number of convolution filters, etc.) constituting the CNN model, and the hyper-parameter tuning process includes Bayesian optimization and grid search techniques. It can be characterized by using.
- the learning process optimizes the internal parameters (weights) of the CNN model using predetermined hyper-parameters, and when the validation loss compared to the training loss starts to increase, it is determined that the model is overfitting, and before that, the model It may be characterized as stopping learning.
- the result value analyzed from the vectorized data input by the artificial intelligence model in step (d) can be used without limitation as long as it is a specific score or real number, and is preferably a DPI (Deep Probability Index) value. It can, but is not limited to this.
- DPI Deep Probability Index
- the Deep Probability Index uses a sigmoid function in the case of binary classification in the last layer of the artificial intelligence model and a softmax function in the case of multi-class classification to adjust the output of artificial intelligence to a scale of 0 to 1 to obtain a value expressed as a probability value. it means.
- the sigmoid function is used to learn so that the DPI value becomes 1 in case of cancer. For example, if a neuroblastoma sample and a normal sample are input, the DPI value of the neuroblastoma sample is learned to be close to 1.
- the softmax function is used to select as many DPI values as the number of classes.
- the sum of the DPI values as many as the number of classes is 1, and learning is performed so that the DPI value of the actual cancer type is 1. For example, if there are three classes of neuroblastoma, liver cancer, and normal, and a neuroblastoma sample is received, the breast cancer class will be learned close to 1.
- the output result value of step (d) may be characterized in that it is derived for each type of cancer.
- the artificial intelligence model learns, if there is cancer, the output result learns close to 1, and if there is no cancer, the output result learns close to 0. , 0.5 or less, it was judged that there was no cancer and performance measurement was performed (training, validation, test accuracy).
- the reference value of 0.5 is a value that can be changed at any time. For example, if you want to reduce false positives, you can strictly set the standard value higher than 0.5 to determine that you have cancer. You can take a little weaker standard that judges that there is.
- the standard value can be determined by checking the probability of the DPI value by applying unseen data (data for which the answer is not trained for learning) using the learned artificial intelligence model.
- the step of predicting the cancer type through the comparison of the output result value of step (e) is performed by a method comprising determining the cancer type showing the highest value among the output result values as the cancer of the sample. It can be characterized by doing.
- the present invention extracts nucleic acids from a biological sample and decodes sequence information including methylation information
- an alignment unit that aligns the translated sequence with a standard chromosomal sequence database
- a data generator for generating vectorized data using the aligned sequence-based nucleic acid fragments
- a cancer diagnosis unit that analyzes the generated vectorized data by inputting it to the learned artificial intelligence model and compares it with a reference value to determine whether or not there is cancer
- the present invention relates to a cancer diagnosis and cancer prediction device including a cancer diagnosis and cancer prediction unit including a cancer type prediction unit that analyzes an output result value and predicts a cancer type.
- the decoding unit may be performed in an independent device.
- the decoding unit of the present invention can produce sequence information including methylation information, ie read, in the NGS device.
- the present invention is a computer readable storage medium comprising instructions configured to be executed by a processor for diagnosing cancer and predicting cancer types,
- It relates to a computer-readable storage medium including instructions configured to be executed by a processor for predicting the presence of cancer and the type of cancer through the step of (e) predicting the type of cancer through the comparison of output result values.
- sequence information including methylation information by extracting nucleic acids from a biological sample
- It relates to a method for diagnosing cancer and predicting cancer types, including the step of predicting cancer types through the comparison of the output result values.
- a method according to the present disclosure may be implemented using a computer.
- a computer includes one or more processors coupled to a chip set.
- a memory, a storage device, a keyboard, a graphics adapter, a pointing device, and a network adapter are connected to the chipset.
- the performance of the chipset is enabled by a memory controller hub and an I/O controller hub.
- the memory may be used directly coupled to the processor instead of a chip set.
- a storage device is any device capable of holding data, including a hard drive, compact disk read-only memory (CD-ROM), DVD, or other memory device. Memory is concerned with data and instructions used by the processor.
- the pointing device may be a mouse, track ball or other type of pointing device, and is used in combination with a keyboard to transmit input data to a computer system.
- the graphics adapter presents images and other information on a display.
- the network adapter is connected to the computer system through a local area network or a long distance communication network.
- the computer used herein is not limited to the above configuration, may not have some configurations, may include additional configurations, and may also be part of a storage area network (SAN), and the computer of the present application May be configured to be suitable for the execution of modules in the program for the execution of the method according to the present invention.
- SAN storage area network
- a module herein may mean a functional and structural combination of hardware for implementing the technical idea according to the present application and software for driving the hardware.
- the module may mean a logical unit of a predetermined code and a hardware resource for executing the predetermined code, and does not necessarily mean a physically connected code or one type of hardware. is apparent to those skilled in the art.
- the storage medium includes any medium that stores or transmits data in a form readable by a device such as a computer.
- a computer readable medium may include Read Only Memory (ROM); random access memory (RAM); magnetic disk storage media; optical storage media; It includes flash memory devices and other electrical, optical or acoustic signal transmission media.
- Example 1 Next-generation sequencing analysis performed by extracting methylated cfDNA from blood
- the prepared library produced about 30 million reads per sample by sequencing with Novaseq 6000 (Illumina) in 150 paired-end mode.
- Example 1 The reads obtained in Example 1 were aligned with the reference genome using the bwa (version 0.7.17-r1188) alignment tool, and then PCR duplicate nucleic acid fragments were generated using the biobambam2 bammarkduplicates (version 2.0.87) tool. and nucleic acid fragments having a mapping quality of 60 or less were removed using sambaamba (version 0.6.6).
- the GC plot expresses the alignment of NGS reads from the beginning to the end of the chromosome. All chromosomes except sex chromosomes were divided into non-overlapping 100 kilobase bins, and the number of reads assigned to each bin was counted (read count value ). Normalization was performed by dividing the read count value assigned to each bin by the total number of reads in the sample. A GC plot was produced for each chromosome by setting the normalized bin read count value as Y value and the order of each bin as X value, and aligning the produced GC plot from chromosome 1 to chromosome 22 to produce one image (Fig. 2 ).
- the GC plot produced in Example 2 is divided into training (learning), validation (validation), and test (performance evaluation) data.
- Training data is used when learning a CNN model
- Validation data is used for hyper-parameter tuning verification and the test data was used for performance evaluation after production of the optimal model.
- Tensorflow (version 2.4.1) was used to build and train the CNN model.
- the structure of the CNN model is in the order of convolution layer -> pooling layer -> fully connected layer, and a pooling layer is always inserted after the convolution layer. there is.
- the number of convolution layers and the number of fully connected layers were determined through a hyper-parameter tuning process.
- learning the model learning proceeded in the direction of minimizing the loss function, and the loss function is as shown in Equations 1 and 2.
- hyper-parameter tuning was performed using the scikit-optimize (version 0.7.4) python package.
- the hyper-parameter tuning process is a process of optimizing the values of various parameters (number of convolution layers, number of dense layers, number of convolution filters, etc.) constituting the CNN model. After specifying the number of hidden nodes, activation function, dropout presence, and learning rate as hyper-parameters, an optimal model was built using Bayesian optimization technique. After comparing the performance of several models obtained in the hyper-parameter tuning process using validation data, the model with the best performance is judged to be the best model, and the test Performance evaluation was conducted using the data.
- the sigmoid function was used for the output layer of the model.
- the sigmoid function is as shown in Equation 3 below.
- DPI One probability value
- the softmax function was used for the output layer of the model, as shown in Equation 4.
- Equation 4 Since the probability value (DPI) output in Equation 4 is output as many as the number of classes, it was used to classify the type of cancer.
- Example 4 Construction of neuroblastoma deep learning model of GC plot based on the number of nucleic acid fragments using methylated cfDNA and confirmation of performance
- ROC Receiver Operating Characteristic
- Figure 4 shows the probability of having cancer (DPI values calculated in the artificial intelligence model of the present invention as a boxplot in the normal sample and neuroblastoma sample groups, and the red line represents the DPI cutoff of 0.5.
- Example 5 Construction of neuroblastoma deep learning model of GC plot based on the number of nucleic acid fragments using cfDNA and confirmation of performance
- the method for diagnosing cancer and predicting cancer types using methylated cell-free nucleic acid is a method using a step of determining a chromosome amount based on an existing read count or a concept of distance between aligned reads Compared to using values related to reads as individual standardized values in a detection method using , since vectorized data is generated and analyzed using an AI algorithm, a similar effect can be exerted even if the read coverage is low, which is useful.
Landscapes
- Engineering & Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- General Health & Medical Sciences (AREA)
- Biophysics (AREA)
- Chemical & Material Sciences (AREA)
- Medical Informatics (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Molecular Biology (AREA)
- Biotechnology (AREA)
- Data Mining & Analysis (AREA)
- Proteomics, Peptides & Aminoacids (AREA)
- Bioinformatics & Computational Biology (AREA)
- Evolutionary Biology (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Evolutionary Computation (AREA)
- Artificial Intelligence (AREA)
- General Engineering & Computer Science (AREA)
- Software Systems (AREA)
- Biomedical Technology (AREA)
- Organic Chemistry (AREA)
- Analytical Chemistry (AREA)
- Pathology (AREA)
- Mathematical Physics (AREA)
- General Physics & Mathematics (AREA)
- Computing Systems (AREA)
- Computational Linguistics (AREA)
- Library & Information Science (AREA)
- Immunology (AREA)
- Biochemistry (AREA)
- Zoology (AREA)
- Wood Science & Technology (AREA)
- Genetics & Genomics (AREA)
- Public Health (AREA)
- Epidemiology (AREA)
- Databases & Information Systems (AREA)
- Hospice & Palliative Care (AREA)
- Oncology (AREA)
Abstract
Description
Claims (17)
- (a) 생체시료에서 핵산을 추출하여 메틸화 정보를 포함하는 서열정보를 획득하는 단계;(b) 획득한 서열정보(reads)를 표준 염색체 서열 데이터베이스(reference genome database)에 정렬(alignment)하는 단계;(c) 상기 정렬된 서열정보(reads) 기반의 핵산단편(fragments)을 이용하여 벡터화된 데이터를 생성하는단계;(d) 생성된 상기 벡터화된 데이터를 학습된 인공지능 모델에 입력하여 분석한 출력 결과값과 기준값(cut-off value)을 비교하여 암 유무를 판정하는 단계; 및(e) 상기 출력 결과값 비교를 통해 암 종을 예측하는 단계를 포함하는 암 진단 및 암 종 예측을 위한 정보의 제공방법.
- (a) 생체시료에서 핵산을 추출하여 메틸화 정보를 포함하는 서열정보를 획득하는 단계;(b) 획득한 서열정보(reads)를 표준 염색체 서열 데이터베이스(reference genome database)에 정렬(alignment)하는 단계;(c) 상기 정렬된 서열정보(reads) 기반의 핵산단편(fragments)을 이용하여 벡터화된 데이터를 생성하는단계;(d) 생성된 상기 벡터화된 데이터를 학습된 인공지능 모델에 입력하여 분석한 출력 결과값과 기준값(cut-off value)을 비교하여 암 유무를 판정하는 단계; 및(e) 상기 출력 결과값 비교를 통해 암 종을 예측하는 단계를 포함하는 암 진단 및 암 종 예측방법.
- 제1항 또는 제2항에 있어서, 상기 (a) 단계는 다음의 단계를 포함하는 방법으로 수행되는 것을 특징으로 하는 방법:(a-i) 생체시료에서 메틸화 정보가 포함된 핵산을 수득하는 단계;(a-ii) 채취된 핵산에서 솔팅-아웃 방법(salting-out method), 컬럼 크로마토그래피 방법(column chromatography method) 또는 비드 방법(beads method)을 사용하여 단백질, 지방, 및 기타 잔여물을 제거하고 정제된 핵산을 수득하는 단계;(a-iii) 정제된 핵산 또는 효소적 절단, 분쇄, 수압 절단 방법(hydroshear method)으로 무작위 단편화(random fragmentation)된 핵산에 대하여, 싱글 엔드 시퀀싱(single-end sequencing) 또는 페어 엔드 시퀀싱(pair-end sequencing) 라이브러리(library)를 제작하는 단계;(a-iv) 제작된 라이브러리를 차세대 유전자서열검사기(next-generation sequencer)에 반응시키는 단계; 및(a-v) 차세대 유전자서열검사기에서 핵산의 서열정보(reads)를 획득하는 단계.
- 제3항에 있어서, 상기 (a-i) 단계의 메틸화 정보는 바이설파이트 전환법(bisulfite conversion), 효소전환법 (Enzymatic conversion) 또는 메틸화 DNA 면역침강법(Methylated DNA Immunoprecipitation, MeDIP)으로 수득한 것을 특징으로 하는 방법.
- 제1항에 있어서, 상기 (c) 단계의 벡터화된 데이터는 그랜드 캐년 플롯(Grand Canyon plot, GC plot) 인 것을 특징으로 하는 방법.
- 제5항에 있어서, 상기 GC plot은 정렬된 핵산단편의 염색체 구간 별 분포를 구간 별 수(count) 또는 핵산단편(fragment) 사이의 거리를 계산하여 벡터화된 데이터로 생성하는 것을 특징으로 하는 방법.
- 제6항에 있어서, 상기 염색체 구간 별 분포를 핵산단편의 수로 계산하는 것은 하기의 단계를 포함하여 수행하는 것을 특징으로 하는 방법:i) 염색체를 일정구간(bin)으로 구분하는 단계;ii) 각 구간에 정렬된 핵산단편의 수를 결정하는 단계;iii) 각 구간에 결정된 핵산단편 수를 샘플의 전체 핵산단편 수로 나누어 정규화(normalization)하는 단계; 및iv) 각 구간의 순서를 X 축 값으로 하고, 상기 iii) 단계에서 계산한 정규화 값을 Y축 값으로 하여 GC plot을 생성하는 단계.
- 제6항에 있어서, 상기 염색체 구간 별 분포를 핵산단편 사이의 거리로 계산하는 것은 하기의 단계를 포함하여 수행하는 것을 특징으로 방법:i) 염색체를 일정구간(bin)으로 구분하는 단계;ii) 각 구간에 정렬된 핵산단편 사이의 거리(Fragments Distance, FD)값을 계산하는 단계;iii) 각 구간별로 계산된 거리값을 기반으로 각 구간의 거리의 대표값(RepFD)을 결정하는 단계;iv) 상기 iii) 단계에서 계산된 대표값을 전체 핵산단편 사이의 거리 값의 대표값으로 나누어 정규화(normalization)하는 단계; 및iv) 각 구간의 순서를 X 축 값으로 하고, 상기 iv) 단계에서 계산한 정규화 값을 Y축 값으로 하여 GC plot을 생성하는 단계.
- 제8항에 있어서, 상기 대표값은 핵산단편 사이의 거리의 합, 차, 곱, 평균, 중앙값, 분위수, 최소값, 최대값, 분산, 표준편차, 중앙값 절대편차, 변동계수, 이들의 역수값 및 이들의 조합으로 구성된 군에서 선택되는 하나 이상인 것을 특징으로 하는 방법.
- 제1항에 있어서, 상기 (d) 단계의 인공지능 모델은 정상인 벡터화된 데이터와 암이 있는 벡터화된 데이터를 구별할 수 있도록 학습하는 것을 특징으로 하는 방법.
- 제10항에 있어서, 상기 인공지능 모델은 합성곱 신경망(convolutional neural network, CNN), 심층 신경망(Deep Neural Network, DNN), 순환 신경망(Recurrent Neural Network, RNN) 및 오토 인코더(autoencoder)로 구성된 군에서 선택되는 것을 특징으로 하는 방법.
- 제1항에 있어서, 상기 (d) 단계의 인공지능 모델이 입력된 벡터화된 데이터를 분석하여 출력하는 결과값은 DPI(Deep Probability Index)값인 것을 특징으로 하는 방법.
- 제1항에 있어서, 상기 (d) 단계의 기준값은 0.5이며, 0.5 이상일 경우, 암인 것으로 판정하는 것을 특징으로 하는 방법.
- 제1항에 있어서, 상기 (e) 단계의 출력 결과값 비교를 통해 암 종을 예측하는 단계는 출력 결과값 중, 가장 높은 값을 나타내는 암 종을 샘플의 암으로 판정하는 단계를 포함하는 방법으로 수행하는 것을 특징으로 하는 방법.
- 생체시료에서 핵산을 추출하여 메틸화 정보가 포함된 서열정보를 해독하는 해독부;해독된 서열을 표준 염색체 서열 데이터베이스에 정렬하는 정렬부;정렬된 서열 기반의 핵산단편을 이용하여 벡터화된 데이터를 생성하는 데이터 생성부;생성된 벡터화된 데이터를 학습된 인공지능 모델에 입력하여 분석하고, 기준값과 비교하여 암 유무를 판정하는 암 진단부; 및출력된 결과값을 분석하여 암 종을 예측하는 암 종 예측부를 포함하는 암 진단 및 암 종 예측 장치.
- 컴퓨터 판독 가능한 저장 매체로서, 암 진단 및 암 종을 예측하는 프로세서에 의해 실행되도록 구성되는 명령을 포함하되,(a) 생체시료에서 핵산을 추출하여 메틸화 정보가 포함된 서열정보를 획득하는 단계;(b) 획득한 서열정보(reads)를 표준 염색체 서열 데이터베이스(reference genome database)에 정렬(alignment)하는 단계;(c) 상기 정렬된 서열정보(reads) 기반의 핵산단편을 이용하여 벡터화된 데이터를 생성하는단계;(d) 생성된 상기 벡터화된 데이터를 학습된 인공지능 모델에 입력하여 분석한 출력 결과값과 기준값(cut-off value)을 비교하여 암 유무를 판정하는 단계; 및(e) 상기 출력 결과값 비교를 통해 암 종을 예측하는 단계를 통하여, 암 유무 및 암 종을 예측하는 프로세서에 의해 실행되도록 구성되는 명령을 포함하는 컴퓨터 판독 가능한 저장 매체.
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
EP22887598.5A EP4425499A1 (en) | 2021-10-26 | 2022-10-26 | Method for diagnosis of cancer and prediction of cancer type, using methylated acellular nucleic acid |
JP2024524688A JP2024537916A (ja) | 2021-10-26 | 2022-10-26 | メチル化された細胞遊離dnaを用いた癌種診断及び癌種予測方法 |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
KR1020210143610A KR20230059423A (ko) | 2021-10-26 | 2021-10-26 | 메틸화된 무세포 핵산을 이용한 암 진단 및 암 종 예측방법 |
KR10-2021-0143610 | 2021-10-26 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2023075402A1 true WO2023075402A1 (ko) | 2023-05-04 |
Family
ID=86158124
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/KR2022/016448 WO2023075402A1 (ko) | 2021-10-26 | 2022-10-26 | 메틸화된 무세포 핵산을 이용한 암 진단 및 암 종 예측방법 |
Country Status (4)
Country | Link |
---|---|
EP (1) | EP4425499A1 (ko) |
JP (1) | JP2024537916A (ko) |
KR (1) | KR20230059423A (ko) |
WO (1) | WO2023075402A1 (ko) |
Citations (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR20150082228A (ko) * | 2012-09-20 | 2015-07-15 | 더 차이니즈 유니버시티 오브 홍콩 | 혈장으로부터 태아 또는 종양 메틸롬의 비침습적 결정 |
KR20180124550A (ko) | 2017-05-12 | 2018-11-21 | 한국전자통신연구원 | 연관패턴 학습을 통한 사용자 일정 추천 시스템 및 방법 |
KR20190001741A (ko) | 2017-06-28 | 2019-01-07 | 삼성전자주식회사 | 안테나 장치 및 안테나를 포함하는 전자 장치 |
KR20190003676A (ko) | 2016-05-02 | 2019-01-09 | 코닝 인코포레이티드 | 광학적 선명도(clarity)를 갖는 적층된(laminated) 유리 구조물 및 이의 제조 방법. |
KR20190036494A (ko) * | 2017-09-27 | 2019-04-04 | 이화여자대학교 산학협력단 | Dna 복제수 변이 기반의 암 종 예측 방법 |
KR20190085667A (ko) * | 2018-01-11 | 2019-07-19 | 주식회사 녹십자지놈 | 무세포 dna를 포함하는 샘플에서 순환 종양 dna를 검출하는 방법 및 그 용도 |
US20200131582A1 (en) | 2016-06-07 | 2020-04-30 | The Regents Of The University Of California | Cell-free dna methylation patterns for disease and condition analysis |
KR20210021923A (ko) * | 2019-08-19 | 2021-03-02 | 주식회사 녹십자지놈 | 핵산 단편간 거리 정보를 이용한 염색체 이상 검출 방법 |
US10975431B2 (en) | 2018-05-18 | 2021-04-13 | The Johns Hopkins University | Cell-free DNA for assessing and/or treating cancer |
KR20210067931A (ko) | 2019-11-29 | 2021-06-08 | 주식회사 녹십자지놈 | 인공지능 기반 염색체 이상 검출 방법 |
KR20220074088A (ko) * | 2020-11-27 | 2022-06-03 | 주식회사 지씨지놈 | 인공지능 기반 암 진단 및 암 종 예측방법 |
-
2021
- 2021-10-26 KR KR1020210143610A patent/KR20230059423A/ko unknown
-
2022
- 2022-10-26 JP JP2024524688A patent/JP2024537916A/ja active Pending
- 2022-10-26 WO PCT/KR2022/016448 patent/WO2023075402A1/ko active Application Filing
- 2022-10-26 EP EP22887598.5A patent/EP4425499A1/en active Pending
Patent Citations (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR20150082228A (ko) * | 2012-09-20 | 2015-07-15 | 더 차이니즈 유니버시티 오브 홍콩 | 혈장으로부터 태아 또는 종양 메틸롬의 비침습적 결정 |
KR102148547B1 (ko) | 2012-09-20 | 2020-08-26 | 더 차이니즈 유니버시티 오브 홍콩 | 혈장으로부터 태아 또는 종양 메틸롬의 비침습적 결정 |
KR20190003676A (ko) | 2016-05-02 | 2019-01-09 | 코닝 인코포레이티드 | 광학적 선명도(clarity)를 갖는 적층된(laminated) 유리 구조물 및 이의 제조 방법. |
US20200131582A1 (en) | 2016-06-07 | 2020-04-30 | The Regents Of The University Of California | Cell-free dna methylation patterns for disease and condition analysis |
KR20180124550A (ko) | 2017-05-12 | 2018-11-21 | 한국전자통신연구원 | 연관패턴 학습을 통한 사용자 일정 추천 시스템 및 방법 |
KR20190001741A (ko) | 2017-06-28 | 2019-01-07 | 삼성전자주식회사 | 안테나 장치 및 안테나를 포함하는 전자 장치 |
KR20190036494A (ko) * | 2017-09-27 | 2019-04-04 | 이화여자대학교 산학협력단 | Dna 복제수 변이 기반의 암 종 예측 방법 |
KR20190085667A (ko) * | 2018-01-11 | 2019-07-19 | 주식회사 녹십자지놈 | 무세포 dna를 포함하는 샘플에서 순환 종양 dna를 검출하는 방법 및 그 용도 |
US10975431B2 (en) | 2018-05-18 | 2021-04-13 | The Johns Hopkins University | Cell-free DNA for assessing and/or treating cancer |
KR20210021923A (ko) * | 2019-08-19 | 2021-03-02 | 주식회사 녹십자지놈 | 핵산 단편간 거리 정보를 이용한 염색체 이상 검출 방법 |
KR20210067931A (ko) | 2019-11-29 | 2021-06-08 | 주식회사 녹십자지놈 | 인공지능 기반 염색체 이상 검출 방법 |
KR20220074088A (ko) * | 2020-11-27 | 2022-06-03 | 주식회사 지씨지놈 | 인공지능 기반 암 진단 및 암 종 예측방법 |
Non-Patent Citations (4)
Title |
---|
HINTON, GEOFFREY ET AL., IEEESIGNAL PROCESSING MAGAZINE, vol. 29, no. 6, 2012, pages 82 - 97 |
LI, JIAQI ET AL., BIORXIV, 12 January 2021 (2021-01-12), pages 426440 |
METZKER, M, NATURE BIOTECHNOLOGY REVIEWS, vol. 11, 2010, pages 31 - 46 |
ZHOU, XIONGHUI ET AL., BIORXIV, 16 July 2020 (2020-07-16), pages 201350 |
Also Published As
Publication number | Publication date |
---|---|
KR20230059423A (ko) | 2023-05-03 |
EP4425499A1 (en) | 2024-09-04 |
JP2024537916A (ja) | 2024-10-16 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2021107676A1 (ko) | 인공지능 기반 염색체 이상 검출 방법 | |
WO2022114631A1 (ko) | 인공지능 기반 암 진단 및 암 종 예측방법 | |
WO2021154060A1 (en) | Method of predicting disease, gene or protein related to queried entity and prediction system built by using the same | |
CN114045345B (zh) | 基于游离dna的基因组癌变信息检测系统和检测方法 | |
WO2019139363A1 (ko) | 무세포 dna를 포함하는 샘플에서 순환 종양 dna를 검출하는 방법 및 그 용도 | |
WO2011071209A1 (ko) | 히든 마코브 모델을 이용한 식물 저항성 유전자 동정 및 분류를 위한 시스템 및 방법 | |
WO2023033329A1 (ko) | 질환 연관 유전자 변이 분석을 통한 질환별 위험 유전자 변이 정보 생성 장치 및 그 방법 | |
WO2011105667A1 (ko) | 쿼리 서열의 유전형 또는 아형 분류 방법 | |
WO2023080586A1 (ko) | 세포유리 핵산단편 위치별 서열 빈도 및 크기를 이용한 암 진단 방법 | |
WO2022098086A1 (ko) | 비기능성 전사체를 이용한 parp 저해제 또는 dna 손상 약물 감수성 판정방법 | |
WO2022097844A1 (ko) | 유전자 복제수 변이 정보를 이용하여 췌장암 환자의 생존 예후를 예측하는 방법 | |
WO2023075402A1 (ko) | 메틸화된 무세포 핵산을 이용한 암 진단 및 암 종 예측방법 | |
WO2022250513A1 (ko) | 세포유리 핵산단편 말단 서열 모티프 빈도 및 크기를 이용한 암 진단 및 암 종 예측방법 | |
WO2022250512A1 (ko) | 조직 특이적 조절지역의 무세포 dna 분포를 이용한 인공지능 기반 암 조기진단 방법 | |
WO2022203437A1 (ko) | 인공지능 기반 무세포 dna의 종양 유래 변이 검출 방법 및 이를 이용한 암 조기 진단 방법 | |
WO2015126058A1 (ko) | 암 예후 예측 방법 | |
Bai et al. | A unified deep learning model for protein structure prediction | |
WO2024117792A1 (ko) | 세포유리 핵산단편 말단 서열 모티프 빈도 및 크기를 이용한 암 진단 및 암 종 예측방법 | |
WO2022250514A1 (ko) | 세포유리 핵산과 이미지 분석기술 기반의 암 진단 및 암 종 예측 방법 | |
WO2022225308A1 (ko) | 음수 미포함 행렬 분해를 이용한 마이크로바이옴 데이터로부터의 미생물 상호작용 네트워크 분석 방법 | |
WO2022108407A1 (ko) | 핵산 길이 비를 이용한 암 진단 및 예후예측 방법 | |
WO2024219902A1 (ko) | 전사체 기반 면역학적 프로파일링을 이용한 암 진단 방법 | |
WO2023244046A1 (en) | Method for diagnosing cancer and predicting type of cancer based on single nucleotide variant in cell-free dna | |
WO2020149719A2 (ko) | 과민성대장증후군 특이적 미생물 바이오마커와 이를 이용하여 과민성대장증후군의 위험도를 예측하는 방법 | |
Emam et al. | Detection of mammalian coding sequences using a hybrid approach of chaos game representation and machine learning |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 22887598 Country of ref document: EP Kind code of ref document: A1 |
|
ENP | Entry into the national phase |
Ref document number: 2024524688 Country of ref document: JP Kind code of ref document: A |
|
WWE | Wipo information: entry into national phase |
Ref document number: 2022887598 Country of ref document: EP |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
ENP | Entry into the national phase |
Ref document number: 2022887598 Country of ref document: EP Effective date: 20240527 |