CN118240934A - Methylation signal detection method, device and kit - Google Patents
Methylation signal detection method, device and kit Download PDFInfo
- Publication number
- CN118240934A CN118240934A CN202310567683.8A CN202310567683A CN118240934A CN 118240934 A CN118240934 A CN 118240934A CN 202310567683 A CN202310567683 A CN 202310567683A CN 118240934 A CN118240934 A CN 118240934A
- Authority
- CN
- China
- Prior art keywords
- cancer
- sample
- methylation level
- risk
- level data
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 230000011987 methylation Effects 0.000 title claims abstract description 76
- 238000007069 methylation reaction Methods 0.000 title claims abstract description 76
- 238000001514 detection method Methods 0.000 title abstract description 17
- 239000000090 biomarker Substances 0.000 claims abstract description 38
- 230000036952 cancer formation Effects 0.000 claims abstract description 34
- 239000000523 sample Substances 0.000 claims description 98
- 101000782147 Homo sapiens WD repeat-containing protein 20 Proteins 0.000 claims description 84
- 102100036561 WD repeat-containing protein 20 Human genes 0.000 claims description 84
- 206010028980 Neoplasm Diseases 0.000 claims description 52
- 201000011510 cancer Diseases 0.000 claims description 43
- 201000007270 liver cancer Diseases 0.000 claims description 41
- 208000014018 liver neoplasm Diseases 0.000 claims description 41
- 238000000034 method Methods 0.000 claims description 32
- 238000012360 testing method Methods 0.000 claims description 25
- 238000007481 next generation sequencing Methods 0.000 claims description 15
- 239000003153 chemical reaction reagent Substances 0.000 claims description 14
- 238000003860 storage Methods 0.000 claims description 14
- 208000000461 Esophageal Neoplasms Diseases 0.000 claims description 10
- 208000005016 Intestinal Neoplasms Diseases 0.000 claims description 10
- 206010058467 Lung neoplasm malignant Diseases 0.000 claims description 10
- 206010030155 Oesophageal carcinoma Diseases 0.000 claims description 10
- 206010033128 Ovarian cancer Diseases 0.000 claims description 10
- 206010061535 Ovarian neoplasm Diseases 0.000 claims description 10
- 206010061902 Pancreatic neoplasm Diseases 0.000 claims description 10
- 201000004101 esophageal cancer Diseases 0.000 claims description 10
- 201000002313 intestinal cancer Diseases 0.000 claims description 10
- 201000005202 lung cancer Diseases 0.000 claims description 10
- 208000020816 lung neoplasm Diseases 0.000 claims description 10
- 208000015486 malignant pancreatic neoplasm Diseases 0.000 claims description 10
- 201000002528 pancreatic cancer Diseases 0.000 claims description 10
- 208000008443 pancreatic carcinoma Diseases 0.000 claims description 10
- 210000004369 blood Anatomy 0.000 claims description 9
- 239000008280 blood Substances 0.000 claims description 9
- 238000009396 hybridization Methods 0.000 claims description 9
- 238000013139 quantization Methods 0.000 claims description 8
- 230000004044 response Effects 0.000 claims description 8
- 238000012549 training Methods 0.000 claims description 8
- 238000004590 computer program Methods 0.000 claims description 7
- 238000002790 cross-validation Methods 0.000 claims description 6
- 238000012937 correction Methods 0.000 claims description 5
- 210000001519 tissue Anatomy 0.000 claims description 4
- 206010003445 Ascites Diseases 0.000 claims description 3
- 208000002151 Pleural effusion Diseases 0.000 claims description 3
- 206010036790 Productive cough Diseases 0.000 claims description 3
- 210000003567 ascitic fluid Anatomy 0.000 claims description 3
- 210000001175 cerebrospinal fluid Anatomy 0.000 claims description 3
- 210000004072 lung Anatomy 0.000 claims description 3
- 238000004519 manufacturing process Methods 0.000 claims description 3
- 210000003296 saliva Anatomy 0.000 claims description 3
- 210000003802 sputum Anatomy 0.000 claims description 3
- 208000024794 sputum Diseases 0.000 claims description 3
- 241000792859 Enema Species 0.000 claims description 2
- 230000015572 biosynthetic process Effects 0.000 claims description 2
- 239000007920 enema Substances 0.000 claims description 2
- 229940095399 enema Drugs 0.000 claims description 2
- 108091029430 CpG site Proteins 0.000 description 29
- 238000012163 sequencing technique Methods 0.000 description 27
- 108020004414 DNA Proteins 0.000 description 12
- 108090000623 proteins and genes Proteins 0.000 description 11
- 230000035945 sensitivity Effects 0.000 description 11
- 238000003556 assay Methods 0.000 description 8
- 238000004364 calculation method Methods 0.000 description 7
- 125000003729 nucleotide group Chemical group 0.000 description 7
- 102000013529 alpha-Fetoproteins Human genes 0.000 description 6
- 108010026331 alpha-Fetoproteins Proteins 0.000 description 6
- 108091033319 polynucleotide Proteins 0.000 description 6
- 102000040430 polynucleotide Human genes 0.000 description 6
- 239000002157 polynucleotide Substances 0.000 description 6
- 238000012216 screening Methods 0.000 description 6
- 210000002966 serum Anatomy 0.000 description 6
- 239000012634 fragment Substances 0.000 description 5
- 102000039446 nucleic acids Human genes 0.000 description 5
- 108020004707 nucleic acids Proteins 0.000 description 5
- 150000007523 nucleic acids Chemical class 0.000 description 5
- LSNNMFCWUKXFEE-UHFFFAOYSA-M Bisulfite Chemical compound OS([O-])=O LSNNMFCWUKXFEE-UHFFFAOYSA-M 0.000 description 4
- 238000005516 engineering process Methods 0.000 description 4
- 239000002773 nucleotide Substances 0.000 description 4
- 238000006243 chemical reaction Methods 0.000 description 3
- 238000011161 development Methods 0.000 description 3
- 238000003745 diagnosis Methods 0.000 description 3
- 201000010099 disease Diseases 0.000 description 3
- 208000037265 diseases, disorders, signs and symptoms Diseases 0.000 description 3
- 230000006870 function Effects 0.000 description 3
- 238000003384 imaging method Methods 0.000 description 3
- 230000003287 optical effect Effects 0.000 description 3
- 230000008569 process Effects 0.000 description 3
- 230000009467 reduction Effects 0.000 description 3
- 239000004065 semiconductor Substances 0.000 description 3
- 239000000758 substrate Substances 0.000 description 3
- 239000013598 vector Substances 0.000 description 3
- 108091032973 (ribonucleotides)n+m Proteins 0.000 description 2
- 108091029523 CpG island Proteins 0.000 description 2
- 230000007067 DNA methylation Effects 0.000 description 2
- 108010044467 Isoenzymes Proteins 0.000 description 2
- 108091034117 Oligonucleotide Proteins 0.000 description 2
- 238000012408 PCR amplification Methods 0.000 description 2
- 208000000453 Skin Neoplasms Diseases 0.000 description 2
- 108091027967 Small hairpin RNA Proteins 0.000 description 2
- 108020004459 Small interfering RNA Proteins 0.000 description 2
- 108020004566 Transfer RNA Proteins 0.000 description 2
- 230000002159 abnormal effect Effects 0.000 description 2
- 238000004458 analytical method Methods 0.000 description 2
- 239000011324 bead Substances 0.000 description 2
- 238000002591 computed tomography Methods 0.000 description 2
- 238000010276 construction Methods 0.000 description 2
- 230000004069 differentiation Effects 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 206010073071 hepatocellular carcinoma Diseases 0.000 description 2
- 150000002500 ions Chemical class 0.000 description 2
- 238000007477 logistic regression Methods 0.000 description 2
- 239000003550 marker Substances 0.000 description 2
- 108020004999 messenger RNA Proteins 0.000 description 2
- 238000012164 methylation sequencing Methods 0.000 description 2
- 108091070501 miRNA Proteins 0.000 description 2
- 239000011807 nanoball Substances 0.000 description 2
- 238000002360 preparation method Methods 0.000 description 2
- 238000012545 processing Methods 0.000 description 2
- 230000000644 propagated effect Effects 0.000 description 2
- 108020004418 ribosomal RNA Proteins 0.000 description 2
- 201000000849 skin cancer Diseases 0.000 description 2
- 241000894007 species Species 0.000 description 2
- 238000012706 support-vector machine Methods 0.000 description 2
- 239000000439 tumor marker Substances 0.000 description 2
- 102000040650 (ribonucleotides)n+m Human genes 0.000 description 1
- 102000002260 Alkaline Phosphatase Human genes 0.000 description 1
- 108020004774 Alkaline Phosphatase Proteins 0.000 description 1
- 241000972773 Aulopiformes Species 0.000 description 1
- 206010005003 Bladder cancer Diseases 0.000 description 1
- 206010005949 Bone cancer Diseases 0.000 description 1
- 208000018084 Bone neoplasm Diseases 0.000 description 1
- 208000003174 Brain Neoplasms Diseases 0.000 description 1
- 206010006187 Breast cancer Diseases 0.000 description 1
- 208000026310 Breast neoplasm Diseases 0.000 description 1
- 201000009030 Carcinoma Diseases 0.000 description 1
- 108090000994 Catalytic RNA Proteins 0.000 description 1
- 102000053642 Catalytic RNA Human genes 0.000 description 1
- 206010008342 Cervix carcinoma Diseases 0.000 description 1
- 108091026890 Coding region Proteins 0.000 description 1
- 108020004635 Complementary DNA Proteins 0.000 description 1
- 102000053602 DNA Human genes 0.000 description 1
- 108700024394 Exon Proteins 0.000 description 1
- 208000022072 Gallbladder Neoplasms Diseases 0.000 description 1
- 101710107035 Gamma-glutamyltranspeptidase Proteins 0.000 description 1
- 101710173228 Glutathione hydrolase proenzyme Proteins 0.000 description 1
- 108091092195 Intron Proteins 0.000 description 1
- 208000008839 Kidney Neoplasms Diseases 0.000 description 1
- 102000003855 L-lactate dehydrogenase Human genes 0.000 description 1
- 108700023483 L-lactate dehydrogenases Proteins 0.000 description 1
- 206010025323 Lymphomas Diseases 0.000 description 1
- 206010027476 Metastases Diseases 0.000 description 1
- 208000003445 Mouth Neoplasms Diseases 0.000 description 1
- 208000001894 Nasopharyngeal Neoplasms Diseases 0.000 description 1
- 206010061306 Nasopharyngeal cancer Diseases 0.000 description 1
- 108020004711 Nucleic Acid Probes Proteins 0.000 description 1
- 206010060862 Prostate cancer Diseases 0.000 description 1
- 208000000236 Prostatic Neoplasms Diseases 0.000 description 1
- 102100027378 Prothrombin Human genes 0.000 description 1
- 108010094028 Prothrombin Proteins 0.000 description 1
- 208000015634 Rectal Neoplasms Diseases 0.000 description 1
- 206010038389 Renal cancer Diseases 0.000 description 1
- 108091028664 Ribonucleotide Proteins 0.000 description 1
- 206010039491 Sarcoma Diseases 0.000 description 1
- 108020004682 Single-Stranded DNA Proteins 0.000 description 1
- 206010041662 Splinter Diseases 0.000 description 1
- 208000005718 Stomach Neoplasms Diseases 0.000 description 1
- 208000024313 Testicular Neoplasms Diseases 0.000 description 1
- 206010057644 Testis cancer Diseases 0.000 description 1
- 206010043515 Throat cancer Diseases 0.000 description 1
- 208000024770 Thyroid neoplasm Diseases 0.000 description 1
- 208000007097 Urinary Bladder Neoplasms Diseases 0.000 description 1
- 208000006105 Uterine Cervical Neoplasms Diseases 0.000 description 1
- 208000002495 Uterine Neoplasms Diseases 0.000 description 1
- 230000003321 amplification Effects 0.000 description 1
- 238000002583 angiography Methods 0.000 description 1
- 238000000137 annealing Methods 0.000 description 1
- 239000002585 base Substances 0.000 description 1
- 201000009036 biliary tract cancer Diseases 0.000 description 1
- 208000020790 biliary tract neoplasm Diseases 0.000 description 1
- 210000004027 cell Anatomy 0.000 description 1
- 201000010881 cervical cancer Diseases 0.000 description 1
- 210000000038 chest Anatomy 0.000 description 1
- 238000013145 classification model Methods 0.000 description 1
- 238000010367 cloning Methods 0.000 description 1
- 239000012141 concentrate Substances 0.000 description 1
- 230000034994 death Effects 0.000 description 1
- 231100000517 death Toxicity 0.000 description 1
- 238000012350 deep sequencing Methods 0.000 description 1
- 239000005547 deoxyribonucleotide Substances 0.000 description 1
- 125000002637 deoxyribonucleotide group Chemical group 0.000 description 1
- 238000009826 distribution Methods 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 239000000835 fiber Substances 0.000 description 1
- 201000010175 gallbladder cancer Diseases 0.000 description 1
- 102000006640 gamma-Glutamyltransferase Human genes 0.000 description 1
- 206010017758 gastric cancer Diseases 0.000 description 1
- 201000010536 head and neck cancer Diseases 0.000 description 1
- 208000014829 head and neck neoplasm Diseases 0.000 description 1
- 230000036541 health Effects 0.000 description 1
- 230000002489 hematologic effect Effects 0.000 description 1
- 230000002440 hepatic effect Effects 0.000 description 1
- 231100000844 hepatocellular carcinoma Toxicity 0.000 description 1
- 201000010982 kidney cancer Diseases 0.000 description 1
- 208000012987 lip and oral cavity carcinoma Diseases 0.000 description 1
- 210000004185 liver Anatomy 0.000 description 1
- 208000019423 liver disease Diseases 0.000 description 1
- 235000019689 luncheon sausage Nutrition 0.000 description 1
- 238000002595 magnetic resonance imaging Methods 0.000 description 1
- 201000001441 melanoma Diseases 0.000 description 1
- 230000009401 metastasis Effects 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 238000012544 monitoring process Methods 0.000 description 1
- 238000003199 nucleic acid amplification method Methods 0.000 description 1
- 239000002853 nucleic acid probe Substances 0.000 description 1
- 239000013307 optical fiber Substances 0.000 description 1
- 239000013612 plasmid Substances 0.000 description 1
- 230000035935 pregnancy Effects 0.000 description 1
- 238000004393 prognosis Methods 0.000 description 1
- 102000004169 proteins and genes Human genes 0.000 description 1
- 229940039716 prothrombin Drugs 0.000 description 1
- 230000002685 pulmonary effect Effects 0.000 description 1
- 238000012175 pyrosequencing Methods 0.000 description 1
- 238000003127 radioimmunoassay Methods 0.000 description 1
- 206010038038 rectal cancer Diseases 0.000 description 1
- 201000001275 rectum cancer Diseases 0.000 description 1
- 239000002336 ribonucleotide Substances 0.000 description 1
- 125000002652 ribonucleotide group Chemical group 0.000 description 1
- 108091092562 ribozyme Proteins 0.000 description 1
- 238000012502 risk assessment Methods 0.000 description 1
- 235000019515 salmon Nutrition 0.000 description 1
- 238000007841 sequencing by ligation Methods 0.000 description 1
- 239000007787 solid Substances 0.000 description 1
- 238000000638 solvent extraction Methods 0.000 description 1
- 201000011549 stomach cancer Diseases 0.000 description 1
- 208000024891 symptom Diseases 0.000 description 1
- 238000003786 synthesis reaction Methods 0.000 description 1
- 201000003120 testicular cancer Diseases 0.000 description 1
- 201000002510 thyroid cancer Diseases 0.000 description 1
- 238000002604 ultrasonography Methods 0.000 description 1
- 201000005112 urinary bladder cancer Diseases 0.000 description 1
- 206010046766 uterine cancer Diseases 0.000 description 1
Abstract
The invention provides a methylation signal detection method, a methylation signal detection device and a methylation signal detection kit, and particularly relates to a biomarker combination for detecting the methylation level of a sample to be detected, wherein the biomarker combination comprises any of at least 10 different methylation regions DMR shown in a table 1, wherein a reference genome adopted by the DMR in the table 1 is GRCh37/hg19 human reference genome, and the risk of cancer formation can be evaluated with low cost and high accuracy.
Description
Technical Field
The invention relates to the technical field of biology, in particular to a methylation signal detection method and a kit.
Background
In 2018, human cancers have resulted in a number of deaths worldwide, most of which are diagnosed as late. To date, intervention prior to distant metastasis provides the greatest opportunity to improve prognosis, and therefore it is highly desirable to develop sensitive, reliable and minimally invasive assays to detect cancer prior to the appearance of symptoms. Among many cancer species, liver cancer (hepatocellular carcinoma, HCC) is a serious disease that seriously jeopardizes human health, and is not only high in incidence but also hidden, fast in progress, high in recurrence rate and mortality, and is called "king in cancer". Most liver cancer patients who visit hospitals are middle or late, and if the natural course of the liver cancer patients is not treated, the liver cancer patients only need 3-6 months. Currently, the detection means of liver cancer mainly comprise two types of serum marker detection and imaging detection.
Existing liver cancer serum marker assays include serum Alpha Fetoprotein (AFP) assays and hematological and other tumor marker assays. Among them, serum Alpha Fetoprotein (AFP) assay has relative specificity for diagnosing liver cancer. The continuous serum AFP is more than or equal to 400 mug/L by the radioimmunoassay, and can exclude pregnancy, active liver diseases and the like, thus being capable of considering diagnosis of liver cancer. However, about 30% of liver cancer patients clinically have negative AFP, and thus have low specificity. Blood enzymology and other tumor marker tests are performed by the principle that gamma-glutamyl transpeptidase and its isozyme, abnormal prothrombin, alkaline phosphatase and lactate dehydrogenase isozyme in serum of liver cancer patients can be higher than normal. But also lack specificity.
Imaging examinations typically include ultrasound examinations, computed Tomography (CT) examinations, magnetic Resonance Imaging (MRI) examinations, selective celiac or hepatic angiography examinations, and liver puncture needle aspiration cytology examinations, but imaging examinations are performed after a tumor has been formed and has reached a certain size, failing to achieve the purpose of early cancer or early cancer screening.
Currently, DNA methylation sequencing is increasingly known as a high resolution, high throughput technique that is useful in cancer screening, diagnosis, and monitoring. Most regions of the human genome are not active during the development of cancer, and cancer-related variations tend to concentrate in certain specific regions, such as CpG islands (CPG ISLAND), which provides a good opportunity for targeted sequencing. Despite the vast number of scientific articles reporting biomarkers based on DNA methylation and their clinical relevance in cancer, only a few tens of biomarkers have been converted into commercial clinical test products, related products directed to single cancers (e.g., liver cancer) are more scarce. Meanwhile, the discovery and screening of cancer-related differential methylation regions (DIFFERENTIALLY METHYLATED regions, DMR) is challenging, and because of the non-specific changes in methylation profiles due to crowd heterogeneity, including disease, age, etc., signals that are non-cancerous but abnormal need to be processed during the cancer assessment model building process. Therefore, there is an urgent need to develop methods and biomarker combinations for capturing and assessing risk of cancer formation for DMR of cancer.
Disclosure of Invention
The invention provides a methylation signal detection method, a methylation signal detection device and a methylation signal detection kit, which adopt DNA or RNA oligonucleotide sequences to capture methylation variation regions of cancers, judge the existence of tumor components (ctDNA) in a sample to be detected, and provide a low-cost and high-accuracy method for the correlation evaluation of cancer formation risks.
In one aspect, the invention provides a biomarker panel for assessing the risk correlation of a test sample with cancer formation, wherein the biomarker panel comprises any of the at least 10 different methylation regions DMR shown in table 1, wherein the reference genome employed by the DMR in table 1 is the GRCh37/hg19 human reference genome.
In another aspect, the invention provides a kit comprising reagents for detecting a biomarker combination as described above.
In another aspect, the invention provides the use of a reagent for detecting a biomarker combination as described above in the manufacture of a kit for diagnosing risk of cancer formation.
In another aspect, the invention provides a method of assessing the correlation of a test sample with the risk of cancer formation, comprising: obtaining methylation level data obtained by detecting a biomarker combination in a sample to be detected, wherein the biomarker combination comprises the biomarker combination; based on the methylation level data, carrying out quantization treatment on bias caused by confusion variables corresponding to the sample to be detected to obtain corrected methylation level data; based on the comparison result of the corrected methylation level data and a preset threshold value, indicating information for representing the degree of correlation between the sample to be tested and the cancer formation risk is generated.
In another aspect, the invention provides an apparatus for assessing the correlation of a test sample with the risk of cancer formation, comprising: an acquisition unit configured to acquire methylation level data obtained by detecting a biomarker combination in a sample to be detected, wherein the biomarker combination comprises the biomarker combination; the correction unit is configured to perform quantization processing on bias caused by the confounding variable corresponding to the sample to be detected based on the methylation level data to obtain corrected methylation level data; and a determining unit configured to generate indication information for representing the degree of correlation between the sample to be tested and the risk of cancer formation based on the comparison result of the corrected methylation level data and a preset threshold value.
In another aspect, the present invention provides an electronic device, including: one or more processors; and a storage device having one or more programs stored thereon, which when executed by the one or more processors, cause the one or more processors to implement the method.
In another aspect, the present invention provides a computer readable storage medium having a computer program stored thereon, wherein the computer program, when executed by one or more processors, implements the method described above.
The biomarker combination, the kit, the method, the application, the device, the electronic equipment and the storage medium can be suitable for risk assessment of cancers, and have the advantages of low cost and high accuracy.
Specifically, the cancers include one or more of the following: brain cancer, lung cancer, skin cancer, nasopharyngeal cancer, throat cancer, liver cancer, bone cancer, lymphoma, pancreatic cancer, skin cancer, intestinal cancer, rectal cancer, thyroid cancer, bladder cancer, kidney cancer, oral cancer, stomach cancer, solid tumors, ovarian cancer, esophageal cancer, gall bladder cancer, biliary tract cancer, breast cancer, cervical cancer, uterine cancer, prostate cancer, head and neck cancer, sarcoma, malignant tumors of the thorax (except lung), melanoma, and testicular cancer.
Drawings
Other features, objects and advantages of the present invention will become more apparent upon reading of the detailed description of non-limiting embodiments made with reference to the following drawings. The drawings are only for purposes of illustrating particular embodiments and are not to be construed as limiting the invention. In the drawings:
fig. 1 shows an exemplary case where CpG sites cannot be classified into the same DMR.
Fig. 2 shows an exemplary case where CpG sites are partitioned into the same DMR.
Fig. 3 illustrates an exemplary case for explaining the principle of judging whether the DMR is valid or not in the present invention.
Fig. 4 shows the control results of the weight configuration of the confounding variables in the DOC model of the present application.
Fig. 5 shows that the DOC model established by the present invention remains balanced across the age groups.
Figure 6 shows the distribution of five replicates of 10 DMR at random according to the invention for healthy people, cancer patients as a whole and for different sensitivities of cancer patients under 80% specificity conditions.
Detailed Description
I. Definition of the definition
In the present invention, unless otherwise indicated, scientific and technical terms used herein have the meanings commonly understood by one of ordinary skill in the art. Also, protein and nucleic acid chemistry, molecular biology, cell and tissue culture, microbiology, immunology-related terms and laboratory procedures as used herein are terms and conventional procedures that are widely used in the corresponding arts. Meanwhile, in order to better understand the present invention, definitions and explanations of related terms are provided below.
As used herein, the term "differential methylation region" (DIFFERENTIALLY METHYLATED region, DMR) generally refers to a region of DNA that contains one or more differential methylation sites. For example, a DMR that includes a greater number or frequency of methylation sites under selected conditions of interest, such as a cancer state, may be referred to as a hypermethylated DMR. For example, a DMR that includes a lesser number or frequency of methylation sites under selected conditions of interest, such as a cancer state, may be referred to as a hypomethylated DMR.
As used herein, the term "methylation" generally refers to the methylation state of a gene fragment, nucleotide, or base thereof of the present application. For example, a DNA fragment in which a gene of the application is located may have methylation on one or more strands. For example, a DNA fragment in which a gene of the application resides may have methylation at one site or DMR or at multiple sites or DMR.
As used herein, the term "next generation sequencing" (Next Generation Sequencing, NGS) refers to any sequencing method that determines the nucleotide sequence of an individual nucleic acid molecule (e.g., in single molecule sequencing) or of a surrogate of an individual nucleic acid molecule that is clonally amplified in a high-throughput mode (e.g., sequencing more than 10 3、104、105 molecules or more simultaneously). The next generation sequencing platform includes, but is not limited to, existing Illumina et al sequencing platforms. With the continued development of sequencing technology, one skilled in the art will appreciate that other methods of sequencing methods and devices may also be employed for the present method. The next generation sequencing methods include, for example, sequencing by synthesis technology (Illumina), large-scale parallel signature sequencing (MASSIVELY PARALLEL Signature Sequencing, MPSS), polymerase cloning (Polony Sequencing), pyrosequencing (454), ion semiconductor technology (ion-shock sequencing) (Ion semi conductor sequencing), DNA nanoball sequencing (DNA nano-ball sequencing), DNA nanoarray-and-combinatorial probe anchored ligation sequencing of Complete Genomics, single molecule real-time sequencing (Pacific Biosciences), and sequencing by ligation (SOLiD sequencing), and the like. The next generation sequencing described above may enable detailed analysis of the transcriptome and genome of a species, and is therefore also referred to as deep sequencing. For example, the methods of the invention are equally applicable to first generation gene sequencing, second generation gene sequencing, third generation gene sequencing, or Single Molecule Sequencing (SMS).
As used herein, the term "human reference genome" generally refers to a human genome that can perform a reference function in gene sequencing. The above information of the human reference genome may refer to UCSC. The human reference genome may be in different versions, for example, hg19, hg38, GRCh37, GRCh38, gca_000001405, gcf_000001405, or Ensembl75.
As used herein, the terms "polynucleotide," "nucleotide," "nucleic acid," and "oligonucleotide" are used interchangeably. They represent polymeric forms of nucleotides (deoxyribonucleotides or ribonucleotides) of any length, or analogues thereof. Polynucleotides may have any steric structure and may perform any function, whether known or unknown. The following are non-limiting examples of polynucleotides: coding or non-coding regions of a gene or gene fragment, loci (loci), exons, introns, messenger RNAs (mRNA), transfer RNAs (tRNA), ribosomal RNAs (rRNA), short interfering RNAs (siRNA), short-hairpin RNAs (shRNA), micrornas (miRNA), ribozymes, cdnas, recombinant polynucleotides, branched polynucleotides, plasmids, vectors, isolated DNA of any sequence, isolated RNAs of any sequence, nucleic acid probes, primers and adaptors defined according to linkage analysis. Polynucleotides may include one or more modified nucleotides, such as methylated nucleotides and nucleotide analogs.
As used herein, the term "sample to be tested" generally refers to a sample that is to be tested. For example, the presence or absence of a modification in one or more gene regions on a test sample can be detected. In embodiments of the present invention, the sample to be tested includes, but is not limited to, a tissue sample, a blood sample, saliva, sputum, pleural effusion, pulmonary lavage, peritoneal effusion, peritoneal lavage, and cerebrospinal fluid.
As used herein, the term "about index", also known as the correct index, is a method of evaluating the authenticity of a screening test, which can be applied given the equivalent meaning of the hazard of false negatives (missed diagnosis rates) and false positives (false misdiagnosis rates). The about log index is the sum of sensitivity and specificity minus 1. Indicating the total ability of the screening method to find true patients and non-patients. The larger the index, the better the effect of the screening experiment, and the greater the authenticity. The term "about log index optimum" is the case where the sum of sensitivity and specificity minus 1 is the largest.
Detailed description of the preferred embodiments
In one aspect, the invention provides a biomarker panel for assessing the risk correlation of a test sample with cancer formation, wherein the biomarker panel comprises any of at least 10 different methylation regions DMR as shown in table 1, wherein the reference genome employed by the DMR in table 1 is the GRCh37/hg19 human reference genome.
In some preferred embodiments, the 10 DMRs are any one set of DMRs shown in table 3.
In some preferred embodiments, the biomarker combinations described above comprise all 100 DMRs shown in table 1.
In another aspect, the invention provides a kit, wherein the kit comprises reagents for detecting the biomarker combination.
In some alternative embodiments, the above-described kits comprise next-generation sequencing reagents.
In some preferred embodiments, the next generation sequencing reagents described above include hybridization capture probes or primers covering any of at least 10 DMR in table 1.
In some more preferred embodiments, the next generation sequencing reagents described above include hybridization capture probes or primers that cover any of the sets of DMRs shown in table 3.
In some more preferred embodiments, the next generation sequencing reagents described above include hybridization capture probes or primers that cover all 100 DMRs shown in table 1.
In some alternative embodiments, the above-described kit is used to assess the correlation of a test sample with the risk of cancer formation.
In some preferred embodiments, the cancer comprises: liver cancer, lung cancer, intestinal cancer, pancreatic cancer, esophageal cancer and ovarian cancer.
In some more preferred embodiments, the cancer is liver cancer.
In another aspect, the invention provides the use of a reagent for detecting a biomarker combination as described above in the manufacture of a kit for diagnosing risk of cancer formation.
In some preferred embodiments, the cancer comprises: liver cancer, lung cancer, intestinal cancer, pancreatic cancer, esophageal cancer and ovarian cancer.
In some more preferred embodiments, the cancer is liver cancer.
In another aspect, the invention provides a method of assessing the correlation of a test sample with the risk of cancer formation, comprising: obtaining methylation level data obtained by detecting a biomarker panel in a sample to be tested, wherein the biomarker panel comprises the biomarker panel according to claim 1; based on the methylation level data, carrying out quantization treatment on bias caused by confusion variables corresponding to the sample to be detected to obtain corrected methylation level data; based on the comparison result of the corrected methylation level data and a preset threshold value, indicating information for representing the degree of correlation between the sample to be tested and the cancer formation risk is generated.
In some preferred embodiments, the first indication information indicative of the degree of risk of cancer formation correlation is generated in response to the corrected methylation level data being less than or equal to a preset threshold value, and the second indication information indicative of the degree of risk of cancer formation correlation is generated in response to the corrected methylation level data being greater than the preset threshold value. The indication information may be a prompt information indicating whether there is a risk of cancer formation or indicating a risk of cancer formation of different degrees. For example, the first indication information may be a prompt information for indicating that there is a risk of forming cancer or a risk of forming cancer is high, and the second indication information may be a prompt information for indicating that there is no risk of forming cancer or a risk of forming cancer is low.
In some preferred embodiments, the preset threshold is a mean value of the threshold corresponding to the best condition of the per-trade-off reduction log index by performing ten-fold cross-validation on the training set samples.
In some preferred embodiments, the predetermined threshold is taken from the range of-0.4 to-1.65.
In some more preferred embodiments, the predetermined threshold is taken from the range of-1.22 to-1.65.
In some preferred embodiments, the cancer comprises: liver cancer, lung cancer, intestinal cancer, pancreatic cancer, esophageal cancer and ovarian cancer.
In some more preferred embodiments, the cancer is liver cancer.
In some alternative embodiments, the sample to be tested is selected from any one or more of the following: tissue samples, blood samples, saliva, sputum, pleural effusion, lung lavage, peritoneal effusion, peritoneal lavage, enema, and cerebrospinal fluid.
In another aspect, the invention provides an apparatus for assessing the correlation of a test sample with the risk of cancer formation, comprising: an acquisition unit configured to acquire methylation level data obtained by detecting a biomarker combination in a sample to be detected, wherein the biomarker combination comprises the biomarker combination; the correction unit is configured to perform quantization processing on bias caused by the confounding variable corresponding to the sample to be detected based on the methylation level data to obtain corrected methylation level data; and a determining unit configured to generate indication information for representing the degree of correlation between the sample to be tested and the risk of cancer formation based on the comparison result of the corrected methylation level data and a preset threshold value.
In some preferred embodiments, the above-mentioned determining unit is further configured to generate first indication information for indicating the degree of risk of cancer formation correlation in response to the corrected methylation level data being less than or equal to a preset threshold value, and generate second indication information for indicating the degree of risk of cancer formation correlation in response to the corrected methylation level data being greater than the preset threshold value. The indication information may be a prompt information indicating whether there is a risk of cancer formation or indicating a risk of cancer formation of different degrees. For example, the first indication information may be a prompt information for indicating that there is a risk of forming cancer or a risk of forming cancer is high, and the second indication information may be a prompt information for indicating that there is no risk of forming cancer or a risk of forming cancer is low.
In some preferred embodiments, the preset threshold is a mean value of the threshold corresponding to the best condition of the per-trade-off reduction log index by performing ten-fold cross-validation on the training set samples.
In some preferred embodiments, the predetermined threshold is taken from the range of-0.4 to-1.65.
In some more preferred embodiments, the predetermined threshold is taken from the range of-1.22 to-1.65.
In some preferred embodiments, the cancer comprises: liver cancer, lung cancer, intestinal cancer, pancreatic cancer, esophageal cancer and ovarian cancer.
In some more preferred embodiments, the cancer is liver cancer.
In another aspect, the present invention provides an electronic device, including: one or more processors; and a storage device having one or more programs stored thereon, which when executed by the one or more processors, cause the one or more processors to implement the method.
The implementation environment of the present invention includes an electronic device, and the method for evaluating the correlation between the sample to be tested and the risk of cancer formation in the embodiment of the present invention may be executed by a terminal device. By way of example, the electronic device may comprise at least one of a terminal device or a server.
The terminal device may be hardware or software. When the terminal device is hardware, it may be a variety of electronic devices having a display screen and supporting information input (e.g., text input and/or voice input, etc.), including but not limited to smart phones, tablet computers, laptop and desktop computers, and the like. When the terminal device is software, it can be installed in the above-listed terminal device. It may be implemented as a plurality of software or software modules (e.g. to provide a correlation service for assessing the risk of developing cancer in a sample to be tested) or as a single software or software module. The present invention is not particularly limited herein.
In another aspect, the present invention provides a computer readable storage medium having a computer program stored thereon, wherein the computer program, when executed by one or more processors, implements the method described above.
The computer readable medium of the present invention may be a computer readable signal medium or a computer readable storage medium, or any combination of the two. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples of the computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present invention, however, the computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, with the computer-readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, fiber optic cables, RF (radio frequency), and the like, or any suitable combination of the foregoing.
The computer readable medium may be contained in the electronic device; or may exist alone without being incorporated into the electronic device.
The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to implement the method of assessing the correlation of a sample under test with risk of cancer formation shown in the above-described embodiments and alternative embodiments thereof.
Computer program code for carrying out operations of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, smalltalk, C ++ and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computer (for example, through the Internet using an Internet service provider).
The above description is only illustrative of the preferred embodiments of the present invention and of the principles of the technology employed. It will be appreciated by persons skilled in the art that the scope of the disclosure referred to in the present invention is not limited to the specific combinations of technical features described above, but also covers other technical features formed by any combination of the technical features described above or their equivalents without departing from the spirit of the disclosure. Such as the above-mentioned features and the technical features disclosed in the present invention (but not limited to) having similar functions are replaced with each other.
Examples
Example 1: division of DMR regions
1. Hypothesis testing
Obtaining a sample to be tested (for example, a blood sample), wherein the sample to be tested is divided into a liver cancer group (C group) and a normal group (N group), and the bisulfite methylation sequencing of the sample to be tested can comprise the following steps:
s1: cell-free DNA (cfDNA) extraction: for example, the QiaAmp cycle nucleic acid kit (Qiagen, 55114) and its corresponding platform can be used;
s2: bisulfite conversion: for example, the bisulfite conversion step (Bisulfite Conversion, BC) is performed using a modified protocol according to EZ-96DNA methylation-LIGHTNINGTM MAGPREP (Zymo, D5047);
S3: pre-library preparation: comprises a first tailing and connecting step, wherein a plurality of G or A synthesized randomly by a split (splinter) joint can be used, the 3' -end poly-C/T tail of a single-stranded DNA substrate is annealed, and the connection is completed after hybridization with the first tail through a cantilever of the joint; annealing the DNA substrate with the adaptor added at one end into a single strand, performing 5-15 rounds of linear amplification, performing a second tailing and connecting step by adopting a similar step to the first tailing and connecting step, connecting the second adaptor to the A tail at the other end of the DNA substrate, and performing a plurality of rounds of PCR amplification to complete the preparation of a pre-library (for example, see Chinese patent publication CN 110892097A);
s4: pre-library hybridization: hybridizing a pre-library with a hybridization capture probe covering the target DMR region;
S5: capturing and eluting: the non-specific fragments are eluted through the combination of the magnetic beads and the probes, the magnetic beads are removed, and the final library is formed through PCR amplification;
s6: sequencing: and sequencing the final library by an NGS sequencer to generate sequencing data containing the target DMR region.
In this embodiment, the step of noise reduction treatment for genomic methylation signal CpG and noise region CHH/CHG sites may be optionally included, for example, see Chinese patent publication CN114974417A.
Based on each CpG site, carrying out hypothesis test on whether the difference between the C group and the N group has statistical significance, respectively calculating the P value of each CpG site in the C group and the N group, wherein the calculation process adopts weighted logistic regression (WEIGHTED LR, weighted Logistic Regression), determines the given weight according to the coverage depth of each CpG site, takes the methylation level of each CpG site as an explanatory variable, and outputs a binary result of (0, 1) to correspond to C and N.
Partitioning of DMR
Calculating according to the following formula, taking the methylation level and sequencing coverage depth of each methylation CpG site as parameters, evaluating the similarity of the methylation level of the genome space continuous sites, wherein the deeper the coverage depth is, the larger the value of the parameter P in the following formula is, the higher the similarity of the methylation level between adjacent CpG sites in the same group (liver cancer group or normal group) is, and further dividing the DMR:
The subscript ij of each parameter represents the j-th site of the i-th sample, the parameter d is used for representing the effective coverage depth of the CpG sites in the liver cancer group, and the parameter M is used for representing the methylation level of the CpG sites in the liver cancer group.
Taking a beta value as a judging index after calculation, taking beta=0.25 as a preset threshold value, substituting the j and (j+1) th sites into a calculation area statistic B (B value is used for representing whether the DMR obtained by division is a valid DMR) when the beta is smaller than the preset threshold value, and possibly dividing into one DMR; when β is greater than or equal to the preset threshold, the jth and (j+1) th sites cannot be substituted into the calculated region statistic B and are not divided into one DMR.
In this embodiment, an exemplary case (as shown in fig. 1) that the DMR cannot be divided into the same DMR is given to explain the principle of dividing the DMR in the present invention.
Wherein the colored dots characterize a methylated CpG site, sample A, sample B, and sample C are from the same sample group (e.g., tumor group or normal group as described above), wherein sample A and sample B each obtain coverage of 500 effective sequences, and sample C obtains coverage of 200 effective sequences. The dots of each column correspond to the same CpG site, with the methylation level of the first CpG site in the region being 0.2 and the methylation level of the second CpG site being 0 in sample A.
The coverage depth parameter value P for the first CpG site within the region was calculated to be 0.617 for sample a, sample B and sample C above. At this time, by substituting the above parameters into the above formula, β 11 can be calculated to be 0.29, and based on the preset threshold value of 0.25, the methylation level difference between the first CpG site and the second CpG site in the region is greater than 0.25, so that the two adjacent CpG sites are not classified into the same DMR.
Another exemplary case of dividing into the same DMR is given in this embodiment (as shown in fig. 2) to explain the principle of dividing the DMR in the present invention.
Wherein the colored dots characterize a methylated CpG site, sample A, sample B and sample D are from the same sample group (e.g., tumor group or normal group) and wherein sample A and sample B each obtain coverage of 500 effective sequences and sample D obtains coverage of 400 effective sequences (the coverage depth of sample D is increased compared to sample C in the previous example, and thus the P value in the present example is also increased accordingly). Also, in sample a, the methylation level of the first CpG site in this region is 0.2 and the methylation level of the second CpG site is 0.
The coverage depth parameter value P for the first CpG site within the region was calculated to be 0.962 for sample a, sample B and sample D above. At this time, the above parameters are substituted into the above formula, and β 11 is calculated to be 0.21, and based on the preset threshold value of 0.25, the methylation level difference between the first CpG site and the second CpG site in the region is less than 0.25, so that the two adjacent CpG sites are marked into the same DMR.
The above method can be seen in chinese patent publication CN115132273a.
Therefore, the coverage depth of CpG sites is introduced in the DMR division process by the method, so that the accuracy of DMR region division can be remarkably improved.
3. Calculation of region statistics B value
In some optional embodiments, based on the above calculated β value, a region statistic B value of CpG sites in the region is further calculated according to the following formula to represent whether the DMR obtained by the division is a valid DMR.
The calculation formula of the value B is as follows:
Wherein, the parameter k is the number of CpG sites in the region, and the subscript ij of each parameter represents the j site of the i sample. Taking beta=0.25 as a preset threshold value, when beta is smaller than the preset threshold value, the j-th and (j+1) -th sites can be substituted into the calculated area statistic B, and the calculation of the area statistic B is possible to be divided into one DMR; when β is greater than or equal to the preset threshold, the jth and (j+1) th sites cannot be substituted into the calculated region statistic B and are not divided into one DMR. Taking b=1 as a preset threshold, and when the B value is smaller than the preset threshold, DMR corresponding to the jth and (j+1) th positions can be used as effective DMR; when the B value is greater than or equal to the preset threshold, DMR corresponding to the jth and (j+1) th positions is not used as an effective DMR.
An exemplary case (as shown in fig. 3) is given in this embodiment to explain the principle of judging whether the DMR is effective in the present invention.
When the DMRs divided by the groups a, B and C respectively contain 10 CpG sites, B ij of all samples are combined together when calculating the B value corresponding to each DMR, and the average value is calculated as the score of each DMR.
Wherein the calculation steps of the B value in the DMR shown in the group A are shown in the following table:
b-value division of DMR corresponding to group A Less than a preset threshold of 1, and therefore, the DMR may be an effective DMR.
Similarly, the B value score for DMR shown in group B isCan be used as an effective DMR; b value score in DMR shown in sample C is/>Therefore, the DMR corresponding to sample C cannot be valid.
Example 2: cancer detection (Detection of Cancer, DOC) model building
The invention quantifies bias caused by confounding variables for confounding variables (confounding variable) that may affect the accuracy of the classification model, thereby increasing the accuracy and generalizable capability of the DOC model. In the application scenario of the present invention, because ctDNA content in blood of a patient is greatly different in different development stages of liver cancer, the ctDNA content is easily affected by experimental batch effect, and methylation is related to age of a sample source to be tested, race and whether other diseases are suffered, the above conditions may all constitute confounding variables in the present embodiment.
The parameters involved in the formulas shown in this embodiment are defined in accordance with the definitions known in the art, except for the parameters specifically defined and explained.
In order to quantify bias caused by confusion variables, the invention adopts a Salmon model construction method, and an exemplary quantization mode in the embodiment can adopt Hilbert-Schmidt independence Criterion (HSIC). For the model after biased quantization, regularization term (regularization) is embedded for correction.
For quantization using the hilbert-schmitt independence criterion, the following formula is shown:
‖Ch(y)h(z)‖2=(Eh(x)h(z)-Eh(x)Eh(z))2=(Eh(x)h(z))2+(Eh(x)Eh(z))2-2Eh(x)h(z)Eh(x)Eh(z)
Wherein L H (Hilbert-Schmitt independent coefficient, hilbert-SCHMIDT INDEPENDENCE criterion) calculated by the formula is used for representing the independent degree of variables X and Z, and in the invention, a feature vector X (X 1,…,xm),xi is an n-dimensional vector and represents methylation characteristics of a sample i, a classification label Y (Y 1,…,ym),yi is a classification label of X i, Y i epsilon-1, +1, positive when Y i is +1 and negative when Y i is-1) is set, and a confusion variable Z (Z 1,…,zm),zi is a confusion variable of the sample i and m represents the number of samples).
A support vector machine (SVM, support vector machine) is adopted as a main classifier to carry out two classification, and simultaneously, in order to control confusion variables, regularization terms are added into a target equation solved by the SVM, wherein the target equation is that
s.t.yi(wTx+b)≥1-ξi
ξi≥0
Where ζ i here refers to the degree to which the sample x i violates the equation, C and λ are the coefficients that minimize training errors with control, minimize the correlation of confounding variables with interpreted variables, and maximize the balance of classification intervals.
In this embodiment, fig. 4 shows the control result of the weight configuration of the DOC model of the present application for the confounding variables.
Wherein each data point represents a blood sample for DOC model construction, the horizontal axis represents confounding variables of the corresponding sample, and the vertical axis represents original uncorrected interpretation variables (left graph) and corrected interpretation variables (right graph), respectively. Comparing the correction before and after, the weight of the confusion variable is controlled in the DOC model established by the invention.
In this example, fig. 5 shows that the DOC model established in the present invention overcomes the weakness of increasing the past methylation false positive with age in healthy groups, and maintains balance in each age group (the horizontal axis represents age, and the vertical axis represents model liver cancer probability score).
Example 3: detection of liver cancer based on DMR by DOC model
100 Healthy human blood samples and 30 liver cancer patient blood samples are used as training sets, 100 DMRs with obvious differentiation are screened out based on the differentiation of healthy people and cancer patients in different DMR areas to be used for constructing a DOC model and determining a threshold value. As shown in table 1 below, 100 DMRs of the present invention for DOC model detection are shown.
TABLE 1100 DMR's screened by methylation detection model of the present invention
/>
And carrying out ten-fold cross validation on the training set, taking the average value of the threshold values corresponding to the optimal conditions of the index of the trade-off sign as the threshold value, and using the average value as the threshold value for dividing the liver cancer yin and yang of the test sample. Specifically, first, the healthy samples in the training set were randomly split into 10 parts, and similarly, the liver cancer samples were also randomly split into 10 parts. Then, a DOC model is established by using the 9/10 healthy sample and the 9/10 liver cancer sample to predict the remaining 1/10 healthy sample and 1/10 liver cancer sample. At this point, the "optimal threshold" for this "fold" can be obtained by the about log index best principle. The loop is repeated until all samples are traversed, and 10 "optimal thresholds" can be obtained due to the ten-fold cross-validation. Finally, the average value of 10 "optimal thresholds" is calculated as the threshold value of the DOC model (for the DOC model of 100 DMRs, the threshold= -0.4), and then the model and the corresponding threshold value can be used to judge the yin-yang of the test set sample, namely, if the average value is smaller than the threshold value, the test set sample is regarded as negative, and if the average value is larger than the threshold value, the test set sample is regarded as positive. According to the DOC model and the threshold values described above, a test set consisting of another 100 healthy person samples and 82 liver cancer samples was evaluated. The overall sensitivity of the test dataset was 86.6% (71/82) and the overall specificity was 90.0% (90/100), with the individual stage sensitivities shown in Table 2:
TABLE 2 sensitivity of liver cancer stages
The threshold value determining method is not limited to a specific 100 DMR, and may be applied to a DOC model formed by smaller DMR in consideration of factors such as cost. For example, 10 DMR's per random choice from 100 DMR may be employed to construct a new DOC model and determine the threshold, see table 3:
TABLE 3 five randomly selected 10 DMR sites and threshold information
The sensitivity, specificity results in these five random replicates are shown in table 4 (by adjustment of model parameters to ensure that each round of randomness ensures that the specificity is controlled to the same level, i.e., 80%):
TABLE 4 sensitivity and specificity of 10 random DMR repeat assays
The sensitivity (including each stage) and specificity of the results of the healthy and liver cancer groups in the above five repeated tests are shown in table 5:
TABLE 5.10 sensitivity and specificity results for each stage in random DMR repeat assays
/>
From this, it can be seen that any 10 DMRs of the 100 DMRs provided by the invention can realize better specificity and sensitivity in each stage of liver cancer, and meet the use expectations.
Furthermore, while the description provides only a DOC model for constructing methylation assays and a method for determining thresholds based on differences between liver cancer patients and healthy persons, in practice this method is equally applicable to other cancer patients, including: carcinoma such as lung cancer, intestinal cancer, pancreatic cancer, esophageal cancer, and ovarian cancer.
The foregoing detailed description is provided by way of explanation and example and is not intended to limit the scope of the appended claims. Numerous variations of the presently illustrated embodiments of the application will be apparent to those of ordinary skill in the art and are intended to be within the scope of the appended claims and equivalents thereof.
Claims (10)
1. A biomarker panel for detecting the methylation level of a test sample, wherein the biomarker panel comprises any of at least 10 differentially methylated regions DMR as set forth in table 1, wherein the reference genome employed by the DMR in table 1 is a GRCh37/hg19 human reference genome;
preferably, the 10 DMRs are any one set of DMRs shown in table 3;
preferably, the biomarker combination comprises all 100 DMRs shown in table 1.
2. A kit, wherein the kit comprises reagents for detecting the biomarker combination of claim 1.
3. The kit of claim 2, wherein the kit comprises next generation sequencing reagents;
Preferably, the next generation sequencing reagents comprise hybridization capture probes or primers covering any at least 10 DMR in table 1;
more preferably, the next generation sequencing reagents comprise hybridization capture probes or primers covering any of the sets of DMRs shown in table 3;
More preferably, the next generation sequencing reagents comprise hybridization capture probes or primers that cover all 100 DMRs shown in table 1.
4. A kit according to claim 2 or 3, wherein the kit is for assessing the correlation of a test sample with the risk of formation of cancer;
Preferably, the cancer comprises: liver cancer, lung cancer, intestinal cancer, pancreatic cancer, esophageal cancer and ovarian cancer.
5. Use of a reagent for detecting a biomarker combination according to claim 1, in the manufacture of a kit for diagnosing risk of cancer formation;
Preferably, the cancer comprises: liver cancer, lung cancer, intestinal cancer, pancreatic cancer, esophageal cancer and ovarian cancer.
6. A method of detecting the methylation level of a test sample, comprising:
obtaining methylation level data obtained by detecting a biomarker panel in a sample to be tested, wherein the biomarker panel comprises the biomarker panel of claim 1;
Based on the methylation level data, carrying out quantization treatment on bias caused by confusion variables corresponding to the sample to be detected to obtain corrected methylation level data;
Generating indication information for representing the degree of correlation between the sample to be tested and the cancer formation risk based on the comparison result of the corrected methylation level data and a preset threshold value; preferably, in response to the corrected methylation level data being less than or equal to the preset threshold, first indication information for indicating a degree of risk of cancer formation correlation is generated, and in response to the corrected methylation level data being greater than the preset threshold, second indication information for indicating a degree of risk of cancer formation correlation is generated; preferably, the preset threshold value is a mean value of threshold values corresponding to optimal conditions of the login index of each trade-off is obtained through ten-fold cross validation on the training set sample; preferably, the preset threshold is-0.4 to-1.65; further preferably, the preset threshold is-1.22 to-1.65;
Preferably, the cancer comprises: liver cancer, lung cancer, intestinal cancer, pancreatic cancer, esophageal cancer and ovarian cancer.
7. The method of claim 6, wherein the sample to be tested is selected from any one or more of the following: tissue samples, blood samples, saliva, sputum, pleural effusion, lung lavage, peritoneal effusion, peritoneal lavage, enema, and cerebrospinal fluid.
8. An apparatus for detecting a correlation of a test sample with a risk of cancer formation, comprising:
An acquisition unit configured to acquire methylation level data obtained by detecting a biomarker combination in a sample to be tested, wherein the biomarker combination comprises the biomarker combination according to claim 1;
The correction unit is configured to quantize bias caused by confounding variables corresponding to the sample to be detected based on the methylation level data to obtain corrected methylation level data;
A determining unit configured to generate, based on a result of comparison of the corrected methylation level data and a preset threshold value, indication information for characterizing a degree of correlation of the sample to be tested with a risk of cancer formation; preferably, the determining unit is further configured to generate first indication information for indicating a degree of cancer formation risk correlation in response to the corrected methylation level data being less than or equal to the preset threshold value, and to generate second indication information for indicating a degree of cancer formation risk correlation in response to the corrected methylation level data being greater than the preset threshold value;
preferably, the preset threshold value is a mean value of threshold values corresponding to optimal conditions of the login index of each trade-off is obtained through ten-fold cross validation on the training set sample;
Preferably, the cancer comprises: liver cancer, lung cancer, intestinal cancer, pancreatic cancer, esophageal cancer and ovarian cancer.
9. An electronic device, comprising:
one or more processors;
A storage device having one or more programs stored thereon,
The one or more programs, when executed by the one or more processors, cause the one or more processors to implement the method of claim 6 or 7.
10. A computer readable storage medium having stored thereon a computer program, wherein the computer program when executed by one or more processors implements the method of claim 6 or 7.
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN2022116582789 | 2022-12-22 |
Publications (1)
Publication Number | Publication Date |
---|---|
CN118240934A true CN118240934A (en) | 2024-06-25 |
Family
ID=
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US20230167507A1 (en) | Cell-free dna methylation patterns for disease and condition analysis | |
JP6817259B2 (en) | Use of size and number abnormalities in plasma DNA for the detection of cancer | |
Chang et al. | MicroRNA-223 and microRNA-92a in stool and plasma samples act as complementary biomarkers to increase colorectal cancer detection | |
CN113186287B (en) | Biomarker for non-small cell lung cancer typing and application thereof | |
US20230220492A1 (en) | Methods and systems for detecting colorectal cancer via nucleic acid methylation analysis | |
US20240084397A1 (en) | Methods and systems for detecting cancer via nucleic acid methylation analysis | |
US20200109457A1 (en) | Chromosomal assessment to diagnose urogenital malignancy in dogs | |
JP2023524016A (en) | RNA markers and methods for identifying colon cell proliferative disorders | |
CN115820860A (en) | Method for screening non-small cell lung cancer marker based on methylation difference of enhancer, marker and application thereof | |
JP2022501033A (en) | Cell-free DNA hydroxymethylation profile in the assessment of pancreatic lesions | |
Hobbs et al. | Biostatistics and bioinformatics in clinical trials | |
CN118240934A (en) | Methylation signal detection method, device and kit | |
CN113159529A (en) | Risk assessment model and related system for intestinal polyp | |
CN117965725A (en) | Method, device and kit for distinguishing liver cancer from liver non-cancer disease samples | |
KR102161511B1 (en) | Extracting method for biomarker for diagnosis of biliary tract cancer, computing device therefor, biomarker for diagnosis of biliary tract cancer, and biliary tract cancer diagnosis device comprising same | |
Fan et al. | Rapid preliminary purity evaluation of tumor biopsies using deep learning approach | |
WO2024027591A1 (en) | Multi-cancer methylation detection kit and use thereof | |
TWI832443B (en) | Methylation biomarker selection apparatuses and methods | |
Ciniselli | Identification of Circulating Biomarkers for the Early Diagnosis of Colorectal Cancer: Methodological Aspects | |
Sehovic | Analysis of Circulating Biomarkers for Minimally Invasive Early Detection of Breast Cancer | |
CN115667544A (en) | Method for characterizing extrachromosomal DNA | |
CN115472294A (en) | Model for predicting transformation speed of small cell transformation lung adenocarcinoma patient and construction method thereof | |
WO2023183468A2 (en) | Tcr/bcr profiling for cell-free nucleic acid detection of cancer |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication |