CN117535404A - Multi-cancer methylation detection kit and application thereof - Google Patents
Multi-cancer methylation detection kit and application thereof Download PDFInfo
- Publication number
- CN117535404A CN117535404A CN202210914446.XA CN202210914446A CN117535404A CN 117535404 A CN117535404 A CN 117535404A CN 202210914446 A CN202210914446 A CN 202210914446A CN 117535404 A CN117535404 A CN 117535404A
- Authority
- CN
- China
- Prior art keywords
- cancer
- dmr
- tumor
- biomarker combination
- biomarker
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 206010028980 Neoplasm Diseases 0.000 title claims abstract description 118
- 230000011987 methylation Effects 0.000 title claims abstract description 48
- 238000007069 methylation reaction Methods 0.000 title claims abstract description 48
- 201000011510 cancer Diseases 0.000 title claims abstract description 47
- 238000001514 detection method Methods 0.000 title abstract description 11
- 239000000090 biomarker Substances 0.000 claims abstract description 53
- 238000000034 method Methods 0.000 claims abstract description 33
- 108090000623 proteins and genes Proteins 0.000 claims abstract description 23
- 239000003153 chemical reaction reagent Substances 0.000 claims abstract description 7
- 230000005740 tumor formation Effects 0.000 claims abstract description 5
- 101000782147 Homo sapiens WD repeat-containing protein 20 Proteins 0.000 claims description 49
- 210000001519 tissue Anatomy 0.000 claims description 32
- 238000012360 testing method Methods 0.000 claims description 16
- 206010058467 Lung neoplasm malignant Diseases 0.000 claims description 11
- 206010030155 Oesophageal carcinoma Diseases 0.000 claims description 11
- 206010033128 Ovarian cancer Diseases 0.000 claims description 11
- 206010061535 Ovarian neoplasm Diseases 0.000 claims description 11
- 206010061902 Pancreatic neoplasm Diseases 0.000 claims description 11
- 201000002313 intestinal cancer Diseases 0.000 claims description 11
- 201000007270 liver cancer Diseases 0.000 claims description 11
- 208000014018 liver neoplasm Diseases 0.000 claims description 11
- 201000005202 lung cancer Diseases 0.000 claims description 11
- 208000020816 lung neoplasm Diseases 0.000 claims description 11
- 208000015486 malignant pancreatic neoplasm Diseases 0.000 claims description 11
- 201000002528 pancreatic cancer Diseases 0.000 claims description 11
- 208000008443 pancreatic carcinoma Diseases 0.000 claims description 11
- 208000000461 Esophageal Neoplasms Diseases 0.000 claims description 10
- 201000004101 esophageal cancer Diseases 0.000 claims description 10
- 208000000453 Skin Neoplasms Diseases 0.000 claims description 8
- 230000009826 neoplastic cell growth Effects 0.000 claims description 8
- 201000000849 skin cancer Diseases 0.000 claims description 8
- 208000005016 Intestinal Neoplasms Diseases 0.000 claims description 7
- 210000004072 lung Anatomy 0.000 claims description 6
- 239000008280 blood Substances 0.000 claims description 5
- 210000004369 blood Anatomy 0.000 claims description 5
- 238000012165 high-throughput sequencing Methods 0.000 claims description 5
- 206010005003 Bladder cancer Diseases 0.000 claims description 4
- 206010005949 Bone cancer Diseases 0.000 claims description 4
- 208000018084 Bone neoplasm Diseases 0.000 claims description 4
- 208000003174 Brain Neoplasms Diseases 0.000 claims description 4
- 206010006187 Breast cancer Diseases 0.000 claims description 4
- 208000026310 Breast neoplasm Diseases 0.000 claims description 4
- 206010008342 Cervix carcinoma Diseases 0.000 claims description 4
- 208000022072 Gallbladder Neoplasms Diseases 0.000 claims description 4
- 201000003741 Gastrointestinal carcinoma Diseases 0.000 claims description 4
- 208000008839 Kidney Neoplasms Diseases 0.000 claims description 4
- 206010025323 Lymphomas Diseases 0.000 claims description 4
- 208000003445 Mouth Neoplasms Diseases 0.000 claims description 4
- 208000001894 Nasopharyngeal Neoplasms Diseases 0.000 claims description 4
- 206010061306 Nasopharyngeal cancer Diseases 0.000 claims description 4
- 206010060862 Prostate cancer Diseases 0.000 claims description 4
- 208000000236 Prostatic Neoplasms Diseases 0.000 claims description 4
- 208000015634 Rectal Neoplasms Diseases 0.000 claims description 4
- 206010038389 Renal cancer Diseases 0.000 claims description 4
- 206010039491 Sarcoma Diseases 0.000 claims description 4
- 208000005718 Stomach Neoplasms Diseases 0.000 claims description 4
- 208000024313 Testicular Neoplasms Diseases 0.000 claims description 4
- 206010057644 Testis cancer Diseases 0.000 claims description 4
- 206010043515 Throat cancer Diseases 0.000 claims description 4
- 208000024770 Thyroid neoplasm Diseases 0.000 claims description 4
- 208000007097 Urinary Bladder Neoplasms Diseases 0.000 claims description 4
- 208000006105 Uterine Cervical Neoplasms Diseases 0.000 claims description 4
- 208000002495 Uterine Neoplasms Diseases 0.000 claims description 4
- 201000009036 biliary tract cancer Diseases 0.000 claims description 4
- 208000020790 biliary tract neoplasm Diseases 0.000 claims description 4
- 201000010881 cervical cancer Diseases 0.000 claims description 4
- 210000000038 chest Anatomy 0.000 claims description 4
- 201000010175 gallbladder cancer Diseases 0.000 claims description 4
- 206010017758 gastric cancer Diseases 0.000 claims description 4
- 201000010536 head and neck cancer Diseases 0.000 claims description 4
- 208000014829 head and neck neoplasm Diseases 0.000 claims description 4
- 230000002489 hematologic effect Effects 0.000 claims description 4
- 201000010982 kidney cancer Diseases 0.000 claims description 4
- 208000012987 lip and oral cavity carcinoma Diseases 0.000 claims description 4
- 201000001441 melanoma Diseases 0.000 claims description 4
- 206010038038 rectal cancer Diseases 0.000 claims description 4
- 201000001275 rectum cancer Diseases 0.000 claims description 4
- 201000011549 stomach cancer Diseases 0.000 claims description 4
- 201000003120 testicular cancer Diseases 0.000 claims description 4
- 201000002510 thyroid cancer Diseases 0.000 claims description 4
- 201000005112 urinary bladder cancer Diseases 0.000 claims description 4
- 206010046766 uterine cancer Diseases 0.000 claims description 4
- 230000015572 biosynthetic process Effects 0.000 claims description 3
- 206010003445 Ascites Diseases 0.000 claims description 2
- 241000792859 Enema Species 0.000 claims description 2
- 208000002151 Pleural effusion Diseases 0.000 claims description 2
- 206010036790 Productive cough Diseases 0.000 claims description 2
- 210000003567 ascitic fluid Anatomy 0.000 claims description 2
- 210000001175 cerebrospinal fluid Anatomy 0.000 claims description 2
- 239000007920 enema Substances 0.000 claims description 2
- 229940095399 enema Drugs 0.000 claims description 2
- 210000003296 saliva Anatomy 0.000 claims description 2
- 210000003802 sputum Anatomy 0.000 claims description 2
- 208000024794 sputum Diseases 0.000 claims description 2
- 238000004519 manufacturing process Methods 0.000 claims 1
- 238000002360 preparation method Methods 0.000 abstract description 3
- 238000012502 risk assessment Methods 0.000 abstract description 2
- 238000012163 sequencing technique Methods 0.000 description 38
- 239000000523 sample Substances 0.000 description 34
- 108091029430 CpG site Proteins 0.000 description 16
- 238000012549 training Methods 0.000 description 12
- 239000010410 layer Substances 0.000 description 11
- 241000894007 species Species 0.000 description 11
- 108020004414 DNA Proteins 0.000 description 10
- 238000011156 evaluation Methods 0.000 description 10
- 238000007477 logistic regression Methods 0.000 description 8
- 238000012795 verification Methods 0.000 description 8
- 238000010801 machine learning Methods 0.000 description 7
- 238000004364 calculation method Methods 0.000 description 6
- 238000002790 cross-validation Methods 0.000 description 6
- 125000003729 nucleotide group Chemical group 0.000 description 6
- 108091033319 polynucleotide Proteins 0.000 description 6
- 102000040430 polynucleotide Human genes 0.000 description 6
- 239000002157 polynucleotide Substances 0.000 description 6
- 238000010200 validation analysis Methods 0.000 description 6
- 238000004422 calculation algorithm Methods 0.000 description 5
- 230000008569 process Effects 0.000 description 5
- 230000035945 sensitivity Effects 0.000 description 5
- 238000012937 correction Methods 0.000 description 4
- 238000011161 development Methods 0.000 description 4
- 230000018109 developmental process Effects 0.000 description 4
- 239000012634 fragment Substances 0.000 description 4
- 230000006870 function Effects 0.000 description 4
- 239000007787 solid Substances 0.000 description 4
- 230000007067 DNA methylation Effects 0.000 description 3
- 238000010276 construction Methods 0.000 description 3
- 201000010099 disease Diseases 0.000 description 3
- 208000037265 diseases, disorders, signs and symptoms Diseases 0.000 description 3
- 230000000694 effects Effects 0.000 description 3
- 238000007481 next generation sequencing Methods 0.000 description 3
- 239000002773 nucleotide Substances 0.000 description 3
- 230000003287 optical effect Effects 0.000 description 3
- 238000012552 review Methods 0.000 description 3
- 239000002356 single layer Substances 0.000 description 3
- 238000012706 support-vector machine Methods 0.000 description 3
- 239000013598 vector Substances 0.000 description 3
- 108091032973 (ribonucleotides)n+m Proteins 0.000 description 2
- 241000972773 Aulopiformes Species 0.000 description 2
- LSNNMFCWUKXFEE-UHFFFAOYSA-M Bisulfite Chemical compound OS([O-])=O LSNNMFCWUKXFEE-UHFFFAOYSA-M 0.000 description 2
- 108091034117 Oligonucleotide Proteins 0.000 description 2
- 108091027967 Small hairpin RNA Proteins 0.000 description 2
- 108020004459 Small interfering RNA Proteins 0.000 description 2
- 108020004566 Transfer RNA Proteins 0.000 description 2
- 230000002159 abnormal effect Effects 0.000 description 2
- 238000004458 analytical method Methods 0.000 description 2
- 238000006243 chemical reaction Methods 0.000 description 2
- 238000003745 diagnosis Methods 0.000 description 2
- 238000009826 distribution Methods 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 238000013210 evaluation model Methods 0.000 description 2
- 150000002500 ions Chemical class 0.000 description 2
- 238000013178 mathematical model Methods 0.000 description 2
- 108020004999 messenger RNA Proteins 0.000 description 2
- 238000012164 methylation sequencing Methods 0.000 description 2
- 108091070501 miRNA Proteins 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000035772 mutation Effects 0.000 description 2
- 210000000056 organ Anatomy 0.000 description 2
- 230000008520 organization Effects 0.000 description 2
- 238000012175 pyrosequencing Methods 0.000 description 2
- 238000011002 quantification Methods 0.000 description 2
- 108020004418 ribosomal RNA Proteins 0.000 description 2
- 235000019515 salmon Nutrition 0.000 description 2
- 238000012216 screening Methods 0.000 description 2
- 239000004065 semiconductor Substances 0.000 description 2
- 102000040650 (ribonucleotides)n+m Human genes 0.000 description 1
- 238000012935 Averaging Methods 0.000 description 1
- 108090000994 Catalytic RNA Proteins 0.000 description 1
- 102000053642 Catalytic RNA Human genes 0.000 description 1
- 108091026890 Coding region Proteins 0.000 description 1
- 108020004635 Complementary DNA Proteins 0.000 description 1
- 108091029523 CpG island Proteins 0.000 description 1
- 108700024394 Exon Proteins 0.000 description 1
- 108091092195 Intron Proteins 0.000 description 1
- 238000007476 Maximum Likelihood Methods 0.000 description 1
- 108020004711 Nucleic Acid Probes Proteins 0.000 description 1
- 108091028664 Ribonucleotide Proteins 0.000 description 1
- 238000003491 array Methods 0.000 description 1
- 238000001369 bisulfite sequencing Methods 0.000 description 1
- 238000013145 classification model Methods 0.000 description 1
- 238000010367 cloning Methods 0.000 description 1
- 230000006835 compression Effects 0.000 description 1
- 238000007906 compression Methods 0.000 description 1
- 238000007418 data mining Methods 0.000 description 1
- 238000012350 deep sequencing Methods 0.000 description 1
- 239000005547 deoxyribonucleotide Substances 0.000 description 1
- 125000002637 deoxyribonucleotide group Chemical group 0.000 description 1
- PCHJSUWPFVWCPO-UHFFFAOYSA-N gold Chemical compound [Au] PCHJSUWPFVWCPO-UHFFFAOYSA-N 0.000 description 1
- 235000019689 luncheon sausage Nutrition 0.000 description 1
- 239000003550 marker Substances 0.000 description 1
- 238000012544 monitoring process Methods 0.000 description 1
- 239000011807 nanoball Substances 0.000 description 1
- 239000002077 nanosphere Substances 0.000 description 1
- 239000002853 nucleic acid probe Substances 0.000 description 1
- 102000039446 nucleic acids Human genes 0.000 description 1
- 108020004707 nucleic acids Proteins 0.000 description 1
- 150000007523 nucleic acids Chemical class 0.000 description 1
- 239000013612 plasmid Substances 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 230000000750 progressive effect Effects 0.000 description 1
- 230000008929 regeneration Effects 0.000 description 1
- 238000011069 regeneration method Methods 0.000 description 1
- 230000014493 regulation of gene expression Effects 0.000 description 1
- 230000004044 response Effects 0.000 description 1
- 239000002336 ribonucleotide Substances 0.000 description 1
- 125000002652 ribonucleotide group Chemical group 0.000 description 1
- 108091092562 ribozyme Proteins 0.000 description 1
- 230000011218 segmentation Effects 0.000 description 1
- 230000009528 severe injury Effects 0.000 description 1
- 238000000638 solvent extraction Methods 0.000 description 1
- 238000001228 spectrum Methods 0.000 description 1
- 230000001502 supplementing effect Effects 0.000 description 1
Classifications
-
- C—CHEMISTRY; METALLURGY
- C12—BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
- C12Q—MEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
- C12Q1/00—Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
- C12Q1/68—Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
- C12Q1/6876—Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes
- C12Q1/6883—Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes for diseases caused by alterations of genetic material
- C12Q1/6886—Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes for diseases caused by alterations of genetic material for cancer
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B20/00—ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
- G16B20/20—Allele or variant detection, e.g. single nucleotide polymorphism [SNP] detection
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B30/00—ICT specially adapted for sequence analysis involving nucleotides or amino acids
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B40/00—ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
-
- C—CHEMISTRY; METALLURGY
- C12—BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
- C12Q—MEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
- C12Q2600/00—Oligonucleotides characterized by their use
- C12Q2600/154—Methylation markers
Landscapes
- Life Sciences & Earth Sciences (AREA)
- Health & Medical Sciences (AREA)
- Chemical & Material Sciences (AREA)
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Proteomics, Peptides & Aminoacids (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Biophysics (AREA)
- Analytical Chemistry (AREA)
- General Health & Medical Sciences (AREA)
- Biotechnology (AREA)
- Medical Informatics (AREA)
- Organic Chemistry (AREA)
- Evolutionary Biology (AREA)
- Genetics & Genomics (AREA)
- Theoretical Computer Science (AREA)
- Bioinformatics & Computational Biology (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Wood Science & Technology (AREA)
- Immunology (AREA)
- Zoology (AREA)
- Pathology (AREA)
- Molecular Biology (AREA)
- Artificial Intelligence (AREA)
- Databases & Information Systems (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Software Systems (AREA)
- Biochemistry (AREA)
- Public Health (AREA)
- Oncology (AREA)
- Microbiology (AREA)
- Bioethics (AREA)
- Hospice & Palliative Care (AREA)
- Epidemiology (AREA)
- Evolutionary Computation (AREA)
- Data Mining & Analysis (AREA)
- General Engineering & Computer Science (AREA)
- Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
Abstract
Provides a multi-cancer methylation detection kit and application thereof. Specifically provided is a biomarker combination for evaluating the relatedness of a sample to be tested to the risk of tumor formation and/or tumor tissue sources, wherein the reference gene version related to the differential methylation region DMR is hg19 version. Also provided is the use of a reagent of the biomarker combination for the preparation of a kit for diagnosing a risk of tumor formation and/or an assessment of a tumor tissue origin of a sample. The method can be suitable for risk assessment and tissue tracing of various cancers, and has the advantages of low cost and high accuracy.
Description
Technical Field
The application relates to the biomedical field, in particular to a multi-cancer methylation detection kit and application thereof.
Background
DNA methylation is known to play an important role in the regulation of gene expression. Abnormal DNA methylation signatures have been reported in the course of many diseases, including cancer. DNA methylation sequencing is increasingly known as a high resolution, high throughput technique that is useful in cancer screening, diagnosis, and monitoring. Whole genome bisulfite sequencing (WGBS,whole genome bisulfite ssequencing) is a gold standard for methylation sequencing, but is difficult to use clinically due to severe damage to DNA during processing and excessive sequencing costs. More importantly, most regions of the human genome are not active during the development of cancer, and cancer-related variations tend to be concentrated in certain specific regions, such as CpG islands (CpG islans), which provides a good opportunity for targeted sequencing.
Nevertheless, the discovery and screening of cancer-associated methylation differential regions (DMR) is challenging because of non-specific changes in methylation spectra due to crowd heterogeneity, including disease, age, etc. conditions, and thus the need to deal with these non-cancerous but abnormal signals during the cancer assessment DOC model building process. Finally, for the application of detection of various cancer types, the establishment of the tissue traceability TOO model has important auxiliary significance for tracing possible source organs of cancer variation, determining downstream diagnosis and treatment paths and saving medical cost.
Disclosure of Invention
The application establishes a low-cost and high-precision method, adopts DNA or RNA oligonucleotide sequences to capture methylation variation regions of various cancers and specific methylation characteristic regions of various organs, judges the existence of tumor components (ctDNA) in blood free DNA (cfDNA), and evaluates the tissue sources of the tumor components (ctDNA).
In one aspect, the present application provides a biomarker panel for assessing the correlation of a test sample with risk of neoplasia, wherein the biomarker panel comprises any of the at least 10 differential methylation regions DMR shown in table 1A, wherein the DMR in the table relates to a reference gene version that is hg19 version.
In one aspect, the present application provides a biomarker panel for assessing the relatedness of a sample to be tested to a source of tumor tissue, wherein the biomarker panel comprises any of the at least 10 differential methylation regions DMR shown in table 1B, wherein the DMR in the table is related to a reference gene version that is hg 19.
In one aspect, the present application provides a biomarker combination for assessing the correlation of a test sample with the risk of tumour formation and/or tumour tissue origin, characterized in that the biomarker combination comprises any of the at least 10 differentially methylated regions DMR shown in table 1C, wherein the reference gene version referred to by the DMR in the table is the hg19 version.
In one aspect, the present application provides a kit comprising a biomarker combination as described herein, and optionally comprising a second generation high throughput sequencing reagent.
In one aspect, the present application provides the use of a reagent for detecting a biomarker combination described herein in the preparation of a kit for diagnosing risk of neoplasia and/or tumour tissue origin.
In one aspect, the present application provides a method of assessing the correlation of a test sample with the risk of neoplasia and/or source of tumor tissue, the method comprising: detection of methylation levels is performed on a biomarker combination comprising a biomarker combination as described herein in a test sample.
In one aspect, the present application provides a storage medium that records a program that can run the methods described herein.
In one aspect, the present application provides an apparatus comprising a storage medium as described herein, and optionally comprising a processor coupled to the storage medium, the processor configured to execute based on a program stored in the storage medium to implement the methods described herein.
The biomarker combination, the kit, the method, the equipment, the storage medium and the application can be suitable for risk assessment and tissue tracing of various cancers, and have the advantages of low cost and high accuracy.
Other aspects and advantages of the present application will become readily apparent to those skilled in the art from the following detailed description. Only exemplary embodiments of the present application are shown and described in the following detailed description. As those skilled in the art will recognize, the present disclosure enables one skilled in the art to make modifications to the disclosed embodiments without departing from the spirit and scope of the invention as described herein. Accordingly, the drawings and descriptions herein are to be regarded as illustrative in nature and not as restrictive.
Drawings
The specific features of the invention related to this application are set forth in the appended claims. The features and advantages of the invention that are related to the present application will be better understood by reference to the exemplary embodiments and the drawings that are described in detail below. The drawings are briefly described as follows:
FIG. 1 shows an exemplary case (a theoretical exemplary presentation, not intended to represent an actual sequencing case).
FIG. 2 shows another exemplary case (a theoretical exemplary presentation, not intended to represent an actual sequencing case).
Figures 3A-3C show another exemplary scenario (a theoretical exemplary presentation is not intended to represent an actual sequencing scenario).
FIG. 4 shows that in 5-fold cross-validation, a 98% (95% CI: 96-99%) tissue traceability accuracy can be achieved.
FIG. 5 shows the control results of the weighting configuration of the Salmon-DOC model of the present application for confounding relevant features.
FIG. 6 shows that the Salmon-DOC model of the present application can efficiently detect 6 cancer species in different stages in a tumor group model.
FIG. 7 shows that the Salmon-DOC model of the present application overcomes the weakness of past methylation false positives with age in healthy groups, maintaining balance in each age group (horizontal axis for age and vertical axis for model cancer probability score).
Figures 8A-8D show that the Salmon-toi bilayer model traceability accuracy of the present application is superior to that of the monolayer model in both cross-validation and independent validation.
Fig. 9 shows the obtained tissue traceability evaluation result based on 103 toi related DMR regions.
Detailed Description
Further advantages and effects of the invention of the present application will become apparent to those skilled in the art from the disclosure of the present application, from the following description of specific embodiments.
Definition of terms
In the present application, the term "differential methylation region" (DMR) generally refers to a region of DNA comprising one or more differential methylation sites. For example, a DMR that includes a greater number or frequency of methylation sites under selected conditions of interest, such as a cancer state, may be referred to as a hypermethylated DMR. For example, a DMR that includes a lesser number or frequency of methylation sites under selected conditions of interest, such as a cancer state, may be referred to as a hypomethylated DMR.
In this application, the term "second generation gene sequencing (NGS)", high-throughput sequencing "or" next generation sequencing "generally refers to second generation high-throughput sequencing techniques and higher-throughput sequencing methods developed thereafter. The next generation sequencing platform includes, but is not limited to, existing Illumina et al sequencing platforms. With the continued development of sequencing technology, one skilled in the art will appreciate that other methods of sequencing methods and devices may also be employed for the present method. For example, second generation gene sequencing may have the advantages of high sensitivity, large throughput, high sequencing depth, or low cost. According to development history, influence, sequencing principle and technology difference, the following main methods are available: large-scale parallel signature sequencing (Massively Parallel Signature Sequencing, MPSS), polymerase cloning (Polony Sequencing), 454 pyrosequencing (454 pyrosequencing), illumina (Solexa) sequencing, ion semiconductor sequencing (Ion semi conductor sequencing), DNA nanosphere sequencing (DNA nano-ball sequencing), DNA nano-arrays of Complete Genomics and combined probe anchored ligation sequencing methods, and the like. The second generation gene sequencing may enable careful comprehensive analysis of the transcriptome and genome of a species, and is therefore also referred to as deep sequencing. For example, the methods of the present application can be equally applied to first generation gene sequencing, second generation gene sequencing, third generation gene sequencing, or Single Molecule Sequencing (SMS).
In this application, the term "sample to be tested" generally refers to a sample that is to be tested. For example, the presence or absence of a modification in one or more gene regions on a test sample can be detected.
The terms "polynucleotide", "nucleotide", "nucleic acid" and "oligonucleotide" are used interchangeably herein. They represent polymeric forms of nucleotides (deoxyribonucleotides or ribonucleotides) of any length, or analogues thereof. Polynucleotides may have any steric structure and may perform any function, whether known or unknown. The following are non-limiting examples of polynucleotides: coding or non-coding regions of a gene or gene fragment, loci (loci), exons, introns, messenger RNAs (mRNA), transfer RNAs (tRNA), ribosomal RNAs (rRNA), short interfering RNAs (siRNA), short-hairpin RNAs (shRNA), micrornas (miRNA), ribozymes, cdnas, recombinant polynucleotides, branched polynucleotides, plasmids, vectors, isolated DNA of any sequence, isolated RNAs of any sequence, nucleic acid probes, primers and adaptors defined according to linkage analysis. Polynucleotides may include one or more modified nucleotides, such as methylated nucleotides and nucleotide analogs.
In the present application, the term "methylation" generally refers to the methylation state of a gene fragment, a nucleotide or a base thereof in the present application. For example, a DNA fragment in which a gene is located in the present application may have methylation on one or more strands. For example, a DNA fragment in which a gene is located in the present application may have methylation at one site or DMR or multiple sites or DMR.
In this application, the term "human reference genome" generally refers to a human genome that can perform a reference function in gene sequencing. The information of the human reference genome may refer to UCSC. The human reference genome may have different versions, for example, hg19, GRCH37 or ensembl 75.
In this application, the term "machine learning model" generally refers to a collection of system or program instructions and/or data configured to implement an algorithm, process, or mathematical model. In this application, the algorithm, process, or mathematical model may evaluate and provide a desired output based on a given input. In this application, the parameters of the machine learning model may not be explicitly programmed, and in a conventional sense, the machine learning model may not be explicitly designed to follow specific rules in order to provide the desired output for a given input. For example, the use of the machine learning model may mean that the machine learning model and/or the data structure/set of rules as the machine learning model are trained by a machine learning algorithm.
In this application, the term "comprising" is generally intended to include the features specifically recited, but does not exclude other elements.
In this application, the term "about" generally means ranging from 0.5% to 10% above or below the specified value, e.g., ranging from 0.5%, 1%, 1.5%, 2%, 2.5%, 3%, 3.5%, 4%, 4.5%, 5%, 5.5%, 6%, 6.5%, 7%, 7.5%, 8%, 8.5%, 9%, 9.5%, or 10% above or below the specified value.
In order to realize detection of 6 cancer species with high incidence rate and high mortality rate, such as lung cancer, intestinal cancer, liver cancer, ovarian cancer, pancreatic cancer and esophageal cancer, a novel algorithm is adopted to compare methylation mutation and spatial position of genome simultaneously by adopting a mode of combining a public database (TCGA) and internal data mining, and 2536 mutation regions (differentially methylated region, DMR) highly related to cancer are screened out in total
In one aspect, the present application provides a biomarker panel for assessing the correlation of a test sample with a risk of neoplasia, the biomarker panel comprising any of the at least 10 different methylation regions DMR shown in table 1A, wherein the reference gene version referred to by the DMR in the table is the hg19 version.
For example, the biomarker combination comprises 94 DMRs in table 1A. For example, the biomarker combinations comprise about 94 DMR, any at least about 90 DMR, any at least about 80 DMR, any at least about 70 DMR, any at least about 60 DMR, any at least about 50 DMR, any at least about 40 DMR, any at least about 30 DMR, any at least about 20 DMR, or any at least about 10 DMR in table 1A.
In one aspect, the present application provides a biomarker panel for assessing the relatedness of a sample to be tested to a source of tumor tissue, said biomarker panel comprising any of the at least 10 differentially methylated regions DMR shown in table 1B, wherein the reference gene version to which the DMR in the table relates is the hg19 version.
For example, the biomarker combination comprises 103 DMR in table 1B. For example, the biomarker combinations comprise about 103 DMR, any at least about 100 DMR, any at least about 90 DMR, any at least about 80 DMR, any at least about 70 DMR, any at least about 60 DMR, any at least about 50 DMR, any at least about 40 DMR, any at least about 30 DMR, any at least about 20 DMR, or any at least about 10 DMR in table 1B.
In one aspect, the present application provides a biomarker set for assessing the correlation of a test sample with the risk of neoplasia and/or tumour tissue origin, the biomarker set comprising any of the at least 10 different methylation regions DMR shown in table 1C, wherein the reference gene version referred to by the DMR in the table is the hg19 version.
For example, the biomarker combination comprises any at least 222 DMR in table 1E. For example, the biomarker combinations comprise about 222 DMRs, any at least about 220 DMRs, any at least about 210 DMRs, any at least about 200 DMRs, any at least about 150 DMRs, any at least about 100 DMRs, any at least about 90 DMRs, any at least about 80 DMRs, any at least about 70 DMRs, any at least about 60 DMRs, any at least about 50 DMRs, any at least about 40 DMRs, any at least about 30 DMRs, any at least about 20 DMRs, or any at least about 10 DMRs in table 1E.
For example, the biomarker combination comprises 488 DMRs in table 1D. For example, the biomarker combination comprises about 488 DMRs, any at least about 480 DMRs, any at least about 450 DMRs, any at least about 400 DMRs, any at least about 300 DMRs, any at least about 200 DMRs, any at least about 150 DMRs, any at least about 100 DMRs, any at least about 90 DMRs, any at least about 80 DMRs, any at least about 70 DMRs, any at least about 60 DMRs, any at least about 50 DMRs, any at least about 40 DMRs, any at least about 30 DMRs, any at least about 20 DMRs, or any at least about 10 DMRs in table 1D.
For example, the biomarker combination comprises 860 DMRs in table 1C. For example, the biomarker combinations comprise about 860 DMR, any at least about 850 DMR, any at least about 800 DMR, any at least about 700 DMR, any at least about 600 DMR, any at least about 500 DMR, 400 DMR, any at least about 300 DMR, any at least about 200 DMR, any at least about 150 DMR, any at least about 100 DMR, any at least about 90 DMR, any at least about 80 DMR, any at least about 70 DMR, any at least about 60 DMR, any at least about 50 DMR, any at least about 40 DMR, any at least about 30 DMR, any at least about 20 DMR, or any at least about 10 DMR in table 1C.
For example, the tumor is from a homogeneous tumor (homogenous tumors), a heterogeneous tumor, a hematological cancer, and/or a solid tumor. For example, the tumor is from one or more of the following groups of cancers: brain cancer, lung cancer, skin cancer, nasopharyngeal cancer, throat cancer, liver cancer, bone cancer, lymphoma, pancreatic cancer, skin cancer, intestinal cancer, rectal cancer, thyroid cancer, bladder cancer, kidney cancer, oral cancer, stomach cancer, solid tumors, ovarian cancer, esophageal cancer, gall bladder cancer, biliary tract cancer, breast cancer, cervical cancer, uterine cancer, prostate cancer, head and neck cancer, sarcoma, malignant tumors of the thorax other than the lung, melanoma, and testicular cancer. For example, the tumor comprises lung cancer, bowel cancer, liver cancer, ovarian cancer, pancreatic cancer, and/or esophageal cancer.
In one aspect, the present application provides a kit comprising a biomarker combination as described herein, and optionally comprising a second generation high throughput sequencing reagent. For example, the kit can be used to assess the correlation of a test sample with the risk of neoplasia and/or the origin of the tumor tissue.
In one aspect, the present application provides the use of a reagent for detecting a biomarker combination described herein in the preparation of a kit for diagnosing risk of neoplasia and/or tumour tissue origin. For example, the tumor is from a homogeneous tumor (homogenous tumors), a heterogeneous tumor, a hematological cancer, and/or a solid tumor. For example, the tumor is from one or more of the following groups of cancers: brain cancer, lung cancer, skin cancer, nasopharyngeal cancer, throat cancer, liver cancer, bone cancer, lymphoma, pancreatic cancer, skin cancer, intestinal cancer, rectal cancer, thyroid cancer, bladder cancer, kidney cancer, oral cancer, stomach cancer, solid tumors, ovarian cancer, esophageal cancer, gall bladder cancer, biliary tract cancer, breast cancer, cervical cancer, uterine cancer, prostate cancer, head and neck cancer, sarcoma, malignant tumors of the thorax other than the lung, melanoma, and testicular cancer. For example, the tumor comprises lung cancer, bowel cancer, liver cancer, ovarian cancer, pancreatic cancer, and/or esophageal cancer.
In one aspect, the present application provides an assessment method for assessing the correlation of a test sample with the risk of tumor formation and/or the tumour tissue origin of the sample, the method comprising: detection of methylation levels is performed on a biomarker combination comprising a biomarker combination as described herein in a test sample.
For example, the sample is selected from the group consisting of: tissue samples, blood samples, saliva, sputum, pleural effusion, lung lavage, peritoneal effusion, peritoneal lavage, enema, and cerebrospinal fluid.
In one aspect, the present application provides a storage medium that records a program that can run the methods described herein. For example, the non-volatile computer-readable storage medium may include a floppy disk, a flexible disk, a hard disk, a Solid State Storage (SSS) (e.g., solid State Drive (SSD)), a Solid State Card (SSC), a Solid State Module (SSM)), an enterprise-level flash drive, a tape, or any other non-transitory magnetic medium, etc. The non-volatile computer-readable storage medium may also include punch cards, paper tape, optical discs (or any other physical medium having a hole pattern or other optically recognizable indicia), compact disc read-only memory (CD-ROM), rewritable optical discs (CD-RW), digital Versatile Discs (DVD), blu-ray discs (BD), and/or any other non-transitory optical medium.
In one aspect, the present application provides an apparatus comprising a storage medium as described herein, and optionally comprising a processor coupled to the storage medium, the processor configured to execute based on a program stored in the storage medium to implement the methods described herein.
Without intending to be limited by any theory, the following examples are presented merely to illustrate the kits, methods, uses, etc. of the present application and are not intended to limit the scope of the invention of the present application.
Examples
Example 1
Exemplary bisulfite treated second generation sequencing of samples, the resulting sequencing data contained methylation levels and sequencing coverage depth for methylation site CpG. Optionally, noise removal is performed for genomic methylation signature CpG and noise region CHH/CHG sites. Then, for the "tumor" (C) and "normal" (N) groups, the weighted logistic regression (weighted logistic regression) was calculated to obtain p-value, the logistic regression was interpreted as a continuous variable, i.e., methylation level at each CpG point, and the response variable was binary output, i.e., (0, 1), corresponding to C and N. Weighted logistic regression (weighted logistic regression) examines the differences between C and N for each CpG site, and the null hypothesis (null hypothesis) that the differences between C and N at that CpG site are not statistically significant.
DMR partitioning
Based on the methylation level and sequencing coverage depth of the methylation site CpG, it was determined how the DMR individual regions were divided. Specifically, the methylation level and sequencing coverage depth of the methylation site CpG were calculated according to the following formula:
where d ij Is the effective coverage depth of the jth site of the ith sample of the C group, M ij Methylation level at the j-th site of group C i samples was evaluated for similarity in methylation level at spatially consecutive sites of the genome. The deeper the depth of coverage, the greater the value of parameter P, the higher the approximation of the methylation level between adjacent CpG sites within the same group.
FIG. 1 shows an exemplary case (a theoretical exemplary presentation, not intended to represent an actual sequencing case).
For the first CpG site in the region, sample A and sample B obtained coverage of 500 effective sequences, respectively, and sample C obtained coverage of 200 effective sequences. For sample a, the methylation level of this CpG site was 0.2. Sample a had a methylation level of 0 at the second CpG site. The coverage depth parameter value P for the first CpG site of the set was calculated to be 0.617 for three samples. At this time, beta ij =|0.2-0|*e (1-0.617) =0.29. Meanwhile, given that the difference in methylation level between the two CpG sites is less than 0.25, which is one of the requirements for dividing the two adjacent sites into the same DMR, the first and second CpG sites in this example will not be divided into the same DMR.
FIG. 2 shows another exemplary case (a theoretical exemplary presentation, not intended to represent an actual sequencing case).
If the sample is replaced with A, B, D (whereinSample D obtained coverage of 400 effective sequences at the first CpG site). Likewise, for sample a, the methylation level of the CpG site is 0.2. Sample a had a methylation level of 0 at the second CpG site. However, due to the increased sequencing coverage of sample D in this example, three samples calculated a coverage depth parameter value P of 0.962 for the first CpG site of the group. At this time, beta ij =|0.2-0|*e (1-0.962) If the value of =0.21 is smaller than the threshold value of 0.25 for dividing into the same DMR, then the first and second CpG sites in this example have preconditions for dividing into the same DMR according to sample a.
Therefore, the coverage depth of the CpG sites is introduced by the method, so that the accuracy of DMR region division can be remarkably improved.
Further optionally, for B in a region ij The calculation method is as follows
Figures 3A-3C show another exemplary scenario (a theoretical exemplary presentation is not intended to represent an actual sequencing scenario). When the DMR region contains 10 CpG sites, B for all samples ij Combining together, calculating the score of each DMR by an averaging method.
The calculation steps of the B values in the DMR region shown in the group a are shown in the following table:
the B value is scored as 0.1, i.e
Similarly, the B values within DMR shown in group BThe score was 0.7, i.e.,the B value score in DMR shown in group C is 1.233, i.e., +.>
The DMR region screened by the method not only contains the cancer variation information of various cancer species, but also contains the tissue-specific characteristics, and has better segmentation effect at the region boundary.
FIG. 4 shows that in 5-fold cross-validation, a 98% (95% CI: 96-99%) tissue traceability accuracy can be achieved.
Example 2
Cancer assessment (DOC) model establishment
The ctDNA content in blood varies greatly from one cancer to another in different stages of development, and is susceptible to experimental batch effects. Furthermore, methylation changes and age, disease, race, etc., which if left untreated, may affect the accuracy of the classification model as confounding variables (confounding variable). The method comprises the steps of firstly quantifying bias brought by confusion variables (a quantification mode can be but is not limited to Hilbert-Schmidt independent criteria), then embedding regularization terms (regularization) of the model for correction, and increasing model accuracy and generalizable capacity.
Algorithm establishment
Assuming m samples, a feature vector X (X 1 ,…,x m ) Classification tag Y (Y 1 ,…,y m ) Confusion variable Z (Z 1 ,…,z m ) Wherein x is i Is an n-dimensional vector representing the methylation signature of sample i, y i Is x i Classification tag, y i ∈{-1,+1},z i Is some confounding variable for sample i.
Here L H The Hilbert-Schmitt coefficient of independence (Hilbert-Schmidt independence criterion) is used to measure the degree of independence of the variables X and Z, h (Y) and h (Z) are the Kernel functions of Y and Z, P h(x)h(z) Representing the probability distributions of h (y) and h (Z), F and G representing the X and Z regeneration kernel Hilbert space (reproducing kernel Hilbert space), respectively, can be understood as the non-linear post-processed mapped domain of X and Z, C h(x)h(z) The correlation coefficient (correlation coefficient) of these two kernel functions is referred to as HS, hilbert Space.
Using support vector machines (SVM, support vector machine) as the main classifier
f(x;w,b)=sgn(wTx+b)
sgn(a)=1(-1)if a≥0(<0)
The classification interface is determined by solving the following objective equation,
for the insertible data, a soft-spaced support vector machine (soft-margin SVM) introduces a penalty for training errors
Where C controls the balance of minimizing training errors and maximizing classification intervals (margin), and ζ i Refer to sample x i The degree of violation of the equation.
Salmon adds a regularization term to the objective equation of SVM solution for confusing factor control, parameter lambda controls the balance of confusing factor error and maximized boundary width in training, the objective equation is
Here C and λ control minimizes training errors, minimizes correlation of confounding variables with interpretation variables, and maximizes the balance of classification intervals.
FIG. 5 shows the control results of the weighting configuration of the Salmon-DOC model of the present application for confounding relevant features.
Each data point represents a blood sample for Salmon-DOC model construction, with the horizontal axis being confunding factor for the corresponding sample and the vertical axis being original uncorrected variable coef (panel a) and corrected variable coef (panel B), respectively. Comparing the correction before and after the correction shows that the confusion related feature is in the Salmon-DOC, and the weight is controlled.
Review queue data
The application adopts retrospective clinical samples of 6 cancer seeds, which are divided into a Training set (Training set) and a Validation set (Validation set), and evaluates the accuracy of a Salmon binary classifier (cancer vs. non-cancer).
FIG. 6 shows that the Salmon-DOC model of the present application can efficiently detect 6 cancer species in different stages in a tumor group model.
FIG. 7 shows that the Salmon-DOC model of the present application overcomes the weakness of past methylation false positives with age in healthy groups, maintaining balance in each age group (horizontal axis for age and vertical axis for model cancer probability score).
Example 3
Tissue Traceability (TOO) model building
First layer TOO model construction
The TOO model is essentially a multi-classification problem, and for each class (class) probability calculation can be reduced to voting (ranking) on pairs of bi-class (pairing) results, then choosing the most votes. However, for possible clinical applications of the tissue traceability model, it is not enough to generate only one classification result, and only the probability of classification is generated, so that superposition (assembly) of the models is possible.
The first step in the Salmon-TOO model of the present application is therefore to quantify the outcome of the classification vote (voing). This quantification can be demonstrated by probability calculations. If a certain data point x and label y are defined, we assume a pairwise classification probability μ ij If any, we can get a model from the ith and jth categories in the training set, and can use the calculated r as long as any new data point x is entered ij As mu ij Is a similar estimate of (a). The problem can be reduced to using all r ij To estimate the probability of the ith category
p i =P(y=i|x),i=1,…,k
Definition r ij Mu is ij Let μ be the estimate of ij +μ ji =1. A "voting" system is used for multi-classification problems,
μ ij ≡P(y=i|y=i or j,x)
definition I is the target equation: i {x} =1 if x is true, otherwise false. The probability calculation can be written as
Second layer TOO model construction
The second layer of the Salmon-TOO model is MLR fitting for different classes (classes)
Assuming that a probability calculation is required for the source of the seed tissue, a quantized classification probability may be obtained from the first layer, the value range is (+_infinity, - +_infinity). Because the actual distribution of each pair of the classification probabilities is inconsistent, the quantized classification probabilities can be further used as interpretation variables of logistic regression, and the reaction variables adopt multiple outputs corresponding to the known tissue sources in the modeling process.
As shown in the above table, each column represents a characteristic variable of the logistic regressionI.e., two-class assessment probabilities for two-by-two tissue classes; each row represents a reaction variable y 1 I.e., tissue class (class).
For the feature variables used to interpret the two-class probabilities, the evaluation result is converted into Y assuming that there are J discontinuous reflection variables in total i1 ,…,Y iJ ,β j For feature weights based on each of the reflected variables.
Since in the Salmon-DOC model we can get that it is judged negative in some cancer species and positive in some cancer species, for this judgment, when performing the traceability modeling, the tissue class (class) is subjected to weight correction based on the quasi-maximum likelihood estimation method, and taking binary logistic regression as an example, it can be interpreted as:
review queue data
All data of the review queue is randomized 1:1 into a training set and a verification set. Firstly, cross verification is carried out through a training set to obtain a traceability evaluation result, and model parameters are continuously optimized and finally locked in the process. And finally, evaluating the tracing result of all data of the verification set by using the locked model. In the traceable model training set, the total sample size of six cancers is 300, and the number of each stage of each cancer is relatively balanced: 36 cases of lung cancer (the number of cases of I-IV is 4/12/5/15 respectively), 62 cases of intestinal cancer (the number of cases of I-IV is 8/18/18/18 respectively), 74 cases of liver cancer (the number of cases of I-IV is 25/14/22/13 respectively), 48 cases of ovarian cancer (the number of cases of I-IV is 1/4/38/5 respectively), 40 cases of pancreatic cancer (the number of cases of I-IV is 3/6/13/18 respectively), 42 cases of esophageal cancer (the number of cases of I-IV is 5/10/15/12 respectively). A total of 224 samples of the traceability model verification set comprise: 31 cases of lung cancer (the number of cases of I-IV is 4/5/12/10 respectively), 52 cases of intestinal cancer (the number of cases of I-IV is 7/15/13/17 respectively), 55 cases of liver cancer (the number of cases of I-IV is 17/11/20/7 respectively), 27 cases of ovarian cancer (the number of cases of I-IV is 3/4/8/12 respectively), 25 cases of pancreatic cancer (the number of cases of I-IV is 4/6/6/9 respectively), 34 cases of esophageal cancer (the number of cases of I-IV is 4/7/8/15 respectively).
FIG. 8 shows that the Salmon-TOO bilayer model traceability accuracy of the present application is superior to that of the monolayer model in both cross-validation and independent validation.
Fig. A, B is a traceable evaluation result of cross-validation of six cancer species data in a six cancer species training set. Wherein, the graph A is the result output after only the first layer TOO model is constructed, the tracing accuracy is 0.87 (260/300), and if the suboptimal tracing result is included, the accuracy is 0.93 (279/300); FIG. B shows the output result of the second layer MLR model supplemented on the basis of the first layer TOO model, the tracing accuracy is improved to 0.90 (270/300), and if the suboptimal tracing result is included, the accuracy can be further improved to 0.95 (284/300). Similarly, fig. C, D is a traceable evaluation result of the independent verification of six cancer species data in the verification set. Wherein, the graph C is the result output after only the first layer TOO model is constructed, the tracing accuracy is 0.77 (173/224), and if the suboptimal tracing result is included, the accuracy is 0.87 (194/224); graph D shows the output result of supplementing the second layer MLR model based on the first layer TOO model, the tracing accuracy is improved to 0.84 (187/224), and if the suboptimal tracing result is included, the accuracy can be further improved to 0.89 (199/224).
In conclusion, the evaluation accuracy of the Salmon-TOO double-layer traceability model is better than that of a single-layer model in the cross validation and independent validation of a training set.
Example 4
DOC cancer detection model
Table 1A shows 94 DMR regions for DOC cancer detection model
/>
Based on 94 DOC related DMR regions, 100 healthy human samples and 318 six cancer positive samples in independent verification set 1 were evaluated with an overall sensitivity of 80.5% (256/318) and an overall specificity of 95% (95/100). At 90% level of specificity, specific cancer species and stage sensitivity are as follows:
/>
repeated tests were then performed, each employing 50 random ones of the 94 DOC zones. The sensitivity results of six cancer positive samples in five replicates at 90% (90/100) level of specificity are shown in the following table:
example 5
TOO organization traceability model
Table 1B shows 103 DMR regions for TOO organization traceability model
/>
Based on 103 TOO related DMR regions, performing traceability evaluation on 473 cases of six cancer positive samples in the independent verification set 2, wherein the first traceability accuracy is 63.0% (298/473), and if a suboptimal traceability result is included, the accuracy can be improved to 71.5% (338/473).
Fig. 9 shows the obtained tissue traceability evaluation result based on 103 toi related DMR regions.
Four rounds of repeated testing were then performed, each time taking a random 50 of 103 TOO regions, with the traceability accuracy results in four rounds of evaluation shown in the following table:
example 6
DMR simultaneously evaluates DOC and toi:
table 1C shows 860 DMR regions for DOC and TOO evaluation models
/>
/>
/>
/>
/>
/>
/>
/>
/>
/>
Table 1D shows 488 DMR regions for DOC and TOO assessment models
/>
/>
/>
/>
/>
/>
Table 1E shows 222 DMR regions for DOC and TOO evaluation models
/>
/>
In independent validation set 3, sensitivity and tracing accuracy at uniform specificity of 95.1% (450/473) were calculated for 473 negative samples and 473 positive six-cancer samples with progressive gradient compression of the marker number. The tumor detection and tissue traceability results of the evaluation are shown in the following table:
/>
the foregoing detailed description is provided by way of explanation and example and is not intended to limit the scope of the appended claims. Numerous variations of the presently exemplified embodiments of the present application will be apparent to those of ordinary skill in the art and remain within the scope of the appended claims and equivalents thereof.
Claims (22)
1. A biomarker panel for assessing the correlation of a test sample with risk of neoplasia, the biomarker panel comprising any of at least 10 differential methylation regions DMR as set forth in table 1A, wherein the reference gene version referred to by the DMR in the table is the hg19 version.
2. The biomarker combination according to claim 1, comprising any of at least 50 DMR in table 1A.
3. The biomarker combination according to any of claims 1 to 2, comprising 94 DMR in table 1A.
4. A biomarker panel for assessing the relatedness of a sample to be tested to a source of tumour tissue, said biomarker panel comprising any of at least 10 different methylation regions DMR as shown in table 1B, wherein the reference gene version referred to by the DMR in the table is the hg19 version.
5. The biomarker combination according to claim 4, comprising any of at least 50 of table 1B
DMR。
6. The biomarker combination according to any of claims 4 to 5, comprising 103 DMR in table 1B.
7. A biomarker combination for assessing the relatedness of a test sample to a tumor formation risk and/or a tumor tissue source, characterized in that said biomarker combination comprises any of at least 10 different methylation regions as shown in table 1C
DMR, wherein the DMR in the table refers to a reference gene version of hg 19.
8. The biomarker combination of claim 7, comprising at least 50 DMR of any of table 1E, table 1D or table 1C.
9. The biomarker combination according to any of claims 7 to 8, comprising 222 DMR in table 1E.
10. The biomarker combination according to any of claims 7 to 9, comprising 488 DMRs in table 1D.
11. The biomarker combination according to any of claims 7 to 10, comprising 860 DMR in table 1C.
12. The biomarker combination according to any of claims 1 to 11, wherein the tumour is derived from a homogenetic tumour (homogenetic tumour)
tumor), heterogeneous tumors, hematological cancers and/or solid tumors; preferably, the tumor is from one or more of the following groups of cancers: brain cancer, lung cancer, skin cancer, nasopharyngeal cancer, throat cancer, liver cancer, bone cancer, lymphoma, pancreatic cancer, skin cancer, intestinal cancer, rectal cancer, thyroid cancer, bladder cancer, kidney cancer, oral cancer, stomach cancer, solid tumors, ovarian cancer, esophageal cancer, gall bladder cancer, biliary tract cancer, breast cancer, cervical cancer, uterine cancer, prostate cancer, head and neck cancer, sarcoma, malignant tumors of the thorax other than the lung, melanoma, and testicular cancer.
13. The biomarker combination according to any of claims 1 to 12, wherein the tumour comprises lung cancer, bowel cancer, liver cancer, ovarian cancer, pancreatic cancer, and/or oesophageal cancer.
14. A kit comprising the biomarker combination of any of claims 1-13, and optionally comprising a second generation high throughput sequencing reagent.
15. The kit of claim 14 for assessing the correlation of a test sample with the risk of tumor formation and/or the origin of tumor tissue.
16. Use of a reagent for detecting a biomarker combination according to any of claims 1 to 13, in the manufacture of a kit for diagnosing the risk of tumour formation and/or tumour tissue origin.
17. The use of claim 16, wherein the tumor is derived from a homogeneous tumor (homogenous tumors), a heterogeneous tumor, a hematological cancer and/or a solid tumor; preferably, the tumor is from one or more of the following groups of cancers: brain cancer, lung cancer, skin cancer, nasopharyngeal cancer, throat cancer, liver cancer, bone cancer, lymphoma, pancreatic cancer, skin cancer, intestinal cancer, rectal cancer, thyroid cancer, bladder cancer, kidney cancer, oral cancer, stomach cancer, solid tumors, ovarian cancer, esophageal cancer, gall bladder cancer, biliary tract cancer, breast cancer, cervical cancer, uterine cancer, prostate cancer, head and neck cancer, sarcoma, malignant tumors of the thorax other than the lung, melanoma, and testicular cancer.
18. The use of any one of claims 16-17, wherein the tumor comprises lung cancer, bowel cancer, liver cancer, ovarian cancer, pancreatic cancer, and/or esophageal cancer.
19. A method of assessing the correlation of a test sample with the risk of tumour formation and/or tumour tissue origin, the method comprising: detecting the methylation level of a biomarker combination comprising a biomarker combination according to any of claims 1 to 13 in a sample to be tested.
20. The assessment method of claim 19, the sample being selected from the group consisting of: tissue samples, blood samples, saliva, sputum, pleural effusion, lung lavage, peritoneal effusion, peritoneal lavage, enema, and cerebrospinal fluid.
21. A storage medium carrying a program operable to perform the method of any one of claims 19 to 20.
22. An apparatus comprising the storage medium of claim 21, and optionally comprising a processor coupled to the storage medium, the processor configured to execute to implement the method of any of claims 19-20 based on a program stored in the storage medium.
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210914446.XA CN117535404A (en) | 2022-08-01 | 2022-08-01 | Multi-cancer methylation detection kit and application thereof |
PCT/CN2023/109837 WO2024027591A1 (en) | 2022-08-01 | 2023-07-28 | Multi-cancer methylation detection kit and use thereof |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210914446.XA CN117535404A (en) | 2022-08-01 | 2022-08-01 | Multi-cancer methylation detection kit and application thereof |
Publications (1)
Publication Number | Publication Date |
---|---|
CN117535404A true CN117535404A (en) | 2024-02-09 |
Family
ID=89784781
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210914446.XA Pending CN117535404A (en) | 2022-08-01 | 2022-08-01 | Multi-cancer methylation detection kit and application thereof |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN117535404A (en) |
WO (1) | WO2024027591A1 (en) |
Family Cites Families (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20190316209A1 (en) * | 2018-04-13 | 2019-10-17 | Grail, Inc. | Multi-Assay Prediction Model for Cancer Detection |
EP4127232A1 (en) * | 2020-03-30 | 2023-02-08 | Grail, LLC | Cancer classification with synthetic spiked-in training samples |
CN112820407B (en) * | 2021-01-08 | 2022-06-17 | 清华大学 | Deep learning method and system for detecting cancer by using plasma free nucleic acid |
CN114171115B (en) * | 2021-11-12 | 2022-07-29 | 深圳吉因加医学检验实验室 | Differential methylation region screening method and device thereof |
CN114736968B (en) * | 2022-06-13 | 2022-09-27 | 南京世和医疗器械有限公司 | Application of plasma free DNA methylation marker in lung cancer early screening and lung cancer early screening device |
CN115132273B (en) * | 2022-08-01 | 2023-07-28 | 广州燃石医学检验所有限公司 | Method and system for evaluating tumor formation risk and tumor tissue source |
-
2022
- 2022-08-01 CN CN202210914446.XA patent/CN117535404A/en active Pending
-
2023
- 2023-07-28 WO PCT/CN2023/109837 patent/WO2024027591A1/en unknown
Also Published As
Publication number | Publication date |
---|---|
WO2024027591A1 (en) | 2024-02-08 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US20240079092A1 (en) | Systems and methods for deriving and optimizing classifiers from multiple datasets | |
CN115132273B (en) | Method and system for evaluating tumor formation risk and tumor tissue source | |
Bi et al. | Gene expression patterns combined with network analysis identify hub genes associated with bladder cancer | |
CN115335533A (en) | Cancer classification using genomic region modeling | |
US20210310075A1 (en) | Cancer Classification with Synthetic Training Samples | |
CA2959670A1 (en) | Compositions, methods and kits for diagnosis of a gastroenteropancreatic neuroendocrine neoplasm | |
US20230160019A1 (en) | Rna markers and methods for identifying colon cell proliferative disorders | |
CN112951327A (en) | Drug sensitivity prediction method, electronic device and computer-readable storage medium | |
CN113574602A (en) | Sensitive detection of Copy Number Variation (CNV) from circulating cell-free nucleic acids | |
US20190073445A1 (en) | Identifying false positive variants using a significance model | |
CN117413072A (en) | Methods and systems for detecting cancer by nucleic acid methylation analysis | |
JP2022501033A (en) | Cell-free DNA hydroxymethylation profile in the assessment of pancreatic lesions | |
CN116312800A (en) | Lung cancer characteristic identification method, device and storage medium based on circulating RNA whole transcriptome sequencing in blood plasma | |
CN111164701A (en) | Fixed-point noise model for target sequencing | |
CN117535404A (en) | Multi-cancer methylation detection kit and application thereof | |
US20140113829A1 (en) | Systems and methods of selecting combinatorial coordinately dysregulated biomarker subnetworks | |
CN113159529A (en) | Risk assessment model and related system for intestinal polyp | |
Anderson et al. | Predictive modeling of lung cancer recurrence using alternative splicing events versus differential expression data | |
BR102015007391B1 (en) | BIOMARKERS FOR CLASSIFICATION OF ACUTE LEUKEMIA | |
US20240209455A1 (en) | Analysis of fragment ends in dna | |
CN118240934A (en) | Methylation signal detection method, device and kit | |
CN117965725A (en) | Method, device and kit for distinguishing liver cancer from liver non-cancer disease samples | |
Garikipati | Computational genomic algorithms for miRNA-based diagnosis of lung cancer: the potential of machine learning | |
CN118314951A (en) | Glioblastoma prognosis biomarker screening analysis method and system | |
Floares et al. | BLADDER CANCER NON-INVASIVE DIAGNOSIS I-BIOMARKERS BASED ON PLASMA MICRORNA WITH 100% ACCURACY |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication |