WO2023130670A1 - 基于游离dna的基因组癌变信息检测系统和检测方法 - Google Patents

基于游离dna的基因组癌变信息检测系统和检测方法 Download PDF

Info

Publication number
WO2023130670A1
WO2023130670A1 PCT/CN2022/098450 CN2022098450W WO2023130670A1 WO 2023130670 A1 WO2023130670 A1 WO 2023130670A1 CN 2022098450 W CN2022098450 W CN 2022098450W WO 2023130670 A1 WO2023130670 A1 WO 2023130670A1
Authority
WO
WIPO (PCT)
Prior art keywords
sample
baseline
analysis
arm
value
Prior art date
Application number
PCT/CN2022/098450
Other languages
English (en)
French (fr)
Inventor
李宇龙
洪媛媛
韩天澄
吕芳
杨顺莉
聂佩瑶
张琦
陈维之
Original Assignee
无锡臻和生物科技有限公司
臻和(北京)生物科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 无锡臻和生物科技有限公司, 臻和(北京)生物科技有限公司 filed Critical 无锡臻和生物科技有限公司
Priority to US18/052,067 priority Critical patent/US20240060137A1/en
Publication of WO2023130670A1 publication Critical patent/WO2023130670A1/zh

Links

Images

Classifications

    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6876Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes
    • C12Q1/6883Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes for diseases caused by alterations of genetic material
    • C12Q1/6886Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes for diseases caused by alterations of genetic material for cancer
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6813Hybridisation assays
    • C12Q1/6827Hybridisation assays for detection of mutation or polymorphism
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/10Ploidy or copy number detection
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/30Detection of binding sites or motifs
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • G16B40/20Supervised data analysis
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/30ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for calculating health indices; for individual health risk assessment
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q2600/00Oligonucleotides characterized by their use
    • C12Q2600/154Methylation markers
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q2600/00Oligonucleotides characterized by their use
    • C12Q2600/156Polymorphic or mutational markers
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/20ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for computer-aided diagnosis, e.g. based on medical expert systems

Definitions

  • the invention relates to the field of genome canceration information detection, in particular to a genome canceration information detection system and detection method based on free DNA.
  • Liquid biopsy technology especially the detection technology based on the detection of biomarker signals of tumor-derived free tumor DNA (circulating tumor DNA, ctDNA) in plasma free DNA (cell-free DNA, cfDNA), has become a non-invasive method in recent years. Invasive tumor detection methods are widely used in tumor diagnosis, disease tracking, recurrence monitoring, etc. Compared with traditional imaging methods, liquid biopsy technology has higher detection sensitivity for early tumors, and can realize simultaneous detection of multiple cancer types, and has the potential to be used as a routine cancer screening method for the general population.
  • ctDNA is derived from necrotic, apoptotic, circulating tumor cells and exosomes secreted by tumor cells, carrying the genetic and epigenetic characteristics of tumor cells.
  • DNA methylation is an important epigenetic modification in eukaryotic cells, that is, under the action of DNA methyltransferases (DNA methyltransferases, DNMTs), the cytosine (cytosine) of CpG islands is converted into 5'-methylcytosine. Pyrimidine (5-mC).
  • DNA methyltransferases DNA methyltransferases
  • cytosine cytosine
  • Pyrimidine 5-mC
  • the change of DNA methylation status is one of the landmark events in the process of tumorigenesis and development, and it occurs widely in the genome in the early stage of tumors.
  • CpG islands in the promoter region of human genes are often hypermethylated in cancer, which may silence the expression of some tumor suppressor genes; at the same time, cancer genomes often show a large-scale demethylation state, which may lead to the activation or activation of repetitive sequence regions. Chromosomal rearrangement.
  • Weak ctDNA signals can be detected sensitively by detecting changes in plasma cfDNA methylation status.
  • the human genome is larger than 3G.
  • target region capture sequencing is currently the most commonly used methylation detection method, but its performance is limited by the screening of cancer-specific target regions, which requires early detection of cancer and matching.
  • Paracancerous tissues were analyzed by high-depth genome-wide methylation sequencing to select differentially methylated sites. Therefore, a major bottleneck of this technology path is the acquisition of high-quality tissue samples of various cancer types, and the screening and verification process of differentially methylated sites is relatively cumbersome.
  • fragmentation characteristics of cfDNA in cancer patients are also different from those of healthy people.
  • Epigenetic biomarkers of ctDNA are widely exploited for detection in multiple cancer types (“fragmentomics”).
  • copy number variation which is a common genetic feature change in various cancers, has also been widely used to detect ctDNA signals.
  • the present invention is based on the following findings of the inventors: the inventors have discovered for the first time that 5-methylcytosine (5-mC) in plasma cfDNA (cell-free DNA) can be converted into 5 -Formylcytosine (5-fC) and 5-carboxylcytosine (5-caC), unmethylated cytosine (C) is converted to uracil (U), and sequencing libraries can be obtained for simultaneous genome-wide sequencing Methylation, fragmentation (for example, from two dimensions of fragment length coefficient analysis and terminal motif (motif) analysis), chromosome instability analysis (copy number variation), and early, sensitive, and accurate screening for multiple cancers at the same time check.
  • the present invention provides a low-cost library construction method and an analysis model capable of simultaneous genome-wide methylation, fragmentation, and copy number variation analysis of plasma cfDNA for liquid biopsy screening of cancer.
  • the method is suitable for low-initial Quantitative cfDNA eliminates the need for region-of-target capture, simplifying the technical process. Further, the present invention can optionally further improve the detection sensitivity and accuracy of cancer screening through the integrated analysis of the above-mentioned cancer characteristics in each dimension.
  • this paper provides a genome cancer information detection system based on cell free DNA (cfDNA), including:
  • the library construction device converts 5-methylcytosine (5-mC) in free DNA in the sample to be tested (for example, free DNA in plasma) into 5-formylcytosine (5-fC) and 5 - Carboxycytosine (5-caC), unmethylated cytosine (C) converted to uracil (U), used for library construction;
  • An information analysis device which includes one or more of the following modules:
  • the methylation analysis module is used to analyze the methylation information of free DNA
  • Fragment length coefficient analysis module used to analyze the fragmentation information of free DNA
  • the terminal motif analysis module is used to analyze the fragmentation information of free DNA
  • the chromosome instability analysis module is used to analyze the copy number variation information of chromosomes.
  • the information analysis device further includes an integrated classification module, which is used to combine the information obtained by the methylation analysis module, fragment length coefficient analysis module, terminal motif analysis module and/or chromosome instability analysis module. information is integrated.
  • the methylation analysis module is an MD-KNN analysis module, which divides the human reference genome into intervals (ie, bins, such as 1Mb size) by a non-overlapping sliding window method, and calculates all CpGs in each interval
  • the proportion of methylation sites in the site that is, the methylation density MD (methylation density) value
  • KNN K-Nearest Neighbor, K proximity method
  • the fragment length coefficient analysis module is an FSI-SVM analysis module, which divides the human reference genome into intervals (for example, 5Mb size) by a non-overlapping sliding window method, and calculates the short fragments of each interval (for example, 101-167bp) and the ratio of the number of long fragments (such as 170-250bp), the fragment length coefficient FSI (fragment size index) value of each sample is obtained, and the possibility of canceration is calculated by the SVM (support vector machine, support vector machine) model predicted value F.
  • SVM support vector machine, support vector machine
  • the terminal motif analysis module is a Motif-SVM analysis module, which calculates the proportion of the 5' terminal 4-mer motif sequence of the fragments of the sample, and calculates the predicted value S of the possibility of canceration through the SVM model.
  • the chromosomal instability analysis module is a CIN-PAscore analysis module, which calculates the copy number of all half-arm chromosomes of the sample, by integrating the corresponding chromosome copy number of the healthy person's baseline (baseline) sample with the largest change
  • the z-score of the five half-arm chromosomes is calculated as PAscore (plasma aneuploidy score).
  • the integrated classification module is an SVM-integrated classification module, and the above-mentioned predicted values K, F, S and PAscore are integrated using a linear SVM model to obtain the final predicted value Z of a single cancer possibility.
  • said library construction device in said system comprises:
  • Plasma cell-free DNA extraction module used to extract cell-free DNA from plasma samples
  • Enzyme reaction module using enzymes to convert 5-methylcytosine (5-mC) in free DNA to 5-formylcytosine (5-fC) and 5-carboxycytosine (5-caC), non-methylcytosine Convert cytosine (C) to uracil (U);
  • the PCR reaction module uses PCR to amplify the free DNA after the enzyme reaction.
  • the enzymes used are TET2 enzymes and APOBEC enzymes.
  • the sequencing device is selected from Illumina Novaseq 6000, Illumina Nextseq500, MGI DNBSEQ-T7 or MGI SEQ-2000.
  • the MD value in the MD-KNN analysis module is calculated by the following formula:
  • MD n,i Total_mC n,i /Total_C n,i
  • MD n,i is the MD value of the i-th bin of sample n
  • Total_mC i is the total number of all methylated Cs in the i-th bin
  • Total_C n,i is the total number of all Cs in the i-th bin.
  • the FSI value in the FSI-SVM analysis module is calculated by the following formula:
  • FSI n,i is the FSI value of the i-th bin of sample n
  • Total_S n,i is the number of short fragments in the i-th bin
  • Total_L n,i is the number of long fragments in the i-th bin.
  • the motif proportion in the motif-SVM analysis module is calculated by the following formula:
  • Fraction n,i is the proportion of the i-th 4-mer motif in sample n
  • M i is the number of the i-th 4-mer motif.
  • the PAscore in the CIN-PAscore analysis module is calculated by the following formula:
  • Z n,i is the z-score of half-arm chromosome i of sample n relative to the baseline sample
  • ARM n,i is the number of reads (reads) of half-arm chromosome i of sample n
  • MEAN_baseline i is half of the baseline sample
  • SD_baseline i is the standard deviation of the number of reads of half arm chromosome i of the baseline sample;
  • logP n is the negative value of the logarithmic sum of the P values of the z-score of the five half-arm chromosomes of sample n in the t distribution with 3 degrees of freedom;
  • PAscore n
  • PAscore n is the PAscore of sample n
  • MEAN_baseline logP is the mean value of logP of the baseline sample
  • SD_baseline logP is the standard deviation of logP of the baseline sample.
  • the information analysis device includes a data preprocessing module, which converts the off-machine FASTQ data obtained by the sequencing device into a Bam file usable by each module, and establishes an index. For example, compare, deduplicate, sort flags, filter and index.
  • this paper also provides a method for detecting genomic cancer information based on cell-free DNA, which is performed by using the system described in the first aspect above.
  • the method for detecting genome canceration information based on free DNA comprises:
  • Sequencing information analysis which includes one or more of the following analysis steps:
  • Methylation analysis used to analyze the methylation information of free DNA
  • Fragment length coefficient analysis used to analyze the fragmentation information of free DNA
  • Terminal motif analysis for analyzing fragmentation information of cell-free DNA
  • Chromosomal instability analysis used to analyze the copy number variation information of chromosomes.
  • the analysis of sequencing information further includes an integrated classification step for integrating the information obtained from the methylation analysis, fragment length coefficient analysis, terminal motif analysis and/or chromosome instability analysis .
  • the methylation analysis includes dividing the human reference genome into intervals (for example, 1Mb size) by a non-overlapping sliding window method, and calculating the methylation sites in all CpG sites in each interval
  • the ratio of the methylation density MD value, the predictive value K of the possibility of canceration is calculated by the KNN model, referred to as MD-KNN analysis.
  • the fragment length coefficient analysis includes dividing the human reference genome into intervals (for example, 5Mb size) by a non-overlapping sliding window method, and calculating short fragments (for example, 101-167bp) and long fragments for each interval.
  • the ratio of the number of fragments (for example, 170-250bp) is used to obtain the FSI value of the fragment length coefficient of each sample, and the predictive value F of the possibility of canceration is calculated by the SVM model, that is, FSI-SVM analysis.
  • the terminal motif analysis includes calculating the proportion of the 5' terminal 4-mer motif sequence of the fragments of the sample, and calculating the predictive value S of the possibility of canceration through the SVM model, that is, Motif-SVM analysis .
  • the analysis of chromosome instability includes calculating the copy number of all half-arm chromosomes of the sample, by integrating the z-score of the five half-arm chromosomes with the largest variation in the corresponding chromosome copy number of the healthy person baseline sample , to calculate the PAscore value, that is, the CIN-PAscore analysis.
  • the SVM-integrated classification includes integrating the above-mentioned predicted values K, F, S, and PAscore using a linear SVM model to obtain the final predicted value Z of the possibility of single cancer, that is, the SVM-integrated classification .
  • said library construction comprises:
  • Enzymatic reaction step using enzymes to convert 5-methylcytosine (5-mC) in free DNA to 5-formylcytosine (5-fC) and 5-carboxycytosine (5-caC), non-methylcytosine Conversion of cytosine (C) to uracil (U); and
  • the enzymes are TET2 enzymes and APOBEC enzymes.
  • said sequencing is performed using: Illumina Novaseq 6000, Illumina Nextseq500, MGI DNBSEQ-T7, or MGI SEQ-2000.
  • the MD value in the MD-KNN analysis module is calculated by the following formula:
  • MD n,i Total_mC n,i /Total_C n,i
  • MD n,i is the MD value of the i-th bin of sample n
  • Total_mC i is the total number of all methylated Cs in the i-th bin
  • Total_C n,i is the total number of all Cs in the i-th bin.
  • the FSI value in the FSI-SVM analysis module is calculated by the following formula:
  • FSI n,i is the FSI value of the i-th bin of sample n
  • Total_S n,i is the number of short fragments in the i-th bin
  • Total_L n,i is the number of long fragments in the i-th bin.
  • the motif proportion in the motif-SVM analysis module is calculated by the following formula:
  • Fraction n,i is the proportion of the i-th 4-mer motif in sample n
  • M i is the number of the i-th 4-mer motif.
  • the PAscore in the CIN-PAscore analysis module is calculated by the following formula:
  • Z n,i is the z-score of half-arm chromosome i of sample n relative to the baseline sample
  • ARM n,i is the number of reads of half-arm chromosome i of sample n
  • MEAN_baseline i is the number of reads of half-arm chromosome i of the baseline sample The average number of reads
  • SD_baseline i is the standard deviation of the number of reads of the half-arm chromosome i of the baseline sample;
  • logP n is the negative value of the logarithmic sum of the P values of the z-score of the five half-arm chromosomes of sample n in the t-distribution with 3 degrees of freedom;
  • PAscore n
  • PAscore n is the PAscore of sample n
  • MEAN_baseline logP is the mean value of logP of the baseline sample
  • SD_baseline logP is the standard deviation of logP of the baseline sample.
  • the information analysis further includes data preprocessing, converting the off-machine FASTQ data obtained by the sequencing device into a Bam file usable by each module, and establishing an index.
  • Fig. 1 Schematic diagram of the process of low-depth whole-genome sequencing and canceration information detection based on cfDNA in the present invention.
  • the present invention uses the KNN model (MD-KNN analysis module) of the genome-wide methylation density (MD) to independently verify the ROC curves of multiple cancer types in the set.
  • KNN model MD-KNN analysis module
  • MD genome-wide methylation density
  • the present invention uses the SVM model (FSI-SVM analysis module) of the whole genome fragment length index (FSI) to carry out the ROC curve of multiple cancer types prediction in the independent verification set.
  • SVM model FI-SVM analysis module
  • FSI whole genome fragment length index
  • the present invention uses the SVM model (Motif-SVM analysis module) of the proportion of characteristic motifs at the end of the fragment to independently verify the ROC curves of multiple cancer types in the set.
  • SVM model Motif-SVM analysis module
  • the present invention utilizes PAscore to measure half-arm chromosome instability (CIN-PAscore analysis module) to carry out the ROC curve of multiple cancer types prediction in the independent verification set.
  • CIN-PAscore analysis module CIN-PAscore analysis module
  • FIG. 6 The ROC curves of multiple cancer types prediction in the independent verification set of the final integrated classification module of the present invention.
  • the present invention includes construction and sequencing of a low-depth complete methylome sequencing library, multi-dimensional feature extraction of sequencing data, and construction of a prediction model using machine learning.
  • the invention uses TET2 enzyme and APOBEC enzyme to realize the conversion of unmethylated cytosine (C) into uracil (U).
  • TET2 enzyme is used to catalyze the conversion of 5-methylcytosine (5-mC) into 5-hydroxymethylcytosine (5-hmC), which is further oxidized into 5-formylcytosine (5-fC) and 5-carboxycytosine (5-caC), thereby protecting 5-mC and 5-hmC from being affected in the subsequent APOBEC deamination reaction.
  • APOBEC enzymes deaminate unmethylated cytosine (C) to uracil (U), which is replaced by thymine (T) in subsequent library amplification PCR reactions.
  • reaction conditions of enzymatic conversion are mild, which can protect the integrity of DNA molecules to the greatest extent, so it can be used for the analysis of cfDNA fragment characteristics, and can be used for library construction with low input DNA.
  • the methylation state in the process of tumor occurrence and development will have a wide range of abnormalities in the genome.
  • the present invention can simply and sensitively judge the plasma methylation status. Whether the methylation level is normal, and then speculate whether it contains ctDNA signal.
  • machine learning algorithms can be used for modeling to further improve detection sensitivity.
  • KNN K-Nearest Neighbor
  • the fragment length of cfDNA derived from tumor cells is more heterogeneous than that of non-tumor cells.
  • the Fragment Length Index (FSI) the ratio profile of the number of short and long fragments in cfDNA across regions of the genome, is highly consistent in healthy individuals but altered in certain regions in cancer patients, possibly reflecting cancer-associated chromatin Abnormalities in structure or other genomic features.
  • the present invention can simply and sensitively identify whether there is tumor-derived ctDNA by comparing the cfDNA fragment length coefficients of the sample to be tested and the baseline of healthy people. Feature recognition through machine learning algorithms can further improve detection sensitivity.
  • the sequence characteristics of the 4-mer motif at the end of plasma cfDNA fragments are biased, which may be related to the sequence recognition properties of DNA endonucleases such as DNASE1L3.
  • DNASE1L3 DNA endonucleases
  • the present invention selects 125 motif sequences with the highest proportion among 256 possible 4-mer motifs, and uses machine learning model training to recognize the plasma terminal motif characteristics of cancer patients to judge the samples to be tested.
  • Copy number variation is one of the most common genetic signature changes in cancer cells and is a common mechanism by which cancer genome instability occurs. Most solid tumors are characterized by chromosomal instability, manifested by copy number changes of entire or partial chromosomes.
  • the present invention can directly identify the chromosomal variation derived from the tumor by calculating the chromosome copy number at the half-arm level and performing statistical analysis with the healthy person's baseline, and provides a highly specific liquid biopsy method.
  • the analysis of the above four dimensions on the WMS data of each sample can comprehensively measure whether the sample to be tested has a tumor signal based on different biological mechanisms.
  • Using the integrated model to integrate the prediction results of each dimension feature to construct a classifier based on multi-omics analysis can further improve the sensitivity and specificity of the model.
  • the present invention also has many other advantages compared with the prior art.
  • the present invention identifies abnormal methylation signals by detecting plasma low-depth genome-wide methylation profiles. Compared with commonly used target region capture sequencing methods, it is not necessary to use cancer tissues or public databases in advance to screen cancer differentially methylated sites. And the subsequent verification of plasma cfDNA, which greatly simplifies the experiment and data analysis process of methylation detection, and saves the cost of detection.
  • the present invention uses enzyme conversion method with mild reaction conditions to perform methylation sequencing, which can minimize the damage to DNA molecules compared with the method of bisulfite conversion.
  • this method is suitable for low-input cfDNA library construction, and only 10 mL of blood-extracted cfDNA is needed to successfully build a library; on the other hand, this method can retain the original fragment characteristics of cfDNA molecules, so that the same cfDNA
  • the library performs integrated analysis of multi-dimensional features such as methylation, fragment omics, and CNV to improve the sensitivity and specificity of detection.
  • the present invention directly compares the similarity of the genetic and epigenetic characteristics of the sample to be tested with the healthy person's baseline in the whole genome, without the need to screen the difference sites for each cancer type, and can realize simultaneous detection of multiple cancers. species detection.
  • the cancer types of patients include breast cancer, colorectal cancer, esophageal cancer, gastric cancer, liver cancer, lung cancer, and pancreatic cancer.
  • the training set includes 352 healthy people and 559 cancer patients (45 cases of breast cancer, 105 cases of colorectal cancer, 44 cases of esophageal cancer, 79 cases of gastric cancer, 79 cases of liver cancer, 110 cases of lung cancer, 83 cases of pancreatic cancer, 14 cases of other ), of which 34.5% were early stage (I or II stage).
  • the validation set included 145 healthy people and 236 cancer patients (21 breast cancer, 45 colorectal cancer, 18 esophageal cancer, 35 gastric cancer, 34 liver cancer, 47 lung cancer, 36 pancreatic cancer), of which 31.8% It is early stage (I or II stage).
  • 5-Methylcytosine (5-mC) was converted to 5- Formylcytosine (5-fC) and 5-carboxycytosine (5-caC), and deamination of unmethylated cytosine (C) to uracil (U) by APOBEC enzyme, followed by amplification Build a library.
  • the initial amount of cfDNA sample is 5-30ng, no interruption is required.
  • TET2 Reaction Buffer (prepared in 2.6.1) 10 ⁇ L DTT 1 ⁇ L Oxidation Supplement 1 ⁇ L Oxidation Enhancer 1 ⁇ L TET2 4 ⁇ L total capacity 17 ⁇ L
  • Reagent volume dna sample 45 ⁇ L dilute Fe(II) 5 ⁇ L total capacity 50 ⁇ L
  • the constructed library was quantified using Qubit high-sensitivity reagent (thermoscientific cat#Q32854), and the library yield was greater than 400ng for subsequent sequencing on the machine.
  • PhiX DNA (Illumina cat#FC-110-3001) to mix it into a sample on the machine, and perform PE100 sequencing on the Novaseq 6000 (Illumina) platform.
  • MD Methylation density
  • MD n,i Total_mC n,i /Total_C n,i
  • MD n,i is the MD value of the i-th bin of sample n
  • Total_mC i is the total number of all methylated Cs in the i-th bin
  • Total_C n,i is the total number of all Cs in the i-th bin.
  • KNN K-Nearest Neighbor, KNN
  • ROC curve area (AUC) of the MD-KNN classifier for the detection of a single cancer type in the test set reached 0.789-0.870
  • AUC performance for the detection of all seven cancer types reached 0.830, showing good cancer detection performance.
  • Fragment size index (FSI) analysis (FSI-SVM analysis model piece)
  • FsI n,i is the FSI value of the i-th bin of sample n
  • Total_S n,i is the number of short fragments in the i-th bin
  • Total_L n,i is the number of long fragments in the i-th bin.
  • Fraction n,i is the proportion of the i-th 4-mer motif in sample n
  • M i is the number of the i-th 4-mer motif.
  • the SVM model was trained using the caret package of R language, and the hyperparameters were selected by grid search, and 10 times cross-validation was performed.
  • the classification prediction of healthy people or cancer patients is performed, and the predicted value S is obtained.
  • the ROC curve area (AUC) of the Motif-SVM classifier for the detection of a single cancer in the test set reached 0.920-0.966, and the AUC performance for the detection of all seven cancers reached 0.943, showing good cancer detection performance.
  • Chromosome instability (CIN) analysis CIN-PAscore analysis module
  • the z-score transformation was performed on the mean and standard deviation of the number of chromosome reads per half arm of the test sample corresponding to the number of chromosome reads half of the baseline sample.
  • Z n,i is the z-score of half-arm chromosome i of sample n relative to the baseline sample
  • ARM n,i is the number of reads of half-arm chromosome i of sample n
  • MEAN_baseline i is the number of reads of half-arm chromosome i of the baseline sample The average number of reads
  • SD_baseline i is the standard deviation of the number of reads of the half-arm chromosome i of the baseline sample;
  • logP n is the negative value of the logarithmic sum of the P values of the z-score of the five half-arm chromosomes of sample n in the t-distribution with 3 degrees of freedom;
  • PAscore n
  • PAscore n is the PAscore of sample n
  • MEAN_baseline logP is the mean value of logP of the baseline sample
  • SD_baseline logP is the standard deviation of logP of the baseline sample.
  • the integrated model classifier of the present invention has an AUC of 0.934–0.971 for the detection of a single cancer type in the test set, and an AUC of 0.952 for the detection of all seven cancer types, and its performance exceeds any single genetic or epigenetic
  • the feature classifier demonstrates the superiority of multi-dimensional integrated analysis of cancer information data compared to single omics.
  • the detection sensitivity of the integrated model classifier of the present invention to the seven cancer types in the test set is all above 60%, and the detection sensitivity for early cancer (stage I or II) can reach 75%, showing good detection performance for various cancers, and has great potential for early cancer screening.

Landscapes

  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Chemical & Material Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Analytical Chemistry (AREA)
  • General Health & Medical Sciences (AREA)
  • Organic Chemistry (AREA)
  • Biophysics (AREA)
  • Biotechnology (AREA)
  • Medical Informatics (AREA)
  • Genetics & Genomics (AREA)
  • Zoology (AREA)
  • Wood Science & Technology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Molecular Biology (AREA)
  • Theoretical Computer Science (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Pathology (AREA)
  • Evolutionary Biology (AREA)
  • Immunology (AREA)
  • Biochemistry (AREA)
  • General Engineering & Computer Science (AREA)
  • Microbiology (AREA)
  • Public Health (AREA)
  • Hospice & Palliative Care (AREA)
  • Data Mining & Analysis (AREA)
  • Oncology (AREA)
  • Epidemiology (AREA)
  • Databases & Information Systems (AREA)
  • Primary Health Care (AREA)
  • Biomedical Technology (AREA)
  • Artificial Intelligence (AREA)
  • Bioethics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Software Systems (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

一种基于游离DNA并且尤其是血浆游离DNA的基因组癌变信息检测系统以及检测方法,所述系统包括文库构建装置,通过利用酶使待测样品中游离DNA中的5-甲基胞嘧啶(5-mC)转化为5-甲酰胞嘧啶(5-fC)和5-羧基胞嘧啶(5-caC),非甲基化胞嘧啶(C)转化为尿嘧啶(U),测序装置,和信息分析装置,该信息分析装置可分析基因组的甲基化密度、片段长度分布、片段5'末端基序和/或染色体稳定性。

Description

基于游离DNA的基因组癌变信息检测系统和检测方法 技术领域
本发明涉及基因组癌变信息检测领域,尤其涉及一种基于游离DNA的基因组癌变信息检测系统和检测方法。
背景技术
癌症的早筛、早诊可以为及时治疗提供可能,从而降低癌症的死亡率。传统的肿瘤诊断技术主要为影像学检查,例如胃镜、结肠镜检查,作为侵入性检测手段可能会对患者造成创伤,且检测灵敏度受限于肿瘤发展阶段,只能发现直径1cm以上的肿瘤病灶,发现时基本到了中晚期。病理学组织活检是癌症诊断的金标准,但检取样困难,且由于肿瘤的异质性往往难以做到取样完全,不利于诊断分型,又容易导致并发症。液体活检技术,特别是基于检测血浆中的游离DNA(cell-free DNA,cfDNA)中肿瘤来源的游离肿瘤DNA(circulating tumor DNA,ctDNA)的生物标志物信号的检测技术,近年来作为一种非侵入性肿瘤检测手段被广泛应用于肿瘤诊断、病情追踪、复发监测等。相比较于传统影像学方法,液体活检技术对于早期肿瘤有更高的检测灵敏度,且可以实现对多癌种的同时检测,具有作为一种针对普通人群的常规癌症筛查手段的潜力。
ctDNA来源于坏死的、凋亡的、循环中的肿瘤细胞以及肿瘤细胞分泌的外排体,携带着肿瘤细胞的遗传和表观遗传特征。DNA甲基化是真核细胞中的重要表观修饰方式,即在DNA甲基化转移酶(DNA methyltransferases,DNMTs)的作用下使CpG岛的胞嘧啶(cytosine)转变为5’-甲基胞嘧啶(5-mC)。DNA甲基化状态的改变是肿瘤发生、发展过程中的标志性事件之一,在肿瘤早期便在基因组广泛发生。人类基因启动子区的CpG岛在癌症中常发生高甲基化现象,可能会沉默某些抑癌基因的表达;同时癌症基因组常呈现大范围的去甲基化状态,可能会导致重复序列区域的激活或者染色体重排。
通过检测血浆cfDNA甲基化状态的改变可以灵敏的检测微弱的ctDNA信号。人类基因组大于3G,出于测序成本的考虑,目标区域捕获测序是目 前最常用的甲基化检测手段,但是其性能受限于对癌种特异性目标区域的筛选,需要前期对癌症和匹配的癌旁组织进行高深度全基因组甲基化测序分析来选择差异甲基化位点。因而,该技术路径的一大瓶颈为各癌种高质量组织样本的获得,且差异甲基化位点的筛选和验证过程较为繁琐。
除了甲基化状态的改变,癌症病人的cfDNA的片段化特征,包括全基因组各区域不同长度的片段的比例、片段末端序列等,也呈现出与健康人的差异,近年来作为另一种灵敏的ctDNA的表观遗传生物标志物被广泛开发用于多个癌种的检测(“片段组学”)。此外,拷贝数变异(copy number variation,CNV)是各种癌症中常见的遗传特征改变,也被广泛应用于对ctDNA信号的检测。
传统的甲基化测序技术利用重亚硫酸盐将非甲基化的胞嘧啶(C)脱氨转变成尿嘧啶(U),该反应的高温和高pH环境会引起DNA分子的严重降解,从而丢失原始的DNA片段特征。
发明内容
仍然需要开发针对基于游离DNA构建的单个测序文库能够同时分析包括甲基化、片段化特征、拷贝数变异等特征,能够更准确、更灵敏、更廉价、更简便地检测基因组癌变信息的系统和方法,同时用于多种癌症的早期、灵敏、准确筛查。
本发明是基于发明人的下列发现而完成的:发明人首次发现,通过对血浆cfDNA(cell-free DNA)进行酶法处理,使其中的5-甲基胞嘧啶(5-mC)转化为5-甲酰胞嘧啶(5-fC)和5-羧基胞嘧啶(5-caC),非甲基化胞嘧啶(C)转化为尿嘧啶(U),可获得测序文库,同时用于全基因组的甲基化、片段化(例如从片段长度系数分析和末端基序(motif)分析两个维度)、染色体不稳定性分析(拷贝数变异),同时对多种癌症进行早期、灵敏、准确的筛查。
本发明提供了一种低成本的能对血浆cfDNA同时进行全基因组甲基化、片段化以及拷贝数变异分析的文库构建方法及分析模型进行癌症的液体活检筛查,该方法适用于低起始量cfDNA,无需进行目标区域捕获从而简化技术流程。进一步地,本发明可以可选地通过对上述各维度癌症特征的整 合分析进一步提高癌症筛查的检测灵敏度和准确性。
一方面,本文提供了一种基于游离DNA(cell free DNA,cfDNA的基因组癌变信息检测系统,包括:
文库构建装置,通过利用酶使待测样品中游离DNA(例如血浆中的游离DNA)中的5-甲基胞嘧啶(5-mC)转化为5-甲酰胞嘧啶(5-fC)和5-羧基胞嘧啶(5-caC),非甲基化胞嘧啶(C)转化为尿嘧啶(U),用于构建文库;
测序装置,用于对所构建的文库进行测序;
信息分析装置,其包括以下一个或多个模块:
甲基化分析模块,用于分析游离DNA的甲基化信息,
片段长度系数分析模块,用于分析游离DNA的片段化信息,
末端基序分析模块,用于分析游离DNA的片段化信息,
染色体不稳定性分析模块,用于分析染色体的拷贝数变异信息。
在一些实施方案中,所述信息分析装置还包括整合分类模块,用于将所述甲基化分析模块、片段长度系数分析模块、末端基序分析模块和/或染色体不稳定性分析模块所获得的信息进行整合。
在一些实施方案中,所述甲基化分析模块是MD-KNN分析模块,通过非重叠滑窗方法将人参考基因组化分为区间(即bin,例如1Mb大小),计算每个区间的所有CpG位点中甲基化位点的比例,即甲基化密度MD(methylation density)值,通过KNN(K-Nearest Neighbor,K临近法)模型计算癌变可能性的预测值K。
在一些具体实施方案中,所述片段长度系数分析模块是FSI-SVM分析模块,通过非重叠滑窗方法将人参考基因组化分为区间(例如5Mb大小),计算每个区间的短片段(例如101-167bp)和长片段(例如170-250bp)数目的比例,得到每个样本的片段长度系数FSI(fragment size index)值,通过SVM(support vector machine,支持向量机)模型计算癌变可能性的预测值F。
在一些实施方案中,所述末端基序分析模块是Motif-SVM分析模块,计算样本的片段的5’末端4-mer基序序列的占比,通过SVM模型计算癌变可能性的预测值S。
在一些实施方案中,所述染色体不稳定性分析模块是CIN-PAscore分析模块,计算样本的所有半臂染色体的拷贝数,通过整合与健康人基线(baseline)样本的对应染色体拷贝数变化最大的五条半臂染色体的z-score,计算PAscore(plasma aneuploidy score)。
在一些实施方案中,所述整合分类模块是SVM-整合分类模块,将上述预测值K、F、S和PAscore使用线性SVM模型进行整合,得到最终的单一癌变可能性的预测值Z。
在一些具体的实施方案中,所述系统中的所述文库构建装置包括:
血浆游离DNA提取模块,用于从血浆样品提取其中的游离DNA;
酶反应模块,使用酶使游离DNA中的5-甲基胞嘧啶(5-mC)转化为5-甲酰胞嘧啶(5-fC)和5-羧基胞嘧啶(5-caC),非甲基化胞嘧啶(C)转化为尿嘧啶(U);
PCR反应模块,利用PCR对酶反应后的游离DNA进行扩增。
在一些具体的实施方案中,所述使用的酶是TET2酶和APOBEC酶。
在一些具体的实施方案中,所述测序装置选自Illumina Novaseq 6000、Illumina Nextseq500、MGI DNBSEQ-T7或者MGI SEQ-2000。
在一些具体的实施方案中,所述MD-KNN分析模块中的MD值通过以下公式计算:
MD n,i=Total_mC n,i/Total_C n,i
其中MD n,i为样本n的第i个bin的MD值,Total_mC i为第i个bin内的所有甲基化C的总数,Total_C n,i为第i个bin内的所有C的总数。
在一些具体的实施方案中,所述FSI-SVM分析模块中的FSI值通过以下公式计算:
FSI n,i=Total_S n,i/Total_L n,i
其中FSI n,i为样本n的第i个bin的FSI值,Total_S n,i为第i个bin内的短片段数量,Total_L n,i为第i个bin内的长片段数量。
在一些具体的实施方案中,所述motif-SVM分析模块中的基序占比通过以下公式计算:
Figure PCTCN2022098450-appb-000001
其中Fraction n,i为样本n的第i种4-mer基序的占比,M i为第i种4-mer基序的数量。
在一些具体的实施方案中,所述CIN-PAscore分析模块中的PAscore通过以下公式计算:
Z n,i=(ARM n,i-MEAN_baseline i)/SD_baseline i
其中,Z n,i为样本n的半臂染色体i相对于基线样本的z-score,ARM n,i为样本n的半臂染色体i的读段(reads)数,MEAN_baseline i为基线样本的半臂染色体i的读段数的平均值,SD_baseline i为基线样本的半臂染色体i的读段数的标准差;
取待测样本n的z-score绝对值最大的5个半臂染色体的z-score及基线样本对应的半臂染色体的z-score进行后续分析
Figure PCTCN2022098450-appb-000002
其中,logP n为样本n的5个半臂染色体的z-score在自由度为3的t分布中的P值的对数和的负值;
PAscore n=|logP n-MEAN_baseline logP|/SD_baseline logP
其中PAscore n为样本n的PAscore,MEAN_baseline logP为基线样本的logP平均值,SD_baseline logP为基线样本的logP的标准差。
在一些具体的实施方案中,其中所述信息分析装置包括数据预处理模块,将测序装置获得的下机FASTQ数据转换为各模块可使用的Bam文件,并建立索引。例如,进行比对、去重、排序标记、筛选并建立索引。
第二方面,本文还提供了基于游离DNA的基因组癌变信息检测方法,其通过使用以上第一方面所述的系统进行。
所述基于游离DNA的基因组癌变信息检测方法包括:
文库构建,通过利用酶使待测样品中游离DNA(例如血浆中的游离DNA)中的5-甲基胞嘧啶(5-mC)转化为5-甲酰胞嘧啶(5-fC)和5-羧基胞嘧啶(5-caC),非甲基化胞嘧啶(C)转化为尿嘧啶(U),用于构建文库;
全基因组测序,对所构建的文库进行测序;
测序信息分析,其包括以下一个或多个分析步骤:
甲基化分析,用于分析游离DNA的甲基化信息,
片段长度系数分析,用于分析游离DNA的片段化信息,
末端基序分析,用于分析游离DNA的片段化信息,
染色体不稳定性分析,用于分析染色体的拷贝数变异信息。
在一些具体的实施方案中,测序信息分析还包括整合分类步骤,用于将所述甲基化分析、片段长度系数分析、末端基序分析和/或染色体不稳定性分析所获得的信息进行整合。
在一些具体的实施方案中,所述甲基化分析包括通过非重叠滑窗方法将人参考基因组化分为区间(例如1Mb大小),计算每个区间的所有CpG位点中甲基化位点的比例,即甲基化密度MD值,通过KNN模型计算癌变可能性的预测值K,简称为MD-KNN分析。
在一些具体的实施方案中,所述片段长度系数分析包括通过非重叠滑窗方法将人参考基因组化分为区间(例如5Mb大小),计算每个区间的短片段(例如101-167bp)和长片段(例如170-250bp)数目的比例,得到每个样本的片段长度系数FSI值,通过SVM模型计算癌变可能性的预测值F,即FSI-SVM分析。
在一些具体的实施方案中,所述末端基序分析包括计算样本的片段的5’末端4-mer基序序列的占比,通过SVM模型计算癌变可能性的预测值S,即Motif-SVM分析。
在一些具体的实施方案中,所述染色体不稳定性分析包括计算样本的所有半臂染色体的拷贝数,通过整合与健康人基线样本的对应染色体拷贝数变化最大的五条半臂染色体的z-score,计算PAscore值,即CIN-PAscore分析。
在一些具体的实施方案中,所述SVM-整合分类包括将上述预测值K、F、S和PAscore使用线性SVM模型进行整合,得到最终的单一癌变可能性的预测值Z,即SVM-整合分类。
在一些具体的实施方案中,所述文库构建包括:
从血浆样品提取其中的游离DNA(cfDNA);
酶反应步骤,使用酶使游离DNA中的5-甲基胞嘧啶(5-mC)转化为5-甲酰胞嘧啶(5-fC)和5-羧基胞嘧啶(5-caC),非甲基化胞嘧啶(C)转化为尿嘧啶(U);和
PCR扩增,利用PCR对酶反应后的游离DNA进行扩增。
在一些具体的实施方案中,所述酶是TET2酶和APOBEC酶。
在一些具体的实施方案中,所述测序使用以下进行:Illumina Novaseq 6000、Illumina Nextseq500、MGI DNBSEQ-T7或者MGI SEQ-2000。
在一些具体的实施方案中,所述MD-KNN分析模块中的MD值通过以下公式计算:
MD n,i=Total_mC n,i/Total_C n,i
其中MD n,i为样本n的第i个bin的MD值,Total_mC i为第i个bin内的所有甲基化C的总数,Total_C n,i为第i个bin内的所有C的总数。
在一些具体的实施方案中,所述FSI-SVM分析模块中的FSI值通过以下公式计算:
FSI n,i=Total_S n,i/Total_L n,i
其中FSI n,i为样本n的第i个bin的FSI值,Total_S n,i为第i个bin内的短片段数量,Total_L n,i为第i个bin内的长片段数量。
在一些具体的实施方案中,所述motif-SVM分析模块中的基序占比通过以下公式计算:
Figure PCTCN2022098450-appb-000003
其中Fraction n,i为样本n的第i种4-mer基序的占比,M i为第i种4-mer基序的数量。
在一些具体的实施方案中,所述CIN-PAscore分析模块中的PAscore通过以下公式计算:
Z n,i=(ARM n,i-MEAN_baseline i)/SD_baseline i
其中,Z n,i为样本n的半臂染色体i相对于基线样本的z-score,ARM n,i为样本n的半臂染色体i的读段数,MEAN_baseline i为基线样本的半臂染色体i的读段数的平均值,SD_baseline i为基线样本的半臂染色体i的读段数 的标准差;
取待测样本n的z-score绝对值最大的5个半臂染色体的z-score及基线样本对应的半臂染色体的z-score进行以下分析
Figure PCTCN2022098450-appb-000004
其中,logP n为样本n的5个半臂染色体的z-score在自由度为3的t分布中的P值的对数和的负值;
PAscore n=|logP n-MEAN_baseline logP|/SD_baseline logP
其中PAscore n为样本n的PAscore,MEAN_baseline logP为基线样本的logP平均值,SD_baseline logP为基线样本的logP的标准差。
在一些具体的实施方案中,其中所述信息分析还进一步包括数据预处理,将测序装置获得的下机FASTQ数据转换为各模块可使用的Bam文件,并建立索引。
附图说明
图1.本发明基于cfDNA的低深度全基因组测序和癌变信息检测流程示意图。
图2.本发明通过全基因组甲基化密度(MD)的KNN模型(MD-KNN分析模块)进行独立验证集中多个癌种预测的ROC曲线。
图3.本发明通过全基因组片段长度系数(FSI)的SVM模型(FSI-SVM分析模块)进行独立验证集中多个癌种预测的ROC曲线。
图4.本发明通过片段末端特征基序占比的SVM模型(Motif-SVM分析模块)进行独立验证集中多个癌种预测的ROC曲线。
图5.本发明利用PAscore衡量半臂染色体不稳定性(CIN-PAscore分析模块)进行独立验证集中多个癌种预测的ROC曲线。
图6.本发明最终整合分类模块进行独立验证集中多个癌种预测的ROC曲线。
具体实施方式
如图1所示,本发明包括低深度全甲基化组的测序文库构建和测序,对测序数据进行多维度特征提取以及使用机器学习构建预测模型。
1.cfDNA全甲基化组测序文库制备及测序
原理:
本发明使用了TET2酶和APOBEC酶实现对非甲基化胞嘧啶(C)转化为尿嘧啶(U)。具体的,首先利用TET2酶催化5-甲基胞嘧啶(5-mC)转化为5-羟甲基胞嘧啶(5-hmC),并进一步氧化为5-甲酰胞嘧啶(5-fC)和5-羧基胞嘧啶(5-caC),从而保护5-mC和5-hmC在后续的APOBEC脱氨反应中不被作用。APOBEC酶将非甲基化胞嘧啶(C)脱氨转化为尿嘧啶(U),并在随后的文库扩增PCR反应中替换为胸腺嘧啶(T)。相比较传统的bisulfite化学反应,酶法转化的反应条件温和,可以最大程度的保护DNA分子的完整性,因而可以用于cfDNA片段特征的分析,并可以用于低起始量DNA的文库构建。
方案:
1)从4mL健康人或癌症患者的血清中提取cfDNA,对5ng到30ng的cfDNA使用基于TET2和APOBEC的酶法转化,制备测序文库。
2)对文库进行低深度(~20G上机数据量)的2x 100PE测序。
2.甲基化密度(methylation density,MD)分析
原理:
肿瘤发生发展过程中的甲基化状态会在基因组发生大范围的异常,本发明通过比较待测样本与健康人基线在基因组各区域的甲基化水平的相似性,可以简单灵敏的判断血浆甲基化水平是否正常,进而推测是否含有ctDNA信号。分析过程中可以使用机器学习算法进行建模,进一步提升检测灵敏度。
方案:
1)将人参考基因组按照滑窗方式划分为1Mb大小的区间,对每个样本,分别计算各区间的所有CpG位点中甲基化位点的比例,即甲基化密度(MD值)。
2)利用健康人基线和训练集中的各癌种样本的甲基化密度训练K最邻近法(K-Nearest Neighbor,KNN)模型,利用KNN模型对测试集中的待测样本进行健康人或癌症患者的分类预测。
3.片段长度系数(fragment size index,FSI)分析
原理:
肿瘤细胞来源的cfDNA的片段长度相比非肿瘤细胞具有更大的异质性。片段长度系数FSI,即整个基因组各区域的cfDNA的短片段数和长片段数的比例图谱,在健康人群中高度一致,但在癌症患者中某些区域会发生变化,可能反应了癌症相关的染色质结构或其他基因组特征的异常。本发明通过比较待测样本与健康人基线的cfDNA片段长度系数,可以简单灵敏的识别是否存在肿瘤来源的ctDNA。通过机器学习算法进行特征识别,可以进一步提高检测灵敏度。
方案:
1)将人参考基因组按照滑窗方式划分为5Mb大小的区间,对每个样本,分别计算各区间的短片段数目和长片段数目的比例,得到每个样本的片段长度系数。
2)利用健康人基线和训练集中的各癌种样本的片段长度系数训练机器学习模型,选取最优模型SVM(support vector machine)对测试集中的待测样本进行健康人或癌症患者的分类预测。
4.片段5’末端基序分析
原理:
血浆cfDNA片段末端的4-mer基序序列特征具有偏好性,可能和DNA内切酶例如DNASE1L3的序列识别特性有关。癌症病人的相关DNA内切酶可能存在异常表达,从而导致癌症病人血浆的cfDNA末端序列特征发生改变,例如CCCA的比例在多个癌种中显著降低。本发明通过选取256种可能的4-mer基序中占比最高的125种基序序列,使用机器学习模型训练识别出癌症患者的血浆末端基序特征对待测样本进行判断。
方案:
1)计算每个样本的cfDNA片段5’末端的256种可能的4-mer基序序列的占比。选择健康人基线中占比最高的125种基序。
2)利用健康人基线和训练集中的各癌种样本的末端基序频率特征训练机器学习模型,选取最优模型SVM对测试集中的待测样本进行健康人或癌症患者的分类预测。
5.染色体不稳定性(chromosome instability,CIN)分析
原理:
拷贝数变异是癌细胞最常见的遗传特征变化之一,是发生癌症基因组不稳定的普遍机制。大部分实体瘤的特征包含染色体不稳定,表现为整个染色体或部分染色体的拷贝数变化。本发明通过计算半臂水平的染色体拷贝数并与健康人基线进行统计学分析,可以直接识别肿瘤来源的染色体变异,提供一种高特异性的液体活检方法。
方案:
1)计算每个半臂染色体的读段数。
2)对待测样本的每个半臂读段数与基线样本进行比较并计算z-score,选取z-score绝对值最大的五条染色体半臂,将每个z-score转化为p-value并整合得到该样本的PAscore(plasma aneuploidy score)以衡量该样本的染色体拷贝数异常程度。
6.整合(Ensemble)模型分类器(SVM-整合分类模块)的构建
原理:
对每个样本的WMS数据进行上述四个维度的分析,可以基于不同生物学机理全面衡量待测样本是否具有肿瘤信号。利用整合模型整合各维度特征的预测结果构建基于多组学分析的分类器,可以进一步提升模型的敏感度和特异性。
方案:
利用健康人基线和训练集中的各癌种样本的上述四个维度的预测值训练机器学习模型,选取最优模型(linear SVM)作为最终的整合分类器,计算最终的单一癌变可能性的预测值。
除前述优点以外,本发明与现有技术相比,还具有其他许多优点。
例如,本发明通过检测血浆低深度全基因组甲基化图谱识别异常甲基化信号,相对于常用的目标区域捕获测序方法,无需预先利用癌组织或公共数据库进行癌症差异甲基化位点的筛选及后续的血浆cfDNA验证,从而 大大简化了甲基化检测的实验和数据分析流程,节约了检测成本。
例如,本发明利用反应条件温和的酶转法进行甲基化测序,相较于重亚硫酸盐转化的方法可以最大程度的减少对DNA分子的损伤。一方面,此方法适用于低起始量cfDNA建库,仅需要10mL血液所提取的cfDNA便可成功建库;另一方面,此方法可保留cfDNA分子的原始片段特征,从而实现对同一份cfDNA文库进行甲基化、片段组学、CNV等多维度特征的整合分析,提高检测的灵敏度和特异性。
再例如,本发明通过直接比较待测样本与健康人基线在全基因组范围的遗传和表观遗传特征的相似性,无需针对各癌种分别进行差异位点的筛选,可以实现同时对多个癌种的检测。
实施例
下面将结合实施例对本发明的方案进行解释。本领域技术人员将会理解,下面的实施例仅仅用于说明本发明,而不应视为限定发明的范围。实施例中未注明具体技术或条件的,按照本领域的文献所描述的技术或条件或者按照产品、仪器说明书进行。所有试剂或仪器未注明生产商者,均可以市购。
临床队列样本信息:
本试验回顾性地选取了497例无癌症史的健康人血浆以及795例不同分期的多癌种癌症患者的血浆,并随机分组为训练集和验证集。患者的癌症种类包括了乳腺癌、结直肠癌、食管癌、胃癌、肝癌、肺癌、胰腺癌。训练集包括了352例健康人及559例癌症患者(45例乳腺癌,105例结直肠癌,44例食管癌,79例胃癌,79例肝癌,110例肺癌,83例胰腺癌,14例其他),其中34.5%为早期(I或II期)。验证集包括145例健康人和236例癌症患者(21例乳腺癌,45例结直肠癌,18例食管癌,35例胃癌,34例肝癌,47例肺癌,36例胰腺癌),其中31.8%为早期(I或II期)。
一、实验流程:
1.血浆cfDNA提取
1.1每位受试者10mL全血存放在康为EDTA采血管中,通过在4℃以1600g转速离心10min使血浆、血细胞分层。将上层血浆转移至新离心管, 再次以12000rpm转速4℃离心15min取上清以去除细胞碎屑。得到约4mL血浆,-80℃冻存备用。
1.2血浆样本融化后,每1mL样本中加入15μL Proteinase K(20mg/mL,thermoscientific cat#EO0492)和50μL SDS(20%)。血浆量不足4mL,用PBS补足。
1.3翻转混匀,60℃孵育20min,然后冰浴5min。
1.4使用MagMAX Cell-Free DNA Isolation试剂盒(thermoscientific cat# A29319)提取cfDNA。
1.5使用Bioanalyzer 2100(Agilent Technologies)检测cfDNA的提取浓度和质量。
2.cfDNA文库构建
使用甲基化文库构建试剂盒NEBNext Enzymatic Methyl-seq Kit(NEB,cat#E7120),以5-30ng cfDNA起始量,通过TET2酶使5-甲基胞嘧啶(5-mC)转化为5-甲酰胞嘧啶(5-fC)和5-羧基胞嘧啶(5-caC),并且通过APOBEC酶,使非甲基化胞嘧啶(C)脱氨转化为尿嘧啶(U),然后进行扩增建库。
具体文库构建过程如下:
2.1内参准备
取50μL CpG全甲基化的pUC19 DNA和50μL CpG全非甲基化的Lamdba DNA混匀后加入100ul打断管中,使用M220打断仪(Covaris)打断。建库时,向待测cfDNA加入0.001ng的pUC19 DNA和0.02ng的lambda DNA.
2.2 cfDNA样本的准备
cfDNA样本起始量为5-30ng,不需要打断。
2.3末端修复
2.3.1在冰上混合以下反应体系;
试剂 体积
cfDNA样本(5-30ng) 50μL
NEBNext Ultra II End Prep Reaction Buffer 7μL
NEBNext Ultra II End Prep Enzyme Mix 3μL
总体积 60μL
2.3.2反应体系置于PCR仪上,按下表进行末端修复反应。
Figure PCTCN2022098450-appb-000005
2.4连接Adaptor
2.4.1在冰上操作,将以下组分加入上步的60μL反应体系中
试剂 体积
NEBNext EM-seq Adaptor 2.5μL
NEBNext Ultra II Ligation Master Mix 30μL
NEBNext Ligation Enhancer 1μL
总体积 93.5L
2.4.2 20℃孵育15min。
2.5连接后纯化
2.5.1上一步反应结束后,取出样本,加入110μL NEBNext Sample Purification Beads,立即使用移液器吹打混匀。
2.5.2室温孵育5min。
2.5.3离心管置于磁力架上5min待液体澄清,弃去上清。
2.5.4加入200μL现配80%乙醇,孵育30s后弃去。重复一次200μL 80%乙醇清洗步骤。
2.5.5用10μL移液器吸尽离心管底部的残留乙醇,室温干燥3-5min至乙醇完全挥发。
2.5.6从磁力架取下离心管,加入29μL Elution Buffer(NEB),震荡混匀。室温孵育1min。
2.5.7短暂离心,离心管置于磁力架上3min待液体澄清,取28μL放进新的PCR管中。
2.6 5-甲基胞嘧啶和5-羟甲基胞嘧啶氧化反应
使用NEBNext Enzymatic Methyl-seq Kit(NEB,cat#E7120)进行以下反应操作。
2.6.1 TET2 Reaction Buffer Supplement干粉加入400μL TET2 Reaction Buffer,充分混合。
2.6.2在冰上将以下组分加入上述28μL已连接adapter的DNA:
试剂 体积
TET2 Reaction Buffer(2.6.1中配制) 10μL
DTT 1μL
Oxidation Supplement 1μL
Oxidation Enhancer 1μL
TET2 4μL
总体积 17μL
2.6.3将500mM Fe(II)溶液按1:1250比例稀释。往上步混匀的产物中,加入已配好的Fe(II)。
试剂 体积
DNA样本 45μL
稀释Fe(II) 5μL
总体积 50μL
充分混合并在37℃孵育1h。
2.6.4反应结束后移至冰上并加入1μL Stop Reagent。
试剂 体积
Stop Reagent 1μL
总体积 51μL
充分混合。
2.6.5 37℃孵育30min。
步骤 温度 时间
终止氧化反应 37℃ 30min
2.7氧化后纯化
2.7.1上一步反应结束后,取出样本,加入90μL NEBNext Sample Purification Beads,立即使用移液器吹打混匀。
2.7.2室温孵育5min。
2.7.3离心管置于磁力架上5min待液体澄清,弃去上清。
2.7.4加入200μL现配80%乙醇,孵育30s后弃去。重复一次200μL 80%乙醇清洗步骤。
2.7.5用10μL移液器吸尽离心管底部的残留乙醇,室温干燥3-5min至乙醇完全挥发。
2.7.6从磁力架取下离心管,加入17μL Elution Buffer,震荡混匀。室温孵育1min。
2.7.7短暂离心,离心管置于磁力架上3min待液体澄清,取16μL放进新的PCR管中。
2.8 DNA变性
2.8.1配制新鲜的0.1N NaOH。
2.8.2提前预热PCR仪到50℃。
2.8.3加入4μL 0.1N NaOH到上步16μL纯化产物中,充分混合。
2.8.4 50℃孵育10min。
2.8.5反应结束后立刻放入冰上。
2.9胞嘧啶脱氨基
2.9.1在冰上将下列组分加入上步20μL变性DNA。
Figure PCTCN2022098450-appb-000006
充分混合。
2.9.2在PCR仪上37℃孵育3h后转为4℃终止反应。
2.10脱氨后纯化
2.10.1上一步反应结束后,取出样本,加入100μL NEBNext Sample Purification Beads,立即使用移液器吹打混匀。
2.10.2室温孵育5min。
2.10.3离心管置于磁力架上5min待液体澄清,弃去上清。
2.10.4加入200μL现配80%乙醇,孵育30s后弃去。重复一次200μL80%乙醇清洗步骤。
2.10.5用10μL移液器吸尽离心管底部的残留乙醇,室温干燥3-5min至乙醇完全挥发。
2.10.6从磁力架取下离心管,加入21μL Elution Buffer,震荡混匀。室温孵育1min。
2.10.7短暂离心,离心管置于磁力架上3min待液体澄清,取20μL放进新的PCR管中。
2.11文库PCR扩增
2.11.1在冰上将下列组分加入上步脱氨后的20μL DNA。
Figure PCTCN2022098450-appb-000007
2.11.2充分混合后在PCR以上进行以下PCR反应。
Figure PCTCN2022098450-appb-000008
Figure PCTCN2022098450-appb-000009
2.12 PCR后纯化
2.12.1上一步反应结束后,取出样本,加入45μL NEBNext Sample Purification Beads,立即使用移液器吹打混匀。
2.12.2室温孵育5min。
2.12.3离心管置于磁力架上5min待液体澄清,弃去上清。
2.12.4加入200μL现配80%乙醇,孵育30s后弃去。重复一次200μL80%乙醇清洗步骤。
2.12.5用10μL移液器吸尽离心管底部的残留乙醇,室温干燥3-5min至乙醇完全挥发。
2.12.6从磁力架取下离心管,加入21μL Elution Buffer,震荡混匀。室温孵育1min。
2.12.7短暂离心,离心管置于磁力架上3min待液体澄清,取20μL放进新的PCR管中。
2.13文库定量
使用Qubit高灵敏试剂(thermoscientific cat#Q32854)对所构建的文库进行定量,文库产量大于400ng进行后续上机测序。
3.文库测序
取100ng上述文库加入10%PhiX DNA(Illumina cat#FC-110-3001)混合成上机样品,在Novaseq 6000(Illumina)平台进行PE100测序。
二、生信分析流程:
1.处理下机FASTQ数据为各模块可使用的Bam文件
1.1去接头
调用Trimmomatic-0.36将每一对FASTQ文件都作为配对的读段(paired reads)比对到hg19人类参考基因组序列,除M参数与指定Reads Group的ID外,不使用其余参数选项,生成初始bam文件。
1.2比对
调用Bismark-v0.19.0将去接头后的每一对FASTQ文件都作为配对读段比对到hg19人类参考基因组序列和Lambda DNA参考基因组序列,生成初 始Bam文件。
1.3去重
调用Bismark-v0.19.0的deduplicate模块,对初始Bam文件进行去重复处理,生成去重后的Bam文件。
1.4排序标记
调用SAMtools-1.3的sort模块,对去重后的Bam文件进行排序,生成排序后的Bam文件。然后,调用Picard-2.1.0的AddOrReplaceReadGroups模块,对排序后的Bam文件进行标记分组。
1.5筛选
调用BamUtil-1.0.14的clipOverlap模块对标记分组后的Bam文件进行筛选,去除重叠的配对读段,生成Bam文件。并调用SAMtools-1.3view对去除重叠的Bam文件的比对质量进行过滤,采用“-q 20”作为参数,生成最终Bam文件。
1.6建立索引
调用SAMtools-1.3的index模块对最终生成的Bam文件建立索引,生成与最终Bam文件配对的bai文件。
2.甲基化密度(methylation density,MD)分析(MD-KNN分析模块)
2.1将人参考基因组按照非重叠滑窗方式划分为1Mb大小的区间(bin),剔除比对率差的区间后剩余1846个bin,对每个样本,分别计算这1846个bin的所有CpG位点中甲基化位点的比例,该值对应于每个样本的甲基化密度(MD)值,具体公式如下:
MD n,i=Total_mC n,i/Total_C n,i
其中MD n,i为样本n的第i个bin的MD值,Total_mC i为第i个bin内的所有甲基化C的总数,Total_C n,i为第i个bin内的所有C的总数。
2.2对上述2.1中获得的每个样本的1846个MD值进行标准化处理计算z-score,应用R语言的philentropy包计算样本间的欧式距离(distance),样本的权重选择1/distance。用50轮模拟调整参数K,每轮用80%的训练集样本,计算K在不同取值时,根据50轮里每一轮out-of-bag(OOB)的20%样本的预测结果计算AUC,选择OOB样本AUC最高的K值。
2.3用训练好的KNN(K-Nearest Neighbor,KNN)模型对测试集中的 每个待测样本进行健康人或癌症患者的分类预测,获得预测值K。如图2所示,MD-KNN分类器对测试集中的单一癌种的检测ROC曲线面积(AUC)达到0.789-0.870,对全部七个癌种的检测AUC性能达到0.830,显示出良好的癌症检测性能。
3.片段长度系数(fragment size index,FSI)分析(FSI-SVM分析模 块)
3.1将人参考基因组按照非重叠滑窗方式划分为5Mb大小的区间(bin),剔除比对率差的黑名单区间后剩余502个bin,分别计算这502个bin内的短片段(101-167bp)数目和长片段(170-250bp)数目的比例,并用LOESS算法进行GC矫正,得到每个样本的片段长度系数(FSI)。具体计算公式如下:
FSI n,i=Total_S n,i/Total_L n,i
其中FsI n,i为样本n的第i个bin的FSI值,Total_S n,i为第i个bin内的短片段数量,Total_L n,i为第i个bin内的长片段数量。
3.2对每个样本的502个FSI值应用python的sklearn包训练SVM(support vector machine,SVM)模型,使用网格搜索的方式进行超参数的选择,进行10乘交叉验证获得超参数。
3.3对测试集中的每个待测样本进行健康人或癌症患者的分类预测,获得预测值F。如图3所示,FSI-SVM分类器对测试集中的单一癌种的检测ROC曲线面积(AUC)达到0.874-0.933,对全部七个癌种的检测AUC性能达到0.904,显示出良好的癌症检测性能。
4.片段末端基序分析(Motif-SVM分析模块)
4.1计算每个样本的片段5’末端的256种(即四种碱基可能的排列组合,4的4次方)可能的4-mer基序序列的占比。选择占比超过0.0004且在健康人基线中占比最高的125种基序,如下表1所示。
表1
Figure PCTCN2022098450-appb-000010
Figure PCTCN2022098450-appb-000011
上述基序占比通过以下公式计算:
Figure PCTCN2022098450-appb-000012
其中Fraction n,i为样本n的第i种4-mer基序的占比,M i为第i种4-mer基序的数量。
4.2利用健康人基线和训练集中的所有癌症样本的125种特征基序的占比,应用R语言的caret包训练SVM模型,使用网格搜索的方式进行超参数的选择,进行10乘交叉验证。
4.3对测试集中的每个待测样本进行健康人或癌症患者的分类预测,获得预测值S。如图4所示,Motif-SVM分类器对测试集中的单一癌种的检测ROC曲线面积(AUC)达到0.920-0.966,对全部七个癌种的检测AUC性能达到0.943,显示出良好的癌症检测性能。
5.染色体不稳定性(chromosome instability,CIN)分析(CIN-PAscore 分析模块)
5.1对每个样本,计算每半臂染色体的LOESS算法GC矫正后的读段数。
5.2以训练集中的352例健康人作为基线样本,对待测样本的每半臂染色体读段数对应基线样本的相应半臂染色体读段数的均值和标准差进行z-score转化。
5.3待测样本选择z-score绝对值最大的5条半臂染色体及基线样本对应的半臂染色体的z-score按文献所述方式(Leary et al.,2012 Sci Transl Med,)计算PAscore。具体计算如下。
Z n,i=(ARM n,i-MESN_baseline i)/SD_baseline i
其中,Z n,i为样本n的半臂染色体i相对于基线样本的z-score,ARM n,i为样本n的半臂染色体i的读段数,MEAN_baseline i为基线样本的半臂染色体i的读段数的平均值,SD_baseline i为基线样本的半臂染色体i的读段数的标准差;
取待测样本n的z-score绝对值最大的5个半臂染色体的z-score及基线样本对应的半臂染色体的z-score进行后续分析
Figure PCTCN2022098450-appb-000013
其中,logP n为样本n的5个半臂染色体的z-score在自由度为3的t分布中的P值的对数和的负值;
PAscore n=|logP n-MEAN_baseline lo□□|/SD_baseline logP
其中PAscore n为样本n的PAscore,MEAN_baseline logP为基线样本的logP平均值,SD_baseline logP为基线样本的logP的标准差。
5.4如图5所示,CIN-PAscore算法对测试集中单一癌种检测的AUC达到0.770-0.854,对全部七个癌种的检测AUC性能达到0.812。
6.整合模型分类器的构建(SVM-整合分类模块)
6.1将上述所得每个样本的MD-KNN、FSI-SVM、motif-SVM、CIN-PAscore数值(即上述预测值K、F、S和PAscore)作为训练模型中的特征。
6.2应用R语言的caret包训练LinearSVM模型,使用网格搜索的方式 进行超参数的选择,进行10乘交叉验证。通过训练好的模型对测试集中的每个样本进行预测,获得样本预测为癌症单一癌变可能性的预测值Z。
6.3如图6所示,本发明的整合模型分类器对测试集中单一癌种检测的AUC达到0.934–0.971,对全部七个癌种的检测AUC达到0.952,性能超过任何单一的遗传或表观遗传特征分类器,展示出了多维度整合分析癌变信息数据相对单一组学的优越性。
6.4如表2所示,本发明的整合模型分类器在95%特异性下对测试集中七个癌种的检测灵敏度均在60%以上,对于早期癌症(I或II期)的检测灵敏度可达75%,展示出了对于各癌种良好的检测性能,并具有极大的潜力应用于癌症早期筛查。
表2.本发明的整个分类模块在95%特异性下对验证集中各癌种及各分期的检测灵敏度。
Figure PCTCN2022098450-appb-000014

Claims (20)

  1. 一种基于游离DNA的基因组癌变信息检测系统,包括:
    文库构建装置,通过利用酶使待测样品中游离DNA(例如血浆中的游离DNA)中的5-甲基胞嘧啶(5-mC)转化为5-甲酰胞嘧啶(5-fC)和5-羧基胞嘧啶(5-caC),非甲基化胞嘧啶(C)转化为尿嘧啶(U),用于构建文库;
    测序装置,用于对所构建的文库进行测序;和
    信息分析装置,其包括以下一个或多个模块:
    甲基化分析模块,用于分析游离DNA的甲基化信息,
    片段长度系数分析模块,用于分析游离DNA的片段化信息,
    末端基序分析模块,用于分析游离DNA的片段化信息,和
    染色体不稳定性分析模块,用于分析染色体的拷贝数变异信息。
  2. 根据权利要求1所述的系统,其中所述信息分析装置还包括整合分类模块,用于将所述甲基化分析模块、片段长度系数分析模块、末端基序分析模块和/或染色体不稳定性分析模块所获得的信息进行整合。
  3. 根据权利要求2所述的系统,其中:
    所述甲基化分析模块是MD-KNN分析模块,通过非重叠滑窗方法将人参考基因组化分为区间(例如1Mb大小),计算每个区间的所有CpG位点中甲基化位点的比例,即甲基化密度MD值,通过KNN模型计算癌变可能性的预测值K;
    所述片段长度系数分析模块是FSI-SVM分析模块,通过非重叠滑窗方法将人参考基因组化分为区间(例如5Mb大小),计算每个区间的短片段(例如101-167bp)和长片段(例如170-250bp)数目的比例,得到每个样本的片段长度系数FSI值,通过SVM模型计算癌变可能性的预测值F;
    所述末端基序分析模块是Motif-SVM分析模块,计算样本的片段的5’末端4-mer基序序列的占比,通过SVM模型计算癌变可能性的预测值S;
    所述染色体不稳定性分析模块是CIN-PAscore分析模块,计算样本的所有半臂染色体的拷贝数,通过整合与健康人基线样本的对应染色体拷贝数变化最大的五条半臂染色体的z-score,计算PAscore值;
    所述整合分类模块是SVM-整合分类模块,将上述预测值K、F、S和PAscore使用线性SVM模型进行整合,得到最终的单一癌变可能性的预测值Z。
  4. 根据前述权利要求任一项所述的系统,其中所述文库构建装置包括:
    血浆游离DNA提取模块,用于从血浆样品提取其中的游离DNA(cfDNA);
    酶反应模块,使用酶使游离DNA中的5-甲基胞嘧啶(5-mC)转化为5-甲酰胞嘧啶(5-fC)和5-羧基胞嘧啶(5-caC),非甲基化胞嘧啶(C)转化为尿嘧啶(U);
    PCR反应模块,利用PCR对酶反应后的游离DNA进行扩增。
  5. 如前述权利要求任一项所述的系统,其中所述酶是TET2酶和APOBEC酶。
  6. 根据权利要求任一项所述的系统,其中所述测序装置选自Illumina Novaseq 6000、Illumina Nextseq500、MGI DNBSEQ-T7或者MGI SEQ-2000。
  7. 根据权利要求3所述的系统,其中,所述MD-KNN分析模块中的MD值通过以下公式计算:
    MD n,i=Total_mC n,i/Total_C n,i
    其中MD n,i为样本n的第i个bin的MD值,Total_mC i为第i个bin内的所有甲基化C的总数,Total_C n,i为第i个bin内的所有C的总数。
  8. 根据权利要求3所述的系统,其中,所述FSI-SVM分析模块中的FSI值通过以下公式计算:
    FSI n,i=Total_S n,i/Total_L n,i
    其中FSI n,i为样本n的第i个bin的FSI值,Total_S n,i为第i个bin内的短片段数量,Total_L n,i为第i个bin内的长片段数量。
  9. 根据权利要求3所述的系统,其中,所述motif-SVM分析模块中的基序占比通过以下公式计算:
    Figure PCTCN2022098450-appb-100001
    其中Fraction n,i为样本n的第i种4-mer基序的占比,M i为第i种4-mer 基序的数量。
  10. 根据权利要求3所述的系统,其中,所述CIN-PAscore分析模块中的PAscore通过以下公式计算:
    Z n,i=(ARM n,i-MEAN_baseline i)/SD_baseline i
    其中,Z n,i为样本n的半臂染色体i相对于基线样本的z-score,ARM n,i为样本n的半臂染色体i的读段数,MEAN_baseline i为基线样本的半臂染色体i的读段数的平均值,SD_baseline i为基线样本的半臂染色体i的读段数的标准差;
    取待测样本n的z-score绝对值最大的5个半臂染色体的z-score及基线样本对应的半臂染色体的z-score进行以下分析
    Figure PCTCN2022098450-appb-100002
    其中,logP n为样本n的5个半臂染色体的z-score在自由度为3的t分布中的P值的对数和的负值;
    PAscore n=|logP n-MEAN_baseline logP|/SD_baseline logP
    其中PAscore n为样本n的PAscore,MEAN_baseline logP为基线样本的logP平均值,SD_baseline logP为基线样本的logP的标准差。
  11. 根据权利要求任一项所述的系统,其中所述信息分析装置包括数据预处理模块,将测序装置获得的下机FASTQ数据转换为各模块可使用的Bam文件,并建立索引。
  12. 基于游离DNA的基因组癌变信息检测方法,其通过使用以上权利要求1-11任一项的系统进行。
  13. 基于游离DNA的基因组癌变信息检测方法,其包括:
    文库构建,通过利用酶使待测样品中游离DNA(例如血浆中的游离DNA)中的5-甲基胞嘧啶(5-mC)转化为5-甲酰胞嘧啶(5-fC)和5-羧基胞嘧啶(5-caC),非甲基化胞嘧啶(C)转化为尿嘧啶(U),用于构建文库;
    全基因组测序,对所构建的文库进行测序;和
    测序信息分析,其包括以下一个或多个分析步骤:
    甲基化分析,用于分析游离DNA的甲基化信息,
    片段长度系数分析,用于分析游离DNA的片段化信息,
    末端基序分析,用于分析游离DNA的片段化信息,和
    染色体不稳定性分析,用于分析染色体的拷贝数变异信息。
  14. 如权利要求13所述的方法,其中,所述测序信息分析还包括整合分类步骤,用于将所述甲基化分析、片段长度系数分析、末端基序分析和/或染色体不稳定性分析所获得的信息进行整合。
  15. 如权利要求14所述的方法,其中,
    所述甲基化分析包括通过非重叠滑窗方法将人参考基因组化分为区间(例如1Mb大小),计算每个区间的所有CpG位点中甲基化位点的比例,即甲基化密度MD值,通过KNN模型计算癌变可能性的预测值K;
    所述片段长度系数分析包括通过非重叠滑窗方法将人参考基因组化分为区间(例如5Mb大小),计算每个区间的短片段(例如101-167bp)和长片段(例如170-250bp)数目的比例,得到每个样本的片段长度系数FSI值,通过SVM模型计算癌变可能性的预测值F;
    所述末端基序分析包括计算样本的片段的5’末端4-mer基序序列的占比,通过SVM模型计算癌变可能性的预测值S;
    所述染色体不稳定性分析包括计算样本的所有半臂染色体的拷贝数,通过整合与健康人基线样本的对应染色体拷贝数变化最大的五条半臂染色体的z-score,计算PAscore值;
    所述整合分类包括将上述预测值K、F、S和PAscore使用线性SVM模型进行整合,得到最终的单一癌变可能性的预测值Z。
  16. 根据权利要求13-15任一项所述的方法,其中,所述文库构建包括:
    从血浆样品提取其中的游离DNA(cfDNA);
    酶反应步骤,使用酶使游离DNA中的5-甲基胞嘧啶(5-mC)转化为5-甲酰胞嘧啶(5-fC)和5-羧基胞嘧啶(5-caC),非甲基化胞嘧啶(C)转化为尿嘧啶(U);和
    PCR扩增,利用PCR对酶反应后的游离DNA进行扩增。
  17. 根据权利要求13-16任一项所述的方法,其中,所述酶是TET2酶 和APOBEC酶。
  18. 根据权利要求13-17任一项所述的方法,其中,所述测序使用以下进行:Illumina Novaseq 6000、Illumina Nextseq500、MGI DNBSEQ-T7或者MGI SEQ-2000。
  19. 根据权利要求15所述的方法,所述MD值通过以下公式计算:
    MD n,i=Total_mC n,i/Total_C n,i
    其中MD n,i为样本n的第i个bin的MD值,Total_mC i为第i个bin内的所有甲基化C的总数,Total_C n,i为第i个bin内的所有C的总数;
    所述FSI值通过以下公式计算:
    FSI n,i=Total_S n,i/Total_L n,i
    其中FSI n,i为样本n的第i个bin的FSI值,Total_S n,i为第i个bin内的短片段数量,Total_L n,i为第i个bin内的长片段数量;
    所述基序占比通过以下公式计算:
    Figure PCTCN2022098450-appb-100003
    其中Fraction n,i为样本n的第i种4-mer基序的占比,M i为第i种4-mer基序的数量;
    所述PAscore通过以下公式计算:
    Z n,i=(ARM n,i-MEAN_baseline i)/SD_baseline i
    其中,Z n,i为样本n的半臂染色体i相对于基线样本的z-score,ARM n,i为样本n的半臂染色体i的读段数,MEAN_baseline i为基线样本的半臂染色体i的读段数的平均值,SD_baseline i为基线样本的半臂染色体i的读段数的标准差,
    取待测样本n的z-score绝对值最大的5个半臂染色体的z-score及基线样本对应的半臂染色体的z-score进行以下分析
    Figure PCTCN2022098450-appb-100004
    其中,logP n为样本n的5个半臂染色体的z-score在自由度为3的t分 布中的P值的对数和的负值,
    PAscore n=|logP n-MEAN_baseline logP|/SD_baseline logP
    其中PAscore n为样本n的PAscore,MEAN_baseline logP为基线样本的logP平均值,SD_baseline logP为基线样本的logP的标准差。
  20. 根据权利要求13-19任一项所述的方法,其中所述信息分析还进一步包括数据预处理,将测序装置获得的下机FASTQ数据转换为各模块可使用的Bam文件,并建立索引。
PCT/CN2022/098450 2022-01-07 2022-06-13 基于游离dna的基因组癌变信息检测系统和检测方法 WO2023130670A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US18/052,067 US20240060137A1 (en) 2022-01-07 2022-11-02 Detection system and detection method of genomic carcinogenesis information based on cell-free dna

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210023902.1 2022-01-07
CN202210023902.1A CN114045345B (zh) 2022-01-07 2022-01-07 基于游离dna的基因组癌变信息检测系统和检测方法

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US18/052,067 Continuation US20240060137A1 (en) 2022-01-07 2022-11-02 Detection system and detection method of genomic carcinogenesis information based on cell-free dna

Publications (1)

Publication Number Publication Date
WO2023130670A1 true WO2023130670A1 (zh) 2023-07-13

Family

ID=80213508

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/098450 WO2023130670A1 (zh) 2022-01-07 2022-06-13 基于游离dna的基因组癌变信息检测系统和检测方法

Country Status (3)

Country Link
US (1) US20240060137A1 (zh)
CN (1) CN114045345B (zh)
WO (1) WO2023130670A1 (zh)

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114045345B (zh) * 2022-01-07 2022-04-29 臻和(北京)生物科技有限公司 基于游离dna的基因组癌变信息检测系统和检测方法
CN114898802B (zh) * 2022-07-14 2022-09-30 臻和(北京)生物科技有限公司 基于血浆游离dna甲基化测序数据的末端序列频率分布特征确定方法、评价方法及装置
CN115064211B (zh) * 2022-08-15 2023-01-24 臻和(北京)生物科技有限公司 一种基于全基因组甲基化测序的ctDNA预测方法及装置
CN115678964B (zh) * 2022-11-08 2023-07-14 广州女娲生命科技有限公司 基于胚胎培养液的植入前胚胎的无创筛选方法
CN116083578A (zh) * 2022-12-15 2023-05-09 华中科技大学同济医学院附属同济医院 预测宫颈癌新辅助化疗效果或复发高危分类的系统及其方法
CN115910349B (zh) * 2023-01-09 2023-05-30 北京求臻医学检验实验室有限公司 基于低深度wgs测序末端特征的癌症早期预测方法
CN117423388B (zh) * 2023-12-19 2024-03-22 北京求臻医疗器械有限公司 一种基于甲基化水平的多癌种检测系统及电子设备

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104560697A (zh) * 2015-01-26 2015-04-29 上海美吉生物医药科技有限公司 一种基因组拷贝数不稳定性的检测装置
CN109680049A (zh) * 2018-12-03 2019-04-26 东南大学 一种基于血液游离DNA高通量测序分析cfDNA所属个体生理状态的方法及其应用
WO2019136413A1 (en) * 2018-01-08 2019-07-11 Ludwig Institute For Cancer Research Ltd Bisulfite-free, base-resolution identification of cytosine modifications
CN111575347A (zh) * 2020-05-19 2020-08-25 清华大学 构建用于同时获得血浆中游离dna甲基化和片段化模式信息的文库的方法
CN113637760A (zh) * 2021-09-27 2021-11-12 江苏默迪生物科技有限公司 血浆游离dna甲基化检测辅助卵巢癌早期诊断的方法
CN113668068A (zh) * 2021-07-20 2021-11-19 广州滴纳生物科技有限公司 基因组甲基化文库及其制备方法和应用
CN114045345A (zh) * 2022-01-07 2022-02-15 臻和(北京)生物科技有限公司 基于游离dna的基因组癌变信息检测系统和检测方法

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
MX2019010655A (es) * 2017-03-08 2020-01-13 Harvard College Métodos para amplificar ácido desoxirribonucleico (adn) para mantener estado de metilación.

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104560697A (zh) * 2015-01-26 2015-04-29 上海美吉生物医药科技有限公司 一种基因组拷贝数不稳定性的检测装置
WO2019136413A1 (en) * 2018-01-08 2019-07-11 Ludwig Institute For Cancer Research Ltd Bisulfite-free, base-resolution identification of cytosine modifications
CN109680049A (zh) * 2018-12-03 2019-04-26 东南大学 一种基于血液游离DNA高通量测序分析cfDNA所属个体生理状态的方法及其应用
CN111575347A (zh) * 2020-05-19 2020-08-25 清华大学 构建用于同时获得血浆中游离dna甲基化和片段化模式信息的文库的方法
CN113668068A (zh) * 2021-07-20 2021-11-19 广州滴纳生物科技有限公司 基因组甲基化文库及其制备方法和应用
CN113637760A (zh) * 2021-09-27 2021-11-12 江苏默迪生物科技有限公司 血浆游离dna甲基化检测辅助卵巢癌早期诊断的方法
CN114045345A (zh) * 2022-01-07 2022-02-15 臻和(北京)生物科技有限公司 基于游离dna的基因组癌变信息检测系统和检测方法

Also Published As

Publication number Publication date
CN114045345B (zh) 2022-04-29
CN114045345A (zh) 2022-02-15
US20240060137A1 (en) 2024-02-22

Similar Documents

Publication Publication Date Title
WO2023130670A1 (zh) 基于游离dna的基因组癌变信息检测系统和检测方法
WO2021128519A1 (zh) Dna甲基化生物标志物组合、检测方法和试剂盒
CN107771221B (zh) 用于癌症筛查和胎儿分析的突变检测
TWI640634B (zh) 來自血漿之胚胎或腫瘤甲基化模式組(methylome)之非侵入性測定
EP3658684B1 (en) Enhancement of cancer screening using cell-free viral nucleic acids
CN111863250A (zh) 一种早期乳腺癌的联合诊断模型及系统
CN114974430A (zh) 用于癌症筛查的系统及其方法
CN112210601A (zh) 基于粪便样本的结直肠癌筛查试剂盒
WO2022262831A1 (zh) 用于评估肿瘤的物质及其方法
CN113667757B (zh) 用于前列腺癌早期筛查的生物标志物组合、试剂盒及应用
Bergamaschi et al. Pilot study demonstrating changes in DNA hydroxymethylation enable detection of multiple cancers in plasma cell-free DNA
CN117441027A (zh) Heatrich-BS:用于亚硫酸氢盐测序的富含CpG的区域的热富集
CN112210602A (zh) 基于粪便样本的结直肠癌筛查方法
US20240194295A1 (en) Cellular heterogeneity-adjusted clonal methylation (chalm): a methylation quantification method
CN116779025A (zh) 用于癌症筛查的系统
Yang et al. Reduced representative methylome profiling of cell-free DNA for breast cancer detection
KR20240046525A (ko) 세포-유리 dna에 대한 tet-보조 피리딘 보란 시퀀싱과 관련된 조성물 및 방법
WO2024072805A1 (en) Compositions, systems, and methods for detection of ovarian cancer
CN116194596A (zh) 用于检测和预测3级宫颈上皮瘤变(cin3)和/或癌症的方法
CN113943813A (zh) 用于胃部肿瘤筛查的生物标志物组合、试剂盒及应用
Lleshi et al. Identifying Prostate Cancer-Specific Signatures Through Unbiased Capture of Methylated Cell-Free DNA

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22918126

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE