CN116779025A - System for cancer screening - Google Patents

System for cancer screening Download PDF

Info

Publication number
CN116779025A
CN116779025A CN202210228206.4A CN202210228206A CN116779025A CN 116779025 A CN116779025 A CN 116779025A CN 202210228206 A CN202210228206 A CN 202210228206A CN 116779025 A CN116779025 A CN 116779025A
Authority
CN
China
Prior art keywords
cfdna
sequencing data
sequencing
data
chromosome
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210228206.4A
Other languages
Chinese (zh)
Inventor
彭勇飞
杨亚东
李永君
王小齐
郭媛媛
田继超
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Biochain Beijing Science and Technology Inc
Original Assignee
Biochain Beijing Science and Technology Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Biochain Beijing Science and Technology Inc filed Critical Biochain Beijing Science and Technology Inc
Priority to CN202210228206.4A priority Critical patent/CN116779025A/en
Publication of CN116779025A publication Critical patent/CN116779025A/en
Pending legal-status Critical Current

Links

Landscapes

  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The present invention provides a system for cancer screening comprising: a data acquisition module for acquiring methylation level, cfDNA-related characteristics of a target region of a subject; and a cancer calculation module that predicts whether the subject suffers from cancer based on the methylation level and cfDNA-related characteristics acquired in the data acquisition module. The system disclosed by the invention is constructed by comprehensively utilizing indexes related to methylation level and cfDNA characteristics, can greatly reduce the cost of cancer screening and improve the screening accuracy, and has very high sensitivity and specificity.

Description

System for cancer screening
Technical Field
The present invention relates to a system for cancer screening.
Background
Lung cancer is one of the cancers with highest global morbidity and mortality, and the survival rate of 5 years is lower than 20%. In China, the incidence rate and death rate of lung cancer are also the first. The main reason is that lung cancer is usually diagnosed in late stage, and the treatment is far from effective in early stage. Conventional screening means for lung cancer are low dose helical CT (LDCT), and also some protein markers, such as: carcinoembryonic antigen (CEA), squamous cell carcinoma antigen (SCC), neuron-specific enolase (NSE), and the like. However, these conventional means are not very diverse in terms of sensitivity and specificity. At present, DNA methylation has been demonstrated to be tissue-specific, useful for early cancer detection, and can be traced to the primary tumor site based on the methylation profile of circulating tumor DNA (ctDNA).
Liquid biopsy is to analyze cancer components in blood to realize clinical applications such as early screening, molecular typing, prognosis, medication guidance, recurrence detection and the like of cancers. Liquid biopsy is used as a new accurate medical technology, and can qualitatively and quantitatively detect tumor cells and DNA directly related to tumors, and has the characteristics of non-invasiveness, convenient sampling, real-time monitoring and the like, so that the liquid biopsy plays an increasingly important role in tumor diagnosis and treatment gradually.
Currently, studies have demonstrated that cfDNA molecules are not random fragments. Cell death of specific tissues can cause changes in the distribution of DNA fragments of the affected tissues, and the primary tumor sites can be tracked by such tissue-derived analysis for early diagnosis of cancer. In addition, the cfDNA fragment sets can be fully representative of genomic and chromatin characteristics, thereby identifying a large number of changes in tumor derivation in circulation.
Methylation level information can be detected based on whole genome methylation sequencing (WGBS) and cfDNA fragment sets can be detected, thus providing a new strategy for tumor detection based on WGBS data in combination with methylation level and fragment set characteristics.
Disclosure of Invention
In view of the problems of the prior art, it is an object of the present invention to provide a system for cancer screening.
In particular to the following technical scheme:
1. a system for cancer screening, comprising:
a data acquisition module for acquiring methylation level, cfDNA-related characteristics of a target region of a subject; and
a cancer calculation module that predicts whether the subject suffers from cancer based on the methylation level and cfDNA-related characteristics acquired in the data acquisition module.
2. The system according to item 1, wherein,
the data acquisition module comprises a sequencing module, a methylation level analysis module and a cfDNA related characteristic extraction module,
the sequencing module is used for carrying out whole genome sequencing on cfDNA of a subject,
the methylation level analysis module is used for obtaining sequencing data from the sequencing module to analyze the methylation level of the target region, and the cfDNA related feature extraction module is used for extracting related features of cfDNA sequencing data from the sequencing data obtained from the sequencing module.
3. The system according to item 1, wherein,
the target region includes any one or two or more of the following regions:
the 151445000-151450000 position of chromosome 1,
the 191183500-191188500 position of chromosome 2,
the 191184000-191189000 position of chromosome 2,
68566500-68571500 on chromosome 4, or
Chromosome 11, position 30601500-30606500.
4. The system according to item 1, wherein,
the methylation level of the target region is calculated based on the methylation level of each CG site in the target region, where the methylation level of the CG site is the ratio of the cytosine at which methylation is detected to the sum of the cytosine at which methylation and the cytosine at which no methylation is detected in all detected sequence results for that site.
5. The system according to item 2, wherein,
the cfDNA-related features include:
average of all fragment sizes obtained in cfDNA sequencing data,
All segment size modes obtained in cfDNA sequencing data,
Average coverage in cfDNA sequencing data,
Correlation coefficient between pre-stored average value vector of short segment coverage of corresponding interval of healthy person and short segment coverage obtained from cfDNA sequencing data of subjects,
Correlation coefficient between prestored average value vector of long segment coverage of healthy person corresponding interval and long segment coverage obtained from cfDNA sequencing data of the subject,
And a correlation coefficient between a pre-stored average vector of short-long fragment ratios of the corresponding interval of healthy people and the short-long fragment ratio obtained in cfDNA sequencing data of the subject.
6. The system according to item 5, wherein,
the pre-stored average value vector of the corresponding interval short segment coverage of the healthy person refers to the average value of the corresponding interval short segment coverage of the known healthy person calculated based on the sequencing data of cfDNA of the known healthy person, and the average value is used for being provided for a data acquisition module to calculate;
the pre-stored average value vector of the corresponding interval long segment coverage of the healthy person refers to the average value of the corresponding interval long segment coverage of the known healthy person calculated based on the sequencing data of cfDNA of the known healthy person, and the average value is used for being provided for a data acquisition module to calculate;
the pre-stored average vector of the ratio of the short and long segments of the corresponding interval of the healthy person refers to the average value of the short and long segments of the corresponding interval in the known healthy person calculated based on the sequencing data of the cfDNA of the known healthy person, and the average value is used for being provided for a data acquisition module to calculate.
7. The system according to item 5, wherein,
the average value of all fragment sizes obtained in cfDNA sequencing data refers to the ratio of the sum of all fragment sizes obtained in cfDNA sequencing data of a subject to the number of all fragments;
the average coverage obtained in cfDNA sequencing data refers to the number of all fragments obtained from cfDNA sequencing data of the subject;
The short fragment coverage obtained in cfDNA sequencing data refers to the number of short fragments obtained from cfDNA sequencing data of a subject;
the long fragment coverage obtained in cfDNA sequencing data refers to the number of long fragments obtained from cfDNA sequencing data of a subject.
8. The system according to item 5, wherein,
the length of the long fragment is 201-320 bp, and the length of the short fragment is 150-200 bp.
9. The system according to item 2, wherein,
cfDNA sequencing data is cfDNA sequencing data after removal of low quality sequencing fragments.
10. The system according to item 9, wherein,
cfDNA sequencing data is sequencing data after removal of low quality sequencing fragments, further excluding sequencing data in the low-alignment interval.
11. The system according to item 1, wherein,
in the cancer calculation module, a model fitted based on the methylation level of the known sample, the data of the cfDNA related features is pre-stored for predicting whether the subject suffers from cancer,
the model is obtained by gradient lifting tree model based on the methylation level of known samples and the data of cfDNA related characteristics and adopting 5 times of cross validation selection.
12. The system according to item 1, wherein,
The system further includes a bisulfite treatment module for bisulfite treatment of cfDNA of a subject.
1. A system for cancer screening, comprising:
a data acquisition module for acquiring methylation level, cfDNA-related characteristics of a target region of a subject; and
a cancer calculation module that predicts whether the subject suffers from cancer based on the methylation level and cfDNA-related characteristics acquired in the data acquisition module.
2. The system according to item 1, wherein,
the data acquisition module comprises a sequencing module, a methylation level analysis module and a cfDNA related characteristic extraction module,
the sequencing module is used for carrying out whole genome sequencing on cfDNA of a subject,
the methylation level analysis module is used for obtaining sequencing data from the sequencing module to analyze the methylation level of the target region, and the cfDNA related feature extraction module is used for extracting related features of cfDNA sequencing data from the sequencing data obtained from the sequencing module.
3. The system according to item 1, wherein,
the target region includes any one or two or more of the following regions:
the 151445000-151450000 position of chromosome 1,
the 191183500-191188500 position of chromosome 2,
The 191184000-191189000 position of chromosome 2,
68566500-68571500 on chromosome 4, or
Chromosome 11, position 30601500-30606500.
4. The system according to item 1, wherein,
the methylation level of the target region is calculated based on the methylation level of each CG site in the target region, where the methylation level of the CG site is the ratio of the cytosine at which methylation is detected to the sum of the cytosine at which methylation and the cytosine at which no methylation is detected in all detected sequence results for that site.
5. The system according to item 2, wherein,
the cfDNA-related features include:
average of all fragment sizes obtained in cfDNA sequencing data,
All segment size modes obtained in cfDNA sequencing data,
Average coverage in cfDNA sequencing data,
Correlation coefficient between pre-stored average value vector of short segment coverage of corresponding interval of healthy person and short segment coverage obtained from cfDNA sequencing data of subjects,
Correlation coefficient between prestored average value vector of long segment coverage of healthy person corresponding interval and long segment coverage obtained from cfDNA sequencing data of the subject,
And a correlation coefficient between a pre-stored average vector of short-long fragment ratios of the corresponding interval of healthy people and the short-long fragment ratio obtained in cfDNA sequencing data of the subject.
6. The system according to item 5, wherein,
the pre-stored average value vector of the corresponding interval short segment coverage of the healthy person refers to the average value of the corresponding interval short segment coverage of the known healthy person calculated based on the sequencing data of cfDNA of the known healthy person, and the average value is used for being provided for a data acquisition module to calculate;
the pre-stored average value vector of the corresponding interval long segment coverage of the healthy person refers to the average value of the corresponding interval long segment coverage of the known healthy person calculated based on the sequencing data of cfDNA of the known healthy person, and the average value is used for being provided for a data acquisition module to calculate;
the pre-stored average vector of the ratio of the short and long segments of the corresponding interval of the healthy person refers to the average value of the short and long segments of the corresponding interval in the known healthy person calculated based on the sequencing data of the cfDNA of the known healthy person, and the average value is used for being provided for a data acquisition module to calculate.
7. The system according to item 5, wherein,
the average value of all fragment sizes obtained in cfDNA sequencing data refers to the ratio of the sum of all fragment sizes obtained in cfDNA sequencing data of a subject to the number of all fragments;
The average coverage obtained in cfDNA sequencing data refers to the number of all fragments obtained from cfDNA sequencing data of the subject;
the short fragment coverage obtained in cfDNA sequencing data refers to the number of short fragments obtained from cfDNA sequencing data of a subject;
the long fragment coverage obtained in cfDNA sequencing data refers to the number of long fragments obtained from cfDNA sequencing data of a subject.
8. The system according to item 5, wherein,
the length of the long fragment is 201-320 bp, and the length of the short fragment is 150-200 bp.
9. The system according to item 2, wherein,
cfDNA sequencing data is cfDNA sequencing data after removal of low quality sequencing fragments.
10. The system according to item 9, wherein,
cfDNA sequencing data is sequencing data after removal of low quality sequencing fragments, further excluding sequencing data in the low-alignment interval.
11. The system according to item 1, wherein,
in the cancer calculation module, a model fitted based on the methylation level of the known sample, the data of the cfDNA related features is pre-stored for predicting whether the subject suffers from cancer,
the model is obtained by gradient lifting tree model based on the methylation level of known samples and the data of cfDNA related characteristics and adopting 5 times of cross validation selection.
12. The system according to item 1, wherein,
the system further includes a bisulfite treatment module for bisulfite treatment of cfDNA of a subject.
13. A method for cancer screening, comprising:
a data acquisition step, which is used for acquiring methylation level and cfDNA related characteristics of a target area of a subject; and
a cancer calculation step of predicting whether the subject suffers from cancer based on the methylation level and cfDNA-related characteristics acquired in the data acquisition step.
14. The method according to item 13, wherein,
the data acquisition step comprises a sequencing step, a methylation level analysis step and a cfDNA related characteristic extraction step,
the sequencing step is used to perform whole genome sequencing on cfDNA of a subject,
the methylation level analysis step is for obtaining sequencing data from the sequencing step to analyze the methylation level of the target region, and the cfDNA-related feature extraction step is for extracting the related features of cfDNA sequencing data from the sequencing data obtained from the sequencing step.
15. The method according to item 13, wherein,
the target region includes any one or two or more of the following regions:
the 151445000-151450000 position of chromosome 1,
The 191183500-191188500 position of chromosome 2,
the 191184000-191189000 position of chromosome 2,
68566500-68571500 on chromosome 4, or
Chromosome 11, position 30601500-30606500.
16. The method according to item 13, wherein,
the methylation level of the target region is calculated based on the methylation level of each CG site in the target region, where the methylation level of the CG site is the ratio of the cytosine at which methylation is detected to the sum of the cytosine at which methylation and the cytosine at which no methylation is detected in all detected sequence results for that site.
17. The method of item 14, wherein,
the cfDNA-related features include:
average of all fragment sizes obtained in cfDNA sequencing data,
All segment size modes obtained in cfDNA sequencing data,
Average coverage in cfDNA sequencing data,
Correlation coefficient between pre-stored average value vector of short segment coverage of corresponding interval of healthy person and short segment coverage obtained from cfDNA sequencing data of subjects,
Correlation coefficient between prestored average value vector of long segment coverage of healthy person corresponding interval and long segment coverage obtained from cfDNA sequencing data of the subject,
And a correlation coefficient between a pre-stored average vector of short-long fragment ratios of the corresponding interval of healthy people and the short-long fragment ratio obtained in cfDNA sequencing data of the subject.
18. The method according to item 17, wherein,
the pre-stored average value vector of the corresponding interval short segment coverage of the healthy person refers to the average value of the corresponding interval short segment coverage of the known healthy person calculated based on the sequencing data of cfDNA of the known healthy person, and the average value is used for being provided for a data acquisition step to calculate;
the pre-stored average value vector of the corresponding interval long segment coverage of the healthy person refers to the average value of the corresponding interval long segment coverage of the known healthy person calculated based on the sequencing data of cfDNA of the known healthy person, and the average value is used for being provided for a data acquisition step to calculate;
the pre-stored average vector of the ratio of short and long segments of the corresponding interval of healthy people refers to the average value of the short and long segments of the corresponding interval in the known healthy people calculated based on the sequencing data of cfDNA of the known healthy people, and is used for being provided for the data acquisition step to calculate.
19. The method according to item 17, wherein,
the average value of all fragment sizes obtained in cfDNA sequencing data refers to the ratio of the sum of all fragment sizes obtained in cfDNA sequencing data of a subject to the number of all fragments;
The average coverage obtained in cfDNA sequencing data refers to the number of all fragments obtained from cfDNA sequencing data of the subject;
the short fragment coverage obtained in cfDNA sequencing data refers to the number of short fragments obtained from cfDNA sequencing data of a subject;
the long fragment coverage obtained in cfDNA sequencing data refers to the number of long fragments obtained from cfDNA sequencing data of a subject.
20. The method according to item 17, wherein,
the length of the long fragment is 201-320 bp, and the length of the short fragment is 150-200 bp.
21. The method of item 14, wherein,
cfDNA sequencing data is cfDNA sequencing data after removal of low quality sequencing fragments.
22. The method of item 21, wherein,
cfDNA sequencing data is sequencing data after removal of low quality sequencing fragments, further excluding sequencing data in the low-alignment interval.
23. The method according to item 13, wherein,
in the cancer calculation step, a model fitted based on the methylation level of the known sample, the data of the cfDNA related features is pre-stored for predicting whether the subject suffers from cancer,
the model is obtained by gradient lifting tree model based on the methylation level of known samples and the data of cfDNA related characteristics and adopting 5 times of cross validation selection.
24. The method according to item 13, wherein,
the method further comprises a bisulfite treatment step for bisulfite treatment of cfDNA of a subject.
ADVANTAGEOUS EFFECTS OF INVENTION
The system disclosed by the invention is constructed by comprehensively utilizing indexes related to methylation level and cfDNA characteristics, can greatly reduce the cost of cancer screening and improve the screening accuracy, and has very high sensitivity and specificity.
Drawings
Various other advantages and benefits of the present invention will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the invention. It is evident that the figures described below are only some embodiments of the invention, from which other figures can be obtained without inventive effort for a person skilled in the art. Also, like reference numerals are used to designate like parts throughout the figures.
FIG. 1 is a ROC curve in a training set based on 5 region building models;
FIG. 2 is a ROC curve in a test set based on 5 region building models;
FIG. 3 is a graph of the results of a cancer screening model constructed for methylation levels;
FIG. 4 is a graph of the results of a cancer screening model constructed for cfDNA features;
FIG. 5 is a graph of the results of a cancer screening model constructed for methylation levels and cfDNA characteristics.
Detailed Description
Specific embodiments of the present invention will be described in more detail below with reference to the accompanying drawings. While specific embodiments of the invention are shown in the drawings, it should be understood that the invention may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the invention to those skilled in the art.
It should be noted that certain terms are used throughout the description and claims to refer to particular components. Those of skill in the art will understand that a person may refer to the same component by different names. The description and claims do not identify differences in terms of components, but rather differences in terms of the functionality of the components. As used throughout the specification and claims, the terms "include" and "comprise" are used in an open-ended fashion, and thus should be interpreted to mean "include, but not limited to. The description hereinafter sets forth a preferred embodiment for practicing the invention, but is not intended to limit the scope of the invention, as the description proceeds with reference to the general principles of the description. The scope of the invention is defined by the appended claims.
Definition of the definition
Unless specifically defined elsewhere herein, all other technical and scientific terms used herein have the meaning commonly understood by one of ordinary skill in the art to which this application belongs.
Methylation
Methylation is an important modification of proteins and nucleic acids, and is one of the important research contents of epigenetic science, and the methylation is used for regulating the expression and closure of genes, and is closely related to many diseases such as cancers, aging, senile dementia and the like. The most common methylation modifications are DNA methylation and histone methylation.
DNA methylation refers to the methylation process of the 5 th carbon atom on cytosine in CpG dinucleotides, and is taken as a stable modification state, and can inherit new generation child DNA along with the replication process of DNA under the action of DNA methyltransferase, so that the DNA methylation is an important epigenetic mechanism, and when the DNA is methylated, methylation of a gene promoter region can lead to silence of transcription of cancer suppressor genes, so that the DNA has a close relation with tumor occurrence. Aberrant methylation includes hypermethylation of cancer suppressor genes and DNA repair genes, hypomethylation of repeated sequence DNA, imprinting loss of certain genes, which are associated with the occurrence of a variety of tumors.
In this context, the ROC curve may reflect the classification effect of the classifier to some extent. AUC is actually the area under the ROC curve. AUC intuitively reflects the classification ability of ROC curve expression.
Genome-wide methylation sequencing
Whole genome methylation sequencing (white-genome bisulfite sequencing, WGBS) is considered the "gold standard" for methylation sequencing. The principle is that bisulfite treatment is used to convert unmethylated C base into U, PCR amplification is performed to obtain T, the T is distinguished from the original methylation modified C base, and the CpG/CHG/CHH locus can be judged whether to be methylated or not by combining high-throughput sequencing technology with reference sequence comparison.
Specificity (specificity)
Specificity refers to the ratio of negative test results in a sample of patients without a particular clinical condition.
Sensitivity of
Sensitivity refers to the ratio of positive detection results in a sample of a patient with a defined clinical condition.
PPV
The proportion of persons predicted to be positive among those who are truly positive.
NPV
The proportion of persons predicted to be negative, among those who are truly negative.
Mode number
Mode refers to the value that occurs most frequently in a set of data. The mode of all fragment sizes obtained in cfDNA sequencing data refers to the value of the most occurring fragment sizes in cfDNA sequencing data.
Correlation coefficient
The correlation coefficient is a quantity representing the degree of correlation between variables. The correlation coefficient used in the present invention is pearson correlation coefficient. The Pearson correlation coefficient (Pearson correlation coefficient), also known as Pearson product-moment correlation coefficient, PPMC or PCCs, is used to measure the correlation (linear correlation) between two variables X and Y, and has a value between-1 and 1. A value of 1 for the coefficients means that X and Y can be well described by a straight line equation, all data points fall well on a straight line, and Y increases as X increases. A value of-1 for the coefficient means that all data points fall on a straight line and Y decreases with increasing X. A value of 0 for the coefficient means that there is no linear relationship between the two variables.
Gradient lifting tree
The gradient-lifted tree (Gradient Boosting Decision Tree, GBDT), also called MART (Multiple Additive Regression Tree), is an iterative decision tree algorithm. It was considered as a more generalizable algorithm at the beginning of its proposal.
Cross validation
Cross Validation (Cross Validation), sometimes referred to as loop estimation (Rotation Estimation), is a practical method of statistically slicing data samples into smaller subsets. The basic idea of cross-validation is to group raw data (dataset) in a certain sense, wherein one part is used as a training set (train set) and the other part is used as a validation set (validation set or test set), firstly, the classifier is trained by the training set, and then, a model (model) obtained by training is tested by the validation set, so that the model is used as a performance index for evaluating the classifier.
The invention provides a system for cancer screening, which comprises a data acquisition module and a cancer calculation module. The data acquisition module is used for acquiring methylation level and cfDNA related characteristics of a target area of the subject, and the cancer calculation module predicts whether the subject suffers from cancer or not based on the methylation level and cfDNA related characteristics acquired in the data acquisition module.
The target region of the subject herein may be a specific region on the chromosome of the subject, e.g., the target region includes any one or two or more of the following regions:
the 151445000-151450000 position of chromosome 1,
the 191183500-191188500 position of chromosome 2,
the 191184000-191189000 position of chromosome 2,
68566500-68571500 on chromosome 4, or
Chromosome 11, position 30601500-30606500.
In a specific embodiment, the target region is chromosome 1, position 151445000-151450000.
In a specific embodiment, the target region is chromosome 2, position 191183500-191188500.
In a specific embodiment, the target region is chromosome 2, position 191184000-191189000.
In a specific embodiment, the target region is chromosome 4, position 68566500-68571500.
In a specific embodiment, the target region is chromosome 11, position 30601500-30606500.
In a specific embodiment, the target region is chromosome 151445000-151450000 and chromosome 191183500-191188500.
In a specific embodiment, the target region is chromosome 151445000-151450000 and chromosome 191184000-191189000.
In a specific embodiment, the target region is chromosome 151445000-151450000 and chromosome 68566500-68571500.
In a specific embodiment, the target region is chromosome 151445000-151450000 and chromosome 30601500-30606500.
In a specific embodiment, the target region is chromosome 2 191183500-191188500 and chromosome 2 191184000-191189000.
In a specific embodiment, the target region is chromosome 191183500-191188500 and chromosome 68566500-68571500.
In a specific embodiment, the target region is chromosome 191183500-191188500 and chromosome 30601500-30606500.
In a specific embodiment, the target region is chromosome 191184000-191189000 and chromosome 68566500-68571500.
In a specific embodiment, the target region is chromosome 191184000-191189000 and chromosome 30601500-30606500.
In a specific embodiment, the target region is chromosome 1, position 151445000-151450000, chromosome 2, position 191183500-191188500, and chromosome 2, position 191184000-191189000.
In a specific embodiment, the target region is chromosome 1, position 151445000-151450000, chromosome 2, position 191183500-191188500, and chromosome 4, position 68566500-68571500.
In a specific embodiment, the target region is chromosome 151445000-151450000, chromosome 2 191183500-191188500, and chromosome 11 30601500-30606500.
In a specific embodiment, the target region is chromosome 2 191183500-191188500, chromosome 2 191184000-191189000, and chromosome 4 68566500-68571500.
In a specific embodiment, the target region is chromosome 2 191183500-191188500, chromosome 2 191184000-191189000, and chromosome 11 30601500-30606500.
In a specific embodiment, the target region is chromosome 2 191183500-191188500, chromosome 4 68566500-68571500, and chromosome 11 30601500-30606500.
In a specific embodiment, the target region is chromosome 2 191184000-191189000, chromosome 4 68566500-68571500, and chromosome 11 30601500-30606500.
In a specific embodiment, the target region is chromosome 1, position 151445000-151450000, chromosome 2, position 191183500-191188500, chromosome 2, position 191184000-191189000, and chromosome 4, position 68566500-68571500.
In a specific embodiment, the target region is chromosome 151445000-151450000, chromosome 2 191183500-191188500, chromosome 2 191184000-191189000, and chromosome 11 30601500-30606500.
In a specific embodiment, the target region is chromosome 2 191183500-191188500, chromosome 2 191184000-191189000, chromosome 4 68566500-68571500, and chromosome 11 30601500-30606500.
In a specific embodiment, the target region is chromosome 1, position 151445000-151450000, chromosome 2, position 191184000-191189000, chromosome 4, position 68566500-68571500, and chromosome 11, position 30601500-30606500.
In a specific embodiment, the target region is chromosome 1, position 151445000-151450000, chromosome 2, position 191183500-191188500, chromosome 4, position 68566500-68571500, and chromosome 11, position 30601500-30606500.
In a specific embodiment, the target region is position 151445000-151450000 on chromosome 1, position 191183500-191188500 on chromosome 2, position 191184000-191189000 on chromosome 2, position 68566500-68571500 on chromosome 4, and position 30601500-30606500 on chromosome 11.
The methylation level of the target region is calculated based on the methylation level of each CG site in the target region, where the methylation level of the CG site is the ratio of the cytosine at which methylation is detected to the sum of the cytosine at which methylation and the cytosine at which no methylation is detected in all detected sequence results for that site.
For each window, the number of CG sites in each window is counted. Since the depth of the methylated cytosine at each CG site and the total depth of the site are known, the methylation level of the entire window can then be calculated as the ratio of the sum of the depth of the methylated cytosines at all CG sites divided by the sum of the total depths of all CG sites. Each window will obtain a corresponding methylation level by the calculation described above. Wherein the depth of methylated cytosine at each CG site is the number of reads for which sequencing results indicate methylated cytosine at that site, i.e., sequencing results indicate the number of reads for which sequencing results indicate C (cytosine) at that site, and the total depth of sites is the total number of sequencing reads covering that site, i.e., sequencing results indicate the total number of reads for which the site is C or T (thymine). The depth of methylated cytosines and the total depth of sites can be provided directly after analysis by sequencing software.
The methylation levels of cfDNA in the above target region are significantly different for cancer and healthy people and thus can be used as markers associated with cancer detection:
the 151445000-151450000 position of chromosome 1,
the 191183500-191188500 position of chromosome 2,
the 191184000-191189000 position of chromosome 2,
68566500-68571500 on chromosome 4, or
Chromosome 11, position 30601500-30606500.
cfDNA related features refer to features of cfDNA related obtained in cfDNA sequencing data.
In a specific embodiment, the cfDNA-related features include:
average of all fragment sizes obtained in cfDNA sequencing data,
All segment size modes obtained in cfDNA sequencing data,
Average coverage in cfDNA sequencing data,
Correlation coefficient between pre-stored average value vector of short segment coverage of corresponding interval of healthy person and short segment coverage obtained from cfDNA sequencing data of subjects,
Correlation coefficient between prestored average value vector of long segment coverage of healthy person corresponding interval and long segment coverage obtained from cfDNA sequencing data of the subject,
And a correlation coefficient between a pre-stored average vector of short-long fragment ratios of the corresponding interval of healthy people and the short-long fragment ratio obtained in cfDNA sequencing data of the subject.
Wherein, the average value of all fragment sizes obtained in cfDNA sequencing data refers to the ratio of the sum of all fragment sizes obtained in cfDNA sequencing data of a subject to the number of all fragments.
The average coverage obtained in cfDNA sequencing data refers to the number of all fragments obtained from cfDNA sequencing data of the subject.
The short fragment coverage obtained in cfDNA sequencing data refers to the number of short fragments obtained from cfDNA sequencing data of a subject.
The long fragment coverage obtained in cfDNA sequencing data refers to the number of long fragments obtained from cfDNA sequencing data of a subject.
The pre-stored average value vector of the corresponding interval short segment coverage of healthy people refers to the average value of the corresponding interval short segment coverage of the known healthy people calculated based on the sequencing data of cfDNA of the known healthy people, and the average value is used for being provided to a data acquisition module for calculation.
The pre-stored average value vector of the long segment coverage of the corresponding segment of the healthy person refers to the average value of the long segment coverage of the corresponding segment of the known healthy person calculated based on the sequencing data of the cfDNA of the known healthy person, and the average value is used for being provided to a data acquisition module for calculation.
The pre-stored average vector of the ratio of the short and long segments of the corresponding interval of the healthy person refers to the average value of the short and long segments of the corresponding interval in the known healthy person calculated based on the sequencing data of the cfDNA of the known healthy person, and the average value is used for being provided for a data acquisition module to calculate.
In a specific embodiment, the long fragment is 201 to 320bp in length and the short fragment is 150 to 200bp in length.
Further, cfDNA sequencing data is cfDNA sequencing data after removal of low quality sequencing fragments.
Still further, cfDNA sequencing data is sequencing data after removal of low quality sequencing fragments and removal of sequencing data within low alignment intervals. In particular, reference may be made to the low alignment interval provided in https:// genome. Ucsc. Edu/cgi-bin/hgfileuidb=hg19 & g=wgEncodeMaability.
The data acquisition module of the invention further comprises three submodules, namely a sequencing module, a methylation level analysis module and a cfDNA related characteristic extraction module, wherein the sequencing module is used for carrying out whole genome sequencing on cfDNA of a subject. The methylation level analysis module is for obtaining sequencing data from the sequencing module to analyze the methylation level of the target region. The cfDNA related feature extraction module is used for extracting related features of cfDNA sequencing data from the sequencing data obtained by the sequencing module.
In the cancer calculation module of the present invention, a model fitted based on the data of the methylation level and cfDNA related characteristics of the known sample is prestored for predicting whether the subject suffers from cancer. Substituting the methylation level and cfDNA related characteristics of the target area of the subject obtained in the data acquisition module into a model in the cancer calculation module to obtain a prediction result of whether the subject suffers from cancer.
The model is obtained by gradient lifting tree model based on the methylation level of known samples and the data of cfDNA related characteristics and adopting 5 times of cross validation selection.
The system of the present invention may further comprise a bisulfite treatment module for bisulfite treatment of cfDNA of a subject. The bisulfite treated cfDNA was used for subsequent cfDNA sequencing.
The present invention also provides a method for cancer screening comprising a data acquisition step for acquiring methylation level, cfDNA-related characteristics of a target region of a subject; and
a cancer calculation step of predicting whether the subject suffers from cancer based on the methylation level and cfDNA-related characteristics acquired in the data acquisition step.
Further, the data acquisition step comprises a sequencing step, a methylation level analysis step and a cfDNA related feature extraction step,
the sequencing step is used to perform whole genome sequencing on cfDNA of a subject,
the methylation level analysis step is for obtaining sequencing data from the sequencing step to analyze the methylation level of the target region, and the cfDNA-related feature extraction step is for extracting the related features of cfDNA sequencing data from the sequencing data obtained from the sequencing step.
The target region, methylation level of the target region, cfDNA-related features are as described above for the system for cancer screening.
In the cancer calculation step, a model fitted based on the methylation level of the known sample, the data of the cfDNA related features is pre-stored for predicting whether the subject suffers from cancer,
the model is obtained by gradient lifting tree model based on the methylation level of known samples and the data of cfDNA related characteristics and adopting 5 times of cross validation selection.
The method still further comprises a bisulfite treatment step for bisulfite treatment of cfDNA of a subject.
Examples
Example 1 calculation of differential methylation regions and calculation of fragment group characteristics
1.1cfDNA extraction purification
1.1.1 plasma sample preparation:
the blood samples were centrifuged at 2000g for 10min at 4℃and the plasma was transferred to a new centrifuge tube. The plasma samples were centrifuged at 16000g for 10min at 4℃and the next step was performed depending on the type of collection tube used, which was the other one used in the experiment.
/>
1.1.2 cleavage and binding
1.1.2.1. Binding solutions/loads Mix was prepared according to the following table and then thoroughly mixed.
An appropriate volume of plasma sample was added.
1.1.2.2. The plasma samples and Binding solutions/loads Mix were thoroughly mixed.
1.1.2.3. The cfDNA was bound to the magnetic beads by sufficient binding on a spin mixer for 10 min.
1.1.2.4. The binding tube was placed on a magnetic rack for 5min until the solution became clear and the beads were fully adsorbed on the magnetic rack.
1.1.2.5. The supernatant was carefully discarded with a pipette, the tube was kept on the magnetic rack for several minutes, and the residual supernatant was removed with a pipette.
1.1.3 washing
1.1.3.1. The beads were resuspended in 1ml Wash Solution.
1.1.3.2. The resuspension was transferred to a new non-adsorbed 1.5ml centrifuge tube. The binding tube remains.
1.1.3.3. The centrifuge tube containing the bead resuspension was placed on a magnetic rack for 20s.
1.1.3.4. The separated supernatant was aspirated and the binding tube was washed, and the washed residual beads were collected again into a heavy suspension, discarding the lysis/binding tube.
1.1.3.5. The tube was placed on a magnet rack for 2min until the solution became clear, the beads were collected on the magnet rack and the supernatant was removed with a 1ml pipette.
1.1.3.6. The tube was left on the magnet rack and the remaining liquid was removed as much as possible with a 200. Mu.L pipette.
1.1.3.7. The tube was removed from the magnet holder and 1ml Wash Solution was added and vortexed for 30s.
1.1.3.8. The solution was allowed to settle for 2min on a magnetic rack, the beads were collected on the magnetic rack, and the supernatant was removed with a 1ml pipette.
1.1.3.9. The tube was left on the magnet rack and the residual liquid was removed thoroughly with a 200 μl pipette.
1.1.3.10. The tube was removed from the magnet holder, 1ml 80% ethanol was added, and vortexed for 30s.
1.1.3.11. The solution was allowed to settle for 2min on a magnetic rack and the supernatant was removed with a 1ml pipette.
1.1.3.12. The tube was left on the magnet holder and the residual liquid was removed with a 200. Mu.L pipette.
1.1.3.13. The above 10-12 steps were repeated with 80% ethanol once to remove the supernatant as much as possible.
1.1.3.14. The tube was left on the magnetic rack and the beads were dried in air for 3-5 minutes.
1.1.4 elution of cfDNA
1.1.4.1. The Solution of the addition was added according to the following table.
1.1.4.2. Placing on a magnetic rack for 2min, clarifying the solution, and sucking cfDNA in the supernatant.
1.1.4.3. The purified cfDNA was used immediately or the supernatant was transferred to a new centrifuge tube and stored at-20 ℃.
1.2gDNA disruption and purification:
1.2.1. according to the Qubit concentration, 2. Mu.g of DNA was taken, added with water to 125. Mu.l, added to a covaries 130. Mu.l disruption tube, and the procedure was set: 50W,20%,200cycles,250s.
1.2.2.1. Mu.l of sample after the end of the disruption was subjected to fragment detection using Agilent2100, and the main peak of the sample detection after normal disruption was approximately 150bp to 200bp.
For cfDNA samples, agilent2100 performed fragment detection, and direct Qubit was used for subsequent experiments.
1.3 terminal repair, 3' end plus "A":
1.3.1. adding Xng broken gDNA or cfDNA into a PCR tube, supplementing 50 μl with nuclease-free water, adding the following reagents, and mixing by vortex:
component (A) Volume of
gDNA/cfDNA 50μl
End Repair&A-Tailing Buffer 7μl
End Repair&A-Tailing Enzyme Mix 3μl
Total volume of 60μl
1.3.2. The following procedure was set up for the reaction on the PCR instrument:
the temperature of the hot cover is 85 ℃.
Temperature (temperature) Time
20℃ 30min
65℃ 30min
4℃
1.4 linker ligation and purification:
1.4.1. the linker was diluted in advance to the appropriate concentration with reference to the following table:
/>
1.4.2. the following reagents were prepared according to the following table, gently blotted and mixed, and centrifuged briefly:
component (A) Volume of
End repair, addition of "A" reaction product 60μl
Joint 5μl
Nuclease-free water 5μl
Ligation Buffer 30μl
DNA Ligase 10μl
Total volume of 110μl
1.4.3. The following procedure was set up for the reaction on the PCR instrument:
there is no thermal cover.
Temperature (temperature) Time
20℃ 30min
4℃
1.4.4. Adding purified magnetic beads for experiment (Agencourt AMPure XP magnetic beads are taken to room temperature in advance, and are vibrated and mixed uniformly for standby) according to the following system:
Component (A) Volume of
Joint connection product 110μl
Agencourt AMPure XP beads 110μl
Total volume of 220μl
1.4.4.1. Gently sucking and beating, and mixing for 6 times.
1.4.4.2. Standing at room temperature for 5-15min, and placing the PCR tube on a magnetic rack for 3min to clarify the solution.
1.4.4.3. The supernatant was removed, the PCR tube was placed on a magnetic rack, 200. Mu.l of 80% ethanol solution was added to the PCR tube, and the mixture was allowed to stand for 30 seconds.
1.4.4.4. The supernatant was removed, 200. Mu.l of 80% ethanol solution was added to the PCR tube, and after standing for 30s, the supernatant was thoroughly removed (it was recommended to remove the bottom residual ethanol solution using a 10. Mu.l pipette).
1.4.4.5. Standing at room temperature for 3-5min to volatilize residual ethanol thoroughly.
1.4.4.6. Adding 22 μl of Nuclear-freewater, removing the PCR tube from the magnetic rack, gently sucking and beating the resuspended magnetic beads, avoiding generating bubbles, and standing at room temperature for 2min.
1.4.4.7. The PCR tube was placed on a magnetic rack for 2min to clarify the solution.
1.4.4.8. Mu.l of the supernatant was pipetted into a new PCR tube.
1.5 bisulfite treatment and purification:
1.5.1. the desired reagent was taken out in advance and dissolved. The reagents were added according to the following table:
component (A) High concentration sample (1 ng-2. Mu.g) body Low concentration sample (1-500 ng)
Linker ligation of purified products 20μl 40μl
Bisulfite solution 85μl 85μl
DNA protect buffer 35μl 15μl
Total volume of 140μl 140μl
1.5.2.DNA Protect buffer the addition of liquid turned blue. Gently blotted and mixed, and then split into two tubes for PCR.
1.5.3. The following procedure was set up and run:
the lid was heated to 105 ℃.
Temperature (temperature) Time
95℃ 5min
60℃ 10min
95℃ 5min
60℃ 10min
4℃
1.5.4. The same sample from both tubes was combined into the same clean 1.5ml centrifuge tube by brief centrifugation.
1.5.5. 310. Mu.l Buffer BL (sample size less than 100ng 1. Mu.l Carrier RNA (1. Mu.g/. Mu.l)) was added to each sample, vortexed, and centrifuged briefly.
1.5.6. 250 μl of absolute ethanol was added to each sample, vortexed and mixed for 15s, centrifuged briefly, and the mixture was added to the prepared corresponding column.
1.5.7. Standing for 1min, centrifuging for 1min, transferring the liquid in the collecting pipe into a centrifugal column again, centrifuging for 1min, and discarding the liquid in the centrifugal pipe.
1.5.8. Add 500. Mu.l buffer BW (note whether absolute ethanol was added) centrifuge for 1min and discard the waste.
1.5.9. Add 500. Mu.l buffer BD (note whether absolute ethanol was added) cover the tube and leave it at room temperature for 15min. Centrifuging for 1min, and discarding the centrifuged liquid.
1.5.10. 500. Mu.l buffer BW (note whether absolute ethanol was added) was added, centrifuged for 1min, the detached liquid was discarded, and repeated 2 times.
1.5.11. 250 μl of absolute ethanol was added, centrifuged for 1min, the column was placed in a new 2ml collection tube and all remaining liquid was discarded.
1.5.12. The column was placed in a clean 1.5ml centrifuge tube, 20. Mu.l of nuclease-free water was added to the center of the column membrane, the lid was gently covered, the column was placed at room temperature for 1min, and the column was centrifuged for 1min.
1.5.13. The liquid in the collection tube was re-transferred to a centrifuge column, left at room temperature for 1min, and centrifuged for 1min.
1.6 Pre-amplification and purification before hybridization:
1.6.1. preparing a reaction system according to the following table, blowing, mixing uniformly and centrifuging briefly:
1.6.2. the following procedure was set and the PCR procedure was started:
thermal cover 105 DEG C
The number of PCR cycles was adjusted according to the amount of DNA to be added, and the reference data were as follows:
/>
1.6.4. 50 mu l Agencourt AMPure XP magnetic beads are added into a PCR tube after the reaction is finished, and the mixture is blown and evenly mixed by a pipette to avoid generating bubbles (Agencourt AMPure XP is evenly mixed and balanced at room temperature in advance).
1.6.5. Incubating for 5-15min at room temperature, and placing the PCR tube on a magnetic rack for 3min to clarify the solution.
1.6.6. The supernatant was removed, the PCR tube was placed on a magnetic rack, 200. Mu.l of 80% ethanol solution was added to the PCR tube, and the mixture was allowed to stand for 30 seconds.
1.6.7. The supernatant was removed, 200. Mu.l of 80% ethanol solution was added to the PCR tube, and after standing for 30s, the supernatant was thoroughly removed (it was recommended to remove the bottom residual ethanol solution using a 10. Mu.l pipette).
1.6.8. Standing at room temperature for 5min to volatilize residual ethanol thoroughly.
1.6.9. Mu.l of Nuclear-free water was added, the centrifuge tube was removed from the magnet holder, and the resuspended beads were gently pipetted using a pipette.
1.6.10. Standing at room temperature for 2min, and placing 200 μl PCR tube on a magnetic rack for 2min to clarify the solution.
1.6.11. The supernatant was transferred to a new 200. Mu.l PCR tube (placed on an ice box) with a pipette, and the reaction tube was marked with a sample number, and prepared for the next reaction.
1.6.12. 1 μl of the sample was used for library concentration determination using Qubit, and library concentration was recorded.
1.6.13. 1 μl of the sample was used for library fragment length measurement using Agilent 2100, the library length being approximately between 270bp-320 bp.
1.6.14. Sequencing was performed using Illumina high throughput sequencing platform.
1.6.15. Methylation letter analysis flow. The flow is as follows: checking the quality of original sequencing data by using quality control software such as fastp and the like, and filtering, intercepting or removing reads with low quality to obtain corresponding clean data; comparing the clean data after quality control to a reference genome (hg 19) by using Bismark boot 2 comparison software; de-duplicating the primarily aligned bam file by using a duplicate_bismark; extracting corresponding methylation site information by using a Bismark_methylation_extraction to obtain a final methylation CG file (comprising all single CG site information files); finally, windowing is carried out on a reference genome by utilizing a sliding window method, and the overall methylation level of CG loci in each window interval is calculated; for each sample, the methylation level of the corresponding window was counted, and the differential methylation window was found from the group of different samples.
Cfdna fragment feature extraction procedure.
Checking sequencing quality by using quality control software such as fastp, removing low-quality read sections, comparing the data of the clean data after quality control to a reference genome by using comparison software such as Bismarker,
and obtaining the aligned bam file, filtering reads of MAPQ <30, and extracting whole genome fragment information by using R package GCcontent.
The extracted fragment information was tiled into adjacent, non-overlapping 100kb intervals according to hg19 reference genome autosomes, and low-alignment intervals were excluded based on previous study work. Defining the short segment length between 150 and 200bp and the long segment length between 201 and 320bp, and calculating the short segment coverage, the long segment coverage and the total coverage of the short segment and the long segment of each interval.
Short segments, long segments, and total coverage were corrected using local weighted regression (LOWESS) to remove coverage bias caused by GC bias.
Sequentially combining the intervals of 100kb into intervals of 5MB to obtain 499 non-overlapping intervals, and calculating short segment coverage, long segment coverage, short and long segment ratio and total coverage of each corrected interval to obtain a plurality of preliminary characteristics of the sample.
Example 2
Based on 14 lung cancer patients cfDNA and 22 healthy people cfDNA training sets, methylation levels of 1583 initial markers of 14 lung cancer patients and 22 healthy people were detected using the method described in example 1, and 5 methylation regions most significantly distinguishing lung cancer and healthy people cfDNA were screened as specific regions relevant for lung cancer detection. As shown in table 1, the corresponding area information is as follows: first region, chromosome 1, 151445000-151450000; second region, chromosome 2, 191183500-191188500; third region, chromosome 2, 191184000-191189000; fourth region, chromosome 4, 68566500-68571500; fifth region, chromosome 11, 30601500-30606500.
Methylation level data for each of the above markers detected based on the method of example 1 for each of the 14 lung cancer patients and 22 healthy subjects were input into the R software and model regression was performed using the randomForest package of the R software to construct a random forest model. Regression results showed that in the training set, the cutoff value based on the comprehensive methylation level of 5 markers, which can be used to predict lung cancer outcome, was 0.442, i.e. the specified threshold was 0.442 (greater than 0.442, i.e. interpreted as lung cancer patient), and the final model resulted in AUC reaching 1, accuracy 100%, sensitivity 100%, specificity 100%, PPV 100%, NPV 100%, specific information see table 1 and fig. 1.
TABLE 1
Example 3
Based on the 5 methylation markers in example 2, using pROC in R software, the cutoff and AUC values of the comprehensive methylation levels of 5 methylation markers in the test set (10 lung cancer patient cfDNA not used for training set and 16 healthy person cfDNA not used for training set) respectively, which can be used to predict lung cancer outcome, were calculated from the methylation level of each methylation marker, see table 2.
TABLE 2
Example 4
Based on the model constructed in example 2, in the test set of 10 lung cancer patients cfDNA and 16 healthy people cfDNA, the integrated methylation level based on 5 markers that can be used to predict lung cancer outcome was cutoff of 0.442, i.e. the specified threshold was 0.442 (greater than 0.442 interpreted as lung cancer patient), AUC reached 0.919, accuracy 84.62%, sensitivity 90%, specificity 81.25%, PPV 75%, NPV 92.86%, specific information see table 3 and fig. 2.
TABLE 3 Table 3
Example 5
Another 42 lung cancer samples different from examples 2-4 were selected, another 64 healthy persons different from examples 2-4, and a total of 106 samples were collected as in example 1; establishing a library, and sequencing through an Illumina platform; the methylation analysis flow of the sequencing data is used for obtaining the methylation levels of 5 different methylation areas; dividing the 106 samples into a training set of 70% and a test set of 30%; on all 106 samples, the training results of each model were evaluated using a R language tool, 5-fold cross-validation using multiple machine learning models (logistic regression, support vector machine, random forest, gradient lift tree, etc.), on which the gradient lift tree model results were optimal, thereby selecting the gradient lift tree model as the final model to model.
Further aiming at a training set (28 lung cancer samples in 42 lung cancer samples and 46 healthy samples in 64 healthy human samples are selected), a gradient lifting tree model is used, and an optimal model is obtained by adopting a 5-time cross validation method; the results of the model on the test set are shown in fig. 3 and table 4, auc value of 0.905, sensitivity of 85.7% and specificity of 88.9%. The positive predictive value PPV was 85.7%, and the negative predictive value NPV was 88.9%.
TABLE 4 Table 4
Example 6
42 lung cancer samples, 64 healthy persons, were selected as in example 5, and peripheral blood was collected as in example 1; establishing a library, and sequencing through an Illumina platform; obtaining a plurality of preliminary characteristics of the sample about the fragment group according to the fragment group biography analysis flow: 499 short segment coverage features, 499 long segment coverage features, and 499 total coverage features. Calculating by using the three characteristics to obtain 4 comprehensive indexes: cov (coverage, 499 intervals are combined into one interval, and the average value of 499 total segment coverage is calculated), short. Cor (short segment coverage correlation coefficient of short segment coverage average value vector of all healthy people corresponding to interval and sample), long. Cor (long segment coverage correlation coefficient of long segment coverage average value vector of all healthy people corresponding to interval and sample), ratio. Cor (correlation coefficient of average value vector of short long segment ratio of all healthy people corresponding to short long segment ratio and short long segment ratio of sample). Using the above 4 features in combination with mean size, mode size, 106 samples were divided into a training set of 70% and a test set of 30%,
Further aiming at a training set (28 lung cancer samples in 42 lung cancer samples and 46 healthy samples in 64 healthy person samples are selected); and constructing and obtaining a gradient lifting tree model by using an R language tool and adopting a 5-time cross validation method, and then validating the effect of the gradient lifting tree model on a test set.
The results of the model of the application on the test set are shown in fig. 4 and table 5, auc value of 0.937, sensitivity of 85.7% and specificity of 88.9%. The positive predictive value PPV was 85.7%, and the negative predictive value NPV was 88.9%.
TABLE 5
Example 7
42 lung cancer samples identical to example 5 were selected, 64 healthy persons were combined with the methylation signature obtained in example 5 and the 6 fragment signatures used in example 6, and lung cancer discrimination was performed in combination with methylation and fragment group signatures. 106 samples were also divided into a 70% training set and a 30% test set; on a training set (28 lung cancer samples in 42 lung cancer samples and 46 healthy samples in 64 healthy person samples are selected), a gradient lifting tree model is used, and an optimal model (the optimal model is also called as a model of the application) is selected by adopting 5 times of cross validation; the results of this model on the test set are shown in fig. 5 and table 3, auc value of 0.978, sensitivity of 92.9% and specificity of 94.4%. The positive predictive value PPV was 92.9%, and the negative predictive value NPV was 94.4%.
TABLE 6
The method of the present invention can be used for cancer screening using WGBS data, fragment group information obtained by the WGBS data is reliable, and WGBS can also calculate methylation levels, thus allowing early screening in combination of both aspects. The prior study uses WGS data to obtain fragment group information, and then combines the WGS data to obtain CNV copy number variation and other information for early screening of tumors.
Although the embodiments of the present invention have been described above with reference to the accompanying drawings, the present invention is not limited to the above-described specific embodiments and application fields, and the above-described specific embodiments are merely illustrative, and not restrictive. Those skilled in the art, having the benefit of this disclosure, may effect numerous forms of the invention without departing from the scope of the invention as claimed.

Claims (12)

1. A system for cancer screening, comprising:
a data acquisition module for acquiring methylation level, cfDNA-related characteristics of a target region of a subject; and
a cancer calculation module that predicts whether the subject suffers from cancer based on the methylation level and cfDNA-related characteristics acquired in the data acquisition module.
2. The system of claim 1, wherein,
the data acquisition module comprises a sequencing module, a methylation level analysis module and a cfDNA related characteristic extraction module,
the sequencing module is used for carrying out whole genome sequencing on cfDNA of a subject,
the methylation level analysis module is used for obtaining sequencing data from the sequencing module to analyze the methylation level of the target region, and the cfDNA related feature extraction module is used for extracting related features of cfDNA sequencing data from the sequencing data obtained from the sequencing module.
3. The system of claim 1, wherein,
the target region includes any one or two or more of the following regions:
the 151445000-151450000 position of chromosome 1,
the 191183500-191188500 position of chromosome 2,
the 191184000-191189000 position of chromosome 2,
68566500-68571500 on chromosome 4, or
Chromosome 11, position 30601500-30606500.
4. The system of claim 1, wherein,
the methylation level of the target region is calculated based on the methylation level of each CG site in the target region, where the methylation level of the CG site is the ratio of the cytosine at which methylation is detected to the sum of the cytosine at which methylation and the cytosine at which no methylation is detected in all detected sequence results for that site.
5. The system of claim 2, wherein,
the cfDNA-related features include:
average of all fragment sizes obtained in cfDNA sequencing data,
All segment size modes obtained in cfDNA sequencing data,
Average coverage in cfDNA sequencing data,
Correlation coefficient between pre-stored average value vector of short segment coverage of corresponding interval of healthy person and short segment coverage obtained from cfDNA sequencing data of subjects,
Correlation coefficient between prestored average value vector of long segment coverage of healthy person corresponding interval and long segment coverage obtained from cfDNA sequencing data of the subject,
And a correlation coefficient between a pre-stored average vector of short-long fragment ratios of the corresponding interval of healthy people and the short-long fragment ratio obtained in cfDNA sequencing data of the subject.
6. The system of claim 5, wherein,
the pre-stored average value vector of the corresponding interval short segment coverage of the healthy person refers to the average value of the corresponding interval short segment coverage of the known healthy person calculated based on the sequencing data of cfDNA of the known healthy person, and the average value is used for being provided for a data acquisition module to calculate;
the pre-stored average value vector of the corresponding interval long segment coverage of the healthy person refers to the average value of the corresponding interval long segment coverage of the known healthy person calculated based on the sequencing data of cfDNA of the known healthy person, and the average value is used for being provided for a data acquisition module to calculate;
The pre-stored average vector of the ratio of the short and long segments of the corresponding interval of the healthy person refers to the average value of the short and long segments of the corresponding interval in the known healthy person calculated based on the sequencing data of the cfDNA of the known healthy person, and the average value is used for being provided for a data acquisition module to calculate.
7. The system of claim 5, wherein,
the average value of all fragment sizes obtained in cfDNA sequencing data refers to the ratio of the sum of all fragment sizes obtained in cfDNA sequencing data of a subject to the number of all fragments;
the average coverage obtained in cfDNA sequencing data refers to the number of all fragments obtained from cfDNA sequencing data of the subject;
the short fragment coverage obtained in cfDNA sequencing data refers to the number of short fragments obtained from cfDNA sequencing data of a subject;
the long fragment coverage obtained in cfDNA sequencing data refers to the number of long fragments obtained from cfDNA sequencing data of a subject.
8. The system of claim 5, wherein,
the length of the long fragment is 201-320 bp, and the length of the short fragment is 150-200 bp.
9. The system of claim 2, wherein,
cfDNA sequencing data is cfDNA sequencing data after removal of low quality sequencing fragments.
10. The system of claim 9, wherein,
cfDNA sequencing data is sequencing data after removal of low quality sequencing fragments, further excluding sequencing data in the low-alignment interval.
11. The system of claim 1, wherein,
in the cancer calculation module, a model fitted based on the methylation level of the known sample, the data of the cfDNA related features is pre-stored for predicting whether the subject suffers from cancer,
the model is obtained by gradient lifting tree model based on the methylation level of known samples and the data of cfDNA related characteristics and adopting 5 times of cross validation selection.
12. The system of claim 1, wherein,
the system further includes a bisulfite treatment module for bisulfite treatment of cfDNA of a subject.
CN202210228206.4A 2022-03-08 2022-03-08 System for cancer screening Pending CN116779025A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210228206.4A CN116779025A (en) 2022-03-08 2022-03-08 System for cancer screening

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210228206.4A CN116779025A (en) 2022-03-08 2022-03-08 System for cancer screening

Publications (1)

Publication Number Publication Date
CN116779025A true CN116779025A (en) 2023-09-19

Family

ID=87984691

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210228206.4A Pending CN116779025A (en) 2022-03-08 2022-03-08 System for cancer screening

Country Status (1)

Country Link
CN (1) CN116779025A (en)

Similar Documents

Publication Publication Date Title
CN114045345B (en) Free DNA-based genome canceration information detection system and detection method
CN114736968B (en) Application of plasma free DNA methylation marker in lung cancer early screening and lung cancer early screening device
CN110760580B (en) Early diagnosis equipment for liver cancer
WO2012047899A2 (en) Novel dna hypermethylation diagnostic biomarkers for colorectal cancer
CN110964826A (en) High-throughput detection kit for methylation of colorectal cancer suppressor gene and application thereof
CN114974430A (en) System for cancer screening and method thereof
CN112941180A (en) Group of lung cancer DNA methylation molecular markers and application thereof in preparation of lung cancer early diagnosis kit
CN106845154B (en) A device for FFPE sample copy number variation detects
CN107142320B (en) Gene marker for detecting liver cancer and application thereof
CN114743593B (en) Construction method of prostate cancer early screening model based on urine, screening model and kit
CN115094142B (en) Methylation markers for diagnosing lung-intestinal adenocarcinoma
CN107630093B (en) Reagent, kit, detection method and application for diagnosing liver cancer
CN114182022A (en) Method for detecting liver cancer specific mutation based on cfDNA base mutation frequency distribution
WO2023142625A1 (en) Methylation sequencing data filtering method and application
CN116121390A (en) Marker for prognosis of cancer and suitability for immunotherapy and application thereof
CN116779025A (en) System for cancer screening
CN113817822B (en) Tumor diagnosis kit based on methylation detection and application thereof
CN117441027A (en) Headrich-BS: thermal enrichment of CpG-rich regions for bisulfite sequencing
US20240194295A1 (en) Cellular heterogeneity-adjusted clonal methylation (chalm): a methylation quantification method
CN116042820B (en) Colon cancer DNA methylation molecular markers and application thereof in preparation of early diagnosis kit for colon cancer
CN114231635B (en) Marker and probe composition for lung cancer screening and application thereof
CN115896258A (en) Method for screening cancer and system for screening cancer
CN114507734B (en) Marker for thyroid cancer screening, probe composition and application thereof
CN115772566B (en) Methylation biomarker for auxiliary detection of lung cancer somatic ERBB2 gene mutation and application thereof
CN117059163A (en) System and method for screening large fragment methylation markers

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination