CN114974430A - System for cancer screening and method thereof - Google Patents

System for cancer screening and method thereof Download PDF

Info

Publication number
CN114974430A
CN114974430A CN202210182941.6A CN202210182941A CN114974430A CN 114974430 A CN114974430 A CN 114974430A CN 202210182941 A CN202210182941 A CN 202210182941A CN 114974430 A CN114974430 A CN 114974430A
Authority
CN
China
Prior art keywords
tumor
sequencing
cfdna
methylation level
module
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210182941.6A
Other languages
Chinese (zh)
Inventor
田继超
杨亚东
李永君
王小奇
彭勇飞
连明
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Biochain Beijing Science and Technology Inc
Original Assignee
Biochain Beijing Science and Technology Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Biochain Beijing Science and Technology Inc filed Critical Biochain Beijing Science and Technology Inc
Publication of CN114974430A publication Critical patent/CN114974430A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/10Ploidy or copy number detection
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/20Allele or variant detection, e.g. single nucleotide polymorphism [SNP] detection
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/10Sequence alignment; Homology search

Landscapes

  • Life Sciences & Earth Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Medical Informatics (AREA)
  • General Health & Medical Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • Biophysics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Analytical Chemistry (AREA)
  • Chemical & Material Sciences (AREA)
  • Molecular Biology (AREA)
  • Genetics & Genomics (AREA)
  • Bioethics (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Epidemiology (AREA)
  • Evolutionary Computation (AREA)
  • Public Health (AREA)
  • Software Systems (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The present application provides a system for cancer screening, comprising: a data acquisition module for acquiring data of methylation level, tumor fraction, tumor ploidy, cfDNA fragment length characteristic value and reads distribution of a target region of a subject; and a cancer calculation module that predicts a probability that the subject will suffer from cancer based on the data for methylation level, tumor score, tumor ploidy, cfDNA fragment length feature value, and reads distribution obtained in the data acquisition module. The cancer noninvasive screening method has the advantages that a very simple model is used, 5 markers can be used, the cost of cancer screening can be greatly reduced, the screening accuracy is improved, and the sensitivity and the specificity are very high.

Description

System for cancer screening and method thereof
Technical Field
The application belongs to the field of biotechnology, and particularly relates to a system and a method for screening cancer.
Background
Early screening of cancer can effectively improve the survival rate and the prevalence rate of patients. Taking colorectal cancer as an example, according to the '2020 Chinese colorectal cancer screening and early diagnosis and treatment guideline', the occurrence and development of colorectal cancer mostly follow an 'adenoma-carcinoma' sequence, and the progression from precancerous lesion to carcinoma generally requires 5-10 years, so that a precious time window is provided for early diagnosis and clinical intervention of diseases. Furthermore, the prognosis of colorectal cancer is closely related to the stage of diagnosis, the 5-year relative survival rate of stage I colorectal cancer exceeds 90%, and the 5-year relative survival rate of stage IV colorectal cancer with distant metastasis is below 15%. A great deal of research and practice has shown that colorectal cancer screening and early-diagnosis treatment can effectively reduce the morbidity and mortality of colorectal cancer, for example, the population receiving the enteroscope screening is gradually increased since about 1980 in the United states, more patients affected by the screening are found at first, the morbidity of colorectal cancer is improved, but the morbidity and mortality of colorectal cancer in the United states are reduced remarkably later. For other cancer species, early screening and diagnosis are also beneficial to early intervention treatment, and the survival rate of patients is improved.
cfDNA (cell-free DNA) is a small fragment of DNA of nucleic acid isolated from peripheral blood, derived from normal or tumor cells and metabolism, and contains genetic information such as somatic mutations and DNA methylation. Among them, ctDNA methylation can be detected early in tumorigenesis and has good stability. In addition, the length distribution of cfDNA, the abundance of tumors, and the like are also exposed in the assay. Currently, DNA methylation, fragment length, tumor heterogeneity, etc. have found some applications in early cancer screening.
Disclosure of Invention
In view of the problems of the prior art, it is an object of the present application to provide a system for cancer screening and a method thereof.
In particular to the following technical scheme:
1. a system for cancer screening, comprising:
a data acquisition module for acquiring data of methylation level, tumor fraction, tumor ploidy, cfDNA fragment length characteristic value and reads distribution of a target region of a subject; and
a cancer calculation module to predict a probability of the subject suffering from cancer based on the data for methylation level, tumor score, tumor ploidy, cfDNA fragment length feature value, and reads distribution obtained in the data acquisition module.
2. The system of claim 1, wherein,
the data acquisition module comprises a sequencing module, a methylation level analysis module, a tumor score analysis module, a tumor ploidy analysis module, a cfDNA fragment length eigenvalue analysis module and a reads distribution analysis module;
the sequencing module is used for performing whole genome sequencing on cfDNA of a subject;
the methylation level analysis module analyzes the methylation level of the target region based on the sequencing data obtained from the sequencing module;
the tumor score analysis module analyzes a tumor score based on sequencing data obtained from a sequencing module; (the tumor score and the tumor proportion are used in the specification section, herein in a unified manner)
The tumor ploidy analysis module analyzes tumor ploidy based on sequencing data obtained from the sequencing module;
the cfDNA fragment length eigenvalue analysis module calculates a length-related risk eigenvalue of the tumor based on the length of the cfDNA fragment extracted from the sequencing data obtained by the sequencing module;
the reads distribution analysis module analyzes the reads distribution based on the sequencing data obtained from the sequencing module so as to predict the source of each read.
3. The system of item 1 or 2, wherein,
the target region includes any one or two or more of:
chromosome 2 223721500-position 223726500,
chromosome 6, 170147000 and 170152000,
chromosome 8 182500-187500,
chromosome 8 64081500-64086500,
chromosome 9 at positions 14688500 and 14693500,
119803500-119808500 of chromosome 10, or
Chromosome 15 56285500-.
The system according to any one of items 1 to 3, wherein,
calculating the methylation level of the target region based on the methylation level of each CG site of the target region, wherein the methylation level of the CG site is the ratio of a cytosine at which methylation occurs to the sum of a cytosine at which methylation occurs and a cytosine which is not methylated, which is detected in the sequence result of all detected sites;
the tumor fraction is the proportion of ctDNA in cfDNA, i.e. the proportion of free DNA released by tumor cells in total cfDNA;
the tumor ploidy refers to the real content of cancer cells in a tumor sample caused by chromosome structure and number abnormality, namely the ploidy number of the tumor;
the cfDNA fragment length characteristic value analysis module calculates a length-related risk characteristic value of the tumor based on the length of the cfDNA fragment extracted from the sequencing data obtained by the sequencing module, and calculates a cfDNA fragment length characteristic value through a gradient lifting tree model based on a ratio between a short fragment and a long fragment in the cfDNA fragment length;
preferably, the long fragment is a fragment with the length of 201-320 bp;
the short fragment is a fragment with the length of 150-200 bp;
the reads distribution refers to the probability of the sample source obtained by counting each read in cfDNA sequencing data, i.e., the probability that a given read is derived from three source components, healthy plasma, normal tissue, and tumor tissue.
5. The system according to any one of items 1 to 4, wherein,
the cfDNA sequencing data is cfDNA sequencing data after removing low-quality sequencing fragments;
preferably, the cfDNA sequencing data is sequencing data from which sequencing data within the low-alignment interval is further excluded after removal of low-quality sequencing fragments.
6. The system according to any one of items 1 to 5, wherein,
in the cancer calculation module, a formula which is formed by fitting data based on the methylation level, the tumor fraction, the tumor ploidy, the cfDNA fragment length characteristic value and the reads distribution of the known sample is stored in advance to predict the probability of the cancer suffered by the subject, wherein the formula is one
Figure RE-GDA0003756602660000031
P is the probability of the subject suffering from cancer;
X 1 is the tumor score;
X 2 is tumor ploidy;
X 3 is the methylation level;
X 4 is a cfDNA fragment length characteristic value;
X 5 is a reads distribution;
β 0 is any value selected from 100 to 150, preferably 114.12;
β 1 is any value selected from-700 to 0, preferably-686.25;
β 2 is any value selected from 20 to 40, preferably 31.16; beta is a 4 Is any value selected from-20 to 0, preferably-17.45;
7. the system of item 6, wherein β 3 X 3 Obtained by the formula two:
β 3 X 3 =a*X 31 +b*X 32+ c*X 33 +d*X 34 +e*X 35 +f*X 36 +g*X 37 formula II
X 31 The methylation level at position 2 of chromosome 223721500-223726500 in the target region;
X 32 the methylation level at position 170147000-170152000 of chromosome 6 in the target region;
X 33 methylation level at 182500-187500 locus of chromosome 8 in the target region;
X 34 the methylation level at chromosome 8 64081500 and 64086500 in the target region;
X 35 the methylation level at chromosome 9 14688500-14693500 in the target region;
X 36 methylation level at positions 119803500-119808500 of chromosome 10 in the target region;
X 37 the methylation level at position 56285500-56290500 of chromosome 15 in the target region;
a is any value selected from-300 to 300, preferably-260.66;
b is any value selected from-200 to 200, preferably-181.95;
c is any value selected from-100 to 100, preferably-85.21;
d is any value selected from-350 to 350, preferably-305.36;
e is any value selected from-250 to 250, preferably 218.80;
f is any value selected from-100 to 100, preferably 60.85;
g is any value selected from-250 to 250, preferably-209.25.
8. The system of item 6, wherein β 5 X 5 Obtained by the formula three:
β 5 X 5 =l*X 51 +m*X 52+ n*X 53 formula III
X 51 (ii) is a reads profile derived from healthy plasma;
X 52 is a distribution of reads from normal tissue;
X 53 (ii) is a distribution of reads derived from tumor tissue;
l is any number selected from 50 to 200, preferably 166.02;
m is any value selected from 0 to 10, preferably 8.41;
n is any value selected from-0.1 to 1, preferably 0.00000001.
9 the system according to item 1, wherein,
the system also includes a bisulfite treatment module for bisulfite treating cfDNA of the subject.
10. A method of cancer screening using the system of any one of claims 1-9, the method comprising:
a sample collection step, wherein data of methylation level, tumor fraction, tumor ploidy, cfDNA fragment length characteristic value and reads distribution of a target area of a subject are obtained;
a cancer calculation step to predict the probability of the subject suffering from cancer based on the data of methylation level, tumor fraction, tumor ploidy, cfDNA fragment length, and reads distribution obtained in the data acquisition module.
11. A method of cancer screening as claimed in item 10, the method comprising:
the data acquisition step comprises a sequencing step, a methylation level analysis step, a tumor score analysis module step, a tumor ploidy analysis step, a cfDNA fragment length eigenvalue analysis step and a reads distribution analysis step;
the sequencing step is used for whole genome sequencing of cfDNA of a subject;
the methylation level analyzing step analyzes the methylation level of the target region based on the sequencing data obtained from the sequencing module;
the tumor score analyzing step analyzes the tumor score based on the sequencing data obtained from the sequencing module;
the tumor ploidy analyzing step analyzes tumor ploidy based on sequencing data obtained from a sequencing module;
the cfDNA fragment length eigenvalue analysis step calculates a length-related risk eigenvalue of the tumor based on the length of the cfDNA fragment extracted from the sequencing data obtained by the sequencing module;
the reads distribution analysis module analyzes the reads distribution based on the sequencing data obtained from the sequencing step to predict the source of each read.
ADVANTAGEOUS EFFECTS OF INVENTION
The system of the application is constructed by comprehensively utilizing indexes related to methylation level, tumor fraction, tumor ploidy, cfDNA fragment length (fragment size) characteristic value and ready distribution, can greatly reduce the cost of cancer screening and improve the cancer noninvasive screening method of screening accuracy, and has very high sensitivity and specificity.
The model may be: 1) the kit is used for early screening of asymptomatic people and prognosis detection of cancer patients in a non-invasive manner, reduces harm caused by invasive detection, and 2) has higher sensitivity and accuracy and can realize real-time monitoring.
The 5 markers used in the method can greatly reduce the cost of cancer screening and improve the noninvasive cancer screening accuracy, and have very high sensitivity and specificity, and in addition, when the 5 markers are used, good prediction can be realized by using a simpler and robust generalized linear model. The generalized linear model is a data model which is very important at present, is the popularization of a classical linear model, can be widely applied to modern society, has a deep significance particularly on the statistics and analysis of medical, biological and economic data and the like, and can be suitable for discrete data and continuous data. The generalized linear model greatly expands the standard linear model by fitting a function of the conditional mean of the response variable (not the conditional mean of the response variable) and assuming that the response variable follows a certain distribution in the family of exponential distributions (not limited to normal distributions), and the derivation of the model parameter estimates is based on the maximum likelihood estimate. Logistic regression is a very important model when the dependent variable is two or more classes. The Logistic regression model is widely used in many fields because Logistic regression does not require normality and homogeneity of variance of data and independent variable types. Meanwhile, the generalized linear model is generally trained very fast, and in addition, the training process is easily understood by people.
Drawings
FIG. 1 is a ROC curve in a test set based on the MODE model in example 2;
FIG. 2 is a ROC curve in the test set of generalized linear regression models constructed based on tumor score (tumor fraction) in example 3;
FIG. 3 is a ROC curve in the test set of the generalized linear regression model constructed based on tumor ploidy (tumor ploidy) in example 3;
FIG. 4 is a ROC curve in a generalized linear regression model test set constructed based on reads distribution in example 3;
FIG. 5 is an ROC curve in the test set for the construction of a generalized linear regression model based on feature values of cfDNA fragment lengths (fragment sizes) in example 3;
FIG. 6 is a ROC curve in a test set for a generalized linear regression model constructed based on methylation levels in example 3;
FIG. 7 is a ROC curve of a generalized linear regression model in a test set constructed based on tumor ploidy (tumor ploidy) and reads distribution combinations in example 4;
FIG. 8 is a ROC curve in a test set for the construction of a generalized linear regression model based on a combination of tumor ploidy (tumor ploidy), tumor fraction (tumor fraction), and cfDNA fragment length (fragment size) eigenvalues.
Detailed Description
Specific embodiments of the present application will be described in more detail below with reference to the accompanying drawings. While specific embodiments of the present application are shown in the drawings, it should be understood that the present application may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.
It should be noted that certain terms are used throughout the description and claims to refer to particular components. As one skilled in the art will appreciate, various names may be used to refer to a component. This specification and claims do not intend to distinguish between components that differ in name but not function. In the following description and in the claims, the terms "include" and "comprise" are used in an open-ended fashion, and thus should be interpreted to mean "include, but not limited to. The description which follows is a preferred embodiment of the present application, but is made for the purpose of illustrating the general principles of the application and not for the purpose of limiting the scope of the application. The scope of the present application is to be considered as defined by the appended claims.
Definition of
Unless specifically defined elsewhere herein, all other technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs.
Methylation of
Methylation is an important modification of protein and nucleic acid, regulates the expression and closure of genes, is closely related to many diseases such as cancer, aging, senile dementia and the like, and is one of important research contents of epigenetics. The most common methylation modifications are DNA methylation and histone methylation.
DNA methylation refers to the methylation process of the 5 th carbon atom on cytosine in CpG dinucleotide, and is an important epigenetic mechanism which can be inherited to new filial generation DNA along with the DNA replication process under the action of DNA methyltransferase as a stable modification state. Aberrant methylation includes hypermethylation of cancer suppressor genes and DNA repair genes, hypomethylation of repeat DNA, loss of imprinting of certain genes, which is associated with the development of a variety of tumors.
In this context, the ROC curve may reflect the classification effect of the classifier to some extent. The AUC is actually the area under the ROC curve. AUC intuitively reflects the classification ability of ROC curve expression.
Whole genome methylation sequencing
Whole genome methylation sequencing (WGBS) was considered as the "gold standard" for methylation sequencing. The principle is that bisulfite is used for processing, C base which is not methylated in genome is converted into U, the U is converted into T after PCR amplification, the T is distinguished from C base which originally has methylation modification, and whether CpG/CHG/CHH sites are methylated or not can be judged by combining high-throughput sequencing technology and reference sequence comparison.
A system for cancer screening includes a data acquisition module and a cancer calculation module. Wherein the data acquisition module is used for acquiring data of methylation level, tumor fraction (tumor fraction), tumor ploidy (tumor ploidy), cfDNA fragment length (fragment size) characteristic value and ready distribution of the target region of the subject; and a cancer calculation module that predicts a probability that the subject suffers from cancer based on the data for methylation level, tumor fraction (tumor fraction), tumor ploidy (tumor ploidy), cfDNA fragment length (fragment size) feature value, and reads distribution obtained in the data acquisition module.
In some embodiments of the present application, the data acquisition module comprises a sequencing module, a methylation level analysis module, a tumor score analysis module, a tumor ploidy analysis module, a cfDNA fragment length eigenvalue analysis module, and a reads distribution analysis module;
the sequencing module is used for performing whole genome sequencing on cfDNA of a subject;
the methylation level analysis module analyzes the methylation level of the target region based on the sequencing data obtained from the sequencing module;
the tumor score analysis module analyzes a tumor score based on sequencing data obtained from a sequencing module; (the tumor score and the tumor proportion are used in the specification section, herein in a unified manner)
The tumor ploidy analysis module analyzes tumor ploidy based on sequencing data obtained from the sequencing module;
the cfDNA fragment length eigenvalue analysis module calculates a length-related risk eigenvalue of the tumor based on the length of the cfDNA fragment extracted from the sequencing data obtained by the sequencing module;
the reads distribution analysis module analyzes the reads distribution based on the sequencing data obtained from the sequencing module to predict the source of each read.
Herein, the target region of the subject may be a specific region on the chromosome of the subject, for example, the target region includes any one or two or more of the following regions. :
chromosome 2 223721500-position 223726500,
chromosome 6, 170147000 and 170152000,
chromosome 8 182500-187500,
chromosome 8 64081500-64086500,
chromosome 9 at positions 14688500 and 14693500,
119803500-119808500 of chromosome 10, or
Chromosome 15 56285500-.
In a specific embodiment, the target region is chromosome 2 223721500-223726500.
In a specific embodiment, the target region is chromosome 6, position 170147000 and 170152000.
In a specific embodiment, the target region is chromosome 8 182500-187500.
In a specific embodiment, the target region is chromosome 8 64081500 and 64086500.
In a specific embodiment, the target region is chromosome 9 at positions 14688500 and 14693500.
In a specific embodiment, the target region is chromosome 10, 119803500 and 119808500.
In a specific embodiment, the target region is chromosome 15, 56285500-56290500.
Screening a target region, firstly, sliding a window on a reference genome by using a sliding window method, and calculating the overall methylation level of CpG sites in each window interval; and counting the methylation level of the corresponding window for each sample, finding out the differential methylation window according to the groups of different samples, and selecting the window with the methylation difference larger than 0.2 as a target area so as to obtain the target area.
The methylation level of the target region is calculated based on the methylation level of each CG site of the target region, wherein the methylation level of the CG site is the ratio of the cytosine at which methylation occurs to the sum of the cytosine at which methylation occurs and the cytosine which is not methylated, which is detected in the sequence results of all detected sites.
For each window, the number of CG sites in each window was counted. Since the depth of methylated cytosine at each CG site and the total depth of the sites are known, the methylation level of the entire window can be calculated as the ratio of the sum of the depths of methylated cytosines at all CG sites divided by the sum of the total depths of all CG sites. Each window will be calculated as described above to give a corresponding methylation level. Wherein, the depth of the methylated cytosine at each CG site is the number of reads of which the sequencing result shows that the site is methylated cytosine, namely the sequencing result shows that the measured number of reads of which the site results in C (cytosine), and the total depth of the sites is the total number of all sequencing reads covering the site, namely the sequencing result shows that the total number of reads of which the site is C or T (thymine). The depth of methylated cytosines and the total depth of sites can be provided directly after analysis by sequencing software.
Tumor fraction (tumor fraction), also called tumor ratio, refers to the ratio of ctDNA in cfDNA, i.e. the ratio of cfDNA of tumor cells in total cfDNA. Among them, peripheral circulating free DNA (cfdna) is a DNA fragment released into plasma after apoptosis, necrosis, lesion, etc. of cells. cfDNA can be used to describe various forms of free DNA of peripheral circulating blood, which is cfDNA derived from tumor cells, and belongs to one type of cfDNA. The tumor fraction (tumor fraction) is the proportion of ctDNA in cfDNA, i.e. the proportion of cfDNA of tumor cells in total cfDNA.
Herein, cfDNA refers to various forms of free DNA fragments of peripheral circulating blood, and ctDNA refers to free DNA derived from tumor cells among the cfDNA.
ctDNA was performed by ichor cna software, and by whole genome sequencing cfDNA, somatic copy number variations (SCNAs) in cfDNA were detected and tumor scores (tumor fractions) were quantitatively calculated.
Tumor ploidy (tumor ploidy) refers to the true content of cancer cells in a tumor sample caused by abnormal chromosome structure and number, and represents the ploidy number of tumors, and normal cells are all diploid, so that some tumor cells become polyploid due to variation. The occurrence of DNA aneuploidy is related to the canceration rate and the proliferation degree of precancerous lesions, and is an important index for the precancerous lesions to become canceration. Determination of tumor ploidy (tumor ploidy) was performed using the ichor cna software, and by whole genome sequencing of cfDNA, it was possible to detect somatic copy number variations (SCNA) in cfDNA and quantitatively calculate tumor ploidy (tumor ploidy).
Research shows that when tumor happens, cells are proliferated maliciously, and copy number variation generally occurs; when tumors develop, cfDNA of tumor cells is released into the blood. The ichoCNA software calculates copy number variation (copy number variation) by using an HMM model respectively after data reads depth evaluation, and calculates tumor score (tumor fraction) and tumor ploidy (tumor ploidy) by using an EM algorithm according to the copy number variation.
The cfDNA fragment length (fragment size) refers to the average of the given fragment sizes found in the cfDNA sequencing data, i.e., the ratio of the sum of the sizes of all fragments found in the cfDNA sequencing data of a subject to the number of all fragments. The extracted fragment information is typically tiled into adjacent, non-overlapping 100kb intervals according to hg19 reference genome autosomes. The short fragment is defined as short and is between 150 and 200bp, the long fragment is defined as long and is between 201 and 320bp, and cov is short + long. And counting the number of short, long and cov of each interval, namely the coverage of the segments. Short, long and cov are corrected using the LOWESS algorithm to remove coverage bias due to GC offset. And sequentially merging the 100kb intervals into 5MB intervals to obtain 499 non-overlapping intervals, and counting the numbers of short, long and cov of the 5MB intervals. short/long is the short-to-long fragment ratio. By analysis, cov (499 intervals corresponding to 499 cov features) were selected as features, dimensionality reduction was performed using PCA, and then a gradient lifting tree model was used to derive risk values.
The cfDNA fragment length characteristic value analysis module calculates a length-related risk characteristic value of the tumor based on the length of the cfDNA fragment extracted from the sequencing data obtained by the sequencing module, and calculates a cfDNA fragment length characteristic value through a gradient lifting tree model based on a ratio between a short fragment and a long fragment in the cfDNA fragment length;
preferably, the long fragment is a fragment with the length of 201-320 bp;
the short fragment is a fragment with the length of 150-200 bp;
the reads distribution refers to the probability of cfDNA sample origin by counting each read in the cfDNA sequencing data. The probability of the reads from tumor cells is predicted according to the distribution of the reads of the existing cancer patients by using cancer Detecott software. The method adopts the mixed model hypothesis of three tissue sources of healthy sample plasma, normal tissue and tumor tissue instead of the traditional two-component mixed hypothesis of the healthy sample plasma and the tumor tissue, and theoretically, the mixed model hypothesis can not only give the judgment of health or sickness to the sample to be detected, but also give the accurate judgment of whether the sample is cancer or not for the sample judged to be sick. And calculating the characteristic value of the distribution pattern of the methylation level of each sample in three grouped samples of a healthy plasma sample, a normal tissue sample and a tumor tissue sample aiming at the data of the methylation level of each sample, thereby obtaining the tumor prediction probability of the sequenced sample. I.e. the probability that a given reads originates from three source components, healthy plasma, normal tissue and tumor tissue.
In a specific embodiment, the cfDNA sequencing data is cfDNA sequencing data after removal of low quality sequencing fragments; preferably, the cfDNA sequencing data is sequencing data from which sequencing data within the low-alignment interval is further excluded after removal of low-quality sequencing fragments. The low quality sequencing fragments are generally reads with relatively low sequencing quality, such as alignment to multiple chromosome positions, inclusion of linker data, or overall sequencing data less than Q20. The accuracy of subsequent analysis results can be greatly improved by removing low-quality sequencing, and the false positive condition is reduced.
In the cancer calculation module of the present application, a formula fitted based on data of methylation level, tumor fraction (tumor fraction), tumor ploidy (tumor ploidy), cfDNA fragment length (fragment size) characteristic value, and reads distribution of known samples is stored in advance to predict the probability of the subject suffering from cancer. And substituting the methylation level, the tumor fraction (tumor fraction), the tumor ploidy (tumor ploidy), the feature value of the cfDNA fragment length (fragment size) and the reads distribution of the target region of the subject, which are acquired in the data acquisition module, into a formula in the cancer calculation module to obtain the probability of the cancer of the subject.
The formula is logically fitted by the Generalized Linear regression (GLM) algorithm based on data that know the methylation level, tumor fraction (tumor fraction), tumor ploidy (tumor ploidy), cfDNA fragment length (fragment size) eigenvalues, and reads distribution of the sample. The numerical values of the 5 markers are introduced into R software, generalized linear regression (GLM algorithm) is used for integration, response variables obey binomial distribution, a connection function is logistic regression, 100 times of iteration is carried out according to the maximum likelihood estimation principle to solve regression coefficients, and finally a MODE model is synthesized.
Wherein the MODE model is the formula one
Figure RE-GDA0003756602660000121
Model interpretation: p is the probability of outcome of exposure to a certain condition, and logit (P) is a variable transformation mode, which means that P is subjected to logit transformation, and β is a partial regression coefficient, which means the estimated value of each unit of logit (P) change under the condition that other independent variables are unchanged.
P is the probability of the subject suffering from cancer;
X 1 is the tumor score;
X 2 is tumor ploidy;
X 3 is the methylation level;
X 4 is a cfDNA fragment length characteristic value;
X 5 is a reads distribution;
β 0 is any value selected from 100 to 150, preferably 114.12;
β 1 is any value selected from-700 to 0, preferably-686.25;
β 2 is any value selected from 20 to 40, preferably 31.16;
β 4 is any value selected from-20 to 0, preferably-17.45;
the prediction function in R can be used to obtain the probability value predicted by MODE, thereby judging the probability of the sample suffering from cancer.
The generalized linear model greatly expands the standard linear model by fitting a function of the conditional mean of the response variable (not the conditional mean of the response variable) and assuming that the response variable obeys a certain distribution in the family of exponential distributions (not limited to normal distributions), the derivation of the model parameter estimates is based on maximum likelihood estimation. Logistic regression is a very important model when the dependent variable is two or more classes. The Logistic regression model is widely used in many fields because Logistic regression does not require normality and homogeneity of variance of data and independent variable types.
In some embodiments of the present application, β 3 X 3 Obtained by the formula two:
β 3 X 3 =a*X 31 +b*X 32+ c*X 33 +d*X 34 +e*X 35 +f*X 36 +g*X 37 formula II
X 31 The methylation level at position 2 of chromosome 223721500-223726500 in the target region;
X 32 the methylation level at position 170147000-170152000 of chromosome 6 in the target region;
X 33 methylation level at 182500-187500 locus of chromosome 8 in the target region;
X 34 the methylation level at chromosome 8 64081500 and 64086500 in the target region;
X 35 the methylation level at chromosome 9 14688500-14693500 in the target region;
X 36 the methylation level of 119803500-119808500 of chromosome 10 in the target region; x 37 The methylation level at position 56285500-56290500 of chromosome 15 in the target region;
a is any value selected from-300 to 300, preferably-260.66;
b is any value selected from-200 to 200, preferably-181.95;
c is any value selected from-100 to 100, preferably-85.21;
d is any value selected from-350 to 350, preferably-305.36;
e is any value selected from-250 to 250, preferably 218.80;
f is any value selected from-100 to 100, preferably 60.85;
g is any value selected from-250 to 250, preferably-209.25.
In some embodiments of the present application, β 5 X 5 Obtained from formula three:
β 5 X 5 =l*X 51 +m*X 52+ n*X 53 formula III
X 51 (ii) is a reads profile derived from healthy plasma;
X 52 is a distribution of reads from normal tissue;
X 53 (ii) is a distribution of reads derived from tumor tissue;
l is any value selected from 50 to 200, preferably 166.02;
m is any value selected from 0 to 10, preferably 8.41;
n is any value selected from-0.1 to 1, preferably 0.00000001.
In one embodiment of the present application, MODE is 114.12+ SCORE _ TF + SCORE _ TP + SCORE _ DMV + SCORE _ FS + SCORE _ read
SCORE of tumor-686.25 ═ SCORE _ TF ═ 686.25
SCORE _ TP ═ 31.16 tumor ploidy
SCORE _ DMV (-260.66 × methylation level 1) + (-181.95 × methylation level 2) + (-85.21 × methylation level 3) + (-305.36 × methylation level 4) + (218.80 × methylation level 5) + (60.85 × methylation level 61) + (-209.25 × methylation level 7)
tag chromsome start end
Methylation level 1 Chr2 223721500 223726500
Methylation level 2 Chr6 170147000 170152000
Methylation level 3 Chr8 182500 187500
Methylation level 4 Chr8 64081500 64086500
Methylation level 5 Chr9 14688500 14693500
Methylation level 6 Chr10 119803500 119808500
Methylation level 7 Chr15 56285500 56290500
SCORE _ FS ═ -17.45 cfDNA fragment length eigenvalues
SCORE _ READS ═ READS distribution of healthy plasma 166.02) + (8.41 READS distribution of normal tissue) + (0.00000001 READS distribution of tumor tissue)
The system described herein can also further include a bisulfite treatment module for bisulfite treating cfDNA of a subject. Bisulfite treated cfDNA was used for subsequent cfDNA sequencing.
The present application also provides a method of cancer screening using the above system, the method comprising:
a sample collection step of obtaining data of methylation level, tumor fraction (tumor fraction), tumor ploidy (tumor ploidy), cfDNA fragment length (fragment size) characteristic value, and ready distribution of subject in a target region;
a cancer calculation step of predicting the probability of the subject suffering from cancer based on the data of methylation level, tumor score (tumor fraction), tumor ploidy (tumor ploidy), cfDNA fragment length (fragment size) feature value and reads distribution obtained in the data acquisition module.
In some specific embodiments, the data collection step comprises a sequencing step, a methylation level analysis step, a tumor score analysis module step, a tumor ploidy analysis step, a cfDNA fragment length eigenvalue analysis step, and a reads distribution analysis step;
the sequencing step is used for whole genome sequencing of cfDNA of a subject;
the methylation level analyzing step analyzes the methylation level of the target region based on the sequencing data obtained from the sequencing module;
the tumor score analyzing step analyzes the tumor score based on the sequencing data obtained from the sequencing module;
the tumor ploidy analyzing step analyzes tumor ploidy based on sequencing data obtained from a sequencing module;
the cfDNA fragment length eigenvalue analysis step calculates a length-related risk eigenvalue of the tumor based on the length of the cfDNA fragment extracted from the sequencing data obtained by the sequencing module;
the reads distribution analysis module analyzes the reads distribution based on the sequencing data obtained from the sequencing step to predict the source of each read.
Examples
Example 1 calculation of differentially methylated regions and calculation of fragment group characteristics
1.1cfDNA extraction purification
1.1.1 plasma sample preparation:
the blood samples were centrifuged at 2000g for 10min at 4 ℃ and the plasma transferred to a new centrifuge tube. The plasma samples were centrifuged at 16000g for 10min at 4 ℃ and the next step was performed, depending on the type of collection tube used, which was otherwise used in this experiment.
Figure RE-GDA0003756602660000151
1.1.2 cleavage and conjugation
1.1.2.1. Binding solutions/Beads Mix were prepared according to the following table and then thoroughly mixed.
Figure RE-GDA0003756602660000152
Figure RE-GDA0003756602660000161
An appropriate volume of plasma sample was added.
1.1.2.2. The plasma samples were thoroughly mixed with Binding Solution/Beads Mix.
1.1.2.3. Binding was performed on a spin mixer for 10min sufficient to bind cfDNA to the magnetic beads.
1.1.2.4. The binding tube was placed on a magnetic stand for 5min until the solution became clear and the magnetic beads were completely adsorbed on the magnetic stand.
1.1.2.5. The supernatant was carefully discarded with a pipette, the tube was kept on the magnetic rack for several minutes, and the residual supernatant was removed with a pipette.
1.1.3 washing
1.1.3.1. The beads were resuspended in 1ml of Wash Solution.
1.1.3.2. The resuspension was transferred to a new non-adsorbing 1.5ml centrifuge tube. The bonded tube is retained.
1.1.3.3. The centrifuge tube containing the bead resuspension was placed on a magnetic rack for 20 s.
1.1.3.4. The separated supernatant was aspirated to wash the binding tubes, and the washed residual beads were collected again in a resuspension, discarding the lysis/binding tubes.
1.1.3.5. The tube was placed on a magnetic rack for 2min until the solution became clear, the beads were collected in the magnetic rack, and the supernatant was removed with a 1ml pipette.
1.1.3.6. The tube was left on the magnetic rack and the remaining liquid was removed as much as possible with a 200 μ L pipette.
1.1.3.7. The tube was removed from the magnetic stand and 1ml of Wash Solution was added, vortexed for 30 s.
1.1.3.8. Place on magnetic rack for 2min until the solution cleared, the beads were collected on the magnetic rack, and the supernatant was removed with a 1ml pipette.
1.1.3.9. The tube was left on the magnetic rack and the residual liquid was removed thoroughly with a 200 μ L pipette.
1.1.3.10. The tube was removed from the magnetic stand, 1ml of 80% ethanol was added, and vortexed for 30 s.
1.1.3.11. The solution became clear by placing on a magnetic rack for 2min and the supernatant was removed with a 1ml pipette.
1.1.3.12. The tube was left on the magnetic rack and the remaining liquid was removed with a 200 μ L pipette.
1.1.3.13. The above 10-12 steps were repeated once with 80% ethanol to remove the supernatant as much as possible.
1.1.3.14. The tube was left on the magnetic stand and the beads were dried in air for 3-5 minutes.
1.1.4 elution of cfDNA
1.1.4.1. Elution Solution was added as in the following table.
Figure RE-GDA0003756602660000171
1.1.4.2. The solution was settled on a magnetic rack for 2min and cfDNA in the supernatant was aspirated.
1.1.4.3. Purified cfDNA was used immediately, or the supernatant was transferred to a new centrifuge tube and stored at-20 ℃.
1.2gDNA disruption and purification:
1.2.1. according to the Qubit concentration, 2. mu.g of DNA was taken, supplemented to 125. mu.l with water, added to a covaris 130. mu.l stoptube, and the program was set: 50W, 20%, 200cycles, 250 s.
1.2.2 after the disruption, 1 μ l of sample was taken and subjected to fragment detection using Agilent2100, and the main peak of the sample detection after normal disruption was about 150bp-200 bp.
For cfDNA samples, Agilent2100 performed fragment detection with direct Qubit for subsequent experiments.
1.3 end repair, 3' end addition of "A":
1.3.1. xng broken gDNA or cfDNA is taken to be put into a PCR tube, 50 mul is supplemented by nuclease-free water, the following reagents are added, and vortex mixing is carried out:
components Volume of
gDNA/cfDNA 50μl
End Repair&A-Tailing Buffer 7μl
End Repair&A-Tailing Enzyme Mix 3μl
Total volume 60μl
1.3.2. The following program was set up to perform the reaction on a PCR instrument:
the hot lid temperature was 85 ℃.
Figure RE-GDA0003756602660000172
Figure RE-GDA0003756602660000181
1.4 linker ligation and purification:
1.4.1. the linker was diluted in advance to the appropriate concentration with reference to the following table:
Fragmented DNA per 50ul ER&AT reaction Adapter concentration
1μg 10uM
500ng 10uM
250ng 10uM
100ng 10uM
50ng 10uM
25ng 10uM
10ng 3uM
5ng 5uM
2.5ng 2.5uM
1ng 625nM
1.4.2. the following reagents were prepared according to the following table, gently pipetted and mixed, and briefly centrifuged:
Figure RE-GDA0003756602660000182
Figure RE-GDA0003756602660000191
1.4.3. the following program was set up to perform the reaction on a PCR instrument:
without a heat cover.
Temperature of Time
20 30min
4℃
1.4.4. Purified magnetic beads were added for the experiment according to the following system (agencourt tampure XP beads were brought to room temperature in advance and mixed well for use):
components Volume of
Joint ligation product 110μl
AgencourtAMPure XP beads 110μl
Total volume 220μl
1.4.4.1. Gently suck and mix for 6 times.
1.4.4.2. And (3) standing and incubating for 5-15min at room temperature, and placing the PCR tube on a magnetic frame for 3min to clarify the solution.
1.4.4.3. The supernatant was removed, the PCR tube was further placed on a magnetic stand, 200. mu.l of 80% ethanol solution was added to the PCR tube, and the tube was allowed to stand for 30 seconds.
1.4.4.4. The supernatant was removed, 200. mu.l of 80% ethanol solution was added to the PCR tube, and the supernatant was removed thoroughly after standing for 30s (it was recommended to remove the residual ethanol solution at the bottom using a 10. mu.l pipette).
1.4.4.5. Standing at room temperature for 3-5min to completely volatilize residual ethanol.
1.4.4.6. Add 22. mu.l of nucleic-freewater, remove the PCR tube from the magnetic frame, gently pipette the resuspended beads to avoid air bubbles, and let stand at room temperature for 2 min.
1.4.4.7. The PCR tube was placed on a magnetic stand for 2min to clarify the solution.
1.4.4.8. Pipette 20. mu.l of the supernatant and transfer to a new PCR tube.
1.5 bisulfite treatment and purification:
1.5.1. the required reagents were taken out beforehand and dissolved. The reagents were added according to the following table:
components High concentration sample (1 ng-2. mu.g) body Low concentration samples (1-500ng)
Linker ligation of purified products 20μl 40μl
Bisulfite solution 85μl 85μl
DNA protect buffer 35μl 15μl
Total volume 140μl 140μl
1.5.2.DNA Protect buffer addition liquid turns blue. Mix by gentle pipetting and then divide into two tubes and place on the PCR instrument.
1.5.3. The following programs are set and run:
hot lid 105 ℃.
Temperature of Time
95℃ 5min
60℃ 10min
95℃ 5min
60 10min
4℃
1.5.4. Brief centrifugation two tubes of the same sample were combined into the same clean 1.5ml centrifuge tube.
1.5.5. Mu.l Buffer BL (sample size less than 100ng plus 1. mu.l Carrier RNA (1. mu.g/. mu.l)) was added to each sample, vortexed, and briefly centrifuged.
1.5.6. Add 250. mu.l of absolute ethanol to each sample, vortex mix for 15s, centrifuge briefly, and add the mixture to the corresponding prepared spin column.
1.5.7. Standing for 1min, centrifuging for 1min, transferring the liquid in the collecting tube to the centrifugal column again, centrifuging for 1min, and discarding the liquid in the centrifugal tube.
1.5.8. Add 500. mu.l buffer BW (note whether absolute ethanol is added or not), centrifuge for 1min, discard waste.
1.5.9. Add 500. mu.l of buffer BD (note whether absolute ethanol was added or not), cover the tube, and let stand at room temperature for 15 min. Centrifuging for 1min, and discarding the centrifuged liquid.
1.5.10. Add 500. mu.l buffer BW (note whether absolute ethanol is added or not), centrifuge for 1min, discard the separated liquid, repeat once for 2 times.
1.5.11. Add 250. mu.l of absolute ethanol, centrifuge for 1min, place the column in a new 2ml collection tube and discard all remaining liquid.
1.5.12. Placing the column in a clean 1.5ml centrifuge tube, adding 20 μ l nuclease-free water to the center of the column membrane, lightly covering the tube cover, standing at room temperature for 1min, and centrifuging for 1 min.
1.5.13. And (4) transferring the liquid in the collecting pipe to a centrifugal column again, standing at room temperature for 1min, and centrifuging for 1 min.
1.6 Pre-hybridization Pre-amplification and purification:
1.6.1. preparing a reaction system according to the following table, blowing, beating, mixing uniformly, and centrifuging for a short time:
components Volume of
Purification of the product after bisulfite treatment 20μl
Amplification enzymes 25μl
Upstream primer (10uM) 2.5μl
Downstream primer (10uM) 2.5μl
Total volume 50μl
1.6.2. The following program was set up and the PCR program was started:
105 deg.C thermal cover
Figure RE-GDA0003756602660000211
1.6.3 PCR cycle number was adjusted according to the amount of DNA dosed, reference data are as follows:
Figure RE-GDA0003756602660000221
1.6.4. and adding 50 mu l of AgencourtAmPurpure XP magnetic beads into the PCR tube after the reaction is finished, and blowing and uniformly mixing the mixture by using a pipettor to avoid generating bubbles (the AgencourtAmPurpure XP is uniformly mixed at room temperature in advance and balanced).
1.6.5. Incubate at room temperature for 5-15min, and place the PCR tube on a magnetic frame for 3min to clarify the solution.
1.6.6. The supernatant was removed, the PCR tube was further placed on a magnetic stand, 200. mu.l of 80% ethanol solution was added to the PCR tube, and the tube was allowed to stand for 30 seconds.
1.6.7. The supernatant was removed, 200. mu.l of 80% ethanol solution was added to the PCR tube, and after standing for 30s, the supernatant was removed completely (it was recommended to remove the bottom residual ethanol solution using a 10. mu.l pipette).
1.6.8. Standing at room temperature for 5min to completely volatilize residual ethanol.
1.6.9. 30. mu.l of nucleic-free water was added, the tube was removed from the magnetic stand, and the resuspended beads were gently pipetted using a pipette.
1.6.10. After standing at room temperature for 2min, 200. mu.l of PCR tube was placed on a magnetic stand for 2min to clarify the solution.
1.6.11. The supernatant was transferred to a new 200. mu.l PCR tube (on an ice box) using a pipette, and the reaction tube was labeled with a sample number and ready for the next reaction.
1.6.12. A1. mu.l sample was taken for library concentration determination using the Qubit and library concentration was recorded.
1.6.13. A1. mu.l sample was taken and the library fragment length was determined using Agilent2100, with a library length of approximately 270bp to 320 bp.
1.6.14. Sequencing was performed using the Illumina high throughput sequencing platform.
1.6.15. Methylation and tumor data generation analysis procedure.
The process is as follows: checking the quality of original sequencing data by using quality control software such as fastp and the like, and filtering, intercepting or removing low-quality reads to obtain corresponding clean data; comparing the clean data after quality control to a reference genome (hg19) by adopting Bismark bowtie2 comparison software; removing the weight of the primarily compared bam file by using the default _ bismark; extracting corresponding methylation site information by using a Bismark _ methylation _ extra to obtain a final methylated CG file (including all single CG site information files); sliding window calculates the methylation level of the target region, ichor-CNA calculates tumor fraction (tumor fraction) and tumor ploidy (tumor ploidy), predicts the tumor probability using cfDNA fragment size, and calculates the tumor proportion using reads distribution.
Methylation level: sliding windows were used and for each window the number of CG sites in each window was counted. Since the depth of methylated cytosine at each CG site and the total depth of the sites are known, the methylation level of the entire window can be calculated as the ratio of the sum of the depths of methylated cytosines at all CG sites divided by the sum of the total depths of all CG sites. Each window will be calculated as described above to give a corresponding methylation level. Wherein, the depth of the methylated cytosine of each CG locus is the number of reads showing methylated cytosine at the locus as a result of sequencing detection, namely the sequencing result shows that the number of reads showing methylated cytosine at the locus as a result of C (cytosine) is measured at the locus, and the total depth of the loci is the total number of all sequenced reads covering the locus, namely the total number of reads showing C or T (thymine) at the locus as a result of sequencing detection.
Tumor score (tumor fraction) and tumor ploidy (tumor ploidy): bam files were processed using ichor CNA software, CNAs were calculated using HMM (hidden markov model) models, and tumor scores (tumor fraction) and tumor ploidy (tumor ploidy) were calculated using EM (maximum expectation algorithm) algorithms. The detailed principle flow of the ichorCNA software is as follows:
(1) the bin size (10M or 5M) is set firstly, and the number of reads of each bin is calculated;
(2) correcting GC, mappability, and depth difference for the number of reads per bin;
(3) comparing the corrected number of reads of each bin with the number of reads of each bin of the normal panel built in the software to calculate a logR value;
(4) calculating the result of each possible scheme and the maximum likelihood function of each scheme by using an HMM algorithm and an EM algorithm, wherein the HMM algorithm and the EM algorithm are internal models of the ichor CNA software;
(5) the scheme with the maximum likelihood function is selected as the final result, namely the tumor fraction (tumor fraction) and the tumor ploidy (tumor ploidy).
cfDNA fragment length calculation: for the resulting bam file, reads with MAPQ <30 were filtered out and whole genome fragment information was extracted using R-package GCcontent. The extracted fragment information was tiled into adjacent, non-overlapping 100kb intervals according to hg19 reference genome autosomes. The short fragment is defined as short and is between 150 and 200bp, the long fragment is defined as long and is between 201 and 320bp, and cov is short + long. And counting the number of short, long and cov of each interval, namely the segment coverage. Short, long and cov are corrected using the LOWESS algorithm to remove coverage bias due to GC offset. And sequentially merging the 100kb intervals into 5MB intervals to obtain 499 non-overlapping intervals, and counting the numbers of short, long and cov of the 5MB intervals. short/long is the short-to-long fragment ratio. By analysis, cov (499 intervals corresponding to 499 cov features) were selected as features, dimensionality reduction was performed using PCA, and then the probability value of cancer risk was obtained using the gradient boosting tree model.
The cfDNA fragment length characteristic value analysis module calculates a length-related risk characteristic value of the tumor based on the length of the cfDNA fragment extracted from the sequencing data obtained by the sequencing module, and calculates a cfDNA fragment length characteristic value through a gradient lifting tree model based on a ratio between a short fragment and a long fragment in the cfDNA fragment length;
preferably, the long fragment is a fragment with the length of 201-320 bp;
the short fragment is a fragment with the length of 150-200 bp;
reads distribution: and calculating the likelihood value of a certain sequencing fragment in the sample to be detected from a certain source component in three source components of healthy plasma, normal tissue and tumor tissue by using the cancer selector software through a bam file so as to obtain the proportion of the certain reads in the sample to be detected from the tumor source component. I.e. the probability that a given reads is derived from three derived components, healthy plasma, normal tissue and tumor tissue.
Example 2
This example classifies 58 samples (37 healthy persons and 21 lung cancer patients), divides 58 samples into a training set of 70% and a testing set of 30%, and divides them into a training set (15 cfDNA of lung cancer patients and 26 healthy persons) and a testing set (6 lung cancer patients and 11 healthy persons), as shown in fig. 1. Based on the above training set, a plurality of lung cancer-related biomarkers (i.e., methylation levels, cfDNA fragment length eigenvalues, tumor fraction, tumor ploidy, reads distribution) were integrated according to the method described in example 1, and a GLM algorithm was used (a MODE model was obtained, and the specific calculation process is as follows.
Table 1: values for methylation levels, cfDNA fragment length eigenvalues, tumor scores, tumor ploidy, reads distribution in training sets
Marker substance Assignment of value
tumor fraction(β 1 ) -686.25
tumor ploidy(β 2 ) 31.16
Methylation level (. beta.) 3 ) ---
a -260.66
b -181.95
c -85.21
d -305.36
e 218.80
f 60.85
g -209.25
characteristic value (. beta.) of cfDNA fragment length 4 ) -17.45
reads distribution (. beta.) 5 ) ---
l 166.02
m 8.41
n 0.00000001
The numerical values of the 5 markers in table 1 are imported into R software, wherein GLM algorithm is used for integration, response variables obey binomial distribution, a connection function is logistic regression, 100 iterations are performed according to the maximum likelihood estimation principle to solve regression coefficients, and finally a MODE model is fitted as shown in formula one.
Figure RE-GDA0003756602660000251
Model interpretation: p is the probability of outcome of exposure to a certain state, logic (P) is a variable transformation mode, which means that P is subjected to logic transformation, and β is a partial regression coefficient, which means the estimated value of each unit of logic (P) that changes under the condition that other independent variables are unchanged.
P is the probability of the subject suffering from cancer;
X 1 is the tumor score;
X 2 is tumor ploidy;
X 3 is the methylation level;
X 4 is a cfDNA fragment length characteristic value;
X 5 is a reads distribution;
β 0 is any value selected from 100 to 150, preferably 114.12;
β 1 is any value selected from-700 to 0, preferably-686.25;
β 2 is any value selected from 20 to 40, preferably 31.16;
β 4 is any value selected from-20 to 0, preferably-17.45;
wherein matrix analysis based on the methylation level of the target region yields formula two from Table 1:
β 3 X 3 =a*X 31 +b*X 32+ c*X 33 +d*X 34 +e*X 35 +f*X 36 +g*X 37 (formula II)
X 31 The methylation level at position 2 of chromosome 223721500-223726500 in the target region;
X 32 the methylation level at position 170147000-170152000 of chromosome 6 in the target region;
X 33 methylation level at 182500-187500 locus of chromosome 8 in the target region;
X 34 the methylation level at chromosome 8 64081500 and 64086500 in the target region;
X 35 methylation level at chromosome 9 14688500-14693500 in the target region;
X 36 the methylation level of 119803500-119808500 of chromosome 10 in the target region;
X 37 the methylation level at position 56285500-56290500 of chromosome 15 in the target region;
a is any value selected from-300 to 300, preferably-260.66;
b is any value selected from-200 to 200, preferably-181.95;
c is any value selected from-100 to 100, preferably-85.21;
d is any value selected from-350 to 350, preferably-305.36;
e is any value selected from-250 to 250, preferably 218.80;
f is any value selected from-100 to 100, preferably 60.85;
g is any value selected from-250 to 250, preferably-209.25.
Wherein matrix analysis based on reads distribution yields the formula three from table 1:
β 5 X 5 =l*X 51 +m*X 52+ n*X 53 (III)
X 51 (ii) is a reads profile derived from healthy plasma;
X 52 is a distribution of reads from normal tissue;
X 53 (ii) is a distribution of reads derived from tumor tissue;
l is any value selected from 50 to 200, preferably 166.02;
m is any value selected from 0 to 10, preferably 8.41;
n is any value selected from-0.1 to 1, preferably 0.00000001.
Wherein the content of the first and second substances,
tag chromsome start end
methylation level 1 (X) 31 ) Chr2 223721500 223726500
Methylation level 2 (X) 32 ) Chr6 170147000 170152000
Methylation level 3 (X) 33 ) Chr8 182500 187500
Methylation level 4 (X) 34 ) Chr8 64081500 64086500
Methylation level 5 (X) 35 ) Chr9 14688500 14693500
Methylation level 6 (X) 36 ) Chr10 119803500 119808500
Methylation level 7 (X) 37 ) Chr15 56285500 56290500
The probability value predicted by the MODE model can be obtained by using the prediction function in R, so that the AUC value obtained by the model in the test set (6 lung cancer patients and 11 healthy people) is judged to be 1, as shown in FIG. 1.
Example 3
Based on the results of example 1, the tumor fraction (tumor fraction), tumor ploidy (tumor ploidy), reads distribution, cfDNA fragment length (fragment size) eigenvalue, and methylation level single marker were used to construct a model using a generalized linear regression model in the training set (15 lung cancer patients cfDNA and 26 healthy people), respectively, according to the method of example 2, and the AUC obtained in the test set (6 lung cancer patients and 11 healthy people) was 0.788, 0.636, 0.712, 0.955, and 0.985, respectively, and as a result, as shown in fig. 2 to fig. 6, the specific modeling process was the same as in example 2.
Example 4
Based on the results of example 1, the method of example 2 was used to construct a model using a generalized linear regression model in a training set (15 cfDNA of lung cancer patients and 26 healthy people) using two markers of reads distribution and tumor ploidy (tumor ploidy), and the AUC obtained in a test set (6 lung cancer patients and 11 healthy people) was 0.788, as shown in FIG. 7, the specific modeling process was the same as that of example 2.
Example 5
Based on the results of example 1, the method according to example 2 was used to construct a model using a generalized linear regression model in a training set (15 lung cancer patients cfDNA and 26 healthy people) using three marker combinations of tumor ploidy (tumor score), tumor fraction (tumor fraction) and cfDNA fragment length (fragment size) eigenvalues, resulting in an AUC of 0.970 in the test set (6 lung cancer patients and 11 healthy people). The results are shown in fig. 8, and the specific modeling process is the same as in example 2.
In example 2, when the MODE model obtained by using 5 markers together (tumor fraction, tumor ploidy, reads distribution, cfDNA fragment length (fragment size) eigenvalue, methylation level) reached 1 AUC in the test set, which was higher than the AUC level obtained by the same method modeling in example 3, example 4 and example 5. Meanwhile, the AUC obtained by the combination result of multiple markers is higher than that obtained by the modeling result of a single marker, which indicates that the classification effect is improved by using multiple markers in the model, and the above results simultaneously indicate that the optimal prediction and classification effects can be achieved based on multiple dimension information including methylation, fragmentation, and tumor score when the MODE model is obtained by integrating 5 markers (tumor fraction, tumor ploidy, reads distribution, cfDNA fragment length (fragment size) characteristic value, and methylation level).
Although the embodiments of the present application have been described above with reference to the accompanying drawings, the present application is not limited to the above-described embodiments and application fields, and the above-described embodiments are illustrative, instructive, and not restrictive. Those skilled in the art, having the benefit of this disclosure, may effect numerous modifications thereto and changes may be made without departing from the scope of the invention as defined by the appended claims.

Claims (11)

1. A system for cancer screening, comprising:
a data acquisition module for acquiring data of methylation level, tumor fraction, tumor ploidy, cfDNA fragment length characteristic value and reads distribution of a target region of a subject; and
a cancer calculation module to predict a probability of the subject suffering from cancer based on the data for methylation level, tumor score, tumor ploidy, cfDNA fragment length feature value, and reads distribution obtained in the data acquisition module.
2. The system of claim 1, wherein,
the data acquisition module comprises a sequencing module, a methylation level analysis module, a tumor score analysis module, a tumor ploidy analysis module, a cfDNA fragment length eigenvalue analysis module and a reads distribution analysis module;
the sequencing module is used for performing whole genome sequencing on cfDNA of a subject;
the methylation level analysis module analyzes the methylation level of the target region based on the sequencing data obtained from the sequencing module;
the tumor score analysis module analyzes a tumor score based on sequencing data obtained from a sequencing module;
the tumor ploidy analysis module analyzes tumor ploidy based on sequencing data obtained from the sequencing module;
the cfDNA fragment length eigenvalue analysis module calculates a length-related risk eigenvalue of the tumor based on the length of the cfDNA fragment extracted from the sequencing data obtained by the sequencing module;
the reads distribution analysis module analyzes the reads distribution based on the sequencing data obtained from the sequencing module so as to predict the source of each read.
3. The system of claim 1 or 2,
the target region includes any one or two or more of:
chromosome 2 223721500-position 223726500,
chromosome 6, 170147000 and 170152000,
chromosome 8 182500-187500,
chromosome 8 64081500-64086500,
chromosome 9 at positions 14688500 and 14693500,
119803500 and 119808500 of chromosome 10, or
Chromosome 15 56285500-.
4. The system of any one of claims 1 to 3,
calculating the methylation level of the target region based on the methylation level of each CG site of the target region, wherein the methylation level of the CG site is the ratio of a cytosine at which methylation occurs to the sum of a cytosine at which methylation occurs and a cytosine which is not methylated, which is detected in the sequence result of all detected sites;
the tumor fraction is the proportion of ctDNA in cfDNA, i.e. the proportion of free DNA released by tumor cells in total cfDNA;
the tumor ploidy refers to the real content of cancer cells in a tumor sample caused by chromosome structure and number abnormality, namely the ploidy number of the tumor;
the cfDNA fragment length characteristic value analysis module calculates a length-related risk characteristic value of the tumor based on the length of the cfDNA fragment extracted from the sequencing data obtained by the sequencing module, and calculates a cfDNA fragment length characteristic value through a gradient lifting tree model based on a ratio between a short fragment and a long fragment in the cfDNA fragment length;
preferably, the long fragment is a fragment with the length of 201-320 bp;
the short fragment is a fragment with the length of 150-200 bp;
the reads distribution refers to the probability of the sample source obtained by counting each read in cfDNA sequencing data, i.e., the probability that a given read is derived from three source components, healthy plasma, normal tissue, and tumor tissue.
5. The system of any one of claims 1 to 4,
the cfDNA sequencing data is cfDNA sequencing data after removing low-quality sequencing fragments;
preferably, the cfDNA sequencing data is sequencing data from which sequencing data within the low-alignment interval is further excluded after removal of low-quality sequencing fragments.
6. The system of any one of claims 1 to 5,
in the cancer calculation module, a formula which is formed by fitting data based on the methylation level, the tumor fraction, the tumor ploidy, the cfDNA fragment length characteristic value and the reads distribution of the known sample is stored in advance to predict the probability of the cancer suffered by the subject, wherein the formula is one
Figure FDA0003521796370000021
P is the probability of the subject suffering from cancer;
X 1 is the tumor score;
X 2 is tumor ploidy;
X 3 is the methylation level;
X 4 is a cfDNA fragment length characteristic value;
X 5 is a reads distribution;
β 0 is any value selected from 100 to 150, preferably 114.12;
β 1 is any value selected from-700 to 0, preferably-686.25;
β 2 is any value selected from 20 to 40, preferably 31.16; beta is a 4 Is any value selected from-20 to 0, preferably-17.45.
7. The system of claim 6, wherein β 3 X 3 Obtained by the formula two:
β 3 X 3 =a*X 31 +b*X 32+ c*X 33 +d*X 34 +e*X 35 +f*X 36 +g*X 37 formula II
X 31 The methylation level at position 2 of chromosome 223721500-223726500 in the target region;
X 32 the methylation level at position 170147000-170152000 of chromosome 6 in the target region;
X 33 methylation level at 182500-187500 locus of chromosome 8 in the target region;
X 34 the methylation level at chromosome 8 64081500 and 64086500 in the target region;
X 35 the methylation level at chromosome 9 14688500-14693500 in the target region;
X 36 the methylation level of 119803500-119808500 of chromosome 10 in the target region;
X 37 the methylation level at position 56285500-56290500 of chromosome 15 in the target region;
a is any value selected from-300, preferably-260.66;
b is any value selected from-200 to 200, preferably-181.95;
c is any value selected from-100 to 100, preferably-85.21;
d is any value selected from-350 to 350, preferably-305.36;
e is any value selected from-250 to 250, preferably 218.80;
f is any value selected from-100 to 100, preferably 60.85;
g is any value selected from-250 to 250, preferably-209.25.
8. The system of claim 6, wherein β 5 X 5 Obtained by the formula three:
β 5 X 5 =l*X 51 +m*X 52+ n*X 53 formula III
X 51 (ii) is a reads profile derived from healthy plasma;
X 52 is a distribution of reads from normal tissue;
X 53 (ii) is a distribution of reads derived from tumor tissue;
l is any value selected from 50 to 200, preferably 166.02;
m is any value selected from 0 to 10, preferably 8.41;
n is any value selected from-0.1 to 1, preferably 0.00000001.
9. The system of claim 1, wherein,
the system also includes a bisulfite treatment module for bisulfite treating cfDNA of the subject.
10. A method of cancer screening using the system of any one of claims 1-9, the method comprising:
a sample collection step, wherein data of methylation level, tumor fraction, tumor ploidy, cfDNA fragment length characteristic value and reads distribution of a target area of a subject are obtained;
a cancer calculation step to predict the probability of the subject suffering from cancer based on the data of methylation level, tumor fraction, tumor ploidy, cfDNA fragment length, and reads distribution obtained in the data acquisition module.
11. A method of cancer screening as claimed in claim 10, the method comprising:
the data acquisition step comprises a sequencing step, a methylation level analysis step, a tumor score analysis module step, a tumor ploidy analysis step, a cfDNA fragment length eigenvalue analysis step and a reads distribution analysis step;
the sequencing step is used for whole genome sequencing of cfDNA of a subject;
the methylation level analyzing step analyzes the methylation level of the target region based on the sequencing data obtained from the sequencing module;
the tumor score analyzing step analyzes the tumor score based on the sequencing data obtained from the sequencing module;
the tumor ploidy analyzing step analyzes tumor ploidy based on sequencing data obtained from a sequencing module;
the cfDNA fragment length eigenvalue analysis step calculates a length-related risk eigenvalue of the tumor based on the length of the cfDNA fragment extracted from the sequencing data obtained by the sequencing module;
the reads distribution analysis module analyzes the reads distribution based on the sequencing data obtained from the sequencing step to predict the source of each reads.
CN202210182941.6A 2021-02-25 2022-02-25 System for cancer screening and method thereof Pending CN114974430A (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN2021102126451 2021-02-25
CN202110212645 2021-02-25

Publications (1)

Publication Number Publication Date
CN114974430A true CN114974430A (en) 2022-08-30

Family

ID=82976397

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210182941.6A Pending CN114974430A (en) 2021-02-25 2022-02-25 System for cancer screening and method thereof

Country Status (1)

Country Link
CN (1) CN114974430A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115424666A (en) * 2022-09-13 2022-12-02 江苏先声医学诊断有限公司 Method and system for screening pan-cancer early-screening molecular marker based on whole genome bisulfite sequencing data
CN115558716A (en) * 2022-09-29 2023-01-03 昂凯生命科技(苏州)有限公司 cfDNA fragment feature combination, system and application for predicting cancer
CN115662519A (en) * 2022-09-29 2023-01-31 昂凯生命科技(苏州)有限公司 cfDNA fragment feature combination and system for predicting cancer based on machine learning

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115424666A (en) * 2022-09-13 2022-12-02 江苏先声医学诊断有限公司 Method and system for screening pan-cancer early-screening molecular marker based on whole genome bisulfite sequencing data
CN115424666B (en) * 2022-09-13 2023-07-11 江苏先声医学诊断有限公司 Method and system for screening early-stage screening sub-markers of pan-cancer based on whole genome bisulfite sequencing data
CN115558716A (en) * 2022-09-29 2023-01-03 昂凯生命科技(苏州)有限公司 cfDNA fragment feature combination, system and application for predicting cancer
CN115662519A (en) * 2022-09-29 2023-01-31 昂凯生命科技(苏州)有限公司 cfDNA fragment feature combination and system for predicting cancer based on machine learning
CN115558716B (en) * 2022-09-29 2023-11-03 南京医科大学 cfDNA fragment characteristic combination, system and application for predicting cancer
CN115662519B (en) * 2022-09-29 2023-11-03 南京医科大学 cfDNA fragment characteristic combination and system for predicting cancer based on machine learning

Similar Documents

Publication Publication Date Title
JP7168247B2 (en) Mutation detection for cancer screening and fetal analysis
CN114974430A (en) System for cancer screening and method thereof
KR102587176B1 (en) Non-invasive determination of methylome of fetus or tumor from plasma
CN111742062B (en) Methylation markers for diagnosing cancer
CN110964826A (en) High-throughput detection kit for methylation of colorectal cancer suppressor gene and application thereof
CN114045345B (en) Free DNA-based genome canceration information detection system and detection method
CN110982907B (en) Thyroid nodule-related rDNA methylation marker and application thereof
CN114317738B (en) Methylation biomarker related to detection of gastric cancer lymph node metastasis or combination and application thereof
CN108588230B (en) Marker for breast cancer diagnosis and screening method thereof
JP2014519319A (en) Methods and compositions for detecting cancer through general loss of epigenetic domain stability
CN115516110A (en) Method and reagent for detecting DNA methylation of colorectal cancer
AU2017281099A1 (en) Compositions and methods for diagnosing lung cancers using gene expression profiles
CN115176034A (en) Cancer gene methylation detection system and cancer in-vitro detection method implemented in same
CN112899359A (en) Methylation marker for detecting benign and malignant lung nodules or combination and application thereof
CN107630093B (en) Reagent, kit, detection method and application for diagnosing liver cancer
CN108660209B (en) Product for early detection of colorectal cancer prepared based on BMP3 gene methylation
CN114743593B (en) Construction method of prostate cancer early screening model based on urine, screening model and kit
CN110656168A (en) COPD early diagnosis marker and application thereof
CN115851923A (en) Methylated biomarker for detecting colorectal cancer lymph node metastasis and application thereof
CN115279924A (en) Probe composition for detecting 11 cancers
CN116779025A (en) System for cancer screening
CN115772566B (en) Methylation biomarker for auxiliary detection of lung cancer somatic ERBB2 gene mutation and application thereof
WO2023078283A1 (en) Methylation biomarker for breast cancer diagnosis and use thereof
CN115772564A (en) Methylation biomarker for auxiliary detection of lung cancer somatic cell ATM gene fusion mutation and application thereof
CN114507734A (en) Marker and probe composition for thyroid cancer screening and application of marker and probe composition

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination