CN117316278A - Cancer noninvasive early screening method and system based on cfDNA fragment length distribution characteristics - Google Patents
Cancer noninvasive early screening method and system based on cfDNA fragment length distribution characteristics Download PDFInfo
- Publication number
- CN117316278A CN117316278A CN202210704961.5A CN202210704961A CN117316278A CN 117316278 A CN117316278 A CN 117316278A CN 202210704961 A CN202210704961 A CN 202210704961A CN 117316278 A CN117316278 A CN 117316278A
- Authority
- CN
- China
- Prior art keywords
- cfdna
- cancer
- model
- early screening
- fragment
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 206010028980 Neoplasm Diseases 0.000 title claims abstract description 115
- 239000012634 fragment Substances 0.000 title claims abstract description 104
- 201000011510 cancer Diseases 0.000 title claims abstract description 81
- 238000009826 distribution Methods 0.000 title claims abstract description 44
- 238000012216 screening Methods 0.000 title claims abstract description 37
- 238000000034 method Methods 0.000 title claims abstract description 30
- 238000001514 detection method Methods 0.000 claims abstract description 27
- 238000012163 sequencing technique Methods 0.000 claims abstract description 27
- 238000012549 training Methods 0.000 claims description 46
- 238000012360 testing method Methods 0.000 claims description 19
- 238000013145 classification model Methods 0.000 claims description 12
- 238000012795 verification Methods 0.000 claims description 12
- 206010061902 Pancreatic neoplasm Diseases 0.000 claims description 9
- 238000010801 machine learning Methods 0.000 claims description 8
- 201000002528 pancreatic cancer Diseases 0.000 claims description 8
- 238000000605 extraction Methods 0.000 claims description 7
- 238000002790 cross-validation Methods 0.000 claims description 6
- 238000011156 evaluation Methods 0.000 claims description 6
- 208000022072 Gallbladder Neoplasms Diseases 0.000 claims description 5
- 208000006990 cholangiocarcinoma Diseases 0.000 claims description 5
- 208000008443 pancreatic carcinoma Diseases 0.000 claims description 5
- 238000004422 calculation algorithm Methods 0.000 claims description 4
- 201000010175 gallbladder cancer Diseases 0.000 claims description 4
- 208000015486 malignant pancreatic neoplasm Diseases 0.000 claims description 4
- 230000007170 pathology Effects 0.000 claims description 4
- 230000036210 malignancy Effects 0.000 claims description 3
- 210000000496 pancreas Anatomy 0.000 claims description 2
- 238000012545 processing Methods 0.000 claims description 2
- 210000004369 blood Anatomy 0.000 abstract description 20
- 239000008280 blood Substances 0.000 abstract description 20
- 230000035772 mutation Effects 0.000 abstract description 15
- 108700020796 Oncogene Proteins 0.000 abstract description 3
- 102000043276 Oncogene Human genes 0.000 abstract description 3
- 108700025716 Tumor Suppressor Genes Proteins 0.000 abstract description 3
- 102000044209 Tumor Suppressor Genes Human genes 0.000 abstract description 3
- 230000009286 beneficial effect Effects 0.000 abstract description 3
- 230000003394 haemopoietic effect Effects 0.000 abstract description 3
- 238000005516 engineering process Methods 0.000 abstract description 2
- 210000002381 plasma Anatomy 0.000 description 18
- 108020004414 DNA Proteins 0.000 description 17
- 230000035945 sensitivity Effects 0.000 description 11
- 238000010200 validation analysis Methods 0.000 description 10
- 230000001575 pathological effect Effects 0.000 description 9
- 238000012070 whole genome sequencing analysis Methods 0.000 description 8
- 210000004027 cell Anatomy 0.000 description 7
- 238000011528 liquid biopsy Methods 0.000 description 7
- 238000011160 research Methods 0.000 description 7
- 238000003745 diagnosis Methods 0.000 description 6
- 238000004458 analytical method Methods 0.000 description 5
- 241000894007 species Species 0.000 description 5
- 238000003384 imaging method Methods 0.000 description 4
- 239000000047 product Substances 0.000 description 4
- 108090000623 proteins and genes Proteins 0.000 description 4
- 208000026900 bile duct neoplasm Diseases 0.000 description 3
- 239000000090 biomarker Substances 0.000 description 3
- 239000003153 chemical reaction reagent Substances 0.000 description 3
- 238000013467 fragmentation Methods 0.000 description 3
- 238000006062 fragmentation reaction Methods 0.000 description 3
- 238000013058 risk prediction model Methods 0.000 description 3
- 239000006228 supernatant Substances 0.000 description 3
- 230000009897 systematic effect Effects 0.000 description 3
- 210000004881 tumor cell Anatomy 0.000 description 3
- GUAHPAJOXVYFON-ZETCQYMHSA-N (8S)-8-amino-7-oxononanoic acid zwitterion Chemical compound C[C@H](N)C(=O)CCCCCC(O)=O GUAHPAJOXVYFON-ZETCQYMHSA-N 0.000 description 2
- 206010004593 Bile duct cancer Diseases 0.000 description 2
- 101150040844 Bin1 gene Proteins 0.000 description 2
- 206010018910 Haemolysis Diseases 0.000 description 2
- 108010047956 Nucleosomes Proteins 0.000 description 2
- 210000000683 abdominal cavity Anatomy 0.000 description 2
- 238000005119 centrifugation Methods 0.000 description 2
- 238000002591 computed tomography Methods 0.000 description 2
- 238000013399 early diagnosis Methods 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 230000008588 hemolysis Effects 0.000 description 2
- 230000006872 improvement Effects 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 210000004882 non-tumor cell Anatomy 0.000 description 2
- 108020004707 nucleic acids Proteins 0.000 description 2
- 150000007523 nucleic acids Chemical class 0.000 description 2
- 102000039446 nucleic acids Human genes 0.000 description 2
- 210000001623 nucleosome Anatomy 0.000 description 2
- 230000008569 process Effects 0.000 description 2
- 239000002096 quantum dot Substances 0.000 description 2
- 239000007787 solid Substances 0.000 description 2
- 238000003860 storage Methods 0.000 description 2
- 230000001225 therapeutic effect Effects 0.000 description 2
- 206010006187 Breast cancer Diseases 0.000 description 1
- 208000026310 Breast neoplasm Diseases 0.000 description 1
- 102000011727 Caspases Human genes 0.000 description 1
- 108010076667 Caspases Proteins 0.000 description 1
- 206010009944 Colon cancer Diseases 0.000 description 1
- 208000001333 Colorectal Neoplasms Diseases 0.000 description 1
- 230000007018 DNA scission Effects 0.000 description 1
- 101100295776 Drosophila melanogaster onecut gene Proteins 0.000 description 1
- 206010071975 EGFR gene mutation Diseases 0.000 description 1
- 206010058467 Lung neoplasm malignant Diseases 0.000 description 1
- 206010027476 Metastases Diseases 0.000 description 1
- 206010033128 Ovarian cancer Diseases 0.000 description 1
- 101710097421 WD repeat and HMG-box DNA-binding protein 1 Proteins 0.000 description 1
- 102100029469 WD repeat and HMG-box DNA-binding protein 1 Human genes 0.000 description 1
- 230000006907 apoptotic process Effects 0.000 description 1
- 239000012620 biological material Substances 0.000 description 1
- 238000001574 biopsy Methods 0.000 description 1
- 210000000481 breast Anatomy 0.000 description 1
- 201000008275 breast carcinoma Diseases 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 230000001413 cellular effect Effects 0.000 description 1
- 108091092240 circulating cell-free DNA Proteins 0.000 description 1
- 201000010989 colorectal carcinoma Diseases 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 238000007796 conventional method Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 230000001419 dependent effect Effects 0.000 description 1
- 239000003814 drug Substances 0.000 description 1
- 239000003480 eluent Substances 0.000 description 1
- 208000021045 exocrine pancreatic carcinoma Diseases 0.000 description 1
- 206010017758 gastric cancer Diseases 0.000 description 1
- 208000010749 gastric carcinoma Diseases 0.000 description 1
- 230000002496 gastric effect Effects 0.000 description 1
- 230000036541 health Effects 0.000 description 1
- 208000024200 hematopoietic and lymphoid system neoplasm Diseases 0.000 description 1
- 238000009169 immunotherapy Methods 0.000 description 1
- 238000011534 incubation Methods 0.000 description 1
- 210000000265 leukocyte Anatomy 0.000 description 1
- 210000004072 lung Anatomy 0.000 description 1
- 201000005296 lung carcinoma Diseases 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 238000005259 measurement Methods 0.000 description 1
- 230000009401 metastasis Effects 0.000 description 1
- 208000037819 metastatic cancer Diseases 0.000 description 1
- 208000011575 metastatic malignant neoplasm Diseases 0.000 description 1
- 239000003147 molecular marker Substances 0.000 description 1
- 238000012544 monitoring process Methods 0.000 description 1
- 230000002611 ovarian Effects 0.000 description 1
- 238000004806 packaging method and process Methods 0.000 description 1
- 210000005259 peripheral blood Anatomy 0.000 description 1
- 239000011886 peripheral blood Substances 0.000 description 1
- 238000003793 prenatal diagnosis Methods 0.000 description 1
- 238000003908 quality control method Methods 0.000 description 1
- 238000000926 separation method Methods 0.000 description 1
- 238000011451 sequencing strategy Methods 0.000 description 1
- 210000002966 serum Anatomy 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
- 238000007619 statistical method Methods 0.000 description 1
- 201000000498 stomach carcinoma Diseases 0.000 description 1
- 238000001356 surgical procedure Methods 0.000 description 1
- 238000002560 therapeutic procedure Methods 0.000 description 1
- 238000012546 transfer Methods 0.000 description 1
- 238000012800 visualization Methods 0.000 description 1
- XLYOFNOQVPJJNP-UHFFFAOYSA-N water Substances O XLYOFNOQVPJJNP-UHFFFAOYSA-N 0.000 description 1
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B30/00—ICT specially adapted for sequence analysis involving nucleotides or amino acids
-
- C—CHEMISTRY; METALLURGY
- C12—BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
- C12Q—MEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
- C12Q1/00—Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
- C12Q1/68—Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
- C12Q1/6869—Methods for sequencing
-
- C—CHEMISTRY; METALLURGY
- C12—BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
- C12Q—MEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
- C12Q1/00—Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
- C12Q1/68—Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
- C12Q1/6876—Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes
- C12Q1/6883—Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes for diseases caused by alterations of genetic material
- C12Q1/6886—Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes for diseases caused by alterations of genetic material for cancer
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B20/00—ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B40/00—ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
Landscapes
- Life Sciences & Earth Sciences (AREA)
- Health & Medical Sciences (AREA)
- Chemical & Material Sciences (AREA)
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Proteomics, Peptides & Aminoacids (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Organic Chemistry (AREA)
- General Health & Medical Sciences (AREA)
- Biophysics (AREA)
- Analytical Chemistry (AREA)
- Biotechnology (AREA)
- Wood Science & Technology (AREA)
- Medical Informatics (AREA)
- Genetics & Genomics (AREA)
- Zoology (AREA)
- Immunology (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Theoretical Computer Science (AREA)
- Molecular Biology (AREA)
- Evolutionary Biology (AREA)
- Bioinformatics & Computational Biology (AREA)
- General Engineering & Computer Science (AREA)
- Pathology (AREA)
- Microbiology (AREA)
- Biochemistry (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Hospice & Palliative Care (AREA)
- Software Systems (AREA)
- Oncology (AREA)
- Public Health (AREA)
- Evolutionary Computation (AREA)
- Epidemiology (AREA)
- Databases & Information Systems (AREA)
- Data Mining & Analysis (AREA)
- Bioethics (AREA)
- Artificial Intelligence (AREA)
- Investigating Or Analysing Biological Materials (AREA)
Abstract
The invention discloses a non-invasive early screening method and system for cancers based on cfDNA fragment length distribution characteristics. According to the method, the segment length distribution characteristic difference of the cfDNA of a tumor source and the cfDNA of a healthy individual source is counted in a low-depth whole-gene sequencing mode, an early screening model of cancer is established, and noninvasive early screening of the cancer is realized. The scheme focuses on the feature of the size of the blood cfDNA fragment to distinguish ctDNA from cfDNA of non-tumor sources, does not depend on mutation detection of oncogenes or cancer suppressor genes, and eliminates interference caused by clonal hematopoietic mutation; secondly, the data of the embodiment of the invention show that healthy people and early tumor patients can be distinguished by adopting the size characteristics of the blood cfDNA fragments; finally, due to the adoption of a low-depth whole-gene sequencing technology, the scheme of the invention relates to the great reduction of detection cost, and is beneficial to the future application in the field of early screening of malignant tumors.
Description
Technical Field
The invention belongs to the technical field of medical detection, and particularly relates to a non-invasive early screening method and system for cancers based on cfDNA fragment length distribution characteristics.
Background
The majority of mortality rates in human cancers worldwide are due to the poor therapeutic intervention resulting from advanced diagnosis, so early diagnosis of tumors is particularly important. Traditional biomarker imaging techniques play an important role in tumor diagnosis; however, the specificity of traditional serum biomarkers is not satisfactory for therapeutic guidance. Furthermore, imaging techniques cannot be used for "real-time" detection due to radiation exposure and economic problems. From the first plasma cfDNA (cell free DNA) liquid biopsy product based on EGFR gene mutation approved by the FDA in 2016, to bTMB (blood tumor mutation burden), it was demonstrated that the effect of immunotherapy could be predicted, liquid biopsies have been highly prized in the field of tumor therapy. Liquid biopsies are increasingly being considered for early tumor diagnosis, treatment guidance and recurrence monitoring. It may provide information about the tumor. Furthermore, liquid biopsies provide a non-invasive alternative to traditional "solid biopsies" which in some cases or "real-time" cannot be consistently performed. Despite these numerous advantages, there are limitations such as lack of consensus on detection methods, difficulty in analyzing large amounts of measurement information, and insufficient evidence based on evidence-based medicine.
Currently, the conventional method of studying liquid biopsies, early screening for cancer is to identify cfDNA released by tumors by mutation detection of oncogenes or tumor suppressor genes. Unfortunately, the tumor circulating DNA (ctDNA, circulating tumor DNA) molecular content in blood is typically much lower than that of non-cancer related DNA fragments in blood, which makes their detection very difficult, especially in the early stages of cancer. Previous studies have found that accurate tumor information can only be obtained when ctDNA is present in cfDNA in an abundance of 10% or more. However, most tumor patients do not meet this criterion in terms of abundance of ctDNA, except that some advanced tumors release large amounts of ctDNA. Currently, the sensitivity and accuracy of ctDNA detection are mainly improved by increasing the sequencing depth, but increasing the sequencing depth may cause false positives, because non-tumor-derived DNA may also carry various tumor-related mutations. High depth sequencing is also extremely expensive and cannot be used in a wide variety of clinical applications. Furthermore, mutation studies have been directed to patients with advanced stages of cancer metastasis. The literature Razavi, p., li, b.t., brown, d.n. et al high intensity sequencing reveals the sources of plasma circulating cell-free DNA derivatives, nat Med 25,1928-1937 (2019) in Pedram et al found large numbers of cfDNA mutations by a factor of 60000 by performing 508 specific genes on peripheral blood of 124 metastatic cancer patients and 47 healthy people, 81.6% from healthy people, 53.2% from cancer patients, all due to clonal hematopoietic production of leukocytes, and not tumor cells. This can lead to a high false positive rate of identification of cancer patients through mutation of a particular target gene. Moreover, because of the great variability of cancer patients, a portion of specific cancer patients may be missed after defining specific target genes. There is a limit to the detection of specific panel mutations of cfDNA. These problems have limited the use of liquid biopsies.
Another finding by scientists might solve this problem. Previous studies have shown that the length of cfDNA released into the blood by different cells may vary, for example, many cfDNA is 167bp long (similar to the DNA length of one nucleosome), which may be associated with caspase-dependent DNA cleavage during apoptosis. The cfDNA of the infant is significantly shorter than the cfDNA fragment of the mother, a feature that is used for prenatal diagnosis. These studies suggest that cfDNA of different cell sources may exhibit unique patterns in length, potentially as a signal to distinguish cfDNA sources. For a number of reasons, cancer genomes become disorganized in packaging, meaning that when cancer cells die, they release DNA into the blood in a confusing manner, resulting in the likelihood that differences in cfDNA fragment length distribution characteristics will exist between circulating tumor and non-tumor DNA cells. And the research reports that the cfDNA fragment length is used as the characteristic of distinguishing tumor cells from non-tumor cells, so that the defects of low sensitivity, false positive and the like of cfDNA mutation detection can be overcome, however, the assumption is that a few previous researches on the cfDNA fragment length obtain conflicting results, so that the research on the cfDNA fragment length is unprecedented, and a scientific and systematic research cfDNA fragmentation system is urgently needed to be promoted and optimized.
At present, few reports are made on cfDNA fragment size for research on liquid biopsies and early screening of cancers. Of these, 2 most similar to the study of the invention (PMID: 30404863 and PMID: 31142840):
mouliere et al, enhanced detection of circulating tumor DNA by fragment size analysis.sci trans l Med,2018.10 (466) detected fragment length characteristics of cfDNA by whole genome sequencing of cfDNA extracted from 344 plasma samples (including 18 different types of cancers) and 65 plasma samples of healthy people collected from 200 cancer patients. Analysis shows that the ctDNA fragment with cancer mutation is generally 20-40bp shorter than the nucleosome DNA fragment (167 bp), and is enriched in the interval of 90-150bp, so that researchers can increase the abundance of ctDNA by a method of enriching short-fragment selective sequencing, and the fragment ratio of different length intervals is used as a characteristic to distinguish tumor blood samples from healthy blood samples by a machine learning algorithm. However, the method can selectively enrich the effective information such as tumor mutation on fragments with other lengths, and only the fragments with 90-150bp can lose some information. In addition, the total proportion of fragment distribution and mutation characteristics are combined with a training model in the research, and the fragment size proportion information of specific functional sites of the whole genome lacks systematic consideration, so that the cfDNA fragment characteristics of the method still have great improvement and improvement on the cfDNA fragment characteristics serving as early cancer diagnosis indexes.
Literature Cristiano, S., et al, genome-wide cell-free DNA fragmentation in patients with cancer Nature,2019.570 (7761): p.385-389 Cristiano et al examined blood samples from 208 patients with various stages of breast, colorectal, lung, ovarian, pancreatic, gastric and cholangiocarcinoma and 215 healthy persons by whole Genome sequencing, with 57% to 99% of the patients present circulating tumor DNA. The method detects cfDNA based on a low coverage whole genome sequencing method. The number of short and long fragments cfDNA sequences mapped to different regions of the genome are analyzed in non-overlapping windows covering the genome to model the prediction of early cancer species. Moreover, most patients in the study group are also advanced cancer patients of limited cancer species, and segment size statistics use the "one-cut" criteria (short segments defined as 100 to 150bp and long segments defined as 151 to 220 bp) in multiple cancer species, lacking more extensive large sample statistics and studies for optimizing parameters for cancer species-specific segments for determining single cancer species.
Based on this, it is necessary for those skilled in the art to devise a non-invasive early screening method for cancer that can achieve a low depth of whole genome sequencing to predict early cancer patients, and can greatly reduce the cost of early screening for cancer and improve screening accuracy.
Disclosure of Invention
The invention aims to provide a noninvasive early screening method for cancers based on cfDNA fragment length distribution characteristics. Mainly solves the technical problems of low specificity and high false positive rate of early screening of the malignant tumor of the gall pancreas in the prior art.
The technical scheme adopted by the invention for solving the technical problems is as follows:
a non-invasive early screening method for cancers based on cfDNA fragment length distribution characteristics, which comprises the following steps: and (3) counting the fragment length distribution characteristic difference of the cfDNA of the tumor source and the cfDNA of the healthy individual by a low-depth whole-gene sequencing mode, establishing an early screening model of the cancer, and realizing noninvasive early screening of the cancer. The lower depth whole gene sequencing mode refers to a sequencing depth of 2X-4X.
As a preferred embodiment, a normalized z-score of the number of short and total fragments of cfDNA fragment sizes within 504 5Mb length regions of the whole genome is calculated as a characteristic input value for model training. The total number of segments is the sum of the defined number of long segments and the defined number of short segments.
The cfDNA fragments comprise short fragments and long fragments, wherein the length of the cfDNA short fragments ranges from 130bp to 177bp, and the length of the long fragments ranges from 177bp to 237 bp.
As a preferred embodiment, the Linear SVC algorithm is adopted, and a model coefficient is obtained by using a 5-fold cross validation method repeated 30 times, so as to establish an early screening model of the cancer.
As a preferred embodiment, the cancer is a biliopancreatic malignancy. The cholangiopancreatic malignancy includes pancreatic cancer, gall bladder cancer and cholangiocarcinoma.
The invention also provides a cancer noninvasive early screening system based on cfDNA fragment length distribution characteristics, which comprises:
the cfDNA fragment characteristic extraction module is used for obtaining cfDNA fragment size characteristic data in a sample;
the machine learning classification model building module is used for building an early-stage cancer screening model according to the cfDNA fragment size characteristic difference statistics of the tumor source cfDNA and the cfDNA fragment size characteristic statistics of the healthy individual source;
and the independent verification queue evaluation module is used for verifying the prediction efficiency of the established machine learning classification model through the independent verification queue.
As a preferred embodiment, the cfDNA fragment feature extraction module comprises:
a sequencing data comparison unit for comparing the sequencing data to the human reference genome hg19 after removing the sequencing data sequencing linker;
the cfDNA fragment counting unit is used for counting cfDNA fragment length data information; dividing the hg19 autosome into 504 contiguous, non-intersecting window segments, each window segment 5Mb in length; counting the ratio of the number of cfDNAs with the length of more than 130bp and less than 177bp to the number of cfDNAs with the length of more than 177bp and less than 237bp in each window area; finally, obtaining the number of cfDNA long and short fragments in each 5Mb interval;
the cfDNA fragment characteristic determining unit is used for determining a section with the largest difference between the fragment distribution of the cancer patient and the healthy control according to the difference distribution of the fragment distribution between the cancer patient and the healthy control; a short segment range [130,177], a long segment range [177,237] are defined, and then the normalized z-score of the short segment cfDNA and the total segment number of each of the 504 windows is calculated as the feature input value of the model training.
As a preferred embodiment, the machine learning classification model building module includes:
a sample data classifying unit for classifying samples according to 4:1 is divided into a training set and a testing set, and the distribution proportion of healthy control samples and various cancer samples in the two sets is kept consistent;
the model parameter acquisition unit is used for processing sample data in the training set; in a training queue, a model parameter is obtained by using a 30-time repeated 5-fold cross validation method;
and the model efficiency evaluation unit is used for drawing a receiver operation characteristic curve of the training queue according to the model predicted value and the pathology detection result of each sample in the training queue.
Compared with the prior art, the invention has the following beneficial effects:
the scheme of the invention focuses on the feature of the size of blood cfDNA fragments to distinguish ctDNA from cfDNA of non-tumor sources, and does not depend on mutation detection of oncogenes or cancer suppressor genes, so that interference caused by clonal hematopoietic mutation is eliminated; secondly, the data of the embodiment of the invention show that the size characteristics of the blood cfDNA fragments can be used for distinguishing healthy people from early-stage tumor patients, and ctDNA signals of the tumor patients cannot be detected because the ctDNA content of the early-stage tumor patients is low; finally, due to the adoption of the low-depth whole-gene sequencing technology, the detection cost is greatly reduced compared with other full-depth or ultra-depth NGS detection methods, and the advantages are beneficial to the future application of the scheme in the field of early screening of malignant tumors.
According to the invention, blood plasma samples of 60 biliary pancreatic tumors and 31 healthy controls detected in clinic are adopted for carrying out cfDNA low-depth (2X-4X) whole genome detection, an analysis system is established for factors for distinguishing tumor cell and non-tumor cell DNA by considering the position distribution of cfDNA fragment sizes in the whole genome, and the biomarker diagnosis model for early screening of biliary pancreatic malignant tumors can be established by carrying out systematic statistical analysis on the sizes and the number of DNA fragments covering different areas of the whole genome in blood, training and testing the size characteristics of the cfDNA fragments in a research queue. Furthermore, the study independently verified 94 patients with biliary pancreatic tumor and 40 healthy persons, and successfully verified the efficacy of the diagnostic model based on the length distribution characteristics of the free DNA fragments of blood. The method adopts the characteristic of analyzing the cfDNA length in blood more accurately to find clues of early screening of tumors, and provides more solid and reliable data support for clinical accurate application.
Drawings
FIG. 1 is a fragment length distribution profile of cfDNA at the whole genome level in example 1 of the present invention.
FIG. 2 is a graph showing the difference distribution of cfDNA fragments of a cancer patient and a healthy individual in example 1 of the present invention.
Fig. 3 is a graph of the ROC training set in example 2 of the present invention.
FIG. 4 is a graph of the test set ROC of example 2 of the present invention.
Fig. 5 is a ROC curve for independent verification cohorts subject in example 2 of the present invention.
Detailed Description
The following describes the technical scheme of the present invention in detail by referring to examples. The reagents and biological materials used hereinafter are commercial products unless otherwise specified.
Example 1
(1) Study cohort and clinical information
The study includes 154 cases of tumor markers, imaging examination (such as ultrasonic examination, abdominal cavity CT scanning and the like) and pathological detection results, which are confirmed by biliary pancreatic tumor patients (pancreatic cancer, gall bladder cancer and bile duct cancer) and 71 healthy controls, and blood samples of the patients and the healthy controls are collected before operation. Each patient in the group gave an accurate diagnosis after surgery based on the pathological examination results.
(2) Blood collection, separation and storage
Whole blood from preoperative cancer patients and healthy controls was collected in 10ml free nucleic acid holding tubes (REF 43803, BD, USA) and transported at room temperature. The received whole blood sample is separated by a two-step centrifugation method to obtain plasma. The plasma and cellular components were first separated by centrifugation at 1600g for 10 minutes at 4 ℃, the supernatant carefully aspirated, taking care not to aspirate the leucocyte layer, while recording the haemolysis grade of the plasma, samples with haemolysis grade > 5 were not included in the subsequent study. The plasma was then centrifuged again at 16,000g for 15 minutes at 4℃to remove any remaining cells or cell debris. Transferring the supernatant into a centrifuge tube, split charging into 1ml of each tube, and placing the separated plasma sample in a refrigerator at-80 ℃ for storage.
(3) cfDNA extraction
Taking out the plasma sample from the refrigerator at-80 ℃ and placing the plasma sample in a water bath kettle, carrying out static incubation at 37 ℃ for about 5 minutes, transferring the plasma to a low-temperature refrigerated centrifuge, centrifuging at 4 ℃ and 1600g for 10 minutes, and carefully sucking the supernatant into a centrifuge tube. Extraction of plasma cfDNA was extracted from 1ml plasma using a QIAamp Circulating Nucleic Acid Kit (55114, qiagen, shanghai, china) kit, and cfDNA was eluted using 30 μl EB for the final procedure, see product instructions. The total amount of cfDNA extracted was quantified using a Qubit fluorescent quantifier and a matched corresponding reagent (Q32854, thermo Fisher, USA). cfDNA fragment distribution was detected using an agilent 2100 bioanalyzer and a corresponding Agilent High Sensitivity DNA Kit & Reagents (5067-4626, agilent, usa).
(4) cfDNA banking and WGS sequencing
Samples qualified in cfDNA quality control were used for cfDNA library construction and WGS sequencing. The library was prepared using a KAPA DNA Hyper Prep kit (KK 8504, KAPA, USA) and the detailed procedure was as described in the product specification. Each cfDNA sample input was 10ng, then the end of the base was added with A tail, then the linker was ligated, purified, PCR amplified for 7 cycles of enrichment library, purified, finally the DNA eluted with 25. Mu.l of eluent, qubit was used to determine the concentration of the plasma cfDNA library, 4150 determined the fragment distribution of the plasma cfDNA library. The qualified library is subjected to whole genome sequencing by using a NovoSeq 6000 platform, the sequencing strategy is 2x150bp, and the sequencing quantity is 10G (3 x).
(5) cfDNA fragment size feature extraction
Sequence information of cfDNA in patient and healthy control plasma was obtained based on low pass Whole Genome Sequencing (LP-WGS) detection technique. The analytical flow of the sequencing data is as follows:
1) Sequencing data alignment. After removal of the adaptors (fastq) from the raw fastq data obtained by LP-WGS sequencing, the sequencing data was aligned to human reference genome hg19 (genome download link: ftp:// ftp-transfer. Ncbi. Nih. Gov/1000 genome/ftp/technical/reference/human_g1k_v37. Fasta. Gz) using BWA software (version: 0.7.12-r 1039), low quality sequences were removed from the resulting BAM file and duplicate sequences were filtered out.
2) cfDNA fragment length statistics.
3) Full genome fragment size distribution profile. Excluding the low coverage region and Duke black box subregion of the hg19 reference genome; then dividing the hg19 autosomes into 504 contiguous, non-intersecting window segments, each window segment 5Mb in length; counting the ratio of the number of cfDNAs with the length of more than 130bp and less than 177bp to the number of cfDNAs with the length of more than 177bp and less than 237bp in each window area; finally, the number of cfDNA long and short fragments in each 5Mb interval is obtained, and finally, the cfDNA fragmentation size map visualization of the whole genome is carried out by using the proportion, and the cfDNA long and short fragments are distributed maps of the cfDNA at the whole genome level, see fig. 1.
(6) Machine learning classification model establishment
1) And determining the size fragment characteristics. The interval in which the difference between the distribution of the fragments of the cancer patient and the healthy control is the largest is determined according to the difference between the distribution of the fragments of the cancer patient and the healthy control, see fig. 2, which is the distribution of the difference between the cfDNA fragments of the cancer patient and the healthy individual, and the ordinate of fig. 2 refers to the difference between the occurrence frequencies of the cfDNA fragments of the cancer patient and the healthy individual. A short segment range [130,177], a long segment range [177,237] are defined, and then the normalized z-score of the short segment cfDNA and the total segment number of each of the 504 windows is calculated as the feature input value of the model training.
2) The samples are divided into training and testing sets. All samples were taken as 4:1 into a training set and a test set, and the distribution proportion of healthy controls and various cancer samples in the two sets is kept consistent.
3) Sample data in the training set is processed. In the training queue, model coefficients were obtained using a 30-fold repeat 5-fold cross-validation method.
4) The efficacy of the model was evaluated. And drawing a receiver operation characteristic curve (ROC curve, receiver operating characteristic curve) of the training set according to the model predicted value and the pathology detection result of each sample in the training set. And (3) setting a series of thresholds based on the predicted values to divide the training set into healthy people and cancer patients, and evaluating the predicted efficacy of the model by taking the pathological detection result as a true value. The model prediction efficiency evaluation method comprises the following areas (AUC, area opening Curve, value range 0-1), positive prediction values (PPV, positive Predictive Value, value range 0-1), specificity (value range 0-1), accuracy (value range 0-1) and sensitivity (value range 0-1), wherein the higher the value is, the better the effect is.
(7) Verification of classification model predictive efficacy
And in the independent verification queue, verifying the effectiveness of model prediction classification according to the classification model and the predicted value determined in the training queue. The process is as follows:
1) The variables are validated. In the independent validation queue, standard z-score was used as variable for 504 window cfDNA short fragments and total fragment numbers of the whole genome.
2) And (5) verifying model efficiency. And drawing an ROC curve of the test set according to the molecular marker expression quantity and the pathological detection result of each sample in the test set. Based on the predicted values, the independent validation cohorts were divided into healthy people (same training set and test set) and cancer groups, and model prediction efficacy, including specificity, sensitivity and accuracy, was evaluated, with higher values being more effective.
Example 2
(1) Study cohort and clinical information
The study is incorporated into two study queues for 154 cases, and the results of tumor markers, imaging examination (such as ultrasonic examination, abdominal cavity CT scanning and the like) and pathological detection are confirmed to be patients with biliary pancreatic malignant tumors and 71 healthy persons, and blood samples of the patients are collected before operation, and blood samples of healthy controls are collected. The training set and the test set were included in 60 patients (29 pancreatic cancer cases, 15 gallbladder cancer cases, and 16 bile duct cancer cases) and 31 healthy persons (table 1). Samples were taken according to 4:1 into a training set and a test set, and the distribution proportion of healthy controls and various cancer samples in the two sets is kept consistent. Table 1 shows the grouping information of healthy controls and patients in the training set and test set. The analysis results show that the gender ratio of the training set and the test set samples and the distribution ratio of the number of healthy controls and cancer patients have no significant difference.
Table 1: training set and test set information
The independent validation cohort consisted of 94 patients with biliary pancreatic tumors (37 pancreatic cancers, 17 gallbladder cancers and 40 bile duct cancers) and 40 healthy individuals. Table 2 shows the participant information in the training set and independent validation cohort, and the analysis results showed that there was no significant difference in the gender ratio of the training set to the independent validation cohort samples and the healthy control, cancer patient number distribution ratio.
Table 2: training set and independent verification queue information
(2) Health and cancer classification scoring model
And constructing a scoring model of the cancer patient and healthy crowd by utilizing a training set and a pathological detection result and utilizing a linear SVC algorithm. The model consists of three parts, namely a variable, a model formula and a predicted value. The process is as follows:
(1) model variables and parameters. In the training queue, the model uses a 30-fold, 5-fold cross-validation method to obtain model coefficients with normalized z-score 1008 feature variables (Table 3) for the number of short segments and total segments in a region of 504 5Mb length.
Table 3: model input variable examples
Sequence number | Model variables | Number of fragments |
1 | Bin1 short fragment | N1 |
2 | All fragments of Bin1 | N2 |
…… | …… | …… |
1007 | Bin504 short fragment | N1007 |
1008 | All fragments of Bin504 | N1008 |
(2) And (5) a scoring model. The scoring model formula is as follows:
wherein x is i For input variables, the calculation formulas for model parameters w and b are as follows:
where λ is the penalty parameter, n is the number of samples, y i Is a true value of the sample, 1 is cancer, and-1 is healthy.
Using the classification model and the fragment distribution at different regions of the whole genome level for each sample, a class prediction result for each sample can be obtained.
(3) Model efficacy evaluation.
To construct an early screening class model that distinguishes cancer patients from healthy individuals, the predictive value is used to divide the training set samples into healthy individuals and cancer patients. And drawing an ROC curve of the training set according to the predicted value in the queue and the pathological result by taking the pathological detection result as a true value, wherein the AUC value of the training set reaches 1, and the ROC curve of the training set is shown in fig. 3. The PPV (accuracy), specificity and sensitivity predicted by the training model were 100%, 100% and 100%, respectively (table 4). The results show that: in the training set, the risk prediction model has higher sensitivity and NPV, and the model has better prediction efficiency on early diagnosis of cancer.
(4) And (5) verifying prediction efficiency of the discrimination model.
To verify the efficacy of the discriminant model, test set patients were divided into healthy and cancer patient groups (co-training set) with thresholds set by predictive values. And verifying the efficacy of the model according to the classification model and the predicted value determined in the training queue by taking the pathological detection result as a true value, and drawing an ROC curve of the test set, wherein the AUC value reaches 1. Referring to fig. 4, a graph of ROC for the test set is shown. And model predictive efficacy was assessed, including accuracy, specificity and sensitivity, 100% and 92%, respectively (table 4). The results show that: in the test set, the risk prediction model also has higher specificity, sensitivity and accuracy, namely, the model prediction efficiency is better.
TABLE 4 Table 4
(3) Independent validation queue validation
To further verify the efficacy of the discriminant model, independent verification cohorts of patients were divided into healthy and cancer patient groups with thresholds set by predictive values. And verifying the efficacy of the model according to the classification model and the predicted value determined in the training set and the testing set by taking the pathology detection result as a true value, drawing an ROC curve of an independent verification queue, wherein the AUC value of the independent verification queue is as high as 0.94. Fig. 5, ROC curves for independent validation cohorts subjects. And the model predictive efficacy was evaluated, including accuracy, specificity and sensitivity, at 90.1%, 77.5% and 87.2%, respectively (table 5). The results show that: in the independent verification queue, the risk prediction model has higher specificity, sensitivity and accuracy, namely, the model prediction efficiency is better.
Table 5: independent validation queue model effectiveness validation
The foregoing is only a part of the preferred embodiments of the present invention, and the present invention is not limited to the contents of the embodiments. It will be apparent to those skilled in the art that various changes and modifications can be made within the scope of the technical solution of the present invention, and any changes and modifications are within the scope of the present invention.
Claims (9)
1. A non-invasive early screening method for cancers based on cfDNA fragment length distribution characteristics, which comprises the following steps: and (3) counting the fragment length distribution characteristic difference of the cfDNA of the tumor source and the cfDNA of the healthy individual by a low-depth whole-gene sequencing mode, establishing an early screening model of the cancer, and realizing noninvasive early screening of the cancer.
2. The non-invasive early screening method for cancer based on cfDNA fragment length distribution characteristics according to claim 1, wherein: the normalized z-score of the short and total fragment numbers for cfDNA fragment sizes within 504 5Mb length regions of the whole genome was calculated as a feature input value for model training.
3. The non-invasive early screening method for cancer based on cfDNA fragment length distribution characteristics according to claim 1, wherein: the cfDNA fragments comprise short fragments and long fragments, wherein the length of the cfDNA short fragments ranges from 130bp to 177bp, and the length of the long fragments ranges from 177bp to 237 bp.
4. The non-invasive early screening method for cancer based on cfDNA fragment length distribution characteristics according to claim 1, wherein: and (3) adopting a linear SVC algorithm, and using a 30-time repeated 5-fold cross validation method to obtain model coefficients and establishing an early screening model of the cancer.
5. A non-invasive early screening method for cancer based on cfDNA fragment length distribution characteristics according to any of claims 1-4, characterized in that: the cancer is a malignant tumor of gall pancreas.
6. A non-invasive early screening method for cancer based on cfDNA fragment length distribution characteristics according to any of claims 1-4, characterized in that: the cholangiopancreatic malignancy includes pancreatic cancer, gall bladder cancer and cholangiocarcinoma.
7. A cancer noninvasive early screening system based on cfDNA fragment length distribution characteristics, the system comprising:
the cfDNA fragment characteristic extraction module is used for obtaining cfDNA fragment size characteristic data in a sample;
the machine learning classification model building module is used for building an early-stage cancer screening model according to the cfDNA fragment size characteristic difference statistics of the tumor source cfDNA and the cfDNA fragment size characteristic statistics of the healthy individual source;
and the independent verification queue evaluation module is used for verifying the prediction efficiency of the established machine learning classification model through the independent verification queue.
8. The non-invasive early screening system for cancer based on cfDNA fragment length distribution characteristics of claim 7, wherein the cfDNA fragment characteristic extraction module comprises:
a sequencing data comparison unit for comparing the sequencing data to the human reference genome hg19 after removing the sequencing data sequencing linker;
the cfDNA fragment counting unit is used for counting cfDNA fragment length data information; dividing the hg19 autosome into 504 contiguous, non-intersecting window segments, each window segment 5Mb in length; counting the ratio of the number of cfDNAs with the length of more than 130bp and less than 177bp to the number of cfDNAs with the length of more than 177bp and less than 237bp in each window area; finally, obtaining the number of cfDNA long and short fragments in each 5Mb interval;
the cfDNA fragment characteristic determining unit is used for determining a section with the largest difference between the fragment distribution of the cancer patient and the healthy control according to the difference distribution of the fragment distribution between the cancer patient and the healthy control; a short segment range [130,177], a long segment range [177,237] are defined, and then the normalized z-score of the short segment cfDNA and the total segment number of each of the 504 windows is calculated as the feature input value of the model training.
9. The cfDNA fragment length distribution feature-based cancer noninvasive early screening system of claim 7, wherein the machine learning classification model building module comprises:
a sample data classifying unit for classifying samples according to 4:1 is divided into a training set and a testing set, and the distribution proportion of healthy control samples and various cancer samples in the two sets is kept consistent;
the model parameter acquisition unit is used for processing sample data in the training set; in a training queue, a model parameter is obtained by using a 30-time repeated 5-fold cross validation method;
and the model efficiency evaluation unit is used for drawing a receiver operation characteristic curve of the training queue according to the model predicted value and the pathology detection result of each sample in the training queue.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210704961.5A CN117316278A (en) | 2022-06-21 | 2022-06-21 | Cancer noninvasive early screening method and system based on cfDNA fragment length distribution characteristics |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210704961.5A CN117316278A (en) | 2022-06-21 | 2022-06-21 | Cancer noninvasive early screening method and system based on cfDNA fragment length distribution characteristics |
Publications (1)
Publication Number | Publication Date |
---|---|
CN117316278A true CN117316278A (en) | 2023-12-29 |
Family
ID=89241284
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210704961.5A Pending CN117316278A (en) | 2022-06-21 | 2022-06-21 | Cancer noninvasive early screening method and system based on cfDNA fragment length distribution characteristics |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN117316278A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117935914A (en) * | 2024-03-22 | 2024-04-26 | 北京求臻医学检验实验室有限公司 | Unknown-meaning clonal hematopoietic recognition and application method thereof |
-
2022
- 2022-06-21 CN CN202210704961.5A patent/CN117316278A/en active Pending
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117935914A (en) * | 2024-03-22 | 2024-04-26 | 北京求臻医学检验实验室有限公司 | Unknown-meaning clonal hematopoietic recognition and application method thereof |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US20220186320A1 (en) | MicroRNA Marker Combination for Diagnosing Gastric Cancer and Diagnostic Kit | |
CN107727865A (en) | The systemic detection method of tumor markers and its application | |
CN111218513B (en) | Peripheral blood extracellular vesicle microRNA biomarker for early diagnosis of lung cancer and application thereof | |
CN107034301A (en) | A kind of detection Lung neoplasm is benign or pernicious kit and its application | |
WO2022161076A1 (en) | Methylation markers for detection of benign/malignant pulmonary nodules or combination thereof, and application thereof | |
CN111833963A (en) | cfDNA classification method, device and application | |
CN112553344B (en) | Biomarker related to colorectal cancer and application thereof | |
CN109112216A (en) | The kit and method of triple qPCR detection DNA methylations | |
CN108588230A (en) | A kind of marker and its screening technique for breast cancer diagnosis | |
CN112609015A (en) | Microbial marker for predicting colorectal cancer risk and application thereof | |
KR20170067137A (en) | METHOD FOR DISCOVERING miRNA BIOMARKER FOR CANCER DIAGNOSIS AND USE THEREOF | |
CN110570951A (en) | Method for constructing classification model of new auxiliary chemotherapy curative effect of breast cancer | |
CN117316278A (en) | Cancer noninvasive early screening method and system based on cfDNA fragment length distribution characteristics | |
CN117757928A (en) | Plasma exosome RNA biomarker group for early diagnosis of chronic pancreatitis and application thereof | |
CN112951325A (en) | Design method and application of probe combination for cancer detection | |
CN111690746A (en) | Platelet RNA marker related to lung cancer and application thereof | |
CN110408706A (en) | It is a kind of assess recurrent nasopharyngeal carcinoma biomarker and its application | |
CN114875155A (en) | Gene mutation and application thereof in diagnosis of pancreatic and biliary tract cancer | |
CN115803448A (en) | Micronucleus DNA from peripheral red blood cells and uses thereof | |
CN110628907B (en) | Gallbladder cancer plasma exosome microRNAs markers and application thereof | |
CN112852969A (en) | Epigenetically modified lncRNA as tumor diagnosis or tumor progression prediction marker | |
WO2019095541A1 (en) | Composition and method for diagnosing and predicting breast cancer bone metastases | |
CN116287252B (en) | Application of long-chain non-coding RNA APCDD1L-DT in preparation of pancreatic cancer detection products | |
CN115747333B (en) | Tumor marker detection kit, detection analysis system and application thereof | |
CN115820857B (en) | Kit for identifying gastric precancerous lesions and gastric cancer and diagnosing gastric cancer |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |