CN117316278A - Cancer noninvasive early screening method and system based on cfDNA fragment length distribution characteristics - Google Patents

Cancer noninvasive early screening method and system based on cfDNA fragment length distribution characteristics Download PDF

Info

Publication number
CN117316278A
CN117316278A CN202210704961.5A CN202210704961A CN117316278A CN 117316278 A CN117316278 A CN 117316278A CN 202210704961 A CN202210704961 A CN 202210704961A CN 117316278 A CN117316278 A CN 117316278A
Authority
CN
China
Prior art keywords
cfdna
cancer
model
early screening
fragment
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210704961.5A
Other languages
Chinese (zh)
Inventor
张大东
姜国娟
段侨南
许晓雅
陈升
张玮
陈灏
李志宽
年宝宁
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai 3D Medicines Co Ltd
Original Assignee
Shanghai 3D Medicines Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai 3D Medicines Co Ltd filed Critical Shanghai 3D Medicines Co Ltd
Priority to CN202210704961.5A priority Critical patent/CN117316278A/en
Publication of CN117316278A publication Critical patent/CN117316278A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6869Methods for sequencing
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6876Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes
    • C12Q1/6883Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes for diseases caused by alterations of genetic material
    • C12Q1/6886Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes for diseases caused by alterations of genetic material for cancer
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding

Landscapes

  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Chemical & Material Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Organic Chemistry (AREA)
  • General Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • Analytical Chemistry (AREA)
  • Biotechnology (AREA)
  • Wood Science & Technology (AREA)
  • Medical Informatics (AREA)
  • Genetics & Genomics (AREA)
  • Zoology (AREA)
  • Immunology (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Theoretical Computer Science (AREA)
  • Molecular Biology (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Pathology (AREA)
  • Microbiology (AREA)
  • Biochemistry (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Hospice & Palliative Care (AREA)
  • Software Systems (AREA)
  • Oncology (AREA)
  • Public Health (AREA)
  • Evolutionary Computation (AREA)
  • Epidemiology (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Bioethics (AREA)
  • Artificial Intelligence (AREA)
  • Investigating Or Analysing Biological Materials (AREA)

Abstract

The invention discloses a non-invasive early screening method and system for cancers based on cfDNA fragment length distribution characteristics. According to the method, the segment length distribution characteristic difference of the cfDNA of a tumor source and the cfDNA of a healthy individual source is counted in a low-depth whole-gene sequencing mode, an early screening model of cancer is established, and noninvasive early screening of the cancer is realized. The scheme focuses on the feature of the size of the blood cfDNA fragment to distinguish ctDNA from cfDNA of non-tumor sources, does not depend on mutation detection of oncogenes or cancer suppressor genes, and eliminates interference caused by clonal hematopoietic mutation; secondly, the data of the embodiment of the invention show that healthy people and early tumor patients can be distinguished by adopting the size characteristics of the blood cfDNA fragments; finally, due to the adoption of a low-depth whole-gene sequencing technology, the scheme of the invention relates to the great reduction of detection cost, and is beneficial to the future application in the field of early screening of malignant tumors.

Description

Cancer noninvasive early screening method and system based on cfDNA fragment length distribution characteristics
Technical Field
The invention belongs to the technical field of medical detection, and particularly relates to a non-invasive early screening method and system for cancers based on cfDNA fragment length distribution characteristics.
Background
The majority of mortality rates in human cancers worldwide are due to the poor therapeutic intervention resulting from advanced diagnosis, so early diagnosis of tumors is particularly important. Traditional biomarker imaging techniques play an important role in tumor diagnosis; however, the specificity of traditional serum biomarkers is not satisfactory for therapeutic guidance. Furthermore, imaging techniques cannot be used for "real-time" detection due to radiation exposure and economic problems. From the first plasma cfDNA (cell free DNA) liquid biopsy product based on EGFR gene mutation approved by the FDA in 2016, to bTMB (blood tumor mutation burden), it was demonstrated that the effect of immunotherapy could be predicted, liquid biopsies have been highly prized in the field of tumor therapy. Liquid biopsies are increasingly being considered for early tumor diagnosis, treatment guidance and recurrence monitoring. It may provide information about the tumor. Furthermore, liquid biopsies provide a non-invasive alternative to traditional "solid biopsies" which in some cases or "real-time" cannot be consistently performed. Despite these numerous advantages, there are limitations such as lack of consensus on detection methods, difficulty in analyzing large amounts of measurement information, and insufficient evidence based on evidence-based medicine.
Currently, the conventional method of studying liquid biopsies, early screening for cancer is to identify cfDNA released by tumors by mutation detection of oncogenes or tumor suppressor genes. Unfortunately, the tumor circulating DNA (ctDNA, circulating tumor DNA) molecular content in blood is typically much lower than that of non-cancer related DNA fragments in blood, which makes their detection very difficult, especially in the early stages of cancer. Previous studies have found that accurate tumor information can only be obtained when ctDNA is present in cfDNA in an abundance of 10% or more. However, most tumor patients do not meet this criterion in terms of abundance of ctDNA, except that some advanced tumors release large amounts of ctDNA. Currently, the sensitivity and accuracy of ctDNA detection are mainly improved by increasing the sequencing depth, but increasing the sequencing depth may cause false positives, because non-tumor-derived DNA may also carry various tumor-related mutations. High depth sequencing is also extremely expensive and cannot be used in a wide variety of clinical applications. Furthermore, mutation studies have been directed to patients with advanced stages of cancer metastasis. The literature Razavi, p., li, b.t., brown, d.n. et al high intensity sequencing reveals the sources of plasma circulating cell-free DNA derivatives, nat Med 25,1928-1937 (2019) in Pedram et al found large numbers of cfDNA mutations by a factor of 60000 by performing 508 specific genes on peripheral blood of 124 metastatic cancer patients and 47 healthy people, 81.6% from healthy people, 53.2% from cancer patients, all due to clonal hematopoietic production of leukocytes, and not tumor cells. This can lead to a high false positive rate of identification of cancer patients through mutation of a particular target gene. Moreover, because of the great variability of cancer patients, a portion of specific cancer patients may be missed after defining specific target genes. There is a limit to the detection of specific panel mutations of cfDNA. These problems have limited the use of liquid biopsies.
Another finding by scientists might solve this problem. Previous studies have shown that the length of cfDNA released into the blood by different cells may vary, for example, many cfDNA is 167bp long (similar to the DNA length of one nucleosome), which may be associated with caspase-dependent DNA cleavage during apoptosis. The cfDNA of the infant is significantly shorter than the cfDNA fragment of the mother, a feature that is used for prenatal diagnosis. These studies suggest that cfDNA of different cell sources may exhibit unique patterns in length, potentially as a signal to distinguish cfDNA sources. For a number of reasons, cancer genomes become disorganized in packaging, meaning that when cancer cells die, they release DNA into the blood in a confusing manner, resulting in the likelihood that differences in cfDNA fragment length distribution characteristics will exist between circulating tumor and non-tumor DNA cells. And the research reports that the cfDNA fragment length is used as the characteristic of distinguishing tumor cells from non-tumor cells, so that the defects of low sensitivity, false positive and the like of cfDNA mutation detection can be overcome, however, the assumption is that a few previous researches on the cfDNA fragment length obtain conflicting results, so that the research on the cfDNA fragment length is unprecedented, and a scientific and systematic research cfDNA fragmentation system is urgently needed to be promoted and optimized.
At present, few reports are made on cfDNA fragment size for research on liquid biopsies and early screening of cancers. Of these, 2 most similar to the study of the invention (PMID: 30404863 and PMID: 31142840):
mouliere et al, enhanced detection of circulating tumor DNA by fragment size analysis.sci trans l Med,2018.10 (466) detected fragment length characteristics of cfDNA by whole genome sequencing of cfDNA extracted from 344 plasma samples (including 18 different types of cancers) and 65 plasma samples of healthy people collected from 200 cancer patients. Analysis shows that the ctDNA fragment with cancer mutation is generally 20-40bp shorter than the nucleosome DNA fragment (167 bp), and is enriched in the interval of 90-150bp, so that researchers can increase the abundance of ctDNA by a method of enriching short-fragment selective sequencing, and the fragment ratio of different length intervals is used as a characteristic to distinguish tumor blood samples from healthy blood samples by a machine learning algorithm. However, the method can selectively enrich the effective information such as tumor mutation on fragments with other lengths, and only the fragments with 90-150bp can lose some information. In addition, the total proportion of fragment distribution and mutation characteristics are combined with a training model in the research, and the fragment size proportion information of specific functional sites of the whole genome lacks systematic consideration, so that the cfDNA fragment characteristics of the method still have great improvement and improvement on the cfDNA fragment characteristics serving as early cancer diagnosis indexes.
Literature Cristiano, S., et al, genome-wide cell-free DNA fragmentation in patients with cancer Nature,2019.570 (7761): p.385-389 Cristiano et al examined blood samples from 208 patients with various stages of breast, colorectal, lung, ovarian, pancreatic, gastric and cholangiocarcinoma and 215 healthy persons by whole Genome sequencing, with 57% to 99% of the patients present circulating tumor DNA. The method detects cfDNA based on a low coverage whole genome sequencing method. The number of short and long fragments cfDNA sequences mapped to different regions of the genome are analyzed in non-overlapping windows covering the genome to model the prediction of early cancer species. Moreover, most patients in the study group are also advanced cancer patients of limited cancer species, and segment size statistics use the "one-cut" criteria (short segments defined as 100 to 150bp and long segments defined as 151 to 220 bp) in multiple cancer species, lacking more extensive large sample statistics and studies for optimizing parameters for cancer species-specific segments for determining single cancer species.
Based on this, it is necessary for those skilled in the art to devise a non-invasive early screening method for cancer that can achieve a low depth of whole genome sequencing to predict early cancer patients, and can greatly reduce the cost of early screening for cancer and improve screening accuracy.
Disclosure of Invention
The invention aims to provide a noninvasive early screening method for cancers based on cfDNA fragment length distribution characteristics. Mainly solves the technical problems of low specificity and high false positive rate of early screening of the malignant tumor of the gall pancreas in the prior art.
The technical scheme adopted by the invention for solving the technical problems is as follows:
a non-invasive early screening method for cancers based on cfDNA fragment length distribution characteristics, which comprises the following steps: and (3) counting the fragment length distribution characteristic difference of the cfDNA of the tumor source and the cfDNA of the healthy individual by a low-depth whole-gene sequencing mode, establishing an early screening model of the cancer, and realizing noninvasive early screening of the cancer. The lower depth whole gene sequencing mode refers to a sequencing depth of 2X-4X.
As a preferred embodiment, a normalized z-score of the number of short and total fragments of cfDNA fragment sizes within 504 5Mb length regions of the whole genome is calculated as a characteristic input value for model training. The total number of segments is the sum of the defined number of long segments and the defined number of short segments.
The cfDNA fragments comprise short fragments and long fragments, wherein the length of the cfDNA short fragments ranges from 130bp to 177bp, and the length of the long fragments ranges from 177bp to 237 bp.
As a preferred embodiment, the Linear SVC algorithm is adopted, and a model coefficient is obtained by using a 5-fold cross validation method repeated 30 times, so as to establish an early screening model of the cancer.
As a preferred embodiment, the cancer is a biliopancreatic malignancy. The cholangiopancreatic malignancy includes pancreatic cancer, gall bladder cancer and cholangiocarcinoma.
The invention also provides a cancer noninvasive early screening system based on cfDNA fragment length distribution characteristics, which comprises:
the cfDNA fragment characteristic extraction module is used for obtaining cfDNA fragment size characteristic data in a sample;
the machine learning classification model building module is used for building an early-stage cancer screening model according to the cfDNA fragment size characteristic difference statistics of the tumor source cfDNA and the cfDNA fragment size characteristic statistics of the healthy individual source;
and the independent verification queue evaluation module is used for verifying the prediction efficiency of the established machine learning classification model through the independent verification queue.
As a preferred embodiment, the cfDNA fragment feature extraction module comprises:
a sequencing data comparison unit for comparing the sequencing data to the human reference genome hg19 after removing the sequencing data sequencing linker;
the cfDNA fragment counting unit is used for counting cfDNA fragment length data information; dividing the hg19 autosome into 504 contiguous, non-intersecting window segments, each window segment 5Mb in length; counting the ratio of the number of cfDNAs with the length of more than 130bp and less than 177bp to the number of cfDNAs with the length of more than 177bp and less than 237bp in each window area; finally, obtaining the number of cfDNA long and short fragments in each 5Mb interval;
the cfDNA fragment characteristic determining unit is used for determining a section with the largest difference between the fragment distribution of the cancer patient and the healthy control according to the difference distribution of the fragment distribution between the cancer patient and the healthy control; a short segment range [130,177], a long segment range [177,237] are defined, and then the normalized z-score of the short segment cfDNA and the total segment number of each of the 504 windows is calculated as the feature input value of the model training.
As a preferred embodiment, the machine learning classification model building module includes:
a sample data classifying unit for classifying samples according to 4:1 is divided into a training set and a testing set, and the distribution proportion of healthy control samples and various cancer samples in the two sets is kept consistent;
the model parameter acquisition unit is used for processing sample data in the training set; in a training queue, a model parameter is obtained by using a 30-time repeated 5-fold cross validation method;
and the model efficiency evaluation unit is used for drawing a receiver operation characteristic curve of the training queue according to the model predicted value and the pathology detection result of each sample in the training queue.
Compared with the prior art, the invention has the following beneficial effects:
the scheme of the invention focuses on the feature of the size of blood cfDNA fragments to distinguish ctDNA from cfDNA of non-tumor sources, and does not depend on mutation detection of oncogenes or cancer suppressor genes, so that interference caused by clonal hematopoietic mutation is eliminated; secondly, the data of the embodiment of the invention show that the size characteristics of the blood cfDNA fragments can be used for distinguishing healthy people from early-stage tumor patients, and ctDNA signals of the tumor patients cannot be detected because the ctDNA content of the early-stage tumor patients is low; finally, due to the adoption of the low-depth whole-gene sequencing technology, the detection cost is greatly reduced compared with other full-depth or ultra-depth NGS detection methods, and the advantages are beneficial to the future application of the scheme in the field of early screening of malignant tumors.
According to the invention, blood plasma samples of 60 biliary pancreatic tumors and 31 healthy controls detected in clinic are adopted for carrying out cfDNA low-depth (2X-4X) whole genome detection, an analysis system is established for factors for distinguishing tumor cell and non-tumor cell DNA by considering the position distribution of cfDNA fragment sizes in the whole genome, and the biomarker diagnosis model for early screening of biliary pancreatic malignant tumors can be established by carrying out systematic statistical analysis on the sizes and the number of DNA fragments covering different areas of the whole genome in blood, training and testing the size characteristics of the cfDNA fragments in a research queue. Furthermore, the study independently verified 94 patients with biliary pancreatic tumor and 40 healthy persons, and successfully verified the efficacy of the diagnostic model based on the length distribution characteristics of the free DNA fragments of blood. The method adopts the characteristic of analyzing the cfDNA length in blood more accurately to find clues of early screening of tumors, and provides more solid and reliable data support for clinical accurate application.
Drawings
FIG. 1 is a fragment length distribution profile of cfDNA at the whole genome level in example 1 of the present invention.
FIG. 2 is a graph showing the difference distribution of cfDNA fragments of a cancer patient and a healthy individual in example 1 of the present invention.
Fig. 3 is a graph of the ROC training set in example 2 of the present invention.
FIG. 4 is a graph of the test set ROC of example 2 of the present invention.
Fig. 5 is a ROC curve for independent verification cohorts subject in example 2 of the present invention.
Detailed Description
The following describes the technical scheme of the present invention in detail by referring to examples. The reagents and biological materials used hereinafter are commercial products unless otherwise specified.
Example 1
(1) Study cohort and clinical information
The study includes 154 cases of tumor markers, imaging examination (such as ultrasonic examination, abdominal cavity CT scanning and the like) and pathological detection results, which are confirmed by biliary pancreatic tumor patients (pancreatic cancer, gall bladder cancer and bile duct cancer) and 71 healthy controls, and blood samples of the patients and the healthy controls are collected before operation. Each patient in the group gave an accurate diagnosis after surgery based on the pathological examination results.
(2) Blood collection, separation and storage
Whole blood from preoperative cancer patients and healthy controls was collected in 10ml free nucleic acid holding tubes (REF 43803, BD, USA) and transported at room temperature. The received whole blood sample is separated by a two-step centrifugation method to obtain plasma. The plasma and cellular components were first separated by centrifugation at 1600g for 10 minutes at 4 ℃, the supernatant carefully aspirated, taking care not to aspirate the leucocyte layer, while recording the haemolysis grade of the plasma, samples with haemolysis grade > 5 were not included in the subsequent study. The plasma was then centrifuged again at 16,000g for 15 minutes at 4℃to remove any remaining cells or cell debris. Transferring the supernatant into a centrifuge tube, split charging into 1ml of each tube, and placing the separated plasma sample in a refrigerator at-80 ℃ for storage.
(3) cfDNA extraction
Taking out the plasma sample from the refrigerator at-80 ℃ and placing the plasma sample in a water bath kettle, carrying out static incubation at 37 ℃ for about 5 minutes, transferring the plasma to a low-temperature refrigerated centrifuge, centrifuging at 4 ℃ and 1600g for 10 minutes, and carefully sucking the supernatant into a centrifuge tube. Extraction of plasma cfDNA was extracted from 1ml plasma using a QIAamp Circulating Nucleic Acid Kit (55114, qiagen, shanghai, china) kit, and cfDNA was eluted using 30 μl EB for the final procedure, see product instructions. The total amount of cfDNA extracted was quantified using a Qubit fluorescent quantifier and a matched corresponding reagent (Q32854, thermo Fisher, USA). cfDNA fragment distribution was detected using an agilent 2100 bioanalyzer and a corresponding Agilent High Sensitivity DNA Kit & Reagents (5067-4626, agilent, usa).
(4) cfDNA banking and WGS sequencing
Samples qualified in cfDNA quality control were used for cfDNA library construction and WGS sequencing. The library was prepared using a KAPA DNA Hyper Prep kit (KK 8504, KAPA, USA) and the detailed procedure was as described in the product specification. Each cfDNA sample input was 10ng, then the end of the base was added with A tail, then the linker was ligated, purified, PCR amplified for 7 cycles of enrichment library, purified, finally the DNA eluted with 25. Mu.l of eluent, qubit was used to determine the concentration of the plasma cfDNA library, 4150 determined the fragment distribution of the plasma cfDNA library. The qualified library is subjected to whole genome sequencing by using a NovoSeq 6000 platform, the sequencing strategy is 2x150bp, and the sequencing quantity is 10G (3 x).
(5) cfDNA fragment size feature extraction
Sequence information of cfDNA in patient and healthy control plasma was obtained based on low pass Whole Genome Sequencing (LP-WGS) detection technique. The analytical flow of the sequencing data is as follows:
1) Sequencing data alignment. After removal of the adaptors (fastq) from the raw fastq data obtained by LP-WGS sequencing, the sequencing data was aligned to human reference genome hg19 (genome download link: ftp:// ftp-transfer. Ncbi. Nih. Gov/1000 genome/ftp/technical/reference/human_g1k_v37. Fasta. Gz) using BWA software (version: 0.7.12-r 1039), low quality sequences were removed from the resulting BAM file and duplicate sequences were filtered out.
2) cfDNA fragment length statistics.
3) Full genome fragment size distribution profile. Excluding the low coverage region and Duke black box subregion of the hg19 reference genome; then dividing the hg19 autosomes into 504 contiguous, non-intersecting window segments, each window segment 5Mb in length; counting the ratio of the number of cfDNAs with the length of more than 130bp and less than 177bp to the number of cfDNAs with the length of more than 177bp and less than 237bp in each window area; finally, the number of cfDNA long and short fragments in each 5Mb interval is obtained, and finally, the cfDNA fragmentation size map visualization of the whole genome is carried out by using the proportion, and the cfDNA long and short fragments are distributed maps of the cfDNA at the whole genome level, see fig. 1.
(6) Machine learning classification model establishment
1) And determining the size fragment characteristics. The interval in which the difference between the distribution of the fragments of the cancer patient and the healthy control is the largest is determined according to the difference between the distribution of the fragments of the cancer patient and the healthy control, see fig. 2, which is the distribution of the difference between the cfDNA fragments of the cancer patient and the healthy individual, and the ordinate of fig. 2 refers to the difference between the occurrence frequencies of the cfDNA fragments of the cancer patient and the healthy individual. A short segment range [130,177], a long segment range [177,237] are defined, and then the normalized z-score of the short segment cfDNA and the total segment number of each of the 504 windows is calculated as the feature input value of the model training.
2) The samples are divided into training and testing sets. All samples were taken as 4:1 into a training set and a test set, and the distribution proportion of healthy controls and various cancer samples in the two sets is kept consistent.
3) Sample data in the training set is processed. In the training queue, model coefficients were obtained using a 30-fold repeat 5-fold cross-validation method.
4) The efficacy of the model was evaluated. And drawing a receiver operation characteristic curve (ROC curve, receiver operating characteristic curve) of the training set according to the model predicted value and the pathology detection result of each sample in the training set. And (3) setting a series of thresholds based on the predicted values to divide the training set into healthy people and cancer patients, and evaluating the predicted efficacy of the model by taking the pathological detection result as a true value. The model prediction efficiency evaluation method comprises the following areas (AUC, area opening Curve, value range 0-1), positive prediction values (PPV, positive Predictive Value, value range 0-1), specificity (value range 0-1), accuracy (value range 0-1) and sensitivity (value range 0-1), wherein the higher the value is, the better the effect is.
(7) Verification of classification model predictive efficacy
And in the independent verification queue, verifying the effectiveness of model prediction classification according to the classification model and the predicted value determined in the training queue. The process is as follows:
1) The variables are validated. In the independent validation queue, standard z-score was used as variable for 504 window cfDNA short fragments and total fragment numbers of the whole genome.
2) And (5) verifying model efficiency. And drawing an ROC curve of the test set according to the molecular marker expression quantity and the pathological detection result of each sample in the test set. Based on the predicted values, the independent validation cohorts were divided into healthy people (same training set and test set) and cancer groups, and model prediction efficacy, including specificity, sensitivity and accuracy, was evaluated, with higher values being more effective.
Example 2
(1) Study cohort and clinical information
The study is incorporated into two study queues for 154 cases, and the results of tumor markers, imaging examination (such as ultrasonic examination, abdominal cavity CT scanning and the like) and pathological detection are confirmed to be patients with biliary pancreatic malignant tumors and 71 healthy persons, and blood samples of the patients are collected before operation, and blood samples of healthy controls are collected. The training set and the test set were included in 60 patients (29 pancreatic cancer cases, 15 gallbladder cancer cases, and 16 bile duct cancer cases) and 31 healthy persons (table 1). Samples were taken according to 4:1 into a training set and a test set, and the distribution proportion of healthy controls and various cancer samples in the two sets is kept consistent. Table 1 shows the grouping information of healthy controls and patients in the training set and test set. The analysis results show that the gender ratio of the training set and the test set samples and the distribution ratio of the number of healthy controls and cancer patients have no significant difference.
Table 1: training set and test set information
The independent validation cohort consisted of 94 patients with biliary pancreatic tumors (37 pancreatic cancers, 17 gallbladder cancers and 40 bile duct cancers) and 40 healthy individuals. Table 2 shows the participant information in the training set and independent validation cohort, and the analysis results showed that there was no significant difference in the gender ratio of the training set to the independent validation cohort samples and the healthy control, cancer patient number distribution ratio.
Table 2: training set and independent verification queue information
(2) Health and cancer classification scoring model
And constructing a scoring model of the cancer patient and healthy crowd by utilizing a training set and a pathological detection result and utilizing a linear SVC algorithm. The model consists of three parts, namely a variable, a model formula and a predicted value. The process is as follows:
(1) model variables and parameters. In the training queue, the model uses a 30-fold, 5-fold cross-validation method to obtain model coefficients with normalized z-score 1008 feature variables (Table 3) for the number of short segments and total segments in a region of 504 5Mb length.
Table 3: model input variable examples
Sequence number Model variables Number of fragments
1 Bin1 short fragment N1
2 All fragments of Bin1 N2
…… …… ……
1007 Bin504 short fragment N1007
1008 All fragments of Bin504 N1008
(2) And (5) a scoring model. The scoring model formula is as follows:
wherein x is i For input variables, the calculation formulas for model parameters w and b are as follows:
where λ is the penalty parameter, n is the number of samples, y i Is a true value of the sample, 1 is cancer, and-1 is healthy.
Using the classification model and the fragment distribution at different regions of the whole genome level for each sample, a class prediction result for each sample can be obtained.
(3) Model efficacy evaluation.
To construct an early screening class model that distinguishes cancer patients from healthy individuals, the predictive value is used to divide the training set samples into healthy individuals and cancer patients. And drawing an ROC curve of the training set according to the predicted value in the queue and the pathological result by taking the pathological detection result as a true value, wherein the AUC value of the training set reaches 1, and the ROC curve of the training set is shown in fig. 3. The PPV (accuracy), specificity and sensitivity predicted by the training model were 100%, 100% and 100%, respectively (table 4). The results show that: in the training set, the risk prediction model has higher sensitivity and NPV, and the model has better prediction efficiency on early diagnosis of cancer.
(4) And (5) verifying prediction efficiency of the discrimination model.
To verify the efficacy of the discriminant model, test set patients were divided into healthy and cancer patient groups (co-training set) with thresholds set by predictive values. And verifying the efficacy of the model according to the classification model and the predicted value determined in the training queue by taking the pathological detection result as a true value, and drawing an ROC curve of the test set, wherein the AUC value reaches 1. Referring to fig. 4, a graph of ROC for the test set is shown. And model predictive efficacy was assessed, including accuracy, specificity and sensitivity, 100% and 92%, respectively (table 4). The results show that: in the test set, the risk prediction model also has higher specificity, sensitivity and accuracy, namely, the model prediction efficiency is better.
TABLE 4 Table 4
(3) Independent validation queue validation
To further verify the efficacy of the discriminant model, independent verification cohorts of patients were divided into healthy and cancer patient groups with thresholds set by predictive values. And verifying the efficacy of the model according to the classification model and the predicted value determined in the training set and the testing set by taking the pathology detection result as a true value, drawing an ROC curve of an independent verification queue, wherein the AUC value of the independent verification queue is as high as 0.94. Fig. 5, ROC curves for independent validation cohorts subjects. And the model predictive efficacy was evaluated, including accuracy, specificity and sensitivity, at 90.1%, 77.5% and 87.2%, respectively (table 5). The results show that: in the independent verification queue, the risk prediction model has higher specificity, sensitivity and accuracy, namely, the model prediction efficiency is better.
Table 5: independent validation queue model effectiveness validation
The foregoing is only a part of the preferred embodiments of the present invention, and the present invention is not limited to the contents of the embodiments. It will be apparent to those skilled in the art that various changes and modifications can be made within the scope of the technical solution of the present invention, and any changes and modifications are within the scope of the present invention.

Claims (9)

1. A non-invasive early screening method for cancers based on cfDNA fragment length distribution characteristics, which comprises the following steps: and (3) counting the fragment length distribution characteristic difference of the cfDNA of the tumor source and the cfDNA of the healthy individual by a low-depth whole-gene sequencing mode, establishing an early screening model of the cancer, and realizing noninvasive early screening of the cancer.
2. The non-invasive early screening method for cancer based on cfDNA fragment length distribution characteristics according to claim 1, wherein: the normalized z-score of the short and total fragment numbers for cfDNA fragment sizes within 504 5Mb length regions of the whole genome was calculated as a feature input value for model training.
3. The non-invasive early screening method for cancer based on cfDNA fragment length distribution characteristics according to claim 1, wherein: the cfDNA fragments comprise short fragments and long fragments, wherein the length of the cfDNA short fragments ranges from 130bp to 177bp, and the length of the long fragments ranges from 177bp to 237 bp.
4. The non-invasive early screening method for cancer based on cfDNA fragment length distribution characteristics according to claim 1, wherein: and (3) adopting a linear SVC algorithm, and using a 30-time repeated 5-fold cross validation method to obtain model coefficients and establishing an early screening model of the cancer.
5. A non-invasive early screening method for cancer based on cfDNA fragment length distribution characteristics according to any of claims 1-4, characterized in that: the cancer is a malignant tumor of gall pancreas.
6. A non-invasive early screening method for cancer based on cfDNA fragment length distribution characteristics according to any of claims 1-4, characterized in that: the cholangiopancreatic malignancy includes pancreatic cancer, gall bladder cancer and cholangiocarcinoma.
7. A cancer noninvasive early screening system based on cfDNA fragment length distribution characteristics, the system comprising:
the cfDNA fragment characteristic extraction module is used for obtaining cfDNA fragment size characteristic data in a sample;
the machine learning classification model building module is used for building an early-stage cancer screening model according to the cfDNA fragment size characteristic difference statistics of the tumor source cfDNA and the cfDNA fragment size characteristic statistics of the healthy individual source;
and the independent verification queue evaluation module is used for verifying the prediction efficiency of the established machine learning classification model through the independent verification queue.
8. The non-invasive early screening system for cancer based on cfDNA fragment length distribution characteristics of claim 7, wherein the cfDNA fragment characteristic extraction module comprises:
a sequencing data comparison unit for comparing the sequencing data to the human reference genome hg19 after removing the sequencing data sequencing linker;
the cfDNA fragment counting unit is used for counting cfDNA fragment length data information; dividing the hg19 autosome into 504 contiguous, non-intersecting window segments, each window segment 5Mb in length; counting the ratio of the number of cfDNAs with the length of more than 130bp and less than 177bp to the number of cfDNAs with the length of more than 177bp and less than 237bp in each window area; finally, obtaining the number of cfDNA long and short fragments in each 5Mb interval;
the cfDNA fragment characteristic determining unit is used for determining a section with the largest difference between the fragment distribution of the cancer patient and the healthy control according to the difference distribution of the fragment distribution between the cancer patient and the healthy control; a short segment range [130,177], a long segment range [177,237] are defined, and then the normalized z-score of the short segment cfDNA and the total segment number of each of the 504 windows is calculated as the feature input value of the model training.
9. The cfDNA fragment length distribution feature-based cancer noninvasive early screening system of claim 7, wherein the machine learning classification model building module comprises:
a sample data classifying unit for classifying samples according to 4:1 is divided into a training set and a testing set, and the distribution proportion of healthy control samples and various cancer samples in the two sets is kept consistent;
the model parameter acquisition unit is used for processing sample data in the training set; in a training queue, a model parameter is obtained by using a 30-time repeated 5-fold cross validation method;
and the model efficiency evaluation unit is used for drawing a receiver operation characteristic curve of the training queue according to the model predicted value and the pathology detection result of each sample in the training queue.
CN202210704961.5A 2022-06-21 2022-06-21 Cancer noninvasive early screening method and system based on cfDNA fragment length distribution characteristics Pending CN117316278A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210704961.5A CN117316278A (en) 2022-06-21 2022-06-21 Cancer noninvasive early screening method and system based on cfDNA fragment length distribution characteristics

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210704961.5A CN117316278A (en) 2022-06-21 2022-06-21 Cancer noninvasive early screening method and system based on cfDNA fragment length distribution characteristics

Publications (1)

Publication Number Publication Date
CN117316278A true CN117316278A (en) 2023-12-29

Family

ID=89241284

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210704961.5A Pending CN117316278A (en) 2022-06-21 2022-06-21 Cancer noninvasive early screening method and system based on cfDNA fragment length distribution characteristics

Country Status (1)

Country Link
CN (1) CN117316278A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117935914A (en) * 2024-03-22 2024-04-26 北京求臻医学检验实验室有限公司 Unknown-meaning clonal hematopoietic recognition and application method thereof

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117935914A (en) * 2024-03-22 2024-04-26 北京求臻医学检验实验室有限公司 Unknown-meaning clonal hematopoietic recognition and application method thereof

Similar Documents

Publication Publication Date Title
US20220186320A1 (en) MicroRNA Marker Combination for Diagnosing Gastric Cancer and Diagnostic Kit
CN107727865A (en) The systemic detection method of tumor markers and its application
CN111218513B (en) Peripheral blood extracellular vesicle microRNA biomarker for early diagnosis of lung cancer and application thereof
CN107034301A (en) A kind of detection Lung neoplasm is benign or pernicious kit and its application
WO2022161076A1 (en) Methylation markers for detection of benign/malignant pulmonary nodules or combination thereof, and application thereof
CN111833963A (en) cfDNA classification method, device and application
CN112553344B (en) Biomarker related to colorectal cancer and application thereof
CN109112216A (en) The kit and method of triple qPCR detection DNA methylations
CN108588230A (en) A kind of marker and its screening technique for breast cancer diagnosis
CN112609015A (en) Microbial marker for predicting colorectal cancer risk and application thereof
KR20170067137A (en) METHOD FOR DISCOVERING miRNA BIOMARKER FOR CANCER DIAGNOSIS AND USE THEREOF
CN110570951A (en) Method for constructing classification model of new auxiliary chemotherapy curative effect of breast cancer
CN117316278A (en) Cancer noninvasive early screening method and system based on cfDNA fragment length distribution characteristics
CN117757928A (en) Plasma exosome RNA biomarker group for early diagnosis of chronic pancreatitis and application thereof
CN112951325A (en) Design method and application of probe combination for cancer detection
CN111690746A (en) Platelet RNA marker related to lung cancer and application thereof
CN110408706A (en) It is a kind of assess recurrent nasopharyngeal carcinoma biomarker and its application
CN114875155A (en) Gene mutation and application thereof in diagnosis of pancreatic and biliary tract cancer
CN115803448A (en) Micronucleus DNA from peripheral red blood cells and uses thereof
CN110628907B (en) Gallbladder cancer plasma exosome microRNAs markers and application thereof
CN112852969A (en) Epigenetically modified lncRNA as tumor diagnosis or tumor progression prediction marker
WO2019095541A1 (en) Composition and method for diagnosing and predicting breast cancer bone metastases
CN116287252B (en) Application of long-chain non-coding RNA APCDD1L-DT in preparation of pancreatic cancer detection products
CN115747333B (en) Tumor marker detection kit, detection analysis system and application thereof
CN115820857B (en) Kit for identifying gastric precancerous lesions and gastric cancer and diagnosing gastric cancer

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination