CN117316278A

CN117316278A - Cancer noninvasive early screening method and system based on cfDNA fragment length distribution characteristics

Info

Publication number: CN117316278A
Application number: CN202210704961.5A
Authority: CN
Inventors: 张大东; 姜国娟; 段侨南; 许晓雅; 陈升; 张玮; 陈灏; 李志宽; 年宝宁
Original assignee: Shanghai 3D Medicines Co Ltd
Current assignee: Shanghai 3D Medicines Co Ltd
Priority date: 2022-06-21
Filing date: 2022-06-21
Publication date: 2023-12-29

Abstract

The invention discloses a non-invasive early screening method and system for cancer based on the length distribution characteristics of cfDNA fragments. This method uses lower-depth whole-gene sequencing to count the differences in fragment length distribution characteristics of tumor-derived cfDNA and healthy individual-derived cfDNA to establish an early cancer screening model and achieve non-invasive early screening of cancer. The solution of the present invention focuses on the characteristic of blood cfDNA fragment size to distinguish ctDNA from non-tumor-derived cfDNA, does not rely on mutation detection of oncogenes or tumor suppressor genes, and eliminates interference caused by clonal hematopoietic mutations; secondly, embodiments of the present invention The data shows that the size characteristics of blood cfDNA fragments can be used to distinguish healthy people from early-stage tumor patients; finally, due to the use of low-depth whole-gene sequencing technology, the solution of the present invention involves a significant reduction in detection costs, which is beneficial to future applications in the field of early screening of malignant tumors.

Description

A non-invasive early screening method for cancer based on cfDNA fragment length distribution characteristics and system

技术领域Technical field

本发明属于医学检测技术领域，具体涉及一种基于cfDNA片段长度分布特征的癌症无创早筛方法及系统。The invention belongs to the field of medical detection technology, and specifically relates to a non-invasive early screening method and system for cancer based on the length distribution characteristics of cfDNA fragments.

背景技术Background technique

世界范围内人类癌症的大部分死亡率是由于晚期诊断导致治疗干预效果太差，所以肿瘤早期诊断显得尤为重要。传统生物标志物成像技术在肿瘤诊断中发挥着重要作用；然而，传统血清生物标志物的特异性对于治疗指导并不令人满意。此外，由于辐射暴露和经济问题，成像技术不能用于“实时”检测。从2016年FDA批准首个基于EGFR基因突变的血浆cfDNA(cell free DNA)液体活检产品，到bTMB(blood tumor mutation burden)被证实可以预测免疫治疗的效果，液体活检已在肿瘤治疗领域大放异彩。液体活检越来越多地被考虑用于早期肿瘤诊断、治疗指导和复发监测。它可以提供有关肿瘤的信息。此外，液体活检为传统“固体活检”提供了一种非侵入性替代方案，传统“固体活检”在某些情况下或“实时”无法始终如一地进行。尽管有这些众多优势，但仍存在一些局限性，例如对检测方法缺乏共识、难以分析大量测序信息以及基于循证医学的证据不足。Most of the mortality from human cancer worldwide is due to late diagnosis that results in poor therapeutic intervention, so early diagnosis of tumors is particularly important. Traditional biomarker imaging techniques play an important role in tumor diagnosis; however, the specificity of traditional serum biomarkers is not satisfactory for treatment guidance. Furthermore, imaging technology cannot be used for "real-time" detection due to radiation exposure and economic concerns. From the FDA's approval of the first plasma cfDNA (cell free DNA) liquid biopsy product based on EGFR gene mutations in 2016 to bTMB (blood tumor mutation burden) being proven to predict the effectiveness of immunotherapy, liquid biopsy has shone in the field of cancer treatment. . Liquid biopsies are increasingly being considered for early tumor diagnosis, treatment guidance, and recurrence monitoring. It can provide information about the tumor. Additionally, liquid biopsies provide a non-invasive alternative to traditional "solid biopsies" that cannot be performed consistently in certain circumstances or in "real time." Despite these numerous advantages, there are several limitations, such as a lack of consensus on detection methods, difficulty in analyzing large amounts of sequencing information, and insufficient evidence-based medicine.

目前，研究液体活检、癌症早筛的常规方法为通过对致癌基因或者抑癌基因的突变检测来识别肿瘤释放的cfDNA。不幸的是，血液中的肿瘤循环DNA(ctDNA，Circulatingtumor DNA)分子含量通常远远低于血液中非癌症相关DNA的片段，这使得检测它们非常困难，特别是在癌症的早期阶段。以往的研究发现，只有当ctDNA在cfDNA中的丰度在10％或以上时，才能得到准确的肿瘤信息。但是，除了某些晚期的肿瘤会释放大量ctDNA外，大多数肿瘤患者ctDNA的丰度达不到这个标准。目前，主要是通过加大测序深度来提高ctDNA检测的灵敏度和准确率，但是加大测序深度可能会导致假阳性，因为非肿瘤来源的DNA中也可能带有各种肿瘤相关的突变。高深度的测序花费也是极其昂贵，无法在临床上进行普遍的应用。而且突变研究也只是针对于癌症发生转移的晚期病人。文献Razavi,P.,Li,B.T.,Brown,D.N.et al.High intensity sequencing reveals the sources of plasma circulatingcell-free DNA variants.Nat Med 25,1928–1937(2019).中Pedram等人通过对124个转移的癌症患者和47个健康人的外周血进行了508个特定基因，大于60000倍的超深度检测，发现大量的cfDNA突变，81.6％来自于健康人，53.2％来自于癌症患者，都是白细胞的克隆性造血引起的，而非肿瘤细胞。这样会使得通过特定靶标基因的突变来识别癌症患者会发生很大的误判率。而且由于癌症患者的差异性很大，限定了特殊的靶标基因后，可能会遗漏一部分特殊的癌症患者。所以对cfDNA的特定panel突变检测存在一定的限制。这些问题一直限制着液体活检的应用。Currently, the conventional method for studying liquid biopsy and early cancer screening is to identify cfDNA released by tumors through mutation detection of oncogenes or tumor suppressor genes. Unfortunately, the amount of circulating tumor DNA (ctDNA) molecules in the blood is often much lower than the amount of non-cancer-related DNA fragments in the blood, making their detection very difficult, especially in the early stages of cancer. Previous studies have found that accurate tumor information can only be obtained when the abundance of ctDNA in cfDNA is 10% or more. However, except for some advanced tumors that release large amounts of ctDNA, the abundance of ctDNA in most tumor patients does not meet this standard. At present, the sensitivity and accuracy of ctDNA detection are mainly improved by increasing the sequencing depth. However, increasing the sequencing depth may lead to false positives, because non-tumor-derived DNA may also contain various tumor-related mutations. High-depth sequencing is also extremely expensive and cannot be widely used clinically. Moreover, mutation research is only targeted at late-stage patients whose cancer has metastasized. Literature Razavi, P., Li, B.T., Brown, D.N. et al. High intensity sequencing reveals the sources of plasma circulating cell-free DNA variants. Nat Med 25, 1928–1937 (2019). Pedram et al. analyzed 124 metastases The peripheral blood of cancer patients and 47 healthy people underwent ultra-deep detection of 508 specific genes, more than 60,000 times, and found a large number of cfDNA mutations, 81.6% from healthy people and 53.2% from cancer patients, all of which are white blood cells. Caused by clonal hematopoiesis rather than tumor cells. This will lead to a high misdiagnosis rate in identifying cancer patients based on mutations in specific target genes. And because cancer patients are very different, after defining special target genes, some special cancer patients may be missed. Therefore, there are certain limitations in specific panel mutation detection of cfDNA. These problems have always limited the application of liquid biopsy.

科学家的另一个发现却可能解决这个问题。之前有研究表明，不同细胞释放到血液中的cfDNA的长度会有所不同，例如，很多cfDNA的长度都是167bp(与一个核小体的DNA长度类似)，这可能与细胞凋亡过程中半胱天冬氨酸酶依赖的DNA裂解相关。婴儿的cfDNA就明显比母亲的cfDNA片段短，这个特点被运用于产前诊断。这些研究说明不同细胞来源的cfDNA在长度上会呈现独特的模式，可能可以作为区分cfDNA来源的信号。由于种种原因，癌症基因组在包装方式上变得杂乱无章，这意味着当癌细胞死亡时，它们会以混乱的方式将DNA释放到血液中，导致cfDNA片段长度分布特征的差异很可能就存在于循环肿瘤细胞DNA与非肿瘤DNA细胞之中。且研究报道发现cfDNA片段长度作为区分肿瘤细胞与非肿瘤细胞的特征可以弥补检测cfDNA突变的低敏感性和假阳性等缺点，然而这个假设，之前少数几个在此方面的研究却得出了相互冲突的结果，使得这方面的研究裹足不前，所以科学和系统的研究cfDNA片段化体系亟待提升和优化。Another discovery by scientists may solve this problem. Previous studies have shown that the length of cfDNA released into the blood by different cells will vary. For example, the length of many cfDNA is 167bp (similar to the DNA length of a nucleosome), which may be half the length of cfDNA released into the blood during apoptosis. Related to caspase-dependent DNA cleavage. The baby's cfDNA fragment is significantly shorter than the mother's cfDNA fragment, and this feature is used in prenatal diagnosis. These studies indicate that cfDNA derived from different cells exhibits unique patterns in length, which may serve as signals to differentiate cfDNA sources. For a variety of reasons, cancer genomes become disorganized in the way they are packaged. This means that when cancer cells die, they release their DNA into the blood in a disorganized manner, resulting in differences in the length distribution characteristics of cfDNA fragments that are likely to be present in the circulation. Tumor cell DNA and non-tumor DNA cells. Research reports have found that cfDNA fragment length can be used as a feature to distinguish tumor cells from non-tumor cells, which can make up for the low sensitivity and false positives in detecting cfDNA mutations. However, this hypothesis has been contradicted by a few previous studies in this area. The conflicting results have stalled research in this area, so scientific and systematic research on cfDNA fragmentation systems urgently needs to be improved and optimized.

目前，关于cfDNA片段大小用于研究液体活检、癌症早筛的研究也有少量报道。其中与本发明研究最相似的2个研究(PMID:30404863和PMID:31142840)：At present, there are a few reports on the use of cfDNA fragment size in research on liquid biopsy and early cancer screening. Among them, the two studies most similar to the study of this invention (PMID: 30404863 and PMID: 31142840):

文献Mouliere,F.,et al.,Enhanced detection of circulating tumor DNA byfragment size analysis.Sci Transl Med,2018.10(466).中Mouliere等人通过对200个癌症病人收集的344份血浆样本(包括18种不同类型的癌症)和65个健康人的血浆样本提取的cfDNA进行了较低深度(0.4×)的全基因组测序方式检测了cfDNA的片段长度特征。分析发现带有癌症突变的ctDNA片段普遍比核小体DNA片段(167bp)短20-40bp，在90-150bp区间富集，进而研究人员通过富集短片段选择性测序的方法提高ctDNA的丰度，用不同长度区间的片段比例作为特征通过机器学习算法区分肿瘤血液样本和健康血液样本。但是该方法因其选择性富集片段其他长度的片段上也存在肿瘤突变等有效信息，而只取90-150bp的片段会丢失一些信息。除此之外，研究中用的是片段分布的总比例和突变特征结合训练模型，对整个基因组具体功能位点的片段大小比例信息缺乏系统的考虑，因此该方法就cfDNA片段特征作为癌症早期诊断指标仍有很大提升和改进的空间。Literature Mouliere, F., et al., Enhanced detection of circulating tumor DNA byfragment size analysis. Sci Transl Med, 2018.10(466). In Mouliere et al., Mouliere et al. collected 344 plasma samples (including 18 different types) from 200 cancer patients. type of cancer) and 65 healthy people's plasma samples were subjected to lower depth (0.4×) whole-genome sequencing to detect the fragment length characteristics of cfDNA. Analysis found that ctDNA fragments with cancer mutations are generally 20-40 bp shorter than nucleosomal DNA fragments (167 bp), and are enriched in the 90-150 bp range. Researchers then increased the abundance of ctDNA by enriching short fragments and selective sequencing. , using the fragment ratios of different length intervals as features to distinguish tumor blood samples and healthy blood samples through machine learning algorithms. However, because this method selectively enriches fragments, other length fragments also contain effective information such as tumor mutations, and only taking 90-150 bp fragments will lose some information. In addition, the study used a combination of the total proportion of fragment distribution and mutation characteristics to train the model. There was a lack of systematic consideration of the fragment size proportion information of specific functional sites in the entire genome. Therefore, this method uses cfDNA fragment characteristics as an early diagnosis of cancer. There is still a lot of room for improvement in the indicators.

文献Cristiano,S.,et al.,Genome-wide cell-free DNA fragmentation inpatients with cancer.Nature,2019.570(7761):p.385-389.中Cristiano等人通过全基因组测序方式检测了来自208名有不同阶段乳腺癌、结肠直肠癌、肺癌、卵巢癌、胰腺癌、胃癌和胆管癌的患者和215名健康人的血液样本，其中57％至99％以上患者存在循环肿瘤DNA。该方法基于较低覆盖度的全基因组测序方法检测cfDNA。在覆盖基因组的非重叠窗口中分析映射到基因组不同区域的短片段和长片段cfDNA序列的数目来建立预测早期癌种的模型。而且上述研究所入组研究的病患也大多针对有限癌种的癌症晚期病人，且片段大小统计在多癌种中采用“一刀切”的标准(短片段定义为100到150bp，长片段定义为151到220bp)，对确定单癌种的癌种特异性片段优化参数缺乏更为深入的大样本统计和研究。In the literature Cristiano, S., et al., Genome-wide cell-free DNA fragmentation inpatients with cancer. Nature, 2019.570(7761): p.385-389. In the literature, Cristiano et al. detected 208 patients with cancer through whole-genome sequencing. Blood samples from patients with different stages of breast, colorectal, lung, ovarian, pancreatic, gastric and bile duct cancer and 215 healthy people found circulating tumor DNA in 57% to more than 99% of the patients. This method is based on lower coverage whole-genome sequencing methods to detect cfDNA. The number of short and long cfDNA sequences mapped to different regions of the genome was analyzed in non-overlapping windows covering the genome to build a model for predicting early cancer types. Moreover, most of the patients enrolled in the above-mentioned research institutes are late-stage cancer patients with limited cancer types, and the fragment size statistics adopt a "one-size-fits-all" standard for multiple cancer types (short fragments are defined as 100 to 150 bp, and long fragments are defined as 151 bp). to 220 bp), there is a lack of more in-depth large-sample statistics and research on determining the optimization parameters of cancer-specific fragments for single cancer types.

基于此，设计一种能够实现低深度的全基因组测序方式就能预测早期癌症患者，并能够大幅度地降低癌症早筛的成本并提高筛查准确率的癌症无创早筛方法，这对本领域技术人员来说是十分必要的。Based on this, designing a non-invasive cancer early screening method that can achieve low-depth whole-genome sequencing can predict early-stage cancer patients, and can significantly reduce the cost of early cancer screening and improve the screening accuracy. This is a great addition to the technology in this field. It is very necessary for personnel.

发明内容Contents of the invention

本发明的目的是，提供一种基于cfDNA片段长度分布特征的癌症无创早筛方法。主要解决现有技术中胆胰恶性肿瘤的早期筛查特异性低、假阳性率高的技术问题。The purpose of the present invention is to provide a non-invasive early screening method for cancer based on the length distribution characteristics of cfDNA fragments. It mainly solves the technical problems of low specificity and high false positive rate in early screening of bile and pancreatic malignant tumors in the existing technology.

本发明为解决上述技术问题所采用的技术方案如下：The technical solutions adopted by the present invention to solve the above technical problems are as follows:

一种基于cfDNA片段长度分布特征的癌症无创早筛方法，该方法为：通过较低深度的全基因测序方式对肿瘤来源cfDNA与健康个体来源cfDNA的片段长度分布特征差异统计，建立癌症的早期筛查模型，实现对癌症的无创早筛。所述较低深度的全基因测序方式是指测序深度为2X～4X。A non-invasive early screening method for cancer based on the fragment length distribution characteristics of cfDNA. The method is to use lower-depth whole-gene sequencing to count the difference in fragment length distribution characteristics of tumor-derived cfDNA and healthy individual-derived cfDNA to establish early screening of cancer. Check model to achieve non-invasive early screening of cancer. The lower-depth whole-gene sequencing method refers to a sequencing depth of 2X to 4X.

作为优选实施方案，计算cfDNA片段大小在全基因组504个5Mb长度区域内的短片段和总片段数目的标准化z-score，作为模型训练的特征输入值。所述总片段数目是定义的长片段数目和短片段数目之和。As a preferred embodiment, the normalized z-score of the number of short fragments and total fragments in 504 5Mb length regions of the whole genome is calculated as the feature input value for model training. The total number of fragments is the sum of the defined number of long fragments and the number of short fragments.

所述cfDNA片段包括短片段和长片段，其中cfDNA短片段长度范围为[130,177]bp，长片段长度范围为[177,237]bp。The cfDNA fragments include short fragments and long fragments, where the length range of the short cfDNA fragment is [130,177] bp and the length range of the long fragment is [177,237] bp.

作为优选实施方案，采用LinearSVC算法，使用30次重复5折交叉验证法，获得模型系数，建立癌症的早期筛查模型。As a preferred embodiment, the LinearSVC algorithm is used, and the 5-fold cross-validation method is repeated 30 times to obtain the model coefficients and establish an early cancer screening model.

作为优选实施方案，所述癌症为胆胰恶性肿瘤。所述胆胰恶性肿瘤包括胰腺癌、胆囊癌和胆管癌。As a preferred embodiment, the cancer is a malignant tumor of the bile and pancreas. The biliopancreatic malignant tumors include pancreatic cancer, gallbladder cancer and cholangiocarcinoma.

本发明还提供一种基于cfDNA片段长度分布特征的癌症无创早筛系统，所述系统包括：The present invention also provides a non-invasive early screening system for cancer based on the length distribution characteristics of cfDNA fragments. The system includes:

cfDNA片段特征提取模块，用于获得样本中cfDNA片段大小特征数据；cfDNA fragment feature extraction module, used to obtain cfDNA fragment size feature data in the sample;

机器学习分类模型建立模块，用于根据肿瘤来源cfDNA与健康个体来源的cfDNA片段大小特征差异统计，建立癌症的早期筛查模型；The machine learning classification model building module is used to establish an early cancer screening model based on the statistical difference in size characteristics of cfDNA derived from tumors and cfDNA derived from healthy individuals;

独立验证队列评估模块，用于通过独立验证队列对建立的机器学习分类模型的预测效能进行验证。The independent validation queue evaluation module is used to verify the prediction performance of the established machine learning classification model through the independent validation queue.

作为优选实施方案，所述cfDNA片段特征提取模块包括：As a preferred embodiment, the cfDNA fragment feature extraction module includes:

测序数据比对单元，用于去除测序数据测序接头后，将测序数据比对到人类参考基因组hg19；The sequencing data comparison unit is used to compare the sequencing data to the human reference genome hg19 after removing the sequencing adapters from the sequencing data;

cfDNA片段统计单元，用于统计cfDNA片段长度数据信息；将hg19常染色体划分为504个邻接的、没有交集的窗口片段，每个窗口片段长度为5Mb；在每一个窗口区域内统计长度大于130bp小于177bp的cfDNA数目与长度大于177bp小于237bp的cfDNA数目的比值；最后得到每个5Mb区间内cfDNA长短片段的数目；The cfDNA fragment statistics unit is used to count cfDNA fragment length data information; the hg19 autosomal chromosome is divided into 504 adjacent, non-intersecting window fragments, each window fragment length is 5Mb; the statistical length in each window area is greater than 130bp and less than The ratio of the number of 177bp cfDNA to the number of cfDNA with a length greater than 177bp and less than 237bp; finally, the number of long and short cfDNA fragments in each 5Mb interval is obtained;

cfDNA片段特征确定单元，用于根据癌症患者和健康对照之间片段分布的差值分布，确定癌症患者和健康对照片段分布差异最大的区间；定义短片段范围[130,177]，长片段范围[177,237]，继而计算504个窗口里每个窗口的短片段cfDNA及总片段数目的标准化z-score，作为模型训练的特征输入值。The cfDNA fragment feature determination unit is used to determine the interval with the largest difference in fragment distribution between cancer patients and healthy controls based on the difference in fragment distribution between cancer patients and healthy controls; define short fragment ranges [130,177] and long fragment ranges [177,237] , and then calculate the standardized z-score of the short fragment cfDNA and the total number of fragments in each of the 504 windows as the feature input value for model training.

作为优选实施方案，所述机器学习分类模型建立模块包括：As a preferred embodiment, the machine learning classification model building module includes:

样本数据分类单元，用于将样本按照4：1的比例分为训练集和测试集，并使得健康对照和各种癌种样本在两个集合的分布比例保持一致；The sample data classification unit is used to divide the samples into training sets and test sets in a ratio of 4:1, and to keep the distribution proportions of healthy controls and various cancer samples in the two sets consistent;

模型参数获取单元，用于对训练集中样品数据进行处理；在训练队列中，使用30次重复5折交叉验证法，获得模型参数；The model parameter acquisition unit is used to process the sample data in the training set; in the training queue, the model parameters are obtained using the 5-fold cross-validation method repeated 30 times;

模型效能评估单元，用于根据训练队列中每个样本的模型预测值及病理检测结果，绘制训练队列的接受者操作特性曲线。The model performance evaluation unit is used to draw the receiver operating characteristic curve of the training queue based on the model prediction value and pathological detection results of each sample in the training queue.

与现有技术相比，本发明的有益效果如下：Compared with the prior art, the beneficial effects of the present invention are as follows:

本发明方案首先聚焦血液cfDNA片段大小这一特征来区分ctDNA和非肿瘤来源的cfDNA，不依赖于致癌基因或者抑癌基因的突变检测，这排除了克隆性造血突变引起的干扰；其次，本发明实施例数据表明，采用血液cfDNA片段大小特征可以区分健康人和早期肿瘤患者，并没有因为早期肿瘤患者中ctDNA含量低，而导致不能检测到肿瘤患者的ctDNA信号；最后，由于采用低深度全基因测序技术，本发明方案涉及检测费用相比于其他全深度或者超深度NGS检测方法大幅下降，这些优势有利于本方案未来在恶性肿瘤早筛领域的应用。The solution of the present invention first focuses on the characteristic of blood cfDNA fragment size to distinguish ctDNA from non-tumor-derived cfDNA, and does not rely on mutation detection of oncogenes or tumor suppressor genes, which eliminates interference caused by clonal hematopoietic mutations; secondly, the present invention The data of the examples show that the size characteristics of blood cfDNA fragments can be used to distinguish healthy people from early-stage tumor patients, and the ctDNA signal of tumor patients cannot be detected because of the low ctDNA content in early-stage tumor patients; finally, due to the use of low-depth whole-gene Sequencing technology, the solution of the present invention involves a significant reduction in detection costs compared to other full-depth or ultra-depth NGS detection methods. These advantages are conducive to the future application of this solution in the field of early screening of malignant tumors.

本发明研究采用临床中检测出的60例胆胰肿瘤和31例健康对照的血浆样本，进行cfDNA低深度(2X～4X)全基因组检测，考虑cfDNA片段大小在整个基因组的位置分布对区分肿瘤细胞和非肿瘤细胞DNA的因素建立分析体系，通过血液中覆盖全基因组不同区域的DNA片段大小和数量的系统统计分析，在研究队列中训练并测试了cfDNA片段大小特征可用于建立胆胰恶性肿瘤早筛的生物标志物诊断模型。更进一步，本研究独立验证队列入组了94例胆胰肿瘤患者及40例健康人，并成功验证了基于血液游离DNA片段长度分布特征诊断模型的效能。此方法采用更精准的分析血液中cfDNA长度这一特征以寻找肿瘤早筛的线索，为临床精准应用提供更为坚实可靠的数据支撑。The study of this invention uses plasma samples of 60 cases of bile and pancreatic tumors and 31 cases of healthy controls detected in clinical practice to conduct low-depth (2X~4X) whole-genome detection of cfDNA. Considering the position distribution of cfDNA fragment size in the entire genome, it is important to distinguish tumor cells. Establish an analysis system for DNA factors of non-tumor cells. Through systematic statistical analysis of the size and quantity of DNA fragments covering different regions of the whole genome in blood, we trained and tested the cfDNA fragment size characteristics in the research cohort, which can be used to establish the early detection of bile and pancreatic malignant tumors. Screening biomarker diagnostic model. Furthermore, the independent verification team of this study included 94 patients with bile and pancreatic tumors and 40 healthy people, and successfully verified the efficiency of the diagnostic model based on the length distribution characteristics of blood free DNA fragments. This method uses a more precise analysis of the length of cfDNA in blood to find clues for early tumor screening, providing more solid and reliable data support for precise clinical applications.

附图说明Description of drawings

图1为本发明实施例1中cfDNA全基因组水平的片段长度分布图谱。Figure 1 is a fragment length distribution map of cfDNA at the whole genome level in Example 1 of the present invention.

图2为本发明实施例1中癌症患者和健康个体的cfDNA片段的差值分布图。Figure 2 is a difference distribution diagram of cfDNA fragments between cancer patients and healthy individuals in Example 1 of the present invention.

图3为本发明实施例2中训练集ROC曲线图。Figure 3 is a ROC curve chart of the training set in Embodiment 2 of the present invention.

图4为本发明实施例2中测试集ROC曲线图。Figure 4 is a ROC curve chart of the test set in Embodiment 2 of the present invention.

图5为本发明实施例2中独立验证队列受试者ROC曲线。Figure 5 is the ROC curve of independent validation cohort subjects in Example 2 of the present invention.

具体实施方式Detailed ways

下面结合实施例对本发明的技术方案进行详细说明。以下采用的试剂和生物材料如未特别说明，均为商业化产品。The technical solution of the present invention will be described in detail below with reference to examples. The reagents and biological materials used below are all commercial products unless otherwise specified.

实施例1Example 1

(1)研究队列及临床信息(1)Research cohort and clinical information

本研究共纳入154例经肿瘤标志物、影像学检查(如超声检查、腹腔CT扫描等)和病理检测结果确诊为胆胰肿瘤患者(胰腺癌、胆囊癌和胆管癌)及71个健康对照，于术前采集患者及健康对照的血液样本。每个入组的患者于术后根据病理检查结果给出准确的诊断。This study included a total of 154 patients with biliopancreatic tumors (pancreatic cancer, gallbladder cancer, and bile duct cancer) diagnosed by tumor markers, imaging examinations (such as ultrasound, abdominal CT scan, etc.) and pathological examination results, and 71 healthy controls. Blood samples from patients and healthy controls were collected before surgery. Each enrolled patient was given an accurate diagnosis based on pathological examination results after surgery.

(2)血液收集、分离及储存(2) Blood collection, separation and storage

将术前癌症患者和健康对照的全血收集在10ml游离核酸保存管中(REF43803，BD，USA)，室温运输。收到的全血样本采用两步离心法分离得到血浆。首先通过4℃、1600g离心10分钟将血浆和细胞成分分离，小心吸取上清液，注意不要吸到白细胞层，同时记录血浆的溶血等级，溶血等级≥5的样本不纳入后续研究。其次将血浆在4℃温度下以16,000g的速度再次离心15分钟以去除任何残存的细胞或细胞碎片。将上清液转移到离心管中分装成1ml每管，分离好的血浆样本放置于-80℃冰箱储存。Whole blood from preoperative cancer patients and healthy controls was collected in 10 ml cell-free nucleic acid storage tubes (REF43803, BD, USA) and transported at room temperature. The received whole blood samples were separated into plasma using a two-step centrifugation method. First, separate the plasma and cell components by centrifuging at 4°C and 1600g for 10 minutes. Carefully aspirate the supernatant, taking care not to aspirate to the white blood cell layer. At the same time, record the hemolysis grade of the plasma. Samples with hemolysis grade ≥5 will not be included in subsequent studies. The plasma was then centrifuged again at 16,000g for 15 minutes at 4°C to remove any remaining cells or cell debris. Transfer the supernatant to centrifuge tubes and divide into 1 ml tubes. The separated plasma samples are stored in a -80°C refrigerator.

(3)cfDNA提取(3)cfDNA extraction

将血浆样本从-80℃冰箱取出放置于水浴锅中，37℃静态孵育5分钟左右，转移血浆到低温冷冻离心机中4℃、1600g离心10分钟，小心吸取上清液到离心管中。血浆cfDNA的提取使用QIAamp Circulating Nucleic Acid Kit(55114，Qiagen，Shanghai,China)试剂盒从1ml血浆中抽提cfDNA，具体的操作步骤参照产品说明书，最终使用30μl EB洗脱cfDNA。采用Qubit荧光定量仪和配套相应的试剂(Q32854，Thermo Fisher，USA)对所抽提cfDNA的总量进行定量。采用安捷伦2100生物分析仪及配套相应的Agilent High Sensitivity DNAKit&Reagents(5067-4626,Agilent,USA)进行cfDNA片段分布情况检测。Take the plasma sample out of the -80°C refrigerator and place it in a water bath. Incubate it statically at 37°C for about 5 minutes. Transfer the plasma to a low-temperature refrigerated centrifuge and centrifuge it at 1600g and 4°C for 10 minutes. Carefully draw the supernatant into a centrifuge tube. Plasma cfDNA was extracted using the QIAamp Circulating Nucleic Acid Kit (55114, Qiagen, Shanghai, China) kit to extract cfDNA from 1 ml of plasma. The specific operating steps were referred to the product instructions. Finally, 30 μl EB was used to elute cfDNA. The total amount of extracted cfDNA was quantified using a Qubit fluorescence quantifier and corresponding reagents (Q32854, Thermo Fisher, USA). The Agilent 2100 bioanalyzer and the corresponding Agilent High Sensitivity DNAKit&Reagents (5067-4626, Agilent, USA) were used to detect the distribution of cfDNA fragments.

(4)cfDNA建库及WGS测序(4) cfDNA library construction and WGS sequencing

将cfDNA质控合格的样本用于cfDNA文库构建及WGS测序。文库的制备选用KAPADNA Hyper Prep的试剂盒(KK8504,KAPA,USA)，详细操作流程参照产品说明书。每个cfDNA样本input量为10ng，随后将碱基末端补平加A尾，然后连接接头、接头纯化、PCR扩增7个循环富集文库，经纯化，最后用25μl洗脱液洗脱DNA，Qubit测定血浆cfDNA文库浓度，4150测定血浆cfDNA文库的片段分布。质量检测合格的文库选用NovoSeq 6000平台进行全基因组测序，测序策略2x150bp，测序量为～10G(～3×)。Use cfDNA quality control qualified samples for cfDNA library construction and WGS sequencing. The library was prepared using the KAPA DNA Hyper Prep kit (KK8504, KAPA, USA). For detailed operating procedures, refer to the product manual. The input amount of each cfDNA sample is 10ng, and then the base ends are filled and A tails are added, and then the adapter is connected, adapter purified, and PCR amplified for 7 cycles to enrich the library. After purification, the DNA is finally eluted with 25 μl of eluent. Qubit measures the concentration of plasma cfDNA library, and 4150 measures the fragment distribution of plasma cfDNA library. Libraries that passed the quality inspection were selected for whole-genome sequencing using the NovoSeq 6000 platform, with a sequencing strategy of 2x150bp and a sequencing volume of ~10G (~3×).

(5)cfDNA片段大小特征提取(5) cfDNA fragment size feature extraction

基于low pass Whole Genome Sequencing(LP-WGS)检测技术，获得患者及健康对照血浆中cfDNA的序列信息。测序数据的分析流程如下:Based on low pass Whole Genome Sequencing (LP-WGS) detection technology, the sequence information of cfDNA in the plasma of patients and healthy controls is obtained. The analysis process of sequencing data is as follows:

1)测序数据比对。将用LP-WGS测序得到的原始fastq数据去除接头后(fastq)，使用BWA软件(版本：0.7.12-r1039)将测序数据比对到人类参考基因组hg19(基因组下载链接：ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/technical/reference/human_g1k_v37.fasta.gz)，从得到的BAM文件中去除低质量的序列及过滤掉重复的序列。1) Sequencing data comparison. After removing the adapters (fastq) from the original fastq data obtained by LP-WGS sequencing, use BWA software (version: 0.7.12-r1039) to align the sequencing data to the human reference genome hg19 (genome download link: ftp://ftp -trace.ncbi.nih.gov/1000genomes/ftp/technical/reference/human_g1k_v37.fasta.gz), remove low-quality sequences and filter out duplicate sequences from the resulting BAM files.

2)cfDNA片段长度统计。2) Statistics of cfDNA fragment length.

3)全基因组片段大小分布图谱。将hg19参考基因组低覆盖率的区域和Duke黑箱子区域进行排除；然后将hg19常染色体划分为504个邻接的、没有交集的窗口片段，每个窗口片段长度为5Mb；在每一个窗口区域内统计长度大于130bp小于177bp的cfDNA数目与长度大于177bp小于237bp的cfDNA数目的比值；最后得到每个5Mb区间内cfDNA长短片段的数目，最后使用比例来进行整个基因组cfDNA片段化大小图谱可视化，参见图1，为cfDNA全基因组水平的片段长度分布图谱。3) Whole-genome fragment size distribution map. Exclude low-coverage regions of the hg19 reference genome and Duke black box regions; then divide the hg19 autosomal chromosome into 504 contiguous, non-intersecting window segments, each window segment is 5Mb in length; statistics are calculated within each window region The ratio of the number of cfDNA with a length greater than 130bp and less than 177bp to the number of cfDNA with a length greater than 177bp and less than 237bp; finally, the number of long and short cfDNA fragments in each 5Mb interval is obtained, and finally the ratio is used to visualize the cfDNA fragment size map of the entire genome, see Figure 1 , is the fragment length distribution map of cfDNA at the whole genome level.

(6)机器学习分类模型建立(6) Establishment of machine learning classification model

1)大小片段特征确定。根据癌症患者和健康对照之间片段分布的差值分布，确定癌症患者和健康对照片段分布差异最大的区间，参见图2，为癌症患者和健康个体的cfDNA片段的差值分布图，图2的纵坐标是指癌症患者和健康个体的cfDNA片段出现频率的差值。定义短片段范围[130,177]，长片段范围[177,237]，继而计算504个窗口里每个窗口的短片段cfDNA及总片段数目的标准化z-score，作为模型训练的特征输入值。1) Determine the characteristics of large and small fragments. According to the difference distribution of fragment distribution between cancer patients and healthy controls, determine the interval with the largest difference in fragment distribution between cancer patients and healthy controls. See Figure 2, which is a difference distribution diagram of cfDNA fragments between cancer patients and healthy individuals. Figure 2 The ordinate refers to the difference in frequency of cfDNA fragments between cancer patients and healthy individuals. Define the short fragment range [130, 177] and the long fragment range [177, 237], and then calculate the standardized z-score of the short fragment cfDNA and the total number of fragments in each of the 504 windows as the feature input value for model training.

2)将样本分为训练集和测试集。将所有样本按照4：1的比例分为训练集和测试集，并使得健康对照和各种癌种样本在两个集合的分布比例保持一致。2) Divide the samples into training sets and test sets. Divide all samples into training sets and test sets at a ratio of 4:1, and keep the distribution ratios of healthy controls and various cancer samples in the two sets consistent.

3)对训练集中样品数据进行处理。在训练队列中，使用30次重复5折交叉验证法，获得模型系数。3) Process the sample data in the training set. In the training queue, use the 5-fold cross-validation method repeated 30 times to obtain the model coefficients.

4)评估模型的效能。根据训练集中每个样本的模型预测值及病理检测结果，绘制训练集的接受者操作特性曲线(ROC曲线，receiver operating characteristic curve)。以预测值为准，设立一系列阈值将训练集分为健康人群及癌症患者，再以病理检测结果为真值，评估模型预测效能。模型预测效能评估方法，包括ROC曲线下方面积(AUC，Area UnderCurve，取值范围0～1)、阳性预测值(PPV，Positive Predictive Value，取值范围0～1)、特异性(取值范围0～1)、准确性(取值范围0～1)和灵敏度(取值范围0～1)，值越高效果越优。4) Evaluate the performance of the model. Based on the model prediction value and pathological detection results of each sample in the training set, the receiver operating characteristic curve (ROC curve) of the training set is drawn. Based on the predicted value, a series of thresholds are set to divide the training set into healthy people and cancer patients, and then the pathological test results are used as the true value to evaluate the prediction performance of the model. Model prediction performance evaluation methods, including area under the ROC curve (AUC, Area UnderCurve, value range 0 ~ 1), positive predictive value (PPV, Positive Predictive Value, value range 0 ~ 1), specificity (value range 0 ~1), accuracy (value range 0~1) and sensitivity (value range 0~1), the higher the value, the better the effect.

(7)分类模型预测效能的验证(7) Verification of classification model prediction efficiency

在独立验证队列中，根据训练队列中确定的分类模型及预测值，对模型预测分类的效能进行验证。过程如下：In the independent validation cohort, the effectiveness of the model in predicting classification is verified based on the classification model and prediction values determined in the training cohort. The process is as follows:

1)确认变量。在独立验证队列中，以全基因组504个窗口cfDNA短片段及总片段数目的标准z-score为变量。1) Confirm the variables. In the independent validation cohort, the standard z-score of 504 windows of cfDNA short fragments and the number of total fragments in the whole genome was used as the variable.

2)模型效能验证。根据测试集中每个样本的分子标记物表达量及病理检测结果，绘制测试集的ROC曲线。以预测值为准，将独立验证队列分为健康人群(同训练集和测试集)和癌症组，并评估模型预测效能，包括特异性、敏感性和准确性，值越高效果越优。2) Model effectiveness verification. Based on the molecular marker expression and pathological detection results of each sample in the test set, the ROC curve of the test set is drawn. Based on the predicted value, the independent validation cohort was divided into healthy people (same as training set and test set) and cancer group, and the prediction performance of the model was evaluated, including specificity, sensitivity and accuracy. The higher the value, the better the effect.

实施例2Example 2

(1)研究队列及临床信息(1)Research cohort and clinical information

本研究共纳入两个研究队列共计154例经肿瘤标志物、影像学检查(如超声检查、腹腔CT扫描等)和病理检测结果确诊为胆胰恶性肿瘤患者及71例健康人，于术前采集患者的血液样本，同时收集健康对照的血液样本。训练集和测试集共纳入60例患者(胰腺癌29例、胆囊癌15例和胆管癌16例)及31例健康人(表1)。将样本按照4：1的比例分为训练集和测试集，并使得健康对照和各种癌种样本在两个集合的分布比例保持一致。表1展示了训练集和测试集中健康对照及患者的分组信息。分析结果表明训练集和测试集样本的性别比例及健康对照、癌症患者数目分布比例并无显著差异。This study included two research cohorts, a total of 154 patients with bile and pancreatic malignant tumors diagnosed by tumor markers, imaging examinations (such as ultrasonic examination, abdominal CT scan, etc.) and pathological examination results, and 71 healthy individuals, who were collected before surgery. Blood samples were collected from patients, while blood samples from healthy controls were collected. A total of 60 patients (29 cases of pancreatic cancer, 15 cases of gallbladder cancer, and 16 cases of cholangiocarcinoma) and 31 healthy individuals were included in the training set and test set (Table 1). Divide the samples into training sets and test sets at a ratio of 4:1, and keep the distribution ratios of healthy controls and various cancer samples in the two sets consistent. Table 1 shows the grouping information of healthy controls and patients in the training set and test set. The analysis results show that there is no significant difference in the gender ratio of the training set and the test set samples, as well as the distribution ratio of the number of healthy controls and cancer patients.

表1：训练集和测试集信息Table 1: Training set and test set information

独立验证队列共入组94例胆胰肿瘤患者(胰腺癌37例、胆囊癌17例和胆管癌40例)及40例健康人。表2展示了训练集及独立验证队列中参与者信息，分析结果表明训练集与独立验证队列样本的性别比例及健康对照、癌症患者数目分布比例并无显著差异。A total of 94 patients with biliopancreatic tumors (37 cases of pancreatic cancer, 17 cases of gallbladder cancer, and 40 cases of cholangiocarcinoma) and 40 healthy subjects were enrolled in the independent validation cohort. Table 2 shows the participant information in the training set and independent validation cohort. The analysis results show that there is no significant difference in the gender ratio and the number distribution ratio of healthy controls and cancer patients in the training set and independent validation cohort samples.

表2：训练集与独立验证队列信息Table 2: Training set and independent validation cohort information

(2)健康、癌症分类评分模型(2) Health and cancer classification scoring model

利用训练集，结合病理检测结果，利用LinearSVC算法构建癌症患者、健康人群的评分模型。模型由变量、模型公式和预测值三部分组成。过程如下：Using the training set, combined with the pathological detection results, the LinearSVC algorithm is used to construct a scoring model for cancer patients and healthy people. The model consists of three parts: variables, model formulas and predicted values. The process is as follows:

①模型变量及参数。在训练队列中，模型以504个5Mb长度的区域内短片段和总片段数目的标准化z-score 1008个特征变量(表3)，使用30次重复5折交叉验证法，获得模型系数。①Model variables and parameters. In the training queue, the model used the standardized z-score of 1008 feature variables (Table 3) for the number of short fragments and total fragments in 504 5Mb length regions, and used the 5-fold cross-validation method with 30 repetitions to obtain the model coefficients.

表3：模型输入变量示例Table 3: Example of model input variables

序号serial number 模型变量model variables 片段数目number of segments 11 Bin1短片段Bin1 short fragment N1N1 22 Bin1所有片段Bin1 all fragments N2N2 ……… ……… ……… 10071007 Bin504短片段Bin504 short clip N1007N1007 10081008 Bin504所有片段Bin504 all fragments N1008N1008

②评分模型。评分模型公式如下：②Scoring model. The scoring model formula is as follows:

其中x_i为输入变量，模型参数w和b的计算公式如下：Among them, x _i is the input variable, and the calculation formula of model parameters w and b is as follows:

其中λ为惩罚参数，n为样本数量，y_i是样本真值，1为癌症，-1为健康。Among them, λ is the penalty parameter, n is the number of samples, _yi is the true value of the sample, 1 is cancer, and -1 is health.

使用分类模型和每个样本全基因组水平不同区域的片段分布，可以获得每个样本的类别预测结果。Using the classification model and the fragment distribution of different regions at the genome-wide level of each sample, the class prediction results of each sample can be obtained.

③模型效能评估。③Evaluation of model effectiveness.

为了构建区分癌症患者和健康个体的早筛分类模型，利用预测值将训练集样本分为健康个体和癌症患者。以病理检测结果为真值，根据队列中的预测值及病理结果绘制训练集的ROC曲线，训练集AUC值达到1，参见图3，为训练集ROC曲线图。训练模型预测的PPV(准确性)、特异性和敏感性分别为100％、100％和100％(表4)。结果表明：在训练集中，此风险预测模型具有较高敏感性和NPV，模型对癌症的早期诊断的预测效能较优。In order to build an early screening classification model that distinguishes cancer patients from healthy individuals, the training set samples are divided into healthy individuals and cancer patients using predicted values. Taking the pathological test results as the true value, the ROC curve of the training set is drawn based on the predicted value and pathological results in the queue. The AUC value of the training set reaches 1. See Figure 3, which is the ROC curve of the training set. The PPV (accuracy), specificity and sensitivity predicted by the trained model were 100%, 100% and 100% respectively (Table 4). The results show that in the training set, this risk prediction model has high sensitivity and NPV, and the model has better prediction performance for early diagnosis of cancer.

④判别模型预测效能验证。④Verification of the prediction performance of the discriminant model.

为了验证判别模型的效能，以预测值设置的阈值，将测试集患者分为健康和癌症患者组(同训练集)。以病理检测结果为真值，根据训练队列中确定的分类模型及预测值，对模型的效能进行验证，绘制测试集的ROC曲线，其AUC值达到1。参见图4，为测试集ROC曲线图。并评估模型预测效能，包括准确性、特异性和敏感性，分别为100％、100％和92％(表4)。结果表明：在测试集中，此风险预测模型同样具有较高特异性、敏感性和准确性，即模型预测效能较优。In order to verify the performance of the discriminant model, the test set patients were divided into healthy and cancer patient groups (same as the training set) using the threshold set by the predicted value. Taking the pathological detection results as the true value, the efficiency of the model is verified based on the classification model and prediction value determined in the training queue, and the ROC curve of the test set is drawn, and its AUC value reaches 1. See Figure 4, which is the ROC curve chart of the test set. And evaluate the model prediction performance, including accuracy, specificity and sensitivity, which were 100%, 100% and 92% respectively (Table 4). The results show that: in the test set, this risk prediction model also has high specificity, sensitivity and accuracy, that is, the model prediction performance is better.

表4Table 4

(3)独立验证队列验证(3) Independent verification queue verification

为了进一步验证判别模型的效能，以预测值设置的阈值，将独立验证队列患者分为健康和癌症患者组。以病理检测结果为真值，根据训练集和测试集中确定的分类模型及预测值，对模型的效能进行验证，绘制独立验证队列的ROC曲线，独立验证队列AUC值高达0.94。图5，为独立验证队列受试者ROC曲线。并评估模型预测效能，包括准确性、特异性和敏感性，分别为90.1％、77.5％和87.2％(表5)。结果表明：在独立验证队列中，此风险预测模型同样具有较高特异性、敏感性和准确性，即模型预测效能较优。In order to further verify the performance of the discriminant model, patients in the independent validation cohort were divided into healthy and cancer patient groups at a threshold set by the predicted value. Taking the pathological test results as the true value, the efficiency of the model was verified based on the classification model and predicted values determined in the training set and test set, and the ROC curve of the independent validation cohort was drawn. The AUC value of the independent validation cohort was as high as 0.94. Figure 5 shows the ROC curve of subjects in the independent validation cohort. And evaluate the model prediction performance, including accuracy, specificity and sensitivity, which were 90.1%, 77.5% and 87.2% respectively (Table 5). The results show that in the independent validation cohort, this risk prediction model also has high specificity, sensitivity and accuracy, that is, the model prediction performance is better.

表5：独立验证队列模型效能验证Table 5: Independent Validation Cohort Model Performance Verification

上述仅为本发明的部分优选实施例，本发明并不仅限于实施例的内容。对于本领域中的技术人员来说，在本发明技术方案的构思范围内可以有各种变化和更改，所作的任何变化和更改，均在本发明保护范围之内。The above are only some preferred embodiments of the present invention, and the present invention is not limited to the contents of the embodiments. For those skilled in the art, various changes and modifications can be made within the scope of the technical solution of the present invention, and any changes and modifications made are within the protection scope of the present invention.

Claims

1. A non-invasive early screening method for cancer based on the fragment length distribution characteristics of cfDNA. The method is to use lower-depth whole-gene sequencing to collect statistics on the difference in fragment length distribution characteristics of tumor-derived cfDNA and healthy individual-derived cfDNA to establish a diagnosis of cancer. Early screening model enables non-invasive early screening of cancer.

2. The non-invasive early screening method for cancer based on cfDNA fragment length distribution characteristics according to claim 1, characterized in that: calculating the standardized z- score, used as the feature input value for model training.

3. The non-invasive early screening method for cancer based on the length distribution characteristics of cfDNA fragments according to claim 1, characterized in that: the cfDNA fragments include short fragments and long fragments, wherein the length range of the short cfDNA fragment is [130,177] bp, and the long fragment is [130,177] bp. The fragment length range is [177,237]bp.

4. The non-invasive early screening method for cancer based on cfDNA fragment length distribution characteristics according to claim 1, which is characterized by: adopting the LinearSVC algorithm and using 30 times repeated 5-fold cross-validation method to obtain model coefficients and establish early screening of cancer. Model.

5. The non-invasive early screening method for cancer based on cfDNA fragment length distribution characteristics according to any one of claims 1 to 4, characterized in that: the cancer is a malignant tumor of the bile and pancreas.

6. The non-invasive early screening method for cancer based on cfDNA fragment length distribution characteristics according to any one of claims 1 to 4, characterized in that: the bile and pancreatic malignant tumors include pancreatic cancer, gallbladder cancer and cholangiocarcinoma.

7. A non-invasive early screening system for cancer based on cfDNA fragment length distribution characteristics, characterized in that the system includes:

cfDNA fragment feature extraction module, used to obtain cfDNA fragment size feature data in the sample;

The machine learning classification model building module is used to establish an early cancer screening model based on the statistical difference in size characteristics of cfDNA derived from tumors and cfDNA derived from healthy individuals;

The independent validation queue evaluation module is used to verify the prediction performance of the established machine learning classification model through the independent validation queue.

8. The non-invasive early screening system for cancer based on cfDNA fragment length distribution characteristics according to claim 7, wherein the cfDNA fragment feature extraction module includes:

The sequencing data comparison unit is used to compare the sequencing data to the human reference genome hg19 after removing the sequencing adapters from the sequencing data;

The cfDNA fragment statistics unit is used to count cfDNA fragment length data information; the hg19 autosomal chromosome is divided into 504 adjacent, non-intersecting window fragments, each window fragment length is 5Mb; the statistical length in each window area is greater than 130bp and less than The ratio of the number of 177bp cfDNA to the number of cfDNA with a length greater than 177bp and less than 237bp; finally, the number of long and short cfDNA fragments in each 5Mb interval is obtained;

The cfDNA fragment feature determination unit is used to determine the interval with the largest difference in fragment distribution between cancer patients and healthy controls based on the difference in fragment distribution between cancer patients and healthy controls; define short fragment ranges [130,177] and long fragment ranges [177,237] , and then calculate the standardized z-score of the short fragment cfDNA and the total number of fragments in each of the 504 windows as the feature input value for model training.

9. The non-invasive early screening system for cancer based on cfDNA fragment length distribution characteristics according to claim 7, wherein the machine learning classification model establishment module includes:

The sample data classification unit is used to divide the samples into training sets and test sets in a ratio of 4:1, and to keep the distribution proportions of healthy controls and various cancer samples in the two sets consistent;

The model parameter acquisition unit is used to process the sample data in the training set; in the training queue, the model parameters are obtained using the 5-fold cross-validation method repeated 30 times;

The model performance evaluation unit is used to draw the receiver operating characteristic curve of the training queue based on the model prediction value and pathological detection results of each sample in the training queue.