CN115798582A - Model for identifying benign and malignant lung nodules or predicting risk of postoperative recurrence of lung cancer, kit and application - Google Patents

Model for identifying benign and malignant lung nodules or predicting risk of postoperative recurrence of lung cancer, kit and application Download PDF

Info

Publication number
CN115798582A
CN115798582A CN202211552230.XA CN202211552230A CN115798582A CN 115798582 A CN115798582 A CN 115798582A CN 202211552230 A CN202211552230 A CN 202211552230A CN 115798582 A CN115798582 A CN 115798582A
Authority
CN
China
Prior art keywords
markers
marker
benign
lung cancer
prediction
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211552230.XA
Other languages
Chinese (zh)
Inventor
陈克终
杨浩
杨帆
张雪莹
王俊
杜凤霞
陈碧思
白健
郑璐
王寅
吴佳妍
杨爱蓉
周进兴
吴�琳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Peking University Peoples Hospital
Berry Oncology Co Ltd
Original Assignee
Peking University Peoples Hospital
Berry Oncology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Peking University Peoples Hospital, Berry Oncology Co Ltd filed Critical Peking University Peoples Hospital
Priority to CN202211552230.XA priority Critical patent/CN115798582A/en
Publication of CN115798582A publication Critical patent/CN115798582A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Landscapes

  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The invention discloses a model for identifying benign and malignant lung nodules or predicting risk of postoperative recurrence of lung cancer, a kit and application, and relates to the technical field of biological medical treatment. The invention discovers that the methylation levels of the PTGER4 gene, the RASSF1 gene, the SHOX2 gene and the PCDHGB6 gene, the genomic instability of the chr6_46000000, the chr14_78000000, the chr 80000000 and the chr5_146000000 windows 148000000 and the fragment size distribution of the chr2q, the chr11p and the chr14p can be used as markers, so that the effective identification of the benign and malignant pulmonary nodules and the postoperative monitoring of the recurrence of the lung cancer are realized, and the invention has the advantages of high sensitivity, good specificity, low detection cost and the like.

Description

用于鉴别肺结节的良恶性或预测肺癌术后复发的风险的模型 以及试剂盒和应用A model for differentiating benign from malignant pulmonary nodules or predicting the risk of lung cancer recurrence after surgery and kits and applications

技术领域technical field

本发明涉及生物医疗技术领域,具体而言,涉及用于鉴别肺结节的良恶性或预测肺癌术后复发的风险的模型以及试剂盒和应用。The invention relates to the field of biomedical technology, in particular to a model, kit and application for differentiating benign and malignant pulmonary nodules or predicting the risk of postoperative recurrence of lung cancer.

背景技术Background technique

自从低剂量CT扫描的普及以及分辨率越来越高,临床上发现肺结节的情况越来越多,在临床实践中,肺结节切除手术不难,真正考验技术水平的是如何正确识别肺结节性质。肺结节的影像学定义是被肺实质完全包围、界限清楚且直径≤30mm的病变。直径>30mm的病变是肿块而非结节,其恶性概率显著更高。偶发性肺结节则是指偶然发现且没有相应体征和症状的肺结节。有时会发现偶发性多发性肺结节,此时诊断性评估针对主要类型或最可疑的结节(如最大或在生长的结节)。结节的形态分为实性或亚实性,亚实性结节可进一步分为纯磨玻璃结节(无实性成分)和部分实性结节(含有磨玻璃成分和实性成分)。常见的良性肺结节有错构瘤、肉芽肿、类风湿结节、动静脉畸形、感染(包括结核和真菌)、肺内淋巴结、淀粉样变等。综上,肺部结节的性质是做手术的基础,更能避免将良性结节误当恶性切除,以及误把恶性结节当做良性而延误。由于肺癌在恶性肿瘤中死亡率高,很多患者诊断时已经是晚期,因此,对于肺结节的患者需要评估其有无恶性风险,根据恶性风险的高低决定是否采取进一步诊断和治疗措施。Since the popularization of low-dose CT scans and the higher resolution, more and more pulmonary nodules have been found clinically. In clinical practice, pulmonary nodule resection is not difficult. The real test of the technical level is how to correctly identify properties of pulmonary nodules. A pulmonary nodule is defined radiographically as a well-circumscribed lesion ≤30 mm in diameter completely surrounded by lung parenchyma. Lesions >30 mm in diameter are masses rather than nodules and have a significantly higher probability of malignancy. Incidental pulmonary nodules refer to pulmonary nodules that are discovered incidentally without corresponding signs and symptoms. Occasionally, multiple pulmonary nodules are found, and diagnostic evaluation is directed at the dominant type or the most suspicious nodules (eg, largest or growing nodules). Nodules can be classified into solid or subsolid nodules, and subsolid nodules can be further divided into pure ground-glass nodules (without solid components) and part-solid nodules (with ground-glass and solid components). Common benign pulmonary nodules include hamartomas, granulomas, rheumatoid nodules, arteriovenous malformations, infections (including tuberculosis and fungi), intrapulmonary lymph nodes, and amyloidosis. In summary, the nature of pulmonary nodules is the basis for surgery, which can avoid mistaking benign nodules for malignant resection, and mistaking malignant nodules for benign and delaying. Due to the high mortality rate of lung cancer among malignant tumors, many patients are already diagnosed at an advanced stage. Therefore, patients with pulmonary nodules need to be assessed for their malignant risk, and whether to take further diagnostic and treatment measures based on the level of malignant risk.

迄今为止,低剂量计算机断层扫描(LDCT)是高危无症状人群中长期大幅降低肺癌相关死亡率的主要策略。两项大型随机对照试验,即美国国家肺筛查试验(NLST)和荷兰-比利时肺癌筛查试验(NELSON)已经证明,基于LDCT的筛查可以在统计学上显著降低高危人群肺癌相关死亡率20%以上。低剂量螺旋CT可用于肺癌筛查中,发现部分无症状的肺部结节患者,但无法精准鉴别肺部结节良恶性。LDCT检测的疑似结节可通过肺活检(包括支气管镜检查和经皮穿刺)进一步诊断。然而,肺外周和支气管镜下看不见的病变始终是肺活检诊断的挑战。PET-CT被推荐用于>8mm的实性结节的鉴别,PET-CT诊断恶性肺结节的敏感度为72%-94%,但是检测费用高昂,并不适用于一般结节患者良恶性鉴别。组织病理活检是恶性肿瘤诊断的“金标准“,然而组织活检具有一定侵入性且操作起来相对复杂,并且体积小的肿瘤可能还需要多次操作来获取足够的活检组织,用于大规模的肺癌筛查并不现实。To date, low-dose computed tomography (LDCT) has been the main strategy for substantially reducing lung cancer-related mortality over the long term in high-risk asymptomatic populations. Two large randomized controlled trials, the National Lung Screening Trial (NLST) and the Netherlands-Belgium Lung Screening Trial for Lung Cancer (NELSON) have demonstrated that LDCT-based screening can reduce lung cancer-related mortality statistically significantly in high-risk populations20 %above. Low-dose spiral CT can be used in lung cancer screening to find some patients with asymptomatic pulmonary nodules, but it cannot accurately distinguish between benign and malignant pulmonary nodules. Suspected nodules detected by LDCT can be further diagnosed with lung biopsy, including bronchoscopy and percutaneous aspiration. However, peripheral and bronchoscopically invisible lesions remain a diagnostic challenge on lung biopsy. PET-CT is recommended for the identification of solid nodules > 8mm. The sensitivity of PET-CT in diagnosing malignant pulmonary nodules is 72%-94%, but the detection cost is high, and it is not suitable for patients with benign and malignant nodules identification. Histopathological biopsy is the "gold standard" for the diagnosis of malignant tumors. However, tissue biopsy is invasive and relatively complicated to operate, and small tumors may require multiple operations to obtain enough biopsy tissue for large-scale lung cancer Screening is not realistic.

鉴于此,特提出本发明。In view of this, the present invention is proposed.

发明内容Contents of the invention

本发明的目的在于提供用于鉴别肺结节的良恶性或预测肺癌术后复发的风险的模型以及试剂盒和应用。The purpose of the present invention is to provide a model, kit and application for differentiating benign from malignant pulmonary nodules or predicting the risk of postoperative recurrence of lung cancer.

本发明是这样实现的:The present invention is achieved like this:

第一方面,本发明实施例提供了检测标志物组合的试剂在制备用于鉴别肺结节的良恶性和/或预测肺癌术后复发的风险的产品中的应用,所述标志物组合包括以下三类标志物:目标基因的甲基化水平、目标染色体窗口内的基因组不稳定性和目标染色体臂内的片段大小分布;其中,所述目标基因包括:PTGER4基因、RASSF1基因、SHOX2基因和PCDHGB6基因中的至少一种;所述目标染色体窗口包括:chr6_46000000_48000000、chr14_78000000_80000000和chr5_146000000_148000000中的至少一种;所述目标染色体臂包括:chr2q、chr11p和chr14p中的至少一种。In the first aspect, the embodiment of the present invention provides the application of a reagent for detecting a combination of markers in the preparation of a product for distinguishing benign from malignant pulmonary nodules and/or predicting the risk of recurrence of lung cancer after surgery. The combination of markers includes the following Three types of markers: the methylation level of the target gene, the genome instability within the target chromosome window, and the fragment size distribution within the target chromosome arm; wherein, the target genes include: PTGER4 gene, RASSF1 gene, SHOX2 gene and PCDHGB6 At least one of the genes; the target chromosome window includes: at least one of chr6_46000000_48000000, chr14_78000000_80000000 and chr5_146000000_148000000; the target chromosome arm includes: at least one of chr2q, chr11p and chr14p.

第二方面,本发明实施例提供了一种用于鉴别肺结节的良恶性和/或预测肺癌术后复发的风险的试剂或试剂盒,其包括:前述实施例所述的检测标志物组合的试剂。In the second aspect, an embodiment of the present invention provides a reagent or kit for distinguishing benign from malignant pulmonary nodules and/or predicting the risk of lung cancer recurrence after surgery, which includes: the combination of detection markers described in the foregoing embodiments reagents.

第三方面,本发明实施例提供了一种肺结节的良恶性和/或预测肺癌术后复发的风险的预测模型的训练方法,其包括:获取训练样本中的标志物组合中每个标志物的检测结果以及对应的标注结果;其中,所述标注结果为代表样本肺结节的良恶性和/或预测肺癌术后复发的风险的标签,所述标志物组合为前述实施例所述的标志物组合;将所述标志物组合中的所有标志物的检测结果或三类标志物的评分输入预先构建的预测模型中,获得预测结果;其中,每一类标志物的评分的获取方式如下:获取训练样本中每一类标志物中每个标志物的检测结果以及对应的标注结果;将每一类标志物中所有标志物检测结果输入预先构建的预测模型中,将预测的结果作为每一类标志物的评分;所述预先构建的预测模型为能根据所述标志物组合的检测结果、三类标志物的评分或每一类标志物中每个标志物的检测结果预测样本肺结节的良恶性和/或预测肺癌术后复发的风险的机器学习模型;基于所述标注结果和所述预测结果对所述预测模型进行参数更新。In a third aspect, an embodiment of the present invention provides a method for training a predictive model of benign and malignant pulmonary nodules and/or predicting the risk of postoperative recurrence of lung cancer, which includes: obtaining each marker in the marker combination in the training sample The detection result of the substance and the corresponding labeling result; wherein, the labeling result is a label representing the benign and malignant of the sample pulmonary nodules and/or predicting the risk of recurrence of lung cancer after surgery, and the combination of markers is the one described in the preceding examples Marker combination; input the detection results of all markers in the marker combination or the scores of the three types of markers into the pre-built prediction model to obtain the prediction results; wherein, the scoring method of each type of markers is obtained as follows : Obtain the detection results of each marker in each class of markers in the training samples and the corresponding labeling results; input the detection results of all markers in each class of markers into the pre-built prediction model, and use the predicted results as each The score of a class of markers; the pre-built prediction model can predict the sample pulmonary nodules according to the detection results of the marker combination, the scores of the three types of markers or the detection results of each marker in each type of markers. A machine learning model for benign and malignant nodes and/or predicting the risk of postoperative recurrence of lung cancer; based on the labeling results and the prediction results, the parameters of the prediction model are updated.

第四方面,本发明实施例提供了一种肺结节的良恶性和/或预测肺癌术后复发的风险的预测装置,其包括:获取模块和预测模块。获取模块,用于获取待测样本的前述实施例所述的标志物组合中每个标志物的检测结果;预测模块,用于将所有标志物的检测结果或三类标志物的评分输入前述实施例所述的训练方法训练好的预测模型中,获得待测样本的预测结果;每一类标志物的评分的获取方式如前述实施例所述。In a fourth aspect, an embodiment of the present invention provides a predictive device for benign and malignant pulmonary nodules and/or for predicting the risk of postoperative recurrence of lung cancer, which includes: an acquisition module and a prediction module. The acquisition module is used to obtain the detection result of each marker in the marker combination described in the foregoing embodiment of the sample to be tested; the prediction module is used to input the detection results of all markers or the scores of the three types of markers into the aforementioned implementation In the prediction model trained by the training method described in the example, the prediction result of the sample to be tested is obtained; the method of obtaining the score of each type of marker is as described in the foregoing embodiment.

第五方面,本发明实施例提供了一种电子设备,其包括:处理器和存储器;所述存储器用于存储程序,当所述程序被所述处理器执行时,使得所述处理器实现前述实施例所述的训练方法或肺结节的良恶性和/或预测肺癌术后复发的风险的预测方法;所述预测方法包括:获取待测样本的标志物组合中每个标志物的检测结果;所述标志物组合为前述实施例所述的标志物组合;将所有标志物的检测结果或三类标志物的评分输入前述实施例所述的训练方法训练好的预测模型中,获得待测样本的预测结果;每一类标志物的评分的获取方式如前述实施例所述。In a fifth aspect, an embodiment of the present invention provides an electronic device, which includes: a processor and a memory; the memory is used to store a program, and when the program is executed by the processor, the processor realizes the foregoing The training method described in the embodiment or the benign and malignant pulmonary nodules and/or the prediction method for predicting the risk of postoperative recurrence of lung cancer; the prediction method includes: obtaining the detection result of each marker in the marker combination of the sample to be tested The marker combination is the marker combination described in the preceding examples; the detection results of all markers or the scores of the three types of markers are input into the prediction model trained by the training method described in the preceding examples, and the test results are obtained. The prediction result of the sample; the method of obtaining the score of each type of marker is as described in the foregoing examples.

第六方面,本发明实施例提供了一种计算机可读介质,所述计算机可读介质上存储有计算机程序,所述计算机程序被处理器执行时实现前述实施例所述的训练方法或前述实施例所述的预测方法。In a sixth aspect, an embodiment of the present invention provides a computer-readable medium, on which a computer program is stored, and when the computer program is executed by a processor, the training method described in the foregoing embodiment or the foregoing implementation is implemented. The forecasting method described in the example.

本发明具有以下有益效果:The present invention has the following beneficial effects:

本发明使用甲基化检测数据,通过集成学习肺部恶性结节与肺部良性结节人群的片段大小差异、拷贝数差异和甲基化差异,找到肺癌癌变特有的基因组、表观组信号,使其可作为新的肿瘤标志物,进而实现对肺结节的良恶性的有效鉴别,以及肺癌术后复发监测,具有灵敏度高,特异性好,检测成本低等优势。The present invention uses the methylation detection data to find the unique genome and epigenetic signals of lung cancer carcinogenesis through integrated learning of the fragment size difference, copy number difference and methylation difference between malignant pulmonary nodules and benign pulmonary nodules. It can be used as a new tumor marker to realize effective differentiation of benign and malignant pulmonary nodules, as well as postoperative recurrence monitoring of lung cancer. It has the advantages of high sensitivity, good specificity, and low detection cost.

附图说明Description of drawings

为了更清楚地说明本发明实施例的技术方案,下面将对实施例中所需要使用的附图作简单地介绍,应当理解,以下附图仅示出了本发明的某些实施例,因此不应被看作是对范围的限定,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他相关的附图。In order to illustrate the technical solutions of the embodiments of the present invention more clearly, the accompanying drawings used in the embodiments will be briefly introduced below. It should be understood that the following drawings only show some embodiments of the present invention, and thus It should be regarded as a limitation on the scope, and those skilled in the art can also obtain other related drawings based on these drawings without creative work.

图1为LC Score在良性结节与恶性结节中的分布情况;Figure 1 shows the distribution of LC Score in benign nodules and malignant nodules;

图2为LC Score用于鉴别肺结节良恶性的ROC;Figure 2 is the ROC used by LC Score to identify benign and malignant pulmonary nodules;

图3为LC Score用于监测肺癌术后复发风险;Figure 3 shows the LC Score used to monitor the recurrence risk of lung cancer after surgery;

图4为实施例4的实验组2的ROC曲线图;Fig. 4 is the ROC curve figure of the experimental group 2 of embodiment 4;

图5为实施例4的实验组3的ROC曲线图。Fig. 5 is the ROC curve diagram of the experimental group 3 of the embodiment 4.

具体实施方式Detailed ways

为使本发明实施例的目的、技术方案和优点更加清楚,下面将对本发明实施例中的技术方案进行清楚、完整地描述。实施例中未注明具体条件者,按照常规条件或制造商建议的条件进行。所用试剂或仪器未注明生产厂商者,均为可以通过市售购买获得的常规产品。In order to make the purpose, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below. Those who do not indicate the specific conditions in the examples are carried out according to the conventional conditions or the conditions suggested by the manufacturer. The reagents or instruments used were not indicated by the manufacturer, and they were all conventional products that could be purchased from the market.

首先,本发明实施例提供了检测标志物组合的试剂在制备用于鉴别肺结节的良恶性和/或预测肺癌术后复发的风险的产品中的应用,所述标志物组合包括以下三类标志物:目标基因的甲基化水平、目标染色体窗口内的基因组不稳定性和目标染色体臂内的片段大小分布;First, the embodiment of the present invention provides the application of reagents for detecting marker combinations in the preparation of products for distinguishing benign and malignant pulmonary nodules and/or predicting the risk of recurrence of lung cancer after surgery. The marker combinations include the following three types Markers: methylation levels of target genes, genomic instability within target chromosome windows, and fragment size distribution within target chromosome arms;

其中,所述目标基因包括:PTGER4基因、RASSF1基因、SHOX2基因和PCDHGB6基因中的至少一种;Wherein, the target gene includes: at least one of PTGER4 gene, RASSF1 gene, SHOX2 gene and PCDHGB6 gene;

所述目标染色体窗口包括:chr6_46000000_48000000、chr14_78000000_80000000和chr5_146000000_148000000中的至少一种;The target chromosome window includes: at least one of chr6_46000000_48000000, chr14_78000000_80000000 and chr5_146000000_148000000;

所述目标染色体臂包括:chr2q、chr11p和chr14p中的至少一种。The target chromosome arm includes: at least one of chr2q, chr11p and chr14p.

本申请发明人使用甲基化检测数据,通过集成学习肺部恶性结节与良性结节人群的片段大小差异、拷贝数差异和甲基化差异,找到了上述标志物组合,进而实现对肺结节良恶性的鉴别,以及肺癌术后复发监测,相对于其他标志物组合而言,具有更好的灵敏度、特异性和准确性。Using methylation detection data, the inventors of the present application found the above-mentioned marker combination through integrated learning of the fragment size difference, copy number difference and methylation difference between malignant pulmonary nodules and benign nodules, and then realized the detection of pulmonary nodules. Compared with other marker combinations, it has better sensitivity, specificity and accuracy in the identification of benign and malignant tumors, as well as in the monitoring of postoperative recurrence of lung cancer.

在目标基因、目标染色体窗口和目标染色体臂已经公开的情况下,目标基因的甲基化水平、目标染色体窗口内的基因组不稳定性和目标染色体臂内的片段大小分布的检测方法和计算方法可基于现有知识获取。In the case that the target gene, the target chromosome window and the target chromosome arm have been disclosed, the detection method and calculation method of the methylation level of the target gene, the genomic instability within the target chromosome window and the fragment size distribution within the target chromosome arm can be Acquisition based on existing knowledge.

本文中的“目标基因的甲基化水平”是指目标基因启动子的甲基化水平。The "methylation level of the target gene" herein refers to the methylation level of the promoter of the target gene.

本文中的“窗口”是指生物信息分析中的窗口(Windows),chr6_46000000_48000000是指第6染色体的第46000000位碱基~第48000000位碱基之间的基因组区段,参考基因组为hg19,其他目标染色体窗口以此类推。The "window" in this article refers to the window (Windows) in biological information analysis, chr6_46000000_48000000 refers to the genome segment between the 46000000th base and the 48000000th base of chromosome 6, the reference genome is hg19, other targets Chromosome window and so on.

本文中的“chr2q”中的“q”是指对应染色体的长臂,以及“chr11p和chr14p”中的“p”是指对应染色体的短臂。The "q" in "chr2q" herein refers to the long arm of the corresponding chromosome, and the "p" in "chr11p and chr14p" refers to the short arm of the corresponding chromosome.

在一些实施例中,所述检测目标基因的甲基化水平的方法包括:甲基化特异性PCR(MS-PCR)、亚硫酸氢盐处理+测序、联合亚硫酸氢钠的限制性内切酶(COBRA)、荧光定量法、甲基化敏感性高分辨率熔解曲线分析、焦磷酸测序、基于芯片的甲基化图谱分析和高通量测序中的任意一种。In some embodiments, the method for detecting the methylation level of the target gene comprises: methylation-specific PCR (MS-PCR), bisulfite treatment+sequencing, restriction endonuclease combined with sodium bisulfite Any one of enzyme (COBRA), fluorescence quantification, methylation-sensitive high-resolution melting curve analysis, pyrosequencing, chip-based methylation profiling, and high-throughput sequencing.

在一些实施例中,所述目标染色体窗口内的基因组不稳定性由目标染色体窗口内的基因组不稳定性评分体现。In some embodiments, the genomic instability within the target chromosomal window is represented by a genomic instability score within the target chromosomal window.

可选地,所述基因组不稳定性评分的计算方法选自:基于测序深度的Z值算法、基于对照样本的log2ratio算法、基于soft-clipped断裂读段的算法和BinCounti算法中的任意一种。Optionally, the calculation method of the genome instability score is selected from: any one of the Z value algorithm based on sequencing depth, the log2ratio algorithm based on control samples, the algorithm based on soft-clipped break reads and the BinCount i algorithm .

可选地,BinCounti的计算公式如下:Optionally, the calculation formula of BinCount i is as follows:

Figure BDA0003981740380000041
其中,BinCounti为基因组不稳定性评分;Fragmenti为第i个窗口内的读段数目;TotalMappedFragments为该样本总体的读段数目;WindowLengthi为第i个窗口的长度。当窗口长度为2000000bp时,WindowLengthi为2E6。
Figure BDA0003981740380000041
Among them, BinCount i is the genome instability score; Fragment i is the number of reads in the i-th window; TotalMappedFragments is the number of reads in the sample overall; WindowLength i is the length of the i-th window. When the window length is 2000000bp, WindowLength i is 2E6.

在一些实施例中,所述片段大小分布是指目标染色体窗口内的长片段与短片段的数量比或差异,所述长片段的长度为101~220bp,所述短片段的长度为20~100bp。In some embodiments, the fragment size distribution refers to the ratio or difference of the number of long fragments and short fragments in the target chromosome window, the length of the long fragments is 101-220bp, and the length of the short fragments is 20-100bp .

本文中“长片段与短片段的数量比或差异”可以理解为长片段与短片段的数量比,或短片段与长片段的数量比,或代表长片段与短片段的差异的数值。The "quantity ratio or difference between long fragments and short fragments" herein can be understood as the quantitative ratio of long fragments to short fragments, or the quantitative ratio of short fragments to long fragments, or a numerical value representing the difference between long fragments and short fragments.

在一些实施例中,所述肺癌术后包括常规的肺癌手术,具体包括:肺叶切除加淋巴结的清扫、楔形切除、肺段切除、肺叶切除术、全肺切除、胸腔镜微创手术中的任意一种或任意几种的组合。In some embodiments, the surgery for lung cancer includes conventional lung cancer surgery, specifically including: any one of lobectomy plus lymph node dissection, wedge resection, segmentectomy, lobectomy, pneumonectomy, and thoracoscopic minimally invasive surgery. One or any combination of several.

在一些实施例中,所述产品选自:试剂、试剂盒和预测模型中的任意一种。In some embodiments, the product is selected from any one of reagents, kits and predictive models.

另一方面,本发明实施例提供了一种用于鉴别肺结节的良恶性和/或预测肺癌术后复发的风险的试剂或试剂盒,其包括:前述任意实施例所述的检测标志物组合的试剂。On the other hand, an embodiment of the present invention provides a reagent or kit for identifying benign and malignant pulmonary nodules and/or predicting the risk of lung cancer recurrence after surgery, which includes: the detection markers described in any of the foregoing embodiments Combined reagents.

另一方面,本发明实施例提供了一种肺结节的良恶性和/或预测肺癌术后复发的风险的预测模型的训练方法,其包括:On the other hand, an embodiment of the present invention provides a method for training a predictive model of benign and malignant pulmonary nodules and/or predicting the risk of postoperative recurrence of lung cancer, which includes:

获取训练样本中的标志物组合中每个标志物的检测结果以及对应的标注结果;其中,所述标注结果为代表样本肺结节的良恶性和/或预测肺癌术后复发的风险的标签,所述标志物组合为前述任意实施例所述的标志物组合;Obtain the detection result of each marker in the marker combination in the training sample and the corresponding labeling result; wherein, the labeling result is a label representing the benign and malignant of the sample pulmonary nodules and/or predicting the risk of postoperative recurrence of lung cancer, The marker combination is the marker combination described in any of the foregoing embodiments;

将所述标志物组合中的所有标志物的检测结果或三类标志物的评分输入预先构建的预测模型中,获得预测结果;其中,每一类标志物的评分的获取方式如下:获取训练样本中每一类标志物中每个标志物的检测结果以及对应的标注结果;将每一类标志物中所有标志物检测结果输入预先构建的预测模型中,将预测的结果作为每一类标志物的评分;所述预先构建的预测模型为能根据所述标志物组合的检测结果、三类标志物的评分或每一类标志物中每个标志物的检测结果预测样本肺结节的良恶性和/或预测肺癌术后复发的风险的机器学习模型;Input the detection results of all markers in the marker combination or the scores of the three types of markers into the pre-built prediction model to obtain the prediction results; wherein, the way to obtain the scores of each type of markers is as follows: Obtain training samples The detection results of each marker in each type of markers and the corresponding labeling results; input the detection results of all markers in each type of markers into the pre-built prediction model, and use the predicted results as each type of markers score; the pre-built prediction model can predict the benign and malignant of the sample pulmonary nodules according to the detection results of the marker combination, the scores of the three types of markers, or the detection results of each marker in each type of markers and/or a machine learning model that predicts the risk of lung cancer recurrence after surgery;

基于所述标注结果和所述预测结果对所述预测模型进行参数更新。The parameters of the prediction model are updated based on the labeling result and the prediction result.

可选地,所述标签为字符或字符串。Optionally, the label is a character or a character string.

本文中的“三类标志物”是指:目标基因的甲基化水平、目标染色体窗口内的基因组不稳定性或目标染色体臂内的片段大小分布,“每一类标志物”是指其中任意一种。The "three types of markers" in this article refer to: the methylation level of the target gene, the genomic instability within the target chromosome window, or the fragment size distribution within the target chromosome arm, and "each type of marker" refers to any of them. A sort of.

预测模型可以将所述标志物组合中的所有标志物的检测结果作为输入数据构建获得,也可以将三类标志物的评分作为输入数据构建获得,具有相似的预测效果。相比之下,将三类标志物的评分作为输入数据时,构建获得的预测模型效果更优。The prediction model can be constructed using the detection results of all the markers in the marker combination as input data, or the scores of the three types of markers can be constructed as input data, and has similar predictive effects. In contrast, when the scores of the three types of markers are used as input data, the prediction model obtained by construction is more effective.

在一些实施例中,所述机器学习模型包括:逻辑回归模型。In some embodiments, the machine learning model includes: a logistic regression model.

可选地,训练样本的数量可大于等于10、50、100、200、300、400和500中的任意数值。Optionally, the number of training samples may be greater than or equal to any number among 10, 50, 100, 200, 300, 400 and 500.

在一些实施例中,所述待测样本或所述训练样本独立地选自:血浆样本、血清样本、全血样本、阴性标准品或阳性标准品。可选地,待测样本或训练样本还可以选自:含血浆样本或血清样本中的至少一种的环境样本。In some embodiments, the sample to be tested or the training sample is independently selected from: plasma samples, serum samples, whole blood samples, negative standards or positive standards. Optionally, the samples to be tested or the training samples may also be selected from: environmental samples containing at least one of plasma samples or serum samples.

另一方面,本发明实施例提供了一种肺结节的良恶性和/或预测肺癌术后复发的风险的预测装置,其包括:On the other hand, an embodiment of the present invention provides a device for predicting benign and malignant pulmonary nodules and/or predicting the risk of postoperative recurrence of lung cancer, which includes:

获取模块,用于获取待测样本的标志物组合中每个标志物的检测结果;其中,所述标志物组合为前述任意实施例所述的标志物组合;An acquisition module, configured to acquire the detection result of each marker in the marker combination of the sample to be tested; wherein, the marker combination is the marker combination described in any of the foregoing embodiments;

预测模块,用于将待测样本的所有标志物的检测结果或三类标志物的评分的检测结果输入前述任意实施例所述的训练方法训练好的预测模型中,获得待测样本的预测结果;每一类标志物的评分的获取方式如前述任意实施例所述。The prediction module is used to input the detection results of all markers of the samples to be tested or the detection results of the scores of the three types of markers into the prediction model trained by the training method described in any of the foregoing embodiments, and obtain the prediction results of the samples to be tested The acquisition method of the score of each type of marker is as described in any of the foregoing embodiments.

可选地,上述实施例所述的模块可以软件或固件(Firmware)的形式存储于存储器中或固化于本申请提供的电子设备的操作系统(Operating System,OS)中,并可由电子设备中的处理器执行。同时,执行上述模块所需的数据、程序的代码等可以存储在存储器中。Optionally, the modules described in the above embodiments can be stored in the memory in the form of software or firmware (Firmware) or solidified in the operating system (Operating System, OS) of the electronic device provided by this application, and can be controlled by the Processor executes. Meanwhile, data necessary for executing the above-mentioned modules, codes of programs, etc. may be stored in the memory.

另一方面,本发明实施例提供了一种电子设备,其包括:处理器和存储器;所述存储器用于存储程序,当所述程序被所述处理器执行时,使得所述处理器实现前述任意实施例所述的训练方法或肺结节的良恶性和/或预测肺癌术后复发的风险的预测方法;On the other hand, an embodiment of the present invention provides an electronic device, which includes: a processor and a memory; the memory is used to store a program, and when the program is executed by the processor, the processor realizes the foregoing The training method described in any embodiment or the benign and malignant pulmonary nodules and/or the prediction method for predicting the risk of postoperative recurrence of lung cancer;

所述预测方法包括:获取待测样本的标志物组合中每个标志物的检测结果;所述标志物组合为前述任意实施例所述的标志物组合;将待测样本的所有标志物的检测结果或三类标志物的评分输入前述任意实施例所述的训练方法训练好的预测模型中,获得待测样本的预测结果;每一类标志物的评分的获取方式如前述任意实施例所述。The prediction method includes: obtaining the detection result of each marker in the marker combination of the sample to be tested; the marker combination is the marker combination described in any of the foregoing embodiments; The results or the scores of the three types of markers are input into the prediction model trained by the training method described in any of the above-mentioned embodiments to obtain the prediction results of the samples to be tested; the way to obtain the scores of each type of markers is as described in any of the above-mentioned embodiments. .

电子设备可以包括存储器、处理器、总线和通信接口,该存储器、处理器和通信接口相互之间直接或间接地电性连接,以实现数据的传输或交互。例如,这些元件相互之间可通过一条或多条总线或信号线实现电性连接。An electronic device may include a memory, a processor, a bus, and a communication interface, and the memory, the processor, and the communication interface are electrically connected to each other directly or indirectly, so as to realize data transmission or interaction. For example, these components may be electrically connected to each other through one or more buses or signal lines.

存储器可以是但不限于,随机存取存储器(Random Access Memory,RAM),只读存储器(ReadOnly Memory,ROM),可编程只读存储器(Programmable Read-Only Memory,PROM),可擦除只读存储器(Erasable Programmable Read-Only Memory,EPROM),电可擦除只读存储器(ElectricErasable Programmable Read-Only Memory,EEPROM)等。Memory can be, but not limited to, random access memory (Random Access Memory, RAM), read-only memory (ReadOnly Memory, ROM), programmable read-only memory (Programmable Read-Only Memory, PROM), erasable read-only memory (Erasable Programmable Read-Only Memory, EPROM), Electric Erasable Programmable Read-Only Memory (EEPROM), etc.

处理器可以是一种集成电路芯片,具有信号处理能力。该处理器120可以是通用处理器,包括中央处理器(Central Processing Unit,CPU)、网络处理器(NetworkProcessor,NP)等;还可以是数字信号处理器(Digital Signal Processing,DSP)、专用集成电路(Application Specific Integrated Circuit,ASIC)、现场可编程门阵列(Field-Programmable Gate Array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件。A processor may be an integrated circuit chip with signal processing capabilities. The processor 120 can be a general-purpose processor, including a central processing unit (Central Processing Unit, CPU), a network processor (NetworkProcessor, NP), etc.; it can also be a digital signal processor (Digital Signal Processing, DSP), an application-specific integrated circuit (Application Specific Integrated Circuit, ASIC), field programmable gate array (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components.

该电子设备可以是服务器、云平台、手机、平板电脑、笔记本电脑、超级移动个人计算机(ultra-mobile personal computer,UMPC)、手持计算机、上网本、个人数字助理(personal digitalassistant,PDA)、可穿戴电子设备、虚拟现实设备等设备,因此本申请实施例对电子设备的种类不做限制。The electronic device may be a server, a cloud platform, a mobile phone, a tablet computer, a notebook computer, an ultra-mobile personal computer (UMPC), a handheld computer, a netbook, a personal digital assistant (personal digital assistant, PDA), a wearable electronic device, etc. Devices, virtual reality devices and other devices, therefore, the embodiments of the present application do not limit the types of electronic devices.

此外,本发明实施例提供了一种计算机可读介质,所述计算机可读介质上存储有计算机程序,所述计算机程序被处理器执行时实现前述任意实施例所述的训练方法或前述任意实施例所述的预测方法。In addition, an embodiment of the present invention provides a computer-readable medium, on which a computer program is stored. When the computer program is executed by a processor, the training method described in any of the foregoing embodiments or any of the foregoing implementations can be implemented. The forecasting method described in the example.

计算机可读介质可以为通用的存储介质,如移动磁盘、硬盘等。The computer-readable medium may be a general storage medium, such as a removable disk, a hard disk, and the like.

以下结合实施例对本发明的特征和性能作进一步的详细描述。The characteristics and performance of the present invention will be described in further detail below in conjunction with the examples.

实施例1Example 1

肺癌特异性差异甲基化区域(DMR)基因面板(panel)设计:Lung cancer-specific differentially methylated region (DMR) gene panel (panel) design:

从XENA数据库下载泛癌种甲基化数据TCGA Pan-Cancer(PANCAN),其中包括肾上腺皮质癌、膀胱尿路上皮癌、乳腺癌浸润癌、宫颈鳞状细胞癌和宫颈内腺癌、胆管癌、结肠腺癌、细胞淋巴瘤、食管癌、多形性胶质母细胞瘤、头颈部鳞状细胞癌、肾嫌色细胞癌、肾透明细胞癌、乳头状肾细胞癌、急性髓系白血病、脑低级胶质瘤、肝细胞癌、肺腺癌、肺鳞状细胞癌、间皮瘤、卵巢浆液性囊腺癌、胰腺癌、嗜铬细胞瘤和副神经节瘤、前列腺癌、肉瘤、皮肤黑色素瘤、胃腺癌、睾丸生殖细胞肿瘤、甲状腺癌、子宫内膜癌、子宫癌肉瘤、葡萄膜黑色素瘤等恶性肿瘤(样本数目见表1)以及癌旁健康对照组织的甲基化位点及其甲基化水平(β值)数据。Download the pan-cancer methylation data TCGA Pan-Cancer (PANCAN) from the XENA database, including adrenocortical carcinoma, bladder urothelial carcinoma, breast cancer invasive carcinoma, cervical squamous cell carcinoma and endocervical adenocarcinoma, cholangiocarcinoma, Colon adenocarcinoma, cellular lymphoma, esophageal cancer, glioblastoma multiforme, squamous cell carcinoma of the head and neck, chromophobe renal cell carcinoma, clear cell renal cell carcinoma, papillary renal cell carcinoma, acute myeloid leukemia, Brain low-grade glioma, hepatocellular carcinoma, lung adenocarcinoma, lung squamous cell carcinoma, mesothelioma, ovarian serous cystadenocarcinoma, pancreatic cancer, pheochromocytoma and paraganglioma, prostate cancer, sarcoma, skin Melanoma, gastric adenocarcinoma, testicular germ cell tumor, thyroid cancer, endometrial cancer, uterine carcinosarcoma, uveal melanoma and other malignant tumors (see Table 1 for the number of samples) and the methylation sites and Its methylation level (β value) data.

表1:TCGA样本数目统计Table 1: TCGA sample number statistics

Figure BDA0003981740380000061
Figure BDA0003981740380000061

Figure BDA0003981740380000071
Figure BDA0003981740380000071

各个癌种与健康人的差异甲基化位点组合面板为各个癌种(肾上腺皮质癌、膀胱尿路上皮癌、乳腺癌浸润癌、宫颈鳞状细胞癌和宫颈内腺癌、胆管癌、结肠腺癌、细胞淋巴瘤、食管癌、多形性胶质母细胞瘤、头颈部鳞状细胞癌、肾嫌色细胞癌、肾透明细胞癌、乳头状肾细胞癌、急性髓系白血病、脑低级胶质瘤、肝细胞癌、肺腺癌、肺鳞状细胞癌、间皮瘤、卵巢浆液性囊腺癌、胰腺癌、嗜铬细胞瘤和副神经节瘤、前列腺癌、肉瘤、皮肤黑色素瘤、胃腺癌、睾丸生殖细胞肿瘤、甲状腺癌、子宫内膜癌、子宫癌肉瘤、葡萄膜黑色素瘤)分别与癌旁健康对照秩和检验(Wilcoxon Rank SumTest)P<0.05且差异倍数>1.2的CpG位点的并集,其中差异倍数定义为:阳性群体中该CpG位点的甲基化水平的均值/阴性群体中该CpG位点的甲基化水平的均值。Differential methylation site combination panel of each cancer type and healthy person is each cancer type (adrenocortical carcinoma, bladder urothelial carcinoma, breast cancer invasive carcinoma, cervical squamous cell carcinoma and endocervical adenocarcinoma, cholangiocarcinoma, colon carcinoma Adenocarcinoma, cellular lymphoma, esophageal cancer, glioblastoma multiforme, squamous cell carcinoma of the head and neck, chromophobe renal cell carcinoma, clear cell renal cell carcinoma, papillary renal cell carcinoma, acute myeloid leukemia, brain Low-grade glioma, hepatocellular carcinoma, lung adenocarcinoma, lung squamous cell carcinoma, mesothelioma, ovarian serous cystadenocarcinoma, pancreatic cancer, pheochromocytoma and paraganglioma, prostate cancer, sarcoma, skin melanoma Tumor, gastric adenocarcinoma, testicular germ cell tumor, thyroid cancer, endometrial cancer, uterine carcinosarcoma, uveal melanoma) and paracancerous healthy control rank sum test (Wilcoxon Rank SumTest) P<0.05 and the multiple of difference>1.2 The union of CpG sites, where the multiple of difference is defined as: the mean value of the methylation level of the CpG site in the positive population/the mean value of the methylation level of the CpG site in the negative population.

组织溯源特性CpG位点组合面板定义为:肾上腺皮质癌τ>0.85的CpG位点、膀胱尿路上皮癌τ>0.85的CpG位点、乳腺癌浸润癌τ>0.85的CpG位点、宫颈鳞状细胞癌和宫颈内腺癌τ>0.85的CpG位点、胆管癌τ>0.85的CpG位点、结肠腺癌τ>0.85的CpG位点、细胞淋巴瘤τ>0.85的CpG位点、食管癌τ>0.85的CpG位点、多形性胶质母细胞瘤τ>0.85的CpG位点、头颈部鳞状细胞癌τ>0.85的CpG位点、肾嫌色细胞癌τ>0.85的CpG位点、肾透明细胞癌τ>0.85的CpG位点、肾乳头状细胞癌τ>0.85的CpG位点、急性髓系白血病τ>0.85的CpG位点、脑低级胶质瘤τ>0.85的CpG位点、肝细胞癌τ>0.85的CpG位点、肺腺癌τ>0.85的CpG位点、肺鳞状细胞癌τ>0.85的CpG位点、间皮瘤τ>0.85的CpG位点、卵巢浆液性囊腺癌τ>0.85的CpG位点、胰腺癌τ>0.85的CpG位点、嗜铬细胞瘤和副神经节瘤τ>0.85的CpG位点、前列腺癌τ>0.85的CpG位点、肉瘤τ>0.85的CpG位点、皮肤黑色素瘤τ>0.85的CpG位点、胃腺癌τ>0.85的CpG位点、睾丸生殖细胞肿瘤τ>0.85的CpG位点、甲状腺癌τ>0.85的CpG位点、子宫内膜癌τ>0.85的CpG位点、子宫癌肉瘤τ>0.85的CpG位点、葡萄膜黑色素瘤τ>0.85的CpG位点的并集,该面板可以代表驱动癌症发生的甲基化位点。其中τ定义为:Tissue traceability characteristics CpG site combination panel is defined as: CpG sites of adrenocortical carcinoma τ>0.85, CpG sites of bladder urothelial carcinoma τ>0.85, CpG sites of breast cancer invasive cancer τ>0.85, cervical squamous CpG sites of cell carcinoma and endocervical adenocarcinoma τ>0.85, CpG sites of cholangiocarcinoma τ>0.85, CpG sites of colon adenocarcinoma τ>0.85, CpG sites of cell lymphoma τ>0.85, esophageal cancer τ>0.85 CpG sites >0.85, glioblastoma multiforme τ>0.85 CpG sites, head and neck squamous cell carcinoma τ>0.85 CpG sites, renal chromophobe cell carcinoma τ>0.85 CpG sites , CpG sites of renal clear cell carcinoma τ>0.85, renal papillary cell carcinoma τ>0.85 CpG sites, acute myeloid leukemia τ>0.85 CpG sites, brain low-grade glioma τ>0.85 CpG sites , CpG sites of hepatocellular carcinoma τ>0.85, CpG sites of lung adenocarcinoma τ>0.85, CpG sites of lung squamous cell carcinoma τ>0.85, CpG sites of mesothelioma τ>0.85, ovarian serous CpG sites for cystadenocarcinoma τ>0.85, CpG sites for pancreatic cancer τ>0.85, CpG sites for pheochromocytoma and paraganglioma τ>0.85, CpG sites for prostate cancer τ>0.85, sarcoma τ CpG sites of >0.85, CpG sites of skin melanoma τ>0.85, CpG sites of gastric adenocarcinoma τ>0.85, CpG sites of testicular germ cell tumors τ>0.85, CpG sites of thyroid cancer τ>0.85, The union of CpG sites with τ>0.85 in endometrial cancer, CpG sites with τ>0.85 in uterine carcinosarcoma, and CpG sites in uveal melanoma with τ>0.85, this panel can represent the methylation sites that drive cancer point. where τ is defined as:

Figure BDA0003981740380000081
Figure BDA0003981740380000081

Figure BDA0003981740380000082
Figure BDA0003981740380000082

公式1中n表示为任意癌种的样本数。公式2中xi表示为任意癌种的任意样本的甲基化水平,

Figure BDA0003981740380000083
表示为任意癌种的同类型样本中甲基化水平的极大值。In formula 1, n represents the number of samples of any cancer type. In formula 2, x i represents the methylation level of any sample of any cancer type,
Figure BDA0003981740380000083
Expressed as the maximum value of the methylation level in the same type of samples of any cancer type.

合并癌种与健康人的差异甲基化位点与各个癌种具有组织溯源特性的CpG位点组合,由此可得到表征泛癌种普遍特性的甲基化面板组合(如表2所示)。Combine the differential methylation sites of cancer types and healthy people and the combination of CpG sites with tissue traceability characteristics of each cancer type, so as to obtain the methylation panel combination that characterizes the general characteristics of pan-cancer types (as shown in Table 2) .

表2:甲基化基因面板Table 2: Methylated Gene Panel

Figure BDA0003981740380000084
Figure BDA0003981740380000084

Figure BDA0003981740380000091
Figure BDA0003981740380000091

Figure BDA0003981740380000101
Figure BDA0003981740380000101

Figure BDA0003981740380000111
Figure BDA0003981740380000111

Figure BDA0003981740380000121
Figure BDA0003981740380000121

Figure BDA0003981740380000131
Figure BDA0003981740380000131

甲基化文库建库及文库上机测序:Methylation library construction and library sequencing:

1.使用5-30ng cfDNA进行甲基化文库构建,掺入50pg内参DNA混合液(自制,166bp)后使用Zymo Research公司的Lightning conversion reagent试剂盒对样本进行重亚硫酸盐处理以及纯化回收。1. Use 5-30ng cfDNA for methylation library construction, add 50pg internal reference DNA mixture (self-made, 166bp) and use Zymo Research’s Lightning conversion reagent kit to treat the sample with bisulfite and purify and recover.

2.使用翌圣生物的甲基化建库试剂盒进行文库构建,建库过程中的DNA纯化回收使用AMPureXP beads(Beckman),使用EB缓冲液(Qiagen)洗脱收集文库。2. Use the methylation library construction kit of Yisheng Biotechnology for library construction. During the library construction process, DNA purification and recovery use AMPureXP beads (Beckman), and use EB buffer (Qiagen) to elute and collect the library.

3.取500ng预文库DNA,使用Twist Bioscience的杂交捕获洗杂相关试剂进行目标区DNA(携带表2所示基因的DNA片段)的捕获。3. Take 500ng of the pre-library DNA, and use Twist Bioscience's hybrid capture and washing related reagents to capture the target region DNA (DNA fragments carrying the genes shown in Table 2).

4.对洗杂后的DNA使用KAPA HiFi Hotstart ready Mix进行扩增,扩增产物使用AMPure XP beads(Beckman)进行纯化,即为终文库。4. Use KAPA HiFi Hotstart ready Mix to amplify the washed DNA, and use AMPure XP beads (Beckman) to purify the amplified product, which is the final library.

5.终文库使用qPCR(KAPA SYBR Fast Kit,Roche)进行定量,然后在IlluminaNovaSeq 6000测序平台进行双端150bp的测序。5. The final library was quantified using qPCR (KAPA SYBR Fast Kit, Roche), and then performed paired-end 150bp sequencing on the IlluminaNovaSeq 6000 sequencing platform.

获取甲基化水平观测值:Obtain methylation level observations:

测序仪下机数据经过bcl2fastq软件识别后转化为fastq文件,使用cutadapt软件去除测序数据中低质量读段、包含N碱基占比大于5%的读段以及读长短于50bp的读段。将上述读段使用BSMAP软件比对到hg19参考基因组,然后去除PCR冗余序列。使用Bismark软件检测目标区域(携带表2所述基因的DNA片段)内所有CpG位点的甲基化水平观测值。The off-machine data of the sequencer was recognized by the bcl2fastq software and converted into a fastq file, and the cutadapt software was used to remove low-quality reads, reads containing more than 5% of N bases, and reads shorter than 50 bp in the sequencing data. The above reads were aligned to the hg19 reference genome using BSMAP software, and then redundant PCR sequences were removed. Use Bismark software to detect the methylation level observations of all CpG sites in the target region (DNA fragments carrying the genes described in Table 2).

基因组不稳定性评估:Genome Instability Assessment:

将全基因组按照划分成长度为2M(两百万)个碱基组成的等长窗口,并排除X、Y性染色体、线粒体及其他contig。使用featureCounts软件统计任意窗口内读段数目。不同的样本,由于文库大小不一致,因此需要在样本内进行均一化操作,引入“每一百万reads”均一化方法,具体为:Divide the whole genome into windows of equal length with a length of 2M (two million) bases, and exclude X, Y sex chromosomes, mitochondria and other contigs. Use featureCounts software to count the number of reads in any window. For different samples, due to the inconsistent size of the library, it is necessary to perform a homogenization operation within the sample, and introduce the "every million reads" normalization method, specifically:

Figure BDA0003981740380000132
Figure BDA0003981740380000132

其中Fragmenti表示为第i个窗口内读段数目,TotalMappedFragments表示为该样本总体读段数目,WindowLengthi表示为第i个窗口长度,特别的,为2E6。Among them, Fragment i represents the number of reads in the i-th window, TotalMappedFragments represents the total number of reads in the sample, and WindowLength i represents the length of the i-th window, in particular, 2E6.

片段大小:Fragment size:

将待测样本BAM文件的reads还原成原始的DNA模板(fragment),将全基因组按照染色体臂(arm),特别的,包括chr1p、chr1q、chr2p、chr2q、chr3p、chr3q、chr4p、chr4q、chr5p、chr5q、chr6p、chr6q、chr7p、chr7q、chr8p、chr8q、chr9p、chr9q、chr10p、chr10q、chr11p、chr11q、chr12p、chr12q、chr13p、chr13q、chr14p、chr14q、chr15p、chr15q、chr16p、chr16q、chr17p、chr17q、chr18p、chr18q、chr19p、chr19q、chr20p、chr20q、chr21p、chr21q、chr22p、chr22q。分别统计每个染色体臂内长度在20-100bp、101~220bp区间的fragment数目。不同的样本,由于文库大小不一致,因此需要在样本内进行均一化操作,具体为将特定长度区间的framgent数目除以该染色体内总体fragment数目,具体为用20-100bp范围内的fragment数目除以该染色体臂内总体的fragment数目,从而得到“短片段(即长度在20-100bp)”占比;用101~220范围内的fragment数目除以该染色体臂内总体的fragment数目,从而得到“长片段(即长度在101~220)”占比,同时计算每个染色体臂内长度在20-100bp区间(即“短片段”)的fragment数目与每个染色体臂内长度在101~220bp区间(即“长片段”)的fragment数目的比值,作为该染色体臂内的片段大小分布。Restore the reads of the BAM file of the sample to be tested to the original DNA template (fragment), and divide the whole genome according to the chromosome arm (arm), in particular, including chr1p, chr1q, chr2p, chr2q, chr3p, chr3q, chr4p, chr4q, chr5p, chr5q, chr6p, chr6q, chr7p, chr7q, chr8p, chr8q, chr9p, chr9q, chr10p, chr10q, chr11p, chr11q, chr12p, chr12q, chr13p, chr13q, chr14p, chr14q, chr15p, chr15q, chr16p, chr17p, chr17 chr18p, chr18q, chr19p, chr19q, chr20p, chr20q, chr21p, chr21q, chr22p, chr22q. The number of fragments in the intervals of 20-100bp and 101-220bp in each chromosome arm were counted respectively. For different samples, due to the inconsistent size of the library, it is necessary to perform a homogenization operation within the sample. Specifically, divide the number of fragments in a specific length interval by the overall number of fragments in the chromosome, specifically divide the number of fragments in the range of 20-100bp by The overall number of fragments in the chromosome arm to obtain the proportion of "short fragments (that is, the length is 20-100bp)"; divide the number of fragments in the range of 101 to 220 by the overall number of fragments in the chromosome arm to obtain the "long Fragments (that is, the length is between 101 and 220)", and at the same time calculate the number of fragments with a length of 20-100 bp (ie "short fragments") in each chromosome arm and the number of fragments with a length of 101 to 220 bp in each chromosome arm (that is, The ratio of the number of fragments of the "long fragment") as the fragment size distribution within the chromosome arm.

多模态集成学习模型构建:Multimodal ensemble learning model construction:

1.特征数据提取:提取各个样本测序数据中上述方法所产生的甲基化水平、基因组不稳定性评分、片段大小作为输入数据,使用机器学习方法LASSO(least absoluteshrinkage and selection operator)回归算法对上述三种标志物进一步降维处理,筛选权重不等于0的标志物作为模型构建的区域组合;1. Feature data extraction: Extract the methylation level, genome instability score, and fragment size generated by the above methods in the sequencing data of each sample as input data, and use the machine learning method LASSO (least absoluteshrinkage and selection operator) regression algorithm to analyze the above The three markers were further dimensionally reduced, and the markers with weights not equal to 0 were selected as the regional combination for model construction;

2.模型最优参数确定:使用逻辑回归进行模型构建及迭代训练,通过约登指数法确定最优阈值;2. Determination of the optimal parameters of the model: use logistic regression for model construction and iterative training, and determine the optimal threshold through the Youden index method;

3:模型性能验证:使用确定好的模型的最优参数和最优阈值在独立的测试集中进行验证,绘制受试者工作特征ROC曲线,计算曲线下AUC值。定义LC Score为模型的最终结果:3: Model performance verification: Use the determined optimal parameters and optimal thresholds of the model to verify in an independent test set, draw the receiver operating characteristic ROC curve, and calculate the AUC value under the curve. Define LC Score as the final result of the model:

Z=coefi×Pmeth+coefj×Pbincount+coefk×Pfragment-intercept;Z=coef i ×P meth +coef j ×P bincount +coef k ×P fragment -intercept;

其中,Pmeth表示甲基化模型评分,Pbincount表示基因组不稳定性模型评分,Pfragment表示片段大小分别模型评分;coefi、coefj、coefk分别表示在逻辑回归中甲基化、基因组不稳定性、片段大小分布的权重;intercept表示截距项。Among them, P meth represents the methylation model score, P bincount represents the genome instability model score, and P fragment represents the fragment size model score; coef i , coef j , coef k represent methylation, genome instability in logistic regression, respectively. Stability, the weight of the fragment size distribution; intercept represents the intercept term.

本实施例中,coefi、coefj、coefk分别为:2.98、1.59、3.27,intercept为:-5.75。在其他实施例中,这些参数可能基于具体的训练集和测试集的样本会产生变化,但不会对最终的结果造成显著影响。In this embodiment, coef i , coef j , and coef k are respectively: 2.98, 1.59, and 3.27, and intercept is: -5.75. In other embodiments, these parameters may vary based on specific samples of the training set and test set, but will not have a significant impact on the final result.

需要说明的是,本实施例中,甲基化模型、基因组不稳定性模型和片段大小分布模型(每一类标志物的模型)与最终的预测模型的构建方法相同,都是基于相同的训练样本,分别对逻辑回归模型进行训练后获得的,区别仅在于输入数据的不同。It should be noted that in this example, the methylation model, genome instability model and fragment size distribution model (models for each type of marker) are constructed in the same way as the final prediction model, all based on the same training The samples are obtained after training the logistic regression model respectively, and the difference is only in the input data.

实施例2Example 2

在52例良性结节(其中错构瘤10例、支气管上皮增生5例、纤维组织增生伴玻璃样变8例、间质性肺疾病12例、隐球菌感染7例、真菌感染10例)血浆样本和76例肺癌患者(其中I期肺癌30例、II期肺癌12例、III期肺癌14例、IV期肺癌20例)血浆样本中筛选候选的生物标志物组合。样本的具体信息见说明书最后的表7。In 52 cases of benign nodules (including 10 cases of hamartoma, 5 cases of bronchial epithelial hyperplasia, 8 cases of fibrous tissue hyperplasia with hyalinization, 12 cases of interstitial lung disease, 7 cases of cryptococcal infection, and 10 cases of fungal infection) plasma Candidate biomarker combinations were screened in samples and plasma samples of 76 lung cancer patients (including 30 cases of stage I lung cancer, 12 cases of stage II lung cancer, 14 cases of stage III lung cancer, and 20 cases of stage IV lung cancer). See Table 7 at the end of the specification for specific information on the samples.

(1)甲基化面板的筛选步骤为:提取各个样本在实施例1中表2所示的甲基化基因面板内的所有基因启动子的甲基化水平为输入数据,保留良性结节与肺癌秩和检验(Wilcoxon Rank Sum Test)P<0.05且差异倍数>1.2的CpG位点的并集,其中差异倍数定义为:肺癌群体中该CpG位点的甲基化水平的均值/良性结节群体中该CpG位点的甲基化水平的均值。进一步的,使用机器学习方法特征递归消除(RFE,recursive featureelimination)对剩余的CpG位点进一步降维处理。基于当前样本,使用候选的CpG位点拟合逻辑回归模型,并通过5倍交叉验证评估曲线下面积AUC,将AUC从大到小排序后并保留top100个基因,结果见表3。(1) The screening steps of the methylation panel are as follows: extract the methylation levels of all gene promoters in the methylation gene panel shown in Table 2 in Example 1 of each sample as input data, and retain benign nodules and The union of CpG sites with Wilcoxon Rank Sum Test (Wilcoxon Rank Sum Test) P<0.05 and the multiple of difference>1.2, where the multiple of difference is defined as: the mean value of the methylation level of the CpG site in the lung cancer population/benign nodule The mean value of the methylation level of this CpG site in the population. Further, the remaining CpG sites are further dimensionally reduced using a machine learning method called recursive feature elimination (RFE, recursive feature elimination). Based on the current sample, the candidate CpG sites were used to fit the logistic regression model, and the area under the curve (AUC) was evaluated by 5-fold cross-validation. The AUC was sorted from large to small and the top100 genes were retained. The results are shown in Table 3.

表3:不同甲基化基因对肺结节良恶性的诊断性能Table 3: Diagnostic performance of different methylated genes for benign and malignant pulmonary nodules

Figure BDA0003981740380000151
Figure BDA0003981740380000151

Figure BDA0003981740380000161
Figure BDA0003981740380000161

Figure BDA0003981740380000171
Figure BDA0003981740380000171

经评估,联合top4基因组合(PTGER4、RASSF1、SHOX2、PCDHGB6)可获得优异的诊断性能,因此,甲基化面板选取PTGER4、RASSF1、SHOX2、PCDHGB6基因。After evaluation, the combined top4 gene combination (PTGER4, RASSF1, SHOX2, PCDHGB6) can obtain excellent diagnostic performance, therefore, the methylation panel selects PTGER4, RASSF1, SHOX2, PCDHGB6 genes.

(2)基因组不稳定性面板的筛选步骤为:提取各个样本的全基因组范围内,各个窗口(全基因组按照划分成长度为2M个碱基组成的等长窗口)的Bincount值为输入数据,保留良性结节与肺癌秩和检验(Wilcoxon Rank Sum Test)P<0.05且差异倍数>1.2的窗口的并集。进一步的,使用机器学习方法特征递归消除(RFE,recursive feature elimination)对剩余的窗口进一步降维处理。基于当前样本,使用候选的窗口拟合逻辑回归模型,并通过5倍交叉验证评估曲线下面积AUC,将AUC从大到小排序后并保留top30个窗口,结果见表4。(2) The screening steps of the genome instability panel are as follows: extract the Bincount value of each window (the whole genome is divided into equal-length windows with a length of 2M bases) within the scope of the whole genome of each sample as input data, and keep The union of benign nodules and lung cancer (Wilcoxon Rank Sum Test) windows with P<0.05 and multiples of difference>1.2. Further, the remaining windows are further dimensionally reduced using the machine learning method recursive feature elimination (RFE, recursive feature elimination). Based on the current sample, use the candidate window to fit the logistic regression model, and evaluate the area under the curve (AUC) through 5-fold cross-validation. After sorting the AUC from large to small, keep the top30 windows. The results are shown in Table 4.

表4:不同窗口的基因组不稳定性对肺结节良恶性的诊断性能Table 4: Diagnostic performance of different windows of genomic instability on benign and malignant pulmonary nodules

窗口window AUCAUC chr6_46000000_48000000chr6_46000000_48000000 0.7142520060.714252006 chr14_78000000_80000000chr14_78000000_80000000 0.7123214210.712321421 chr5_146000000_148000000chr5_146000000_148000000 0.7106741570.710674157 chr19_2000000_4000000chr19_2000000_4000000 0.6106696980.610669698 chr12_132000000_133851895chr12_132000000_133851895 0.6099403660.609940366 chr7_38000000_40000000chr7_38000000_40000000 0.604792140.60479214 chr19_1_2000000chr19_1_2000000 0.6038272470.603827247 chr19_48000000_50000000chr19_48000000_50000000 0.6034621820.603462182 chr19_16000000_18000000chr19_16000000_18000000 0.6031250.603125 chr3_130000000_132000000chr3_130000000_132000000 0.6030975160.603097516 chr19_58000000_59128983chr19_58000000_59128983 0.602732850.60273285 chr7_68000000_70000000chr7_68000000_70000000 0.6024446980.602444698 chr16_1_2000000chr16_1_2000000 0.6022033010.602203301 chr19_10000000_12000000chr19_10000000_12000000 0.6004476830.600447683 chr7_24000000_26000000chr7_24000000_26000000 0.5975728580.597572858 chr16_2000000_4000000chr16_2000000_4000000 0.5975509130.597550913 chr19_44000000_46000000chr19_44000000_46000000 0.594989060.59498906 chr19_12000000_14000000chr19_12000000_14000000 0.5917070660.591707066 chr4_1_2000000chr4_1_2000000 0.5876975070.587697507 chr10_134000000_135534747chr10_134000000_135534747 0.5864246840.586424684 chr21_8000000_10000000chr21_8000000_10000000 0.5814650630.581465063 chr19_52000000_54000000chr19_52000000_54000000 0.5806750350.580675035 chr16_88000000_90000000chr16_88000000_90000000 0.5805525760.580552576 chr12_24000000_26000000chr12_24000000_26000000 0.5673103930.567310393 chr1_232000000_234000000chr1_232000000_234000000 0.5630310740.563031074 chr11_1_2000000chr11_1_2000000 0.560200140.56020014 chr19_46000000_48000000chr19_46000000_48000000 0.5544066010.554406601 chr9_18000000_20000000chr9_18000000_20000000 0.5510789820.551078982 chr3_152000000_154000000chr3_152000000_154000000 0.5447726470.544772647 chr8_120000000_122000000chr8_120000000_122000000 0.5398130270.539813027

经评估,联合top3窗口组合(chr6_46000000_48000000、chr14_78000000_80000000、chr5_146000000_148000000)可获得优异的诊断性能,因此片段大小模型选取chr6_46000000_48000000、chr14_78000000_80000000、chr5_146000000_148000000窗口。After evaluation, the combination of the TOP3 window (CHR6_46000000_48000000, CHR14_78000000_80000000, CHR5_146000000_148000000) can obtain excellent diagnosis performance. Therefore, the segment size model selects CHR6_46000000000000, CHR14_8000000000, and CHR5146666.

(3)片段大小分布面板的筛选步骤为:提取各个样本的任意染色体臂的fragment值为输入数据,保留良性结节与肺癌秩和检验(Wilcoxon Rank Sum Test)P<0.05且差异倍数>1.2的染色体臂的并集。进一步的,使用机器学习方法特征递归消除(RFE,recursivefeature elimination)对剩余的窗口进一步降维处理。基于当前样本,使用候选的染色体臂拟合逻辑回归模型,并通过5倍交叉验证评估曲线下面积AUC,将AUC从大到小排序,结果见表5。(3) The screening steps of the fragment size distribution panel are as follows: extract the fragment value of any chromosome arm of each sample as input data, and retain benign nodules and lung cancer rank sum test (Wilcoxon Rank Sum Test) P<0.05 and the multiple of difference>1.2 Union of chromosome arms. Further, the remaining windows are further dimensionally reduced using the machine learning method recursive feature elimination (RFE, recursive feature elimination). Based on the current sample, the candidate chromosome arm was used to fit the logistic regression model, and the area under the curve (AUC) was evaluated by 5-fold cross-validation, and the AUC was sorted from large to small. The results are shown in Table 5.

表5:不同染色体臂的片段大小对肺结节良恶性的诊断性能Table 5: Diagnostic performance of fragment sizes of different chromosome arms on benign and malignant pulmonary nodules

Figure BDA0003981740380000191
Figure BDA0003981740380000191

Figure BDA0003981740380000201
Figure BDA0003981740380000201

经评估,联合top3染色体臂组合(chr2q、chr11p、chr14p)可获得优异的诊断性能,因此片段大小模型选取chr2q、chr11p、chr14p染色体臂。After evaluation, the combination of top3 chromosome arms (chr2q, chr11p, chr14p) can obtain excellent diagnostic performance, so the fragment size model selects the chromosome arms of chr2q, chr11p, chr14p.

综上,联合片段大小模式、基因组不稳定性及甲基化水平进行肺结节良恶性鉴别的肿瘤标志物面板定义为:包含PTGER4、RASSF1、SHOX2、PCDHGB6基因甲基化、chr6_46000000_48000000、chr14_78000000_80000000、chr5_146000000_148000000窗口不稳定性值以及chr2q、chr11p、chr14p染色体臂片段大小比例的逻辑回归模型。In summary, the tumor marker panel for the identification of benign and malignant pulmonary nodules combined with fragment size pattern, genomic instability and methylation level is defined as: including PTGER4, RASSF1, SHOX2, PCDHGB6 gene methylation, chr6_46000000_48000000, chr14_78000000_80000000, chr5_146000000_148000000 Logistic regression models for window instability values and size ratios of chr2q, chr11p, chr14p chromosome arm fragments.

基于当前样本(随机打乱128例样本顺序,每次使用128例样本的80%来训练,剩余20%作为独立验证),针对每一类标志物,分别拟合逻辑回归模型,获取每一类标志物的评分。Based on the current sample (randomly shuffling the order of 128 samples, using 80% of the 128 samples for training each time, and the remaining 20% as independent verification), for each type of marker, respectively fit a logistic regression model to obtain each type Scoring of markers.

基于当前样本,将三类标志物的评分作为输入数据,拟合逻辑回归模型,并可对待测样本(样本的具体信息见表7)标记恶性概率值(下文称LC Score)。为了体现模型的稳定性能,灵敏度和特异性等参数为100次迭代测试的平均值。在良性结节与肺癌中恶性概率的分布情况见图1,LC Score用于鉴别肺结节良恶性的性能见图2和表6。Based on the current sample, the scores of the three types of markers are used as input data to fit the logistic regression model, and the malignancy probability value (hereinafter referred to as LC Score) can be marked for the sample to be tested (see Table 7 for specific information of the sample). In order to reflect the stable performance of the model, parameters such as sensitivity and specificity are the average value of 100 iterative tests. The distribution of malignant probability in benign nodules and lung cancer is shown in Figure 1, and the performance of LC Score for differentiating benign and malignant pulmonary nodules is shown in Figure 2 and Table 6.

表6 LC Score用于鉴别肺结节良恶性的混淆矩阵Table 6 Confusion matrix of LC Score for differentiating benign from malignant pulmonary nodules

// 恶性结节(肺癌)malignant nodules (lung cancer) 良性结节benign nodules 阳性预测值/阴性预测值Positive Predictive Value/Negative Predictive Value 阳性Positive 6868 77 90.67%90.67% 阴性Negative 88 4545 84.91%84.91% 灵敏度/特异性Sensitivity/Specificity 89.47%89.47% 86.54%86.54% 88.28%88.28%

备注:灵敏度为89.47%,特异性为86.54%,准确率为88.28%。Remarks: The sensitivity is 89.47%, the specificity is 86.54%, and the accuracy is 88.28%.

实施例3Example 3

使用LC Score模型对比20对术前阳性且术后复发患者(合计40例样本),11对术前阳性且肺癌术后无复发患者(合计22例样本),用于评估LC Score模型对肺癌术后复发风险监测的灵敏度。从结果可看到,LC Score可以准确区分术后复发与未复发样本,评估结果见图3。The LC Score model was used to compare 20 pairs of patients with preoperative positive and postoperative recurrence (a total of 40 samples), and 11 pairs of patients with preoperative positive and postoperative recurrence of lung cancer (a total of 22 samples) to evaluate the effect of the LC Score model on lung cancer surgery. Sensitivity of post-relapse risk monitoring. It can be seen from the results that LC Score can accurately distinguish postoperative recurrence and non-recurrence samples, and the evaluation results are shown in Figure 3.

实施例4Example 4

将实施例2的10个标志物组合(PTGER4基因、RASSF1基因、SHOX2基因和PCDHGB6基因的甲基化水平,chr6_46000000_48000000、chr14_78000000_80000000和chr5_146000000_148000000窗口的基因组不稳定性,以及chr2q、chr11p和chr14p的片段大小分布)作为实验组1,并同时实验组2~3:Combining the 10 markers of Example 2 (methylation levels of PTGER4 gene, RASSF1 gene, SHOX2 gene and PCDHGB6 gene, genomic instability of chr6_46000000_48000000, chr14_78000000_80000000 and chr5_146000000_148000000 windows, and chr2q and size distribution of chr1p fragments ) as the experimental group 1, and at the same time the experimental groups 2-3:

实验组2为3个标志物(TGER4基因、RASSF1基因、SHOX2基因的甲基化水平);Experimental group 2 is 3 markers (methylation level of TGER4 gene, RASSF1 gene, SHOX2 gene);

实验组3为15个标志物(指10个标志物外加chr19_2000000_4000000、chr12_132000000_133851895窗口的基因组不稳定性和chr1p、chr6p、chr9p的片段大小分布)组合;Experimental group 3 is a combination of 15 markers (referring to 10 markers plus chr19_2000000_4000000, chr12_132000000_133851895 window genomic instability and chr1p, chr6p, chr9p fragment size distribution);

将3组标志物组合依照实施例2提供的方法拟合逻辑回归模型,并对实施例2中所述的128例样本进行受试者曲线分析。The three groups of markers were combined to fit the logistic regression model according to the method provided in Example 2, and the receiver curve analysis was performed on the 128 samples described in Example 2.

实验组2的ROC曲线见图4,AUC为0.772;实验组的ROC曲线见图2,AUC为0.928;实验组3的ROC曲线见图5,AUC为0.92。The ROC curve of the experimental group 2 is shown in Figure 4, and the AUC is 0.772; the ROC curve of the experimental group is shown in Figure 2, and the AUC is 0.928; the ROC curve of the experimental group 3 is shown in Figure 5, and the AUC is 0.92.

表7样本信息Table 7 Sample information

Figure BDA0003981740380000211
Figure BDA0003981740380000211

Figure BDA0003981740380000221
Figure BDA0003981740380000221

Figure BDA0003981740380000231
Figure BDA0003981740380000231

Figure BDA0003981740380000241
Figure BDA0003981740380000241

Figure BDA0003981740380000251
Figure BDA0003981740380000251

Figure BDA0003981740380000261
Figure BDA0003981740380000261

表8样本信息Table 8 Sample Information

Figure BDA0003981740380000262
Figure BDA0003981740380000262

Figure BDA0003981740380000271
Figure BDA0003981740380000271

以上所述仅为本发明的优选实施例而已,并不用于限制本发明,对于本领域的技术人员来说,本发明可以有各种更改和变化。凡在本发明的精神和原则之内,所作的任何修改、等同替换、改进等,均应包含在本发明的保护范围之内。The above descriptions are only preferred embodiments of the present invention, and are not intended to limit the present invention. For those skilled in the art, the present invention may have various modifications and changes. Any modifications, equivalent replacements, improvements, etc. made within the spirit and principles of the present invention shall be included within the protection scope of the present invention.

Claims (10)

1.检测标志物组合的试剂在制备用于鉴别肺结节的良恶性和/或预测肺癌术后复发的风险的产品中的应用,其特征在于,所述标志物组合包括以下三类标志物:目标基因的甲基化水平、目标染色体窗口内的基因组不稳定性和目标染色体臂内的片段大小分布;1. The application of the reagent for detecting the combination of markers in the preparation of products for distinguishing benign from malignant pulmonary nodules and/or predicting the risk of recurrence of lung cancer after surgery, characterized in that the combination of markers includes the following three types of markers : methylation level of the target gene, genomic instability within the target chromosome window and fragment size distribution within the target chromosome arm; 其中,所述目标基因包括:PTGER4基因、RASSF1基因、SHOX2基因和PCDHGB6基因中的至少一种;Wherein, the target gene includes: at least one of PTGER4 gene, RASSF1 gene, SHOX2 gene and PCDHGB6 gene; 所述目标染色体窗口包括:chr6_46000000_48000000、chr14_78000000_80000000和chr5_146000000_148000000中的至少一种;The target chromosome window includes: at least one of chr6_46000000_48000000, chr14_78000000_80000000 and chr5_146000000_148000000; 所述目标染色体臂包括:chr2q、chr11p和chr14p中的至少一种。The target chromosome arm includes: at least one of chr2q, chr11p and chr14p. 2.根据权利要求1所述的应用,其特征在于,所述目标染色体窗口内的基因组不稳定性由目标染色体窗口内的基因组不稳定性评分体现;2. The application according to claim 1, wherein the genome instability in the target chromosome window is reflected by the genome instability score in the target chromosome window; 优选地,所述基因组不稳定性评分的计算方法选自:基于测序深度的Z值算法、基于对照样本的log2ratio算法、基于soft-clipped断裂读段的算法和BinCounti计算公式中的任意一种;Preferably, the calculation method of the genome instability score is selected from: any one of the Z value algorithm based on sequencing depth, the log2ratio algorithm based on control samples, the algorithm based on soft-clipped break reads and the BinCount i calculation formula ; 优选地,BinCounti的计算公式如下:Preferably, the calculation formula of BinCount i is as follows:
Figure FDA0003981740370000011
其中,BinCounti为基因组不稳定性评分;Fragmenti为第i个窗口内的读段数目;TotalMappedFragments为该样本总体的读段数目;WindowLengthi为第i个窗口的长度。
Figure FDA0003981740370000011
Among them, BinCount i is the genome instability score; Fragment i is the number of reads in the i-th window; TotalMappedFragments is the number of reads in the sample overall; WindowLength i is the length of the i-th window.
3.根据权利要求1所述的应用,其特征在于,所述片段大小分布是指目标染色体窗口内的长片段与短片段的数量比或差异,所述长片段的长度为101~220bp,所述短片段的长度为20~100bp。3. The application according to claim 1, wherein the fragment size distribution refers to the ratio or difference between the number of long fragments and short fragments in the target chromosome window, and the length of the long fragments is 101-220bp, so The length of the short fragment is 20-100bp. 4.根据权利要求1~3任一项所述的应用,其特征在于,所述产品选自:试剂、试剂盒和预测模型中的任意一种。4. The application according to any one of claims 1-3, wherein the product is selected from any one of reagents, kits and predictive models. 5.一种用于鉴别肺结节的良恶性和/或预测肺癌术后复发的风险的试剂或试剂盒,其特征在于,其包括:权利要求1~4任一项所述的检测标志物组合的试剂。5. A reagent or kit for differentiating benign from malignant pulmonary nodules and/or predicting the risk of postoperative recurrence of lung cancer, characterized in that it comprises: the detection marker according to any one of claims 1 to 4 Combined reagents. 6.一种肺结节的良恶性和/或预测肺癌术后复发的风险的预测模型的训练方法,其特征在于,其包括:6. A method for training a predictive model of benign and malignant pulmonary nodules and/or the risk of postoperative recurrence of lung cancer, characterized in that it comprises: 获取训练样本中的标志物组合中每个标志物的检测结果以及对应的标注结果;其中,所述标注结果为代表样本肺结节的良恶性和/或预测肺癌术后复发的风险的标签,所述标志物组合为权利要求1~4任一项所述的标志物组合;Obtain the detection result of each marker in the marker combination in the training sample and the corresponding labeling result; wherein, the labeling result is a label representing the benign and malignant of the sample pulmonary nodules and/or predicting the risk of postoperative recurrence of lung cancer, The marker combination is the marker combination according to any one of claims 1-4; 将所述标志物组合中的所有标志物的检测结果或三类标志物的评分输入预先构建的预测模型中,获得预测结果;其中,每一类标志物的评分的获取方式如下:获取训练样本中每一类标志物中每个标志物的检测结果以及对应的标注结果;将每一类标志物中所有标志物检测结果输入预先构建的预测模型中,将预测的结果作为每一类标志物的评分;所述预先构建的预测模型为能根据所述标志物组合的检测结果、三类标志物的评分或每一类标志物中每个标志物的检测结果预测样本肺结节的良恶性和/或预测肺癌术后复发的风险的机器学习模型;Input the detection results of all markers in the marker combination or the scores of the three types of markers into the pre-built prediction model to obtain the prediction results; wherein, the way to obtain the scores of each type of markers is as follows: Obtain training samples The detection results of each marker in each type of markers and the corresponding labeling results; input the detection results of all markers in each type of markers into the pre-built prediction model, and use the predicted results as each type of markers score; the pre-built prediction model can predict the benign and malignant of the sample pulmonary nodules according to the detection results of the marker combination, the scores of the three types of markers, or the detection results of each marker in each type of markers and/or a machine learning model that predicts the risk of lung cancer recurrence after surgery; 基于所述标注结果和所述预测结果对所述预测模型进行参数更新。The parameters of the prediction model are updated based on the labeling result and the prediction result. 7.根据权利要求6所述的训练方法,其特征在于,所述机器学习模型包括:逻辑回归模型。7. The training method according to claim 6, wherein the machine learning model comprises: a logistic regression model. 8.一种肺结节的良恶性和/或预测肺癌术后复发的风险的预测装置,其特征在于,其包括:8. A prediction device for benign and malignant pulmonary nodules and/or the risk of postoperative recurrence of lung cancer, characterized in that it comprises: 获取模块,用于获取待测样本的标志物组合中每个标志物的检测结果;其中,所述标志物组合为权利要求1~4任一项所述的标志物组合;An acquisition module, configured to acquire the detection result of each marker in the marker combination of the sample to be tested; wherein the marker combination is the marker combination according to any one of claims 1-4; 预测模块,用于将所有标志物的检测结果或三类标志物的评分输入权利要求6或7所述的训练方法训练好的预测模型中,获得待测样本的预测结果;每一类标志物的评分的获取方式如权利要求6所述。The prediction module is used to input the detection results of all markers or the scores of the three types of markers into the prediction model trained by the training method described in claim 6 or 7, to obtain the prediction results of the samples to be tested; each type of markers The way to obtain the score is as described in claim 6. 9.一种电子设备,其特征在于,其包括:处理器和存储器;所述存储器用于存储程序,当所述程序被所述处理器执行时,使得所述处理器实现权利要求6或7所述的训练方法或肺结节的良恶性和/或预测肺癌术后复发的风险的预测方法;9. An electronic device, characterized in that it comprises: a processor and a memory; the memory is used to store a program, and when the program is executed by the processor, the processor is made to implement claim 6 or 7 The training method or the benign and malignant pulmonary nodules and/or the prediction method for predicting the risk of postoperative recurrence of lung cancer; 所述预测方法包括:获取待测样本的标志物组合中每个标志物的检测结果;所述标志物组合为权利要求1~4任一项所述的标志物组合;将所有标志物的检测结果或三类标志物的评分输入权利要求6或7所述的训练方法训练好的预测模型中,获得待测样本的预测结果,每一类标志物的评分的获取方式如权利要求6所述。The prediction method includes: obtaining the detection result of each marker in the marker combination of the sample to be tested; the marker combination is the marker combination described in any one of claims 1 to 4; the detection of all markers The results or the scores of the three types of markers are input into the prediction model trained by the training method described in claim 6 or 7, and the prediction results of the samples to be tested are obtained. The scoring method of each type of markers is as described in claim 6. . 10.一种计算机可读介质,其特征在于,所述计算机可读介质上存储有计算机程序,所述计算机程序被处理器执行时实现权利要求6或7所述的训练方法或权利要求9所述的预测方法。10. A computer-readable medium, characterized in that, a computer program is stored on the computer-readable medium, and when the computer program is executed by a processor, it realizes the training method described in claim 6 or 7 or the training method described in claim 9. the forecasting method described above.
CN202211552230.XA 2022-12-05 2022-12-05 Model for identifying benign and malignant lung nodules or predicting risk of postoperative recurrence of lung cancer, kit and application Pending CN115798582A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211552230.XA CN115798582A (en) 2022-12-05 2022-12-05 Model for identifying benign and malignant lung nodules or predicting risk of postoperative recurrence of lung cancer, kit and application

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211552230.XA CN115798582A (en) 2022-12-05 2022-12-05 Model for identifying benign and malignant lung nodules or predicting risk of postoperative recurrence of lung cancer, kit and application

Publications (1)

Publication Number Publication Date
CN115798582A true CN115798582A (en) 2023-03-14

Family

ID=85445821

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211552230.XA Pending CN115798582A (en) 2022-12-05 2022-12-05 Model for identifying benign and malignant lung nodules or predicting risk of postoperative recurrence of lung cancer, kit and application

Country Status (1)

Country Link
CN (1) CN115798582A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN119152926A (en) * 2024-11-20 2024-12-17 北京大学人民医院 Prediction model for predicting prognosis of lung cancer after novel adjuvant immunotherapy through primary tumor and construction method thereof
CN119152926B (en) * 2024-11-20 2025-02-18 北京大学人民医院 Prediction model and construction method for predicting the prognosis of lung cancer after neoadjuvant immunotherapy based on primary tumor

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN119152926A (en) * 2024-11-20 2024-12-17 北京大学人民医院 Prediction model for predicting prognosis of lung cancer after novel adjuvant immunotherapy through primary tumor and construction method thereof
CN119152926B (en) * 2024-11-20 2025-02-18 北京大学人民医院 Prediction model and construction method for predicting the prognosis of lung cancer after neoadjuvant immunotherapy based on primary tumor

Similar Documents

Publication Publication Date Title
JP7546946B2 (en) Use of size and number abnormalities in plasma DNA for the detection of cancer - Patents.com
JP7119014B2 (en) Systems and methods for detecting rare mutations and copy number variations
CN112236520B (en) Methylation Markers and Targeted Methylation Probe Panels
Weaver et al. Ordering of mutations in preinvasive disease stages of esophageal carcinogenesis
US20230220492A1 (en) Methods and systems for detecting colorectal cancer via nucleic acid methylation analysis
CN111910004B (en) Application of cfDNA in noninvasive diagnosis of early breast cancer
US20240084397A1 (en) Methods and systems for detecting cancer via nucleic acid methylation analysis
CN111863250B (en) Combined diagnosis model and system for early breast cancer
JP2023145696A (en) Enhancement of cancer screening using cell-free viral nucleic acids
CN105067822A (en) Marker for diagnosing esophagus cancer
CN117625793B (en) Screening method of ovarian cancer biomarker and application thereof
JP2023528533A (en) Multimodal analysis of circulating tumor nucleic acid molecules
CN111564177A (en) Construction method of early non-small cell lung cancer recurrence model based on DNA methylation
CN108588230A (en) A kind of marker and its screening technique for breast cancer diagnosis
JP2022522354A (en) DNA methylation marker for liver cancer recurrence prediction and its use
CN117305447A (en) Cervical cancer DNA methylation marker and application thereof
CN111028888B (en) Detection method for copy number variation of whole genome and application thereof
CN115798582A (en) Model for identifying benign and malignant lung nodules or predicting risk of postoperative recurrence of lung cancer, kit and application
CN115976209A (en) Training method of lung cancer prediction model, prediction device and application
WO2022262831A1 (en) Substance and method for tumor assessment
Bergamaschi et al. Pilot study demonstrating changes in DNA hydroxymethylation enable detection of multiple cancers in plasma cell-free DNA
WO2022243566A1 (en) Dna methylation biomarkers for hepatocellular carcinoma
JP2024507174A (en) Cell-free DNA methylation test
CN106636351B (en) One kind SNP marker relevant to breast cancer and its application
CN115595369B (en) Liquid biopsy prediction model, diagnostic kit and application based on esophageal precancerous lesions or esophageal cancer miRNAs

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination