CN115798582A - Model for identifying benign and malignant lung nodules or predicting risk of postoperative recurrence of lung cancer, kit and application - Google Patents
Model for identifying benign and malignant lung nodules or predicting risk of postoperative recurrence of lung cancer, kit and application Download PDFInfo
- Publication number
- CN115798582A CN115798582A CN202211552230.XA CN202211552230A CN115798582A CN 115798582 A CN115798582 A CN 115798582A CN 202211552230 A CN202211552230 A CN 202211552230A CN 115798582 A CN115798582 A CN 115798582A
- Authority
- CN
- China
- Prior art keywords
- marker
- markers
- sample
- lung
- lung cancer
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02A—TECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
- Y02A90/00—Technologies having an indirect contribution to adaptation to climate change
- Y02A90/10—Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation
Landscapes
- Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
Abstract
The invention discloses a model for identifying benign and malignant lung nodules or predicting risk of postoperative recurrence of lung cancer, a kit and application, and relates to the technical field of biological medical treatment. The invention discovers that the methylation levels of the PTGER4 gene, the RASSF1 gene, the SHOX2 gene and the PCDHGB6 gene, the genomic instability of the chr6_46000000, the chr14_78000000, the chr 80000000 and the chr5_146000000 windows 148000000 and the fragment size distribution of the chr2q, the chr11p and the chr14p can be used as markers, so that the effective identification of the benign and malignant pulmonary nodules and the postoperative monitoring of the recurrence of the lung cancer are realized, and the invention has the advantages of high sensitivity, good specificity, low detection cost and the like.
Description
Technical Field
The invention relates to the technical field of biological medicine, in particular to a model for identifying benign and malignant lung nodules or predicting risk of postoperative recurrence of lung cancer, a kit and application.
Background
Since the popularization of low-dose CT scanning and the higher resolution, the condition of clinically finding lung nodules is more and more, in clinical practice, lung nodule resection is not difficult, and the technical level is really tested to correctly identify the properties of the lung nodules. The iconographic definition of a lung nodule is a lesion completely surrounded by lung parenchyma, well defined, and less than or equal to 30mm in diameter. Lesions > 30mm in diameter are masses rather than nodules, with a significantly higher probability of malignancy. Sporadic pulmonary nodules are defined as pulmonary nodules that are found by chance and without corresponding signs and symptoms. Sporadic multiple lung nodules are sometimes found, when the diagnostic assessment is for the major type or most suspicious nodule (e.g., the largest or growing nodule). The form of the nodules is classified into solid or semi-solid nodules, and the semi-solid nodules can be further classified into pure ground glass nodules (without solid components) and partial solid nodules (containing ground glass components and solid components). Common benign pulmonary nodules include hamartoma, granuloma, rheumatoid nodules, arteriovenous malformations, infections (including tuberculosis and fungi), intrapulmonary lymph nodes, amyloidosis, and the like. In conclusion, the property of the pulmonary nodules is the basis of the operation, and can avoid misjudging the benign nodules as malignant excision and misjudging the malignant nodules as benign but delayed. Since lung cancer has a high mortality rate in malignant tumors and many patients are diagnosed at an advanced stage, patients with lung nodules need to be evaluated for the presence or absence of a malignant risk, and whether to take further diagnosis and treatment measures is determined according to the level of the malignant risk.
To date, low Dose Computed Tomography (LDCT) is the primary strategy for long-term, large-scale reduction of lung cancer-related mortality in high-risk asymptomatic populations. Two large randomized controlled trials, the National Lung Screening Test (NLST) and the netherlands-belgian lung cancer screening test (NELSON), have demonstrated that LDCT-based screening can statistically significantly reduce lung cancer-related mortality in high risk groups by more than 20%. Low dose helical CT can be used in lung cancer screening, and some asymptomatic pulmonary nodule patients are found, but the benign and malignant pulmonary nodules cannot be accurately identified. Suspected nodules detected by LDCT can be further diagnosed by lung biopsy (including bronchoscopy and percutaneous aspiration). However, lesions that are not visible around the lung and under the bronchoscope remain a challenge for diagnosis of lung biopsies. PET-CT is recommended to be used for identifying solid nodules of more than 8mm, and the sensitivity of PET-CT in diagnosing malignant lung nodules is 72% -94%, but the detection cost is high, and the method is not suitable for identifying benign and malignant nodules of general nodule patients. Histopathological biopsy is the "gold standard" for malignancy diagnosis, however, tissue biopsy is somewhat invasive and relatively complex to operate, and small tumors may also require multiple operations to obtain sufficient biopsy tissue for large-scale lung cancer screening.
In view of the above, the present invention is particularly proposed.
Disclosure of Invention
The invention aims to provide a model for identifying benign and malignant lung nodules or predicting the risk of postoperative recurrence of lung cancer, a kit and application.
The invention is realized by the following steps:
in a first aspect, the embodiments of the present invention provide the use of an agent for detecting a marker combination for the manufacture of a product for identifying benign and malignant lung nodules and/or predicting the risk of postoperative recurrence of lung cancer, the marker combination comprising the following three classes of markers: methylation levels of the target gene, genomic instability within the target chromosomal window, and fragment size distribution within the target chromosomal arm; wherein the target gene comprises: at least one of PTGER4 gene, RASSF1 gene, SHOX2 gene and PCDHGB6 gene; the target chromosome window includes: at least one of chr6_46000000, chr14_78000000, chr 80000000 and chr5_146000000, chr 148000000; the target chromosome arm comprises: at least one of chr2q, chr11p and chr14 p.
In a second aspect, embodiments of the present invention provide an agent or kit for identifying benign and malignant lung nodules and/or predicting risk of postoperative recurrence of lung cancer, comprising: reagents for detecting a combination of markers as described in the previous examples.
In a third aspect, an embodiment of the present invention provides a method for training a prediction model for benign and malignant lung nodules and/or predicting risk of postoperative recurrence of lung cancer, including: obtaining a detection result and a corresponding labeling result of each marker in a marker combination in a training sample; wherein the labeling result is a label representing benign and malignant lung nodules of the sample and/or predicting risk of recurrence after lung cancer surgery, and the marker combination is the marker combination described in the previous embodiment; inputting the detection results of all the markers in the marker combination or the scores of the three types of markers into a pre-constructed prediction model to obtain a prediction result; the score of each type of marker is obtained in the following mode: obtaining a detection result and a corresponding labeling result of each marker in each type of marker in a training sample; inputting the detection results of all the markers in each type of marker into a pre-constructed prediction model, and taking the prediction results as the scores of each type of marker; the pre-constructed prediction model is a machine learning model capable of predicting benign and malignant lung nodules of a sample and/or predicting the risk of postoperative recurrence of the lung cancer according to the detection result of the marker combination, the scores of the three types of markers or the detection result of each marker in each type of markers; and updating parameters of the prediction model based on the labeling result and the prediction result.
In a fourth aspect, embodiments of the present invention provide a device for predicting benign and malignant lung nodules and/or predicting risk of postoperative recurrence of lung cancer, including: the device comprises an acquisition module and a prediction module. The acquisition module is used for acquiring the detection result of each marker in the marker combination of the embodiment to be detected of the sample to be detected; the prediction module is used for inputting the detection results of all the markers or the scores of the three types of markers into the prediction model trained by the training method in the embodiment to obtain the prediction result of the sample to be tested; the score for each type of marker was obtained as described in the previous examples.
In a fifth aspect, an embodiment of the present invention provides an electronic device, which includes: a processor and a memory; the memory for storing a program that, when executed by the processor, causes the processor to implement the training method or the method of predicting benign and malignant lung nodules and/or the risk of post-operative recurrence of lung cancer of the preceding embodiments; the prediction method comprises the following steps: obtaining the detection result of each marker in the marker combination of the sample to be detected; the marker combination is the marker combination described in the previous embodiment; inputting the detection results of all the markers or the scores of the three types of markers into the prediction model trained by the training method in the embodiment to obtain the prediction result of the sample to be tested; the score for each class of markers was obtained as described in the previous examples.
In a sixth aspect, an embodiment of the present invention provides a computer-readable medium, where a computer program is stored, and the computer program, when executed by a processor, implements the training method described in the foregoing embodiment or the prediction method described in the foregoing embodiment.
The invention has the following beneficial effects:
the invention uses methylation detection data, and finds out specific genome and apparent group signals of lung cancer through integrated learning of segment size difference, copy number difference and methylation difference of population with lung malignant nodules and lung benign nodules, so that the signals can be used as new tumor markers, and further effective identification of benign and malignant lung nodules and postoperative recurrence monitoring of lung cancer are realized.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the embodiments will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present invention and therefore should not be considered as limiting the scope, and for those skilled in the art, other related drawings can be obtained according to the drawings without inventive efforts.
FIG. 1 shows the distribution of LC Score in benign and malignant nodules;
FIG. 2 is a ROC used by LC Score to identify benign and malignant lung nodules;
FIG. 3 LC Score for monitoring risk of postoperative recurrence of lung cancer;
FIG. 4 is a ROC plot for experimental group 2 of example 4;
FIG. 5 is a ROC plot for experimental group 3 of example 4.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below. The examples, in which specific conditions are not specified, were conducted under conventional conditions or conditions recommended by the manufacturer. The reagents or instruments used are conventional products which are not indicated by manufacturers and are commercially available.
First, the embodiments of the present invention provide the use of an agent for detecting a marker combination for the preparation of a product for identifying benign and malignant lung nodules and/or predicting the risk of postoperative recurrence of lung cancer, wherein the marker combination comprises the following three types of markers: methylation levels of the target gene, genomic instability within the target chromosomal window, and fragment size distribution within the target chromosomal arm;
wherein the target gene comprises: at least one of PTGER4 gene, RASSF1 gene, SHOX2 gene and PCDHGB6 gene;
the target chromosome window includes: at least one of chr6_46000000_48000000, chr14_78000000 _80000000and chr5_146000000 _148000000;
the target chromosome arm comprises: at least one of chr2q, chr11p and chr14 p.
The inventor uses methylation detection data, and finds the marker combination by integrally learning segment size difference, copy number difference and methylation difference of population with malignant nodules and benign nodules in the lung, so that identification of benign and malignant nodules of the lung and recurrence monitoring after lung cancer are realized, and compared with other marker combinations, the marker combination has better sensitivity, specificity and accuracy.
In the case where a target gene, a target chromosome window, and a target chromosome arm have been disclosed, a detection method and a calculation method of the methylation level of the target gene, genomic instability within the target chromosome window, and fragment size distribution within the target chromosome arm can be obtained based on the prior knowledge.
As used herein, "methylation level of a target gene" refers to the methylation level of a promoter of the target gene.
The "window" herein refers to a window (Windows) in bioinformatics analysis, chr6_46000000 _48000000refers to a genome segment from 46000000 th to 48000000 th bases of chromosome 6, the reference genome is hg19, and so on for other target chromosome Windows.
Herein, "q" in "chr2q" refers to the long arm of the corresponding chromosome, and "p" in "chr11p and chr14p" refers to the short arm of the corresponding chromosome.
In some embodiments, the method of detecting the methylation level of a target gene comprises: any one of methylation specific PCR (MS-PCR), bisulfite treatment + sequencing, restriction endonuclease combined with sodium bisulfite (COBRA), fluorometry, methylation sensitive high resolution melting curve analysis, pyrophosphate sequencing, chip-based methylation profiling, and high throughput sequencing.
In some embodiments, the genomic instability within the window of target chromosomes is represented by a genomic instability score within the window of target chromosomes.
Optionally, the method of calculating the genomic instability score is selected from: z value algorithm based on sequencing depth, log2ratio algorithm based on control sample, soft-clipped fracture read-based algorithm and BinCount i Any one of the algorithms.
Alternatively, binCount i The calculation formula of (a) is as follows:
wherein, binCount i Scoring for genomic instability; fragment i The number of reads in the ith window; totalMappedFragments is the number of reads for the sample population; windowLength i Is the length of the ith window. When the window length is 2000000bp, windowLength i Is 2E6.
In some embodiments, the fragment size distribution refers to the ratio or difference in the number of long fragments to short fragments within a window of a target chromosome, the long fragments being 101-220 bp in length and the short fragments being 20-100bp in length.
By "ratio or difference in the number of long segments to short segments" is herein understood the ratio of the number of long segments to short segments, or the ratio of the number of short segments to long segments, or a numerical value representing the difference between long segments and short segments.
In some embodiments, the lung cancer post-surgery comprises a conventional lung cancer surgery, including in particular: the lung lobe resection is any one or the combination of any more of the cleaning of lymph nodes, wedge resection, lung segment resection, lung lobe resection, total lung resection and thoracoscopic minimally invasive surgery.
In some embodiments, the product is selected from: any one of a reagent, a kit and a predictive model.
In another aspect, the embodiments of the present invention provide an agent or a kit for identifying benign and malignant lung nodules and/or predicting risk of postoperative recurrence of lung cancer, comprising: a reagent for detecting a combination of markers as described in any of the preceding examples.
In another aspect, an embodiment of the present invention provides a method for training a prediction model for benign and malignant lung nodules and/or predicting risk of postoperative recurrence of lung cancer, including:
obtaining a detection result and a corresponding labeling result of each marker in a marker combination in a training sample; wherein the labeling result is a label representing benign and malignant lung nodules of the sample and/or predicting risk of postoperative recurrence of lung cancer, and the marker combination is the marker combination described in any embodiment;
inputting the detection results of all the markers in the marker combination or the scores of the three types of markers into a pre-constructed prediction model to obtain a prediction result; the score of each type of marker is obtained in the following mode: obtaining a detection result and a corresponding labeling result of each marker in each type of marker in a training sample; inputting the detection results of all the markers in each type of marker into a pre-constructed prediction model, and taking the prediction results as the scores of each type of marker; the pre-constructed prediction model is a machine learning model capable of predicting benign and malignant lung nodules of a sample and/or predicting the risk of postoperative recurrence of the lung cancer according to the detection result of the marker combination, the scores of the three types of markers or the detection result of each marker in each type of markers;
and updating parameters of the prediction model based on the labeling result and the prediction result.
Optionally, the label is a character or a string of characters.
The "three types of markers" herein refer to: the methylation level of the target gene, genomic instability within the target chromosomal window, or fragment size distribution within the target chromosomal arm, "each type of marker" refers to any of these.
The prediction model can be constructed by taking the detection results of all the markers in the marker combination as input data, and can also be constructed by taking the scores of the three types of markers as input data, so that the prediction model has similar prediction effect. In contrast, when the scores of the three classes of markers are used as input data, the constructed prediction model has better effect.
In some embodiments, the machine learning model comprises: and (4) performing logistic regression model.
Alternatively, the number of training samples may be equal to or greater than any of 10, 50, 100, 200, 300, 400, and 500.
In some embodiments, the test sample or the training sample is independently selected from: a plasma sample, a serum sample, a whole blood sample, a negative standard, or a positive standard. Optionally, the sample to be tested or the training sample may also be selected from: an environmental sample comprising at least one of a plasma sample or a serum sample.
In another aspect, an embodiment of the present invention provides a device for predicting benign and malignant lung nodules and/or predicting risk of recurrence after lung cancer surgery, including:
the acquisition module is used for acquiring the detection result of each marker in the marker combination of the sample to be detected; wherein the marker combination is the marker combination of any of the preceding embodiments;
the prediction module is used for inputting the detection results of all the markers or the scored detection results of the three types of markers of the sample to be tested into the prediction model trained by the training method in any embodiment to obtain the prediction result of the sample to be tested; the score for each class of markers is obtained as described in any of the preceding examples.
Alternatively, the modules described in the above embodiments may be stored in a memory in the form of software or Firmware (Firmware) or be fixed in an Operating System (OS) of the electronic device provided in the present application, and may be executed by a processor in the electronic device. Meanwhile, data, codes of programs, and the like required to execute the above modules may be stored in the memory.
In another aspect, an embodiment of the present invention provides an electronic device, which includes: a processor and a memory; the memory for storing a program that, when executed by the processor, causes the processor to implement the training method or the predictive method of malignancy of a lung nodule and/or predicting risk of postoperative recurrence of lung cancer of any of the preceding embodiments;
the prediction method comprises the following steps: obtaining the detection result of each marker in the marker combination of the sample to be detected; the marker combination is the marker combination of any of the preceding embodiments; inputting the detection results of all markers or the scores of the three types of markers of the sample to be detected into the prediction model trained by the training method in any of the embodiments to obtain the prediction result of the sample to be detected; the score for each type of marker is obtained as described in any of the preceding examples.
The electronic device may include a memory, a processor, a bus, and a communication interface, which are electrically connected to each other, directly or indirectly, to enable the transfer or interaction of data. For example, the components may be electrically connected to each other via one or more buses or signal lines.
The Memory may be, but is not limited to, a Random Access Memory (RAM), a Read Only Memory (ROM), a Programmable Read Only Memory (PROM), an Erasable Read Only Memory (EPROM), an electrically Erasable Read Only Memory (EEPROM), and the like.
The processor may be an integrated circuit chip having signal processing capabilities. The Processor 120 may be a general-purpose Processor including a Central Processing Unit (CPU), a Network Processor (NP), and the like; but also Digital Signal Processors (DSPs), application Specific Integrated Circuits (ASICs), field Programmable Gate Arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components.
The electronic device may be a server, a cloud platform, a mobile phone, a tablet computer, a notebook computer, an ultra-mobile personal computer (UMPC), a handheld computer, a netbook, a Personal Digital Assistant (PDA), a wearable electronic device, a virtual reality device, or the like, and thus the embodiment of the present application does not limit the type of the electronic device.
Furthermore, an embodiment of the present invention provides a computer-readable medium, on which a computer program is stored, and the computer program, when executed by a processor, implements the training method according to any of the foregoing embodiments or the prediction method according to any of the foregoing embodiments.
The computer readable medium may be a general storage medium such as a removable disk, a hard disk, etc.
The features and properties of the present invention are described in further detail below with reference to examples.
Example 1
Lung cancer specific Differentially Methylated Region (DMR) gene panel (panel) design:
pan Cancer methylation data TCGA Pan-Cancer (PANCAN) are downloaded from the XENA database, including adrenocortical carcinoma, urothelial carcinoma of the bladder, breast Cancer infiltration, cervical squamous cell carcinoma and adenocarcinoma of the cervix, cholangiocarcinoma, colon adenocarcinoma, cell lymphoma, esophageal carcinoma, glioblastoma multiforme, squamous cell carcinoma of the head and neck, renal chromophobe carcinoma, renal clear cell carcinoma, papillary renal cell carcinoma, acute myeloid leukemia, brain low-grade glioma, hepatocellular carcinoma, lung adenocarcinoma, lung squamous cell carcinoma, mesothelioma, ovarian serous cystadenocarcinoma, pancreatic carcinoma, pheochromocytoma and paraganglioma, prostate carcinoma, sarcoma, cutaneous melanoma, gastric adenocarcinoma, testicular germ cell tumor, thyroid carcinoma, endometrial carcinoma, uterine carcinoma sarcoma, uveal melanoma, and like malignancies (sample numbers see table 1) as well as the methylation sites of paracancerous healthy control tissue and their methylation level (β value) data.
Table 1: statistics of TCGA sample number
The panel of differentially methylated sites of each cancer species with healthy humans is the union of each cancer species (adrenocortical carcinoma, urothelial carcinoma, breast cancer infiltration, cervical squamous cell carcinoma and endocervical adenocarcinoma, cholangiocarcinoma, colon adenocarcinoma, cellular lymphoma, esophageal carcinoma, glioblastoma multiforme, squamous cell carcinoma of the head and neck, renal chromophobe carcinoma, renal clear cell carcinoma, papillary renal cell carcinoma, acute myeloid leukemia, brain low-grade glioma, hepatocellular carcinoma, lung adenocarcinoma, lung squamous cell carcinoma, mesothelioma, ovarian serous cystadenocarcinoma, pancreatic carcinoma, pheochromocytoma and paraganglioma, prostate carcinoma, sarcoma, cutaneous melanoma, gastric adenocarcinoma, testicular germ cell tumor, thyroid carcinoma, endometrial carcinoma, uterine carcinosarcoma, uveal melanoma) with a paracancer healthy control Rank and test (Wilcoxon Rank Sumtest) P <0.05 and a fold difference >1.2 CpG site, where the fold difference is defined as: mean of methylation level of the CpG site in positive population/mean of methylation level of the CpG site in negative population.
Tissue traceability characteristics CpG site combination panels are defined as: cpG sites of adrenal cortex cancer tau >0.85, cpG sites of bladder urothelial cancer tau >0.85, cpG sites of breast cancer infiltrating cancer tau >0.85, cpG sites of cervical squamous cell carcinoma and endocervical adenocarcinoma tau >0.85, cpG sites of bile duct cancer tau >0.85, cpG sites of colon adenocarcinoma tau >0.85, cpG sites of lymphoblastic tumor tau >0.85, cpG sites of esophageal cancer tau >0.85, cpG sites of glioblastoma multiforme tau >0.85, cpG sites of head and neck squamous cell carcinoma tau >0.85, cpG sites of renal chromophobe carcinoma tau >0.85, cpG sites of renal clear cell carcinoma tau >0.85, cpG sites of renal papillary cell carcinoma tau >0.85, cpG sites of acute myeloid leukemia tau >0.85, cpG sites of brain low-grade glioma tau >0.85, cpG sites of cancer tau >0.85 a CpG site of hepatocellular carcinoma τ >0.85, a CpG site of lung adenocarcinoma τ >0.85, a CpG site of lung squamous cell carcinoma τ >0.85, a CpG site of mesothelioma τ >0.85, a CpG site of ovarian serous cystadenocarcinoma τ >0.85, a CpG site of pancreatic carcinoma τ >0.85, a CpG site of pheochromocytoma and paraganglioma τ >0.85, a CpG site of prostate carcinoma τ >0.85, a CpG site of sarcoma τ >0.85, a CpG site of skin melanoma τ >0.85, a CpG site of gastric adenocarcinoma τ >0.85, a CpG site of testicular germ cell tumor τ >0.85, a CpG site of thyroid carcinoma τ >0.85, a CpG site of endometrial carcinoma τ >0.85, a CpG site of uterine carcinoma τ >0.85, a CpG site of uveal melanoma τ >0.85, the panel may represent a methylation site that drives carcinogenesis. Where τ is defined as:
in formula 1, n represents the number of samples of any cancer species. X in formula 2 i Expressed as the methylation level of any sample of any cancer species,expressed as the maxima in methylation levels in samples of the same type of any cancer species.
Combining differential methylation sites of cancer species with healthy people with CpG sites with tissue-tracing properties of individual cancer species, methylation panel combinations can be obtained that characterize the prevalence of pan-cancer species (as shown in Table 2).
Table 2: methylated gene panel
Establishing a library of a methylation library and performing on-machine sequencing on the library:
1. methylation library construction was performed using 5-30ng cfDNA, and 50pg of an internal reference DNA mixture (prepared by house, 166 bp) was added thereto, followed by bisulfite treatment and purification recovery of the sample using the Lightning conversion reagent kit from Zymo Research.
2. Library construction was performed using the methylation library building kit from the holy next, and the DNA was purified and recovered during library construction using AMPureXP beads (Beckman) and collected by elution using EB buffer (Qiagen).
3. 500ng of the pre-library DNA was taken and capture of the target region DNA (DNA fragment carrying the gene shown in Table 2) was performed using hybridization capture hybridization correlation reagent of Twist Bioscience.
4. And amplifying the DNA after impurity washing by using KAPA HiFi hotspot ready Mix, and purifying an amplification product by using AMPure XP beads (Beckman) to obtain a final library.
5. The final library was quantified using qPCR (KAPA SYBR Fast Kit, roche) and then sequenced 150bp double-ended on Illumina NovaSeq 6000 sequencing platform.
Obtaining methylation level observations:
and (3) converting the data of the sequencer into a fastq file after being identified by bcl2fastq software, and removing low-quality reads, reads containing N bases accounting for more than 5% and reads shorter than 50bp in the sequencing data by using cutadapt software. The above reads were aligned to the hg19 reference genome using BSMAP software, and PCR redundancy was removed. The observed methylation levels of all CpG sites within the target region (DNA fragment carrying the genes described in table 2) were measured using Bismark software.
Genome instability assessment:
the whole genome was divided into equal length windows of 2M (two million) bases in length and excluded from X, Y sex chromosomes, mitochondria and other contigs. The number of reads within any window is counted using the featureCounts software. Different samples are different in library size, so that homogenization operation needs to be carried out in the samples, and a homogenization method of 'every million reads' is introduced, and the method specifically comprises the following steps:
wherein Fragment i Expressed as the number of reads in the ith window, totalMappedFragments expressed as the total number of reads for this sample, windowLength i Denoted as the ith window length, specifically 2E6.
Fragment size:
the reads of the BAM file of the sample to be detected is reduced into an original DNA template (fragment), and the whole genome comprises chr1p, chr1q, chr2p, chr2q, chr3p, chr3q, chr4p, chr4q, chr5p, chr5q, chr6p, chr6q, chr7p, chr7q, chr8p, chr8q, chr9p, chr9q, chr10p, chr10q, chr11p, chr11q, chr12p, chr12q, chr13p, chr13q, chr14p, chr14q, chr15p, chr15q, chr16p, chr16q, chr17p, chr17q, chr18p, chr19p, chr21p, chr22p and chr22 according to a chromosome arm (arm. Respectively counting the fragment number of the interval of 20-100bp and 101-220 bp in the length of each chromosome arm. Different samples, because the library size is not consistent, the homogenization operation needs to be carried out in the samples, specifically, the fragmentation number of a specific length interval is divided by the total fragmentation number in the chromosome, specifically, the fragmentation number in the range of 20-100bp is divided by the total fragmentation number in the chromosome arm, so as to obtain the occupation ratio of 'short fragment (namely, length is 20-100 bp)'; the ratio of the number of fragments in the range of 101-220 to the total number of fragments in the chromosome arm was obtained, and the ratio of the number of fragments in the interval of 20-100bp (i.e., "short fragments") in each chromosome arm to the number of fragments in the interval of 101-220 bp (i.e., "long fragments") in each chromosome arm was calculated as the size distribution of the fragments in the chromosome arm.
Constructing a multi-modal ensemble learning model:
1. characteristic data extraction: extracting methylation level, genome instability score and fragment size generated by the method in each sample sequencing data as input data, further performing dimension reduction treatment on the three markers by using a machine learning method LASSO (least absolute shrinkage and selection operator) regression algorithm, and screening the markers with the weight not equal to 0 as a region combination constructed by the model;
2. determining optimal parameters of the model: performing model construction and iterative training by using logistic regression, and determining an optimal threshold value by using a johnson index method;
3: and (3) verifying the performance of the model: and verifying in an independent test set by using the determined optimal parameters and optimal thresholds of the model, drawing an ROC curve of the operating characteristics of the testees, and calculating an AUC value under the curve. Define LC Score as the final result of the model:
Z=coef i ×P meth +coef j ×P bincount +coef k ×P fragment -intercept;
wherein, P meth Represents the methylation model score, P bincount Representing the genomic instability model score, P fragment Representing the segment size separate model score; coef i 、coef j 、coef k Weights representing methylation, genomic instability, fragment size distribution in logistic regression, respectively; intercept denotes the intercept term.
In this embodiment, coef i 、coef j 、coef k Respectively as follows: 2.98, 1.59, 3.27, intercept is: -5.75. In other embodiments, these parameters may vary based on the samples of a particular training set and test set, but do not significantly affect the final results.
In this embodiment, the methylation model, the genome instability model, and the fragment size distribution model (the model of each marker) are constructed in the same way as the final prediction model, and are obtained by training the logistic regression model based on the same training samples, respectively, and the difference is only in the difference of the input data.
Example 2
Candidate biomarker combinations were screened in plasma samples of 52 benign nodules (10 of them, 5 of them, 8 of them with fibroplasia, 12 of them, 7 of them with cryptococcus infection, 10 of them with fungi) and 76 lung cancer patients (30 of them, 12 of them, 14 of them, 20 of them). The specific information of the sample is shown in table 7 at the end of the specification.
(1) The screening steps of the methylated panel are as follows: extraction of methylation levels of all gene promoters for each sample within the methylated gene panel shown in table 2 in example 1 as input data, the union of the benign nodules with the lung cancer Rank Sum Test (Wilcoxon Rank Sum Test) P <0.05 and a multiple of difference >1.2 was retained, where the multiple of difference is defined as: mean of methylation levels of the CpG sites in the lung cancer population/mean of methylation levels of the CpG sites in the benign nodule population. Further, the remaining CpG sites were further dimension-reduced using machine learning method feature recursive elimination (RFE). Based on the current sample, a logistic regression model is fitted with candidate CpG sites, the AUC of the area under the curve is estimated by 5-fold cross validation, the AUC is ranked from large to small and 100 top genes are retained, and the results are shown in table 3.
Table 3: diagnostic performance of different methylated genes on benign and malignant pulmonary nodules
It was evaluated that superior diagnostic performance could be obtained in combination with the top4 gene combination (PTGER 4, RASSF1, SHOX2, PCDHGB 6), and therefore the PTGER4, RASSF1, SHOX2, PCDHGB6 genes were selected for the methylation panel.
(2) The screening steps of the genome instability panel are as follows: the whole genome range of each sample is extracted, the Bincount value of each window (the whole genome is divided into equal length windows with the length of 2M bases) is used as input data, and a union set of windows with the Rank Sum Test (Wilcoxon Rank Sum Test) P <0.05 and the difference multiple >1.2 of benign nodules and lung cancer is reserved. Further, the remaining windows are further dimension-reduced using machine learning method feature recursive elimination (RFE). Based on the current sample, a logistic regression model is fitted with candidate windows, the area under the curve AUC is estimated through 5-fold cross validation, the AUC is ranked from large to small and 30 top windows are retained, and the results are shown in table 4.
Table 4: diagnostic performance of genomic instability of different windows on benign and malignant lung nodules
Window opening | AUC |
chr6_46000000_48000000 | 0.714252006 |
chr14_78000000_80000000 | 0.712321421 |
chr5_146000000_148000000 | 0.710674157 |
chr19_2000000_4000000 | 0.610669698 |
chr12_132000000_133851895 | 0.609940366 |
chr7_38000000_40000000 | 0.60479214 |
chr19_1_2000000 | 0.603827247 |
chr19_48000000_50000000 | 0.603462182 |
chr19_16000000_18000000 | 0.603125 |
chr3_130000000_132000000 | 0.603097516 |
chr19_58000000_59128983 | 0.60273285 |
chr7_68000000_70000000 | 0.602444698 |
chr16_1_2000000 | 0.602203301 |
chr19_10000000_12000000 | 0.600447683 |
chr7_24000000_26000000 | 0.597572858 |
chr16_2000000_4000000 | 0.597550913 |
chr19_44000000_46000000 | 0.59498906 |
chr19_12000000_14000000 | 0.591707066 |
chr4_1_2000000 | 0.587697507 |
chr10_134000000_135534747 | 0.586424684 |
chr21_8000000_10000000 | 0.581465063 |
chr19_52000000_54000000 | 0.580675035 |
chr16_88000000_90000000 | 0.580552576 |
chr12_24000000_26000000 | 0.567310393 |
chr1_232000000_234000000 | 0.563031074 |
chr11_1_2000000 | 0.56020014 |
chr19_46000000_48000000 | 0.554406601 |
chr9_18000000_20000000 | 0.551078982 |
chr3_152000000_154000000 | 0.544772647 |
chr8_120000000_122000000 | 0.539813027 |
It was estimated that excellent diagnostic performance could be obtained in combination with top3 window combinations (chr 6_46000000, chr14_78000000, chr5_146000000, chr 148000000), so the fragment size model selects the chr6_46000000, chr14_78000000, chr 80000000, chr5_146000000, chr 148000000 windows.
(3) The screening steps of the fragment size distribution panel are as follows: fragment values for any of the chromosomal arms of each sample were extracted as input data, keeping the union of benign nodules with the chromosomal arms of the lung cancer Rank-Sum Test (Wilcoxon Rank Sum Test) P <0.05 and with a fold difference > 1.2. Further, the remaining windows are further reduced in dimension using machine learning method feature recursive elimination (RFE). Based on the current sample, a logistic regression model was fitted using candidate chromosomal arms and the area under the curve AUC was estimated by 5-fold cross validation, ranking the AUC from large to small, with the results in table 5.
Table 5: diagnostic performance of segment sizes of different chromosome arms on benign and malignant pulmonary nodules
It was estimated that excellent diagnostic performance could be obtained in combination with the top3 chromosomal arm combinations (chr 2q, chr11p, chr14 p), and therefore the fragment size model chose chr2q, chr11p, chr14p chromosomal arms.
Taken together, the tumor marker panel that combines fragment size patterns, genomic instability and methylation levels for the identification of benign and malignant lung nodules is defined as: a logistic regression model comprising PTGER4, RASSF1, SHOX2, PCDHGB6 gene methylation, chr6_46000000_48000000, chr14_78000000_80000000, chr5_146000000 _148000000window instability values, and chr2q, chr11p, chr14p chromosome arm fragment size ratios.
Based on the current sample (randomly disorganizing the sequence of 128 samples, training with 80% of 128 samples each time, and taking the remaining 20% as independent verification), a logistic regression model is fitted to each type of marker respectively, and the score of each type of marker is obtained.
Based on the current sample, scores of three types of markers are used as input data, a logistic regression model is fitted, and a sample to be tested (specific information of the sample is shown in table 7) can be marked with a malignancy probability value (hereinafter referred to as LC Score). In order to reflect the stable performance of the model, the parameters such as sensitivity and specificity are the average value of 100 iterative tests. The distribution of malignancy probability among benign nodules and lung cancer is shown in FIG. 1, and the performance of LC Score for identifying benign and malignant lung nodules is shown in FIG. 2 and Table 6.
Table 6 confusion matrix for identification of benign and malignant lung nodules by LC Score
/ | Malignant nodule (lung cancer) | Benign nodules | Positive/negative predictive value |
Positive for | 68 | 7 | 90.67% |
Negative of | 8 | 45 | 84.91% |
Sensitivity/specificity | 89.47% | 86.54% | 88.28% |
Remarking: the sensitivity was 89.47%, the specificity was 86.54%, and the accuracy was 88.28%.
Example 3
The LC Score model was used to assess the sensitivity of the LC Score model to monitoring of risk of postoperative recurrence of lung cancer compared to 20 pairs of preoperative positive and postoperative recurrence patients (40 samples total) and 11 pairs of preoperative positive and postoperative recurrence free patients (22 samples total). As can be seen from the results, LC Score can accurately distinguish postoperative recurrence from non-recurrence samples, and the evaluation results are shown in fig. 3.
Example 4
The 10 marker combinations of example 2 (methylation levels of PTGER4 gene, RASSF1 gene, SHOX2 gene and PCDHGB6 gene, genomic instability at the chr6_46000000, 48000000, chr14_78000000, 80000000 and chr5_146000000 windows, and fragment size distributions of chr2q, chr11p and chr14 p) were used as experimental group 1, and experimental groups 2 to 3 were simultaneously performed:
experimental group 2 was 3 markers (methylation level of TGER4 gene, RASSF1 gene, SHOX2 gene);
the experimental group 3 is a combination of 15 markers (genome instability of 10 markers plus chr19_2000000 \/4000000 and chr12_132000000 \/13385895 windows and fragment size distribution of chr1p, chr6p and chr9 p);
the logistic regression model was fitted to the 3 marker combinations according to the method provided in example 2 and subject curve analysis was performed on 128 samples described in example 2.
The ROC curve of the experimental group 2 is shown in FIG. 4, AUC is 0.772; the ROC curve of the experimental group is shown in figure 2, the AUC is 0.928; the ROC curve of the experimental group 3 is shown in FIG. 5, and the AUC is 0.92.
TABLE 7 sample information
TABLE 8 sample information
The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.
Claims (10)
1. Use of reagents for detecting a marker combination for the manufacture of a product for identifying benign and malignant lung nodules and/or predicting the risk of postoperative recurrence of lung cancer, wherein the marker combination comprises the following three classes of markers: methylation levels of the target gene, genomic instability within the target chromosomal window, and fragment size distribution within the target chromosomal arm;
wherein the target gene comprises: at least one of PTGER4 gene, RASSF1 gene, SHOX2 gene and PCDHGB6 gene;
the target chromosome window includes: at least one of chr6_46000000_48000000, chr14_78000000 _80000000and chr5_146000000 _148000000;
the target chromosome arm comprises: at least one of chr2q, chr11p and chr14 p.
2. The use of claim 1, wherein the genomic instability within the window of target chromosomes is represented by a genomic instability score within the window of target chromosomes;
preferably, the method of calculating the genomic instability score is selected from the group consisting of: z value algorithm based on sequencing depth, log2ratio algorithm based on control sample, soft-clipped fracture read-based algorithm and BinCount i Calculating any one of formulas;
preferably, binCount i The calculation formula of (a) is as follows:
3. The use according to claim 1, wherein the fragment size distribution is the ratio or difference in the number of long fragments and short fragments within a window of a target chromosome, the long fragments are 101-220 bp in length, and the short fragments are 20-100bp in length.
4. Use according to any one of claims 1 to 3, characterized in that the product is selected from: any one of a reagent, a kit and a predictive model.
5. An agent or kit for identifying the benign or malignant nature of lung nodules and/or predicting the risk of postoperative recurrence of lung cancer, comprising: the reagent for detecting a marker combination according to any one of claims 1 to 4.
6. A method of training a predictive model for malignancy of a pulmonary nodule and/or predicting risk of postoperative recurrence of lung cancer, comprising:
obtaining a detection result and a corresponding labeling result of each marker in a marker combination in a training sample; wherein the labeling result is a label representing benign and malignant lung nodules of the sample and/or predicting the risk of recurrence after lung cancer surgery, and the marker combination is the marker combination according to any one of claims 1 to 4;
inputting the detection results of all the markers in the marker combination or the scores of the three types of markers into a pre-constructed prediction model to obtain a prediction result; the score of each type of marker is obtained in the following mode: obtaining a detection result and a corresponding labeling result of each marker in each type of marker in a training sample; inputting the detection results of all the markers in each type of marker into a pre-constructed prediction model, and taking the prediction results as the scores of each type of marker; the pre-constructed prediction model is a machine learning model capable of predicting benign and malignant lung nodules of a sample and/or predicting the risk of postoperative recurrence of the lung cancer according to the detection result of the marker combination, the scores of the three types of markers or the detection result of each marker in each type of markers;
and updating parameters of the prediction model based on the labeling result and the prediction result.
7. A training method as recited in claim 6, wherein the machine learning model comprises: and (4) performing a logistic regression model.
8. A device for predicting the benign or malignant condition of a lung nodule and/or predicting the risk of postoperative recurrence of lung cancer, comprising:
the acquisition module is used for acquiring the detection result of each marker in the marker combination of the sample to be detected; wherein the marker combination is the marker combination according to any one of claims 1 to 4;
the prediction module is used for inputting the detection results of all the markers or the scores of the three types of markers into the prediction model trained by the training method of claim 6 or 7 to obtain the prediction result of the sample to be tested; the score for each class of markers is obtained as described in claim 6.
9. An electronic device, comprising: a processor and a memory; the memory for storing a program that, when executed by the processor, causes the processor to implement the training method of claim 6 or 7 or the predictive method of malignancy of a lung nodule and/or predicting risk of postoperative recurrence of lung cancer;
the prediction method comprises the following steps: obtaining the detection result of each marker in the marker combination of the sample to be detected; the marker combination is the marker combination according to any one of claims 1 to 4; inputting the detection results of all the markers or the scores of the three types of markers into the prediction model trained by the training method according to claim 6 or 7 to obtain the prediction result of the sample to be tested, wherein the score of each type of marker is obtained according to the method in claim 6.
10. A computer-readable medium, characterized in that the computer-readable medium has stored thereon a computer program which, when being executed by a processor, carries out the training method of claim 6 or 7 or the prediction method of claim 9.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211552230.XA CN115798582A (en) | 2022-12-05 | 2022-12-05 | Model for identifying benign and malignant lung nodules or predicting risk of postoperative recurrence of lung cancer, kit and application |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211552230.XA CN115798582A (en) | 2022-12-05 | 2022-12-05 | Model for identifying benign and malignant lung nodules or predicting risk of postoperative recurrence of lung cancer, kit and application |
Publications (1)
Publication Number | Publication Date |
---|---|
CN115798582A true CN115798582A (en) | 2023-03-14 |
Family
ID=85445821
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202211552230.XA Pending CN115798582A (en) | 2022-12-05 | 2022-12-05 | Model for identifying benign and malignant lung nodules or predicting risk of postoperative recurrence of lung cancer, kit and application |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115798582A (en) |
-
2022
- 2022-12-05 CN CN202211552230.XA patent/CN115798582A/en active Pending
Similar Documents
Publication | Publication Date | Title |
---|---|---|
JP7546946B2 (en) | Use of size and number abnormalities in plasma DNA for the detection of cancer - Patents.com | |
JP7119014B2 (en) | Systems and methods for detecting rare mutations and copy number variations | |
CN106795562B (en) | Tissue methylation pattern analysis in DNA mixtures | |
CN108138233B (en) | Methylation Pattern analysis of haplotypes of tissues in DNA mixtures | |
JP2021061861A (en) | Detecting mutations for cancer screening and fetal analysis | |
US11581062B2 (en) | Systems and methods for classifying patients with respect to multiple cancer classes | |
CN112236520A (en) | Methylation signatures and target methylation probe plates | |
US20230220492A1 (en) | Methods and systems for detecting colorectal cancer via nucleic acid methylation analysis | |
US20230040907A1 (en) | Diagnostic assay for urine monitoring of bladder cancer | |
AU2018305609B2 (en) | Enhancement of cancer screening using cell-free viral nucleic acids | |
EP3189156A1 (en) | Thyroid cancer diagnosis by dna methylation analysis | |
CN113838533B (en) | Cancer detection model, construction method thereof and kit | |
US20230178181A1 (en) | Methods and systems for detecting cancer via nucleic acid methylation analysis | |
CN111863250B (en) | Combined diagnosis model and system for early breast cancer | |
EP2878678A1 (en) | RNA-biomarkers for diagnosis of prostate cancer | |
Lin et al. | Differentiating progressive from nonprogressive T1 bladder cancer by gene expression profiling: applying RNA-sequencing analysis on archived specimens | |
CN116665771A (en) | Predictive model for simultaneously detecting multiple tumors and carrying out tissue tracing and training method and application thereof | |
US20220084632A1 (en) | Clinical classfiers and genomic classifiers and uses thereof | |
WO2017220782A1 (en) | Screening method for endometrial cancer | |
CN116200499B (en) | Gene combination for liver cancer detection, related reagent and application | |
CN113811621A (en) | Method for determining RCC subtype | |
CN115798582A (en) | Model for identifying benign and malignant lung nodules or predicting risk of postoperative recurrence of lung cancer, kit and application | |
WO2022262831A1 (en) | Substance and method for tumor assessment | |
CN115976209A (en) | Training method of lung cancer prediction model, prediction device and application | |
Gallardo-Gómez et al. | Serum methylation of GALNT9, UPF3A, WARS, and LDB2 as non-invasive biomarkers for the early detection of colorectal cancer and premalignant adenomas |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |