CN117385027A

CN117385027A - Lung cancer specific methylation marker and application thereof in diagnosis of lung cancer

Info

Publication number: CN117385027A
Application number: CN202210787412.9A
Authority: CN
Inventors: 谢可辉; 郗大勇; 马成城; 陈桦; 徐敏杰; 李威; 何其晔; 苏志熙; 刘蕊
Original assignee: Jiangsu Huayuan Biotechnology Co ltd
Current assignee: Jiangsu Huayuan Biotechnology Co ltd
Priority date: 2022-07-04
Filing date: 2022-07-04
Publication date: 2024-01-12

Abstract

Provides a lung cancer specific methylation marker and application thereof in diagnosing lung cancer. The present invention relates to the use of a reagent or module for the preparation of a kit or device for distinguishing lung cancer patients from non-lung cancer patients or for tissue traceability of lung cancer during a pan-cancer screening procedure, wherein the reagent or module comprises a reagent or module for detecting methylation levels of a lung cancer tissue specific methylation marker such as the gene ARHGEF16, such as SEQ ID NOs: 1-48. The method is used for tracing the tissues of the lung cancer in the early stage screening process of the pan-cancer seeds, and achieves the aim of better distinguishing the lung cancer.

Description

Lung cancer specific methylation marker and application thereof in diagnosis of lung cancer

Technical Field

The invention belongs to the field of molecular auxiliary diagnosis, and particularly relates to a lung cancer tissue specific methylation marker and application thereof in diagnosing lung cancer.

Background

Lung cancer is the cancer responsible for the highest mortality worldwide. Although the combined use of surgery, chemotherapy, targeting, and immunotherapy significantly improves the survival rate of lung cancer, the prognosis of lung cancer patients is still relatively poor compared to other cancers. The main reason is that most lung cancer is diagnosed in the late stage, which is associated with the lack of widespread early screening of lung cancer.

The early-stage related signals of the cancer high-risk group are detected for cancer screening, so that early-stage cancer patients can be found in time, the early-stage cancer patients can be completely cured through surgical excision, and the death rate of the cancer patients can be greatly reduced through cancer screening. About 85% of lung cancers are non-small cell lung cancers (NSCLC), the five-year survival rate of early stage in-situ cancer patients is up to 55.6%, metastasis easily occurs in middle and late stages, and the five-year survival rate of patients after metastasis is only 4.5%. Early stage NSCLC patients were asymptomatic, and more than 80% of NSCLC patients were diagnosed as having been in the middle and late stages of cancer, with lymph node spread or distant metastasis, with lower survival (Weichert W et al, 2014). From 1990 to 2015, the overall cancer mortality in the united states was reduced by 25%, with a reduction in amplitude of up to 45% in men with lung cancer. A part of the reason why reduction in cancer mortality is important is the widespread use of cancer screening techniques (Byers T et al 2016).

Traditional cancer screening methods include endoscope, imaging detection (CT, MRI, etc.), tumor markers (such as alpha fetoprotein for clinically assisting in diagnosing primary liver cancer, carcinoembryonic antigen which is a broad-spectrum tumor marker, and cytokeratin 19Cyfra21-1 which is a tumor marker for detecting lung cancer), etc., but the traditional methods have certain limitations. For example, the most widely used early screening for lung cancer in clinical practice is Low Dose CT (LDCT). Although LDCT can detect early stage NSCLC patients to a certain extent, its specificity is low, and diagnosis of positive patients requires long follow-up, continuous review or other diagnosis means to make a diagnosis, which can significantly increase patient pain, and medical resource waste due to excessive diagnosis. The existing tumor markers are generally poor in performance, can only be used as clinical references, and are difficult to screen and apply on a large scale.

In recent years, the liquid biopsy for researching fire heat is based on free DNA (ctDNA) released by tumor cells into blood plasma, and compared with the traditional method, the liquid biopsy has the advantages of convenience in sampling, non-invasiveness, capability of realizing early screening of the pan-cancer seeds, capability of overcoming tumor heterogeneity and the like, and is widely applied. ctDNA can reflect cancer information from various aspects such as mutation, fragmentation length distribution, methylation, etc., wherein ctDNA methylation has become a hotspot for research and development of early-stage cancer screening products with superior properties, and there have been numerous applications of early-stage ctDNA methylation screening, such as PanSeer for pan-cancer species methylation screening, which can reach 88% sensitivity in 5 cancer species (gastric cancer, esophageal cancer, liver cancer, colorectal cancer, lung cancer) at 96% specificity, and can be 4 years earlier than traditional methods (Xingdong Chen et al 2020).

Cancer screening, especially early screening of pan-cancerous species, requires not only prediction of the presence or absence of cancer signals, but also tissue tracing of positive samples, whereas cancerous species in different positions of the human body have different methylation characteristics (Kundaje a et al 2015), with which tissue tracing can be achieved. However, the discovery of tissue-specific methylation markers requires extensive methylation sequencing data and stringent screening validation procedures for multiple cancer species is a challenging task. There is a need in the art for tissue-specific methylation markers for lung cancer.

Disclosure of Invention

In view of the current lack of tissue-specific methylation markers for lung cancer in the art, the present inventors screened a large number of Next Generation Sequencing (NGS) cfDNA methylation-targeted sequencing data for 7 cancer species (lung cancer, liver cancer, lung cancer, stomach cancer, esophageal cancer, pancreatic cancer, breast cancer). The inventor uses the methylation marker obtained by screening to construct and verify a machine learning model, and is used for tracing the tissue of lung cancer in the early stage screening process of the pan-cancer species so as to achieve the aim of better distinguishing the lung cancer.

In one aspect, the invention provides the use of a reagent or component in the preparation of a kit or device for (1) distinguishing between a lung cancer patient and a non-lung cancer patient, (2) diagnosing or aiding in the diagnosis of lung cancer; or (3) tissue traceability to lung cancer during a pan-cancer screening procedure, wherein the reagent or module comprises a reagent or module that detects the methylation level of a lung cancer tissue-specific methylation marker in the genomic DNA of the sample, said methylation marker being the region or locus thereof that is the gene that is 2.2kb upstream and 2.2kb downstream in the chromosome in which it is located: gene ARHGEF16; located in gene CASZ1; gene MAP3K6; gene TRIM58; gene ARHGEF33; gene PSD4; gene HOXD4; gene SLC12A8; gene DGKG; a gene TERT; gene NR2F1; the gene PCDHGC5; gene KCNMB1; gene FOXC1; gene HIST1H4F; gene TYW; the gene LRRC4; gene DGKI; gene PDLIM2; gene RHOBTB2; gene TMEM75; the gene OPLAH; gene NR5A1; gene SPAG6; gene WAPAL; gene BTBD16; gene DPYSL4; gene TTC40; gene ADAM8; gene SLC22a11; gene CPT1A; gene B4GALNT1; the gene FBRSL1; gene XPO4; gene TFDP1; gene GCH1; gene TMEM179; the gene ITPKA; gene SOX8; gene SLC9A3R2; gene SEPT-9; gene MBP; gene NFATC1; gene DNM2; gene RASAL3; gene TAF4; gene NTSR1; gene SLC17A9; or a complementary sequence or variant of either gene, provided that the methylation site in the variant is not mutated. In one embodiment, the length of the site is 120bp to 500bp, preferably 200bp to 480bp.

In one embodiment, the cancer or carcinoma other than lung cancer includes colorectal cancer, liver cancer, gastric cancer, esophageal cancer, pancreatic cancer, and/or breast cancer.

In one embodiment, the methylation marker comprises a nucleotide sequence set forth in any one or more of the following, or a complement or variant sequence thereof: SEQ ID NOS: 1-48.

In one embodiment, the reagent or component comprises a reagent or component used in one or more of the following methods of detecting methylation: bisulfite conversion-based PCR, DNA sequencing, methylation-sensitive restriction enzyme analysis, fluorescence quantification, methylation-sensitive high resolution melting curve, and chip-based methylation profile analysis and mass spectrometry.

In one embodiment, the reagent or assembly comprises primers and/or probes for detecting a methylation marker, and/or the sample is a cell, tissue, fine needle biopsy and/or plasma, preferably the sample genomic DNA is free DNA in plasma.

In another aspect, the invention provides a method of constructing a predictive model for distinguishing lung cancer from other non-lung cancer cancers, comprising:

(1) Obtaining methylation levels of methylation markers in genomic DNA of lung cancer samples and non-lung cancer samples as a training set; the methylation marker is selected from the following regions or the sites of the regions, the regions being the following genes and the 2.2kb upstream region and the 2.2kb downstream region of the chromosome in which the genes are located: gene ARHGEF16; located in gene CASZ1; gene MAP3K6; gene TRIM58; gene ARHGEF33; gene PSD4; gene HOXD4; gene SLC12A8; gene DGKG; a gene TERT; gene NR2F1; the gene PCDHGC5; gene KCNMB1; gene FOXC1; gene HIST1H4F; gene TYW; the gene LRRC4; gene DGKI; gene PDLIM2; gene RHOBTB2; gene TMEM75; the gene OPLAH; gene NR5A1; gene SPAG6; gene WAPAL; gene BTBD16; gene DPYSL4; gene TTC40; gene ADAM8; gene SLC22a11; gene CPT1A; gene B4GALNT1; the gene FBRSL1; gene XPO4; gene TFDP1; gene GCH1; gene TMEM179; the gene ITPKA; gene SOX8; gene SLC9A3R2; gene SEPT-9; gene MBP; gene NFATC1; gene DNM2; gene RASAL3; gene TAF4; gene NTSR1; gene SLC17A9; or a complementary sequence or variant of either gene, provided that the methylation site in the variant is not mutated; and

(2) A logistic regression machine learning model was constructed using methylation level data of methylation markers.

In one embodiment, the length of the site is 120bp to 500bp, preferably 200bp to 480bp. In one embodiment, the non-lung cancer is colorectal cancer, liver cancer, gastric cancer, esophageal cancer, pancreatic cancer, and/or breast cancer.

In one embodiment, the sample is a cell, tissue, fine needle biopsy, or plasma. In one embodiment, the genomic DNA is free DNA in plasma.

In one embodiment, step (1) comprises obtaining methylation sequencing data of the sample DNA.

In one embodiment, step (2) includes building a logistic regression model to obtain model predictive scores; and training using the methylation level of the obtained methylation marker as a training set, and determining a correlation threshold of the model according to a sample of the training set. For example, a logistic regression model in the sklearn (V1.0.1) package in python (V3.9.7) may be used: the formula of the model is as follows, wherein x is the methylation level value of the methylation marker in the sample, w is the coefficient of the methylation marker, b is the intercept value, and y is the model predictive value

The methylation level of the obtained methylation markers can be used as a training set for training: all model (Traintata, traintheno), wherein Traintata is the data of the training set, traintheno is the property of the training set sample, wherein lung cancer is 1, and other cancer species are 0. The correlation threshold of the model may be determined from samples of the training set.

In another aspect, a predictive model of lung cancer constructed according to the methods of the invention is provided.

In another aspect, there is provided an apparatus for diagnosing lung cancer comprising a memory and a processor for processing instructions stored by the memory, the instructions performing a method according to the present invention to construct a predictive model of lung cancer; and the methylation level of the methylation marker in the genome DNA of the sample to be detected is used as a test set to obtain a model predictive value, whether the sample is lung cancer is judged according to a threshold value by using the predictive value, and lung cancer is predicted to be larger than the threshold value, otherwise, other cancer species are predicted to be. The methylation level of a methylation marker in genomic DNA of a sample to be tested can be used as a test set: testpred=allrodel. Prediction_ proba (TestData) [: 1], where TestData is test set data and TestPred is model predictive score.

In another aspect, a kit or device for detecting lung cancer tissue-specific methylation markers is provided comprising reagents or components for detecting the status and/or level of one or more lung cancer tissue-specific methylation markers in genomic DNA from a sample, the lung cancer tissue-specific methylation markers being the following regions or sites thereof, the regions being the following genes and the 2.2kb upstream region and the 2.2kb downstream region of the chromosome in which the genes are located: gene ARHGEF16; located in gene CASZ1; gene MAP3K6; gene TRIM58; gene ARHGEF33; gene PSD4; gene HOXD4; gene SLC12A8; gene DGKG; a gene TERT; gene NR2F1; the gene PCDHGC5; gene KCNMB1; gene FOXC1; gene HIST1H4F; gene TYW; the gene LRRC4; gene DGKI; gene PDLIM2; gene RHOBTB2; gene TMEM75; the gene OPLAH; gene NR5A1; gene SPAG6; gene WAPAL; gene BTBD16; gene DPYSL4; gene TTC40; gene ADAM8; gene SLC22a11; gene CPT1A; gene B4GALNT1; the gene FBRSL1; gene XPO4; gene TFDP1; gene GCH1; gene TMEM179; the gene ITPKA; gene SOX8; gene SLC9A3R2; gene SEPT-9; gene MBP; gene NFATC1; gene DNM2; gene RASAL3; gene TAF4; gene NTSR1; gene SLC17A9; or a complementary sequence or variant of either gene, provided that the methylation site in the variant is not mutated. In one embodiment, the length of the site is 120bp to 500bp, preferably 200bp to 480bp.

In one embodiment, the sample is a cell, tissue, fine needle biopsy, or plasma. In one embodiment, the nucleic acid is free DNA in plasma.

In one embodiment, the reagent or component comprises a reagent or component used in one or more of the following methods: bisulfite conversion-based PCR, DNA sequencing, methylation-sensitive restriction enzyme analysis, fluorescence quantification, methylation-sensitive high resolution melting curve, and chip-based methylation profile analysis and mass spectrometry.

In one embodiment, the reagent comprises an oligonucleotide for detecting a methylation marker. In one embodiment, the oligonucleotide is a primer and/or probe;

in one embodiment, the primer is a primer that detects the methylation level/state of a site using methylation sequencing or a PCR primer for amplifying one or more methylation sites.

In one embodiment, the reagent comprises bisulfite and derivatives thereof, PCR buffers, polymerase, dntps, primers, probes, methylation sensitive or insensitive restriction enzymes, cleavage buffers, fluorescent dyes, fluorescence quenchers, fluorescent reporters, exonucleases, alkaline phosphatase, internal standards, and/or controls that are the aforementioned specific methylation markers from normal subjects or cancer patients other than lung cancer. In one embodiment, the non-lung cancer is colorectal cancer, liver cancer, gastric cancer, esophageal cancer, pancreatic cancer, and/or breast cancer.

The present invention provides isolated nucleic acids that are one or more specific methylation markers. In one embodiment, the isolated nucleic acid is a lung cancer tissue-specific methylation marker. In one embodiment, the lung cancer tissue-specific methylation marker is the following region or a site thereof, which is the following gene and a 2.2kb upstream region and a 2.2kb downstream region of the chromosome in which the gene is located: gene ARHGEF16; located in gene CASZ1; gene MAP3K6; gene TRIM58; gene ARHGEF33; gene PSD4; gene HOXD4; gene SLC12A8; gene DGKG; a gene TERT; gene NR2F1; the gene PCDHGC5; gene KCNMB1; gene FOXC1; gene HIST1H4F; gene TYW; the gene LRRC4; gene DGKI; gene PDLIM2; gene RHOBTB2; gene TMEM75; the gene OPLAH; gene NR5A1; gene SPAG6; gene WAPAL; gene BTBD16; gene DPYSL4; gene TTC40; gene ADAM8; gene SLC22a11; gene CPT1A; gene B4GALNT1; the gene FBRSL1; gene XPO4; gene TFDP1; gene GCH1; gene TMEM179; the gene ITPKA; gene SOX8; gene SLC9A3R2; gene SEPT-9; gene MBP; gene NFATC1; gene DNM2; gene RASAL3; gene TAF4; gene NTSR1; gene SLC17A9; or a complementary sequence or variant of either gene, provided that the methylation site in the variant is not mutated. In one embodiment, the length of the site is 120bp to 500bp, preferably 200bp to 480bp. In one embodiment, the methylation marker comprises a nucleotide sequence set forth in any one or more of the following, or a complement or variant sequence thereof: SEQ ID NOS: 1-48. In one embodiment, the isolated nucleic acid is isolated from a sample. In one embodiment, the sample is a cell, tissue, fine needle biopsy, or plasma. In one embodiment, the isolated nucleic acid is obtained from a lung cancer patient. For example, the isolated nucleic acid is obtained from free DNA in plasma.

In embodiments of aspects of the invention, the variant comprises a sequence having at least 70% identity to the sequence of either gene. For example, a variant comprises a sequence that is at least 75%, 76%, 77%, 78%, 79%, 80%, 81%, 82%, 83%, 84%, 85%, 86%, 87%, 88%, 89%, 90%, 91%, 92%, 93%, 94%, 95%, 96%, 97%, 98%, or 99% identical to the sequence of any one gene.

In embodiments of aspects of the invention, the region is the gene and the 2.2kb upstream region and the 2.2kb downstream region of the chromosome in which the gene is located. In one embodiment, the upstream region is a 2.1kb, 2kb, 1.9kb, 1.8kb, 1.7kb, 1.6kb, 1.5kb, 1.4kb, 1.3kb, 1.2kb, 1.1kb, 1kb, 900bp, 800bp, 700bp, 600bp, 500bp, 400bp, 300bp, 200bp, 100bp, 90bp, 80bp, 70bp, 60bp, 50bp, 40bp, 30bp, 20bp, 10bp or 5bp upstream region upstream of the gene. The downstream region is a 2.1kb, 2kb, 1.9kb, 1.8kb, 1.7kb, 1.6kb, 1.5kb, 1.4kb, 1.3kb, 1.2kb, 1.1kb, 1kb, 900bp, 800bp, 700bp, 600bp, 500bp, 400bp, 300bp, 200bp, 100bp, 90bp, 80bp, 70bp, 60bp, 50bp, 40bp, 30bp, 20bp, 10bp or 5bp downstream region downstream of the gene.

In embodiments of aspects of the invention, the length of the sites may vary. In one embodiment, the length of the site may be 120bp to 500bp, preferably 200bp to 480bp. In one embodiment, the length of the site may be 130bp, 140bp, 150bp, 160bp, 170bp, 180bp, 190bp, 200bp, 210bp, 220bp, 230bp, 240bp, 250bp, 260bp, 270bp, 280bp, 290bp, 300bp, 310bp, 320bp, 330bp, 340bp, 350bp, 360bp, 370bp, 380bp, 390bp, 400bp, 410bp, 420bp, 430bp, 440bp, 450bp, 460bp, 470bp, 480bp, 490bp or 500bp.

In embodiments of aspects of the invention, a variant is a variant sequence having at least 70%, 75%, 76%, 77%, 78%, 79%, 80%, 81%, 82%, 83%, 84%, 85%, 86%, 87%, 88%, 89%, 90%, 91%, 92%, 93%, 94%, 95%, 96%, 97%, 98% or 99% identity to a nucleotide sequence set forth in any one or more of the above.

In one aspect, the invention provides a method of (1) distinguishing between a lung cancer patient and a non-lung cancer patient, (2) for diagnosing or aiding in diagnosing lung cancer; or (3) tissue traceability of lung cancer during a pan-cancer screening procedure, comprising determining the methylation level of one or more methylation markers described herein in the genomic DNA of the sample. In one embodiment, the method is performed using the lung cancer prediction model of the present invention.

Advantages of the invention include:

1. the invention provides a novel lung cancer tissue-specific methylation marker, which can be used for tracing the lung cancer tissue in the early stage screening process of the pan-cancer species, so as to achieve the aim of better distinguishing the lung cancer;

2. based on free DNA (ctDNA) released by tumor cells into plasma, the method is a non-invasive method and can realize early screening of lung cancer;

3. the lung cancer tissue-specific methylation marker can detect lung cancer with high sensitivity and specificity.

Drawings

Fig. 1: the selected lung cancer tissue-specific methylation markers are methylated at a level in the training set.

Fig. 2: the selected lung cancer tissue-specific methylation markers are methylated at the level of the test set.

Fig. 3: methylation level of lung cancer tissue specific methylation marker Seq ID No. 1 in each cancer species of the training set.

Fig. 4: methylation level of lung cancer tissue specific methylation marker Seq ID No. 1 in each cancer species of the test set.

Fig. 5: all lung cancer tissue-specific methylation markers are distributed in training and test sets with lung cancer and other cancer species model scores.

Fig. 6: ROC curves for all lung cancer tissue-specific methylation markers in training and test sets.

Fig. 7: score for lung cancer tissue specific methylation marker combination 1 model.

Fig. 8: ROC curve for lung cancer tissue specific methylation marker combination model 1.

Fig. 9: lung cancer tissue specific methylation markers combined 2 model scores.

Fig. 10: lung cancer tissue specific methylation markers combined 2 model ROC curve.

Detailed Description

The inventor screens the methylation markers specific to lung cancer tissues from a large number of NGS methylation sequencing data of 7 cancer species, can achieve a good tissue tracing effect in related verification data, and provides important technical support for tissue tracing of lung cancer in the early screening process of flood cancer species.

Machine learning modeling is a process of finding the most appropriate representation for an input data feature, enabling it to solve specific problems, such as classification problems. The modeled data has better discrimination than each of the individual data features entered. The best model and the classification effect of each marker in the model are presented herein, and the discrimination effect of selecting any combination of features for modeling is between the best model and a single feature. As shown herein, each individual marker has a distinguishing effect, and the results of randomly selecting markers for classification are also shown in the examples of this patent. Thus, the present application protects one or a combination of all markers and the model they construct.

The inventors found that lung cancer is associated with the methylation level of the following gene regions or regions upstream and downstream thereof: gene ARHGEF16; located in gene CASZ1; gene MAP3K6; gene TRIM58; gene ARHGEF33; gene PSD4; gene HOXD4; gene SLC12A8; gene DGKG; a gene TERT; gene NR2F1; the gene PCDHGC5; gene KCNMB1; gene FOXC1; gene HIST1H4F; gene TYW; the gene LRRC4; gene DGKI; gene PDLIM2; gene RHOBTB2; gene TMEM75; the gene OPLAH; gene NR5A1; gene SPAG6; gene WAPAL; gene BTBD16; gene DPYSL4; gene TTC40; gene ADAM8; gene SLC22a11; gene CPT1A; gene B4GALNT1; the gene FBRSL1; gene XPO4; gene TFDP1; gene GCH1; gene TMEM179; the gene ITPKA; gene SOX8; gene SLC9A3R2; gene SEPT-9; gene MBP; gene NFATC1; gene DNM2; gene RASAL3; gene TAF4; gene NTSR1; gene SLC17A9.

DNA methylation is a mechanism of epigenetic inheritance, which is a common epigenetic modification of the genome of eukaryotic cells that can alter genetic manifestations without altering DNA sequences. By DNA methylation is meant the covalent attachment of a methyl group at the cytosine carbon number 5 of a genomic CpG dinucleotide under the action of a DNA methyltransferase. DNA methylation plays an important role in cell proliferation, differentiation, development and the like, has close relation with the occurrence and development of tumors, and has the effects of transcriptional inhibition, chromatin structure regulation, X chromosome inactivation, genome imprinting and the like. Abnormal DNA methylation can be involved in tumor development and progression by affecting chromatin structure and expression of oncogenes and tumor suppressor genes.

As used herein, "primer" refers to a nucleic acid molecule of a particular nucleotide sequence that is synthesized by directing the synthesis at the initiation of nucleotide polymerization. Primers are typically two oligonucleotide sequences that are synthesized, one complementary to one strand of the DNA template at one end of the target region and the other complementary to the other strand of the DNA template at the other end of the target region, and function as a starting point for nucleotide polymerization. Primers designed artificially in vitro are widely used in Polymerase Chain Reaction (PCR), qPCR, sequencing, probe synthesis, etc. Typically, the primers are designed to amplify a product of 50-150bp, 60-140, 70-130, 80-120bp in length. The primers contained in the reagents herein may be genome sequencing primers, such as whole genome sequencing primers or sequencing primers directed to a region of the genome, or PCR primers for amplifying a specific region or PCR primers for amplifying one or more methylation sites in a region. The primer may be a whole genome sequencing primer, which may yield a number of amplification products, which may contain the region or the region after splicing. Based on the whole genome sequencing results, the methylation status of each methylation site (CpG) in the region is obtained after sequencing, thereby obtaining the methylation level of the entire region. The primer is complementary or substantially complementary to the gene or region of interest.

As used herein, the term "variant" refers to a polynucleotide that changes a nucleic acid sequence by insertion, deletion, or substitution of one or more nucleotides as compared to a reference sequence, while retaining its ability to hybridize to other nucleic acids. Variants of any of the embodiments herein include nucleotide sequences that have at least 70%, preferably at least 80%, preferably at least 85%, preferably at least 90%, preferably at least 95%, preferably at least 97% sequence identity to a reference sequence or reference gene and retain the methylation site of the reference sequence or reference gene. Sequence identity between two aligned sequences can be calculated using BLASTn, e.g., NCBI. Variants also include nucleotide sequences that have one or more mutations (insertions, deletions, or substitutions) in the nucleotide sequence of the reference sequence, while still retaining the methylation site of the reference sequence. A plurality of mutations generally refers to within 1-10, such as 1-8, 1-5, or 1-3. The substitution may be between purine nucleotides and pyrimidine nucleotides, or may be between purine nucleotides or pyrimidine nucleotides. The substitution is preferably a conservative substitution. For example, conservative substitutions with nucleotides that are similar or analogous in nature generally do not alter the stability and function of the polynucleotide in the art. Conservative substitutions such as exchanges between purine nucleotides (A and G), exchanges between pyrimidine nucleotides (T or U and C). Thus, substitution of one or several sites in a polynucleotide of the invention with residues from the same residue will not substantially affect its activity. Furthermore, the methylation sites described herein contained in the variants of the invention are not mutated. That is, the method of the present invention detects methylation at methylation sites in the corresponding sequence, and mutations may occur at bases other than those sites.

As used herein, the term "biological sample" or "sample" generally refers to a sample obtained or derived from a biological source of interest (e.g., a tissue or organism or cell culture). In some embodiments, the organism from which the sample is derived is an animal or a human, preferably a human. In some embodiments, the sample is or includes biological tissue or fluid. In some embodiments, the biological sample may be or include a cell, tissue, or body fluid. In some embodiments, the biological sample may be or include blood, blood cells, cell-free DNA, free floating nucleic acid, ascites, biopsy, surgical samples, cell-containing body fluids, sputum, saliva, stool, urine, cerebrospinal fluid, peritoneal fluid, pleural fluid, lymph, gynecological fluid, secretions, excretions, skin swabs, vaginal swabs, oral swabs, nasal swabs, lotions such as catheter or bronchoalveolar lavage, aspirates, swabs, and the like. In some embodiments, the biological sample is or includes cells obtained from a single subject or from multiple subjects. The sample may be a "primary sample" obtained directly from a biological source, or may be a "treated sample".

As used herein, the term "cancer" is used to refer to a disease or disorder in which cells exhibit abnormal, uncontrolled and/or autonomous growth such that they exhibit an abnormally elevated proliferation rate and/or abnormal growth phenotype. In the present invention, the cancer of interest may be lung cancer.

As used herein, the term "diagnosis" refers to a quantitative and/or qualitative probability of determining whether a subject has or is at risk of developing cancer. For example, in the diagnosis of cancer, the diagnosis may include a determination as to the risk, type, stage, malignancy, etc. of the cancer.

As used herein, the term "marker" is consistent with its use in the art and refers to an entity whose presence, level or form is associated with a particular biological event or state of interest, and thus is considered to be a "marker" of that event or state. One of skill in the art will recognize that in the context of a methylation marker, the methylation marker can be or include a locus (e.g., one or more methylation loci) and/or a state of a locus (e.g., a state of one or more methylation loci). The marker may be or include a marker of a particular disease, or may be a marker of a quantitative probability of a particular disease developing, or relapsing in a subject. Methylation markers of the invention may be markers for the prediction, prognosis and/or diagnosis of lung cancer.

As used herein, "DNA region" or "region" refers to any contiguous portion of a larger DNA molecule. In this context, a DNA region refers to a gene of interest and regions upstream and downstream thereof. "upstream" of a gene or region refers to the region relative to the 5' end of the gene or region. "downstream" of a gene or region refers to the region that is 3' relative to the gene or region.

As used herein, the term "identity" refers to the overall relatedness between nucleic acid molecules (e.g., DNA molecules and/or RNA molecules). Methods for calculating the percent identity between two provided sequences are known in the art. For example, the percent identity of two nucleic acids can be calculated as follows: alignment of the two sequences for optimal comparison purposes (e.g., gaps may be introduced in one or both of the first and second sequences for optimal alignment, and non-identical sequences may be omitted for comparison purposes); then comparing the nucleotides at the corresponding positions; when a position in a first sequence is occupied by the same residue (e.g., nucleotide or amino acid) as the corresponding position in a second sequence, then the molecules are identical at that position. The percent identity between two sequences is a function of the number of identical positions shared by the sequences (considering the number of gaps introduced for optimal alignment and the length of each gap). Comparison of sequences and determination of percent identity between two sequences may be accomplished using a computational algorithm such as BLAST (basic local alignment search tool).

As used herein, the term "methylation" includes (i) any C5 position of cytosine; (i i) cytosine at position N4; (ii) methylation of adenine at position N6; and (iv) other types of nucleotide methylation. Methylated nucleotides can be referred to as "methylated nucleotides" or "methylated nucleotide bases". In certain embodiments, methylation as described herein specifically refers to methylation of cytosine residues. In some cases, methylation refers to methylation of cytosine residues present in CpG sites.

As used herein, the term "methylation analysis" refers to any technique that can be used to determine the methylation state or level of a methylation site.

As used herein, the term "methylation marker" refers to a marker of at least one methylation site and/or the methylation state of at least one methylation site (e.g., a hypermethylation site). In particular, the methylation marker is characterized by a methylation state of one or more nucleic acid sites that changes between a first state and a second state (e.g., between a cancerous state and a non-cancerous state).

As used herein, "methylation state" refers to the number, frequency, or pattern of methylation sites within a methylation locus. Thus, the change in methylation state between the first state and the second state may be or include an increase in the number, frequency or pattern of methylation sites, or may be or include a decrease in the number, frequency or pattern of methylation sites. In each case, the change in methylation state is a change in methylation value.

As used herein, the term "methylation value" refers to a numerical representation of methylation status, for example, in the form of a number representing the frequency or ratio of methylation of a methylated locus. In some cases, the methylation value can be generated by a method comprising quantifying the amount of intact nucleic acid present in the sample after restriction digestion of the sample with the methylation dependent restriction endonuclease. In some cases, the methylation value can be generated by a method comprising comparing amplification profiles of samples after bisulfite reaction. In some cases, methylation values can be generated by comparing the sequences of bisulfite treated and untreated nucleic acids. In some cases, the methylation value is a quantitative PCR result, including a quantitative PCR result or based on a quantitative PCR result. Herein, methylation level represents the proportion of one or more sites in a methylated state. The methylation level of a region (or group of sites) is the average of the methyl levels of all sites in the region (or all sites in the group). Thus, an increase or decrease in the methylation level of a region does not indicate an increase or decrease in the methylation level of all methylation sites in the region. The process of converting the results obtained by methods for detecting DNA methylation (e.g., simplified methylation sequencing) to methylation levels is known in the art. For example, the methylation level of CpG sites can be obtained using software Bismark (v0.17.0). Methods of detecting DNA Methylation are known in the art and include, but are not limited to, bisulfite conversion-based PCR (e.g., methylation-specific PCR (MSP)), DNA sequencing (e.g., bisulfite sequencing (Bisulfite sequencing, BS), whole genome Methylation sequencing (white-genome bisulfite sequencing, WGBS), simplified Methylation sequencing (Reduced Representation Bisulfite Sequencing, RRBS)), methylation-sensitive restriction enzyme analysis (methyl-Sensitive Dependent Restriction Enzymes), fluorescent quantitation, methylation-sensitive High-resolution melting curve (methyl-sensitivity High-resolution Melting, MS-HRM), chip-based Methylation profile analysis or mass spectrometry (e.g., flight mass spectrometry), large-scale parallel sequencing techniques (e.g., next generation sequencing techniques), e.g., sequencing by synthesis, real-time (e.g., single molecule) sequencing, bead emulsion sequencing, nanopore sequencing, and the like. In one or more embodiments, detecting includes detecting any strand at a gene or site. DNA methylation can also be detected using reduced genome methylation sequencing (RRBS). Simplified genome methylation sequencing is a technique that uses restriction enzymes to cleave the genome, bisulfite-treat it, and sequence the CpG regions of the genome. For example, reagents used to simplify genome methylation sequencing include: plasma nucleic acid purification kit, ligase, bisulfite and derivatives thereof, dNTP, polymerase, primer, nuclease-free water and/or magnetic beads, etc.

As used herein, "specificity" of a marker refers to the percentage of a sample characterized by the absence of an event or state of interest, wherein measurement of the marker accurately indicates the absence of the event or state of interest (true negative rate). In various embodiments, the characterization of the negative sample is independent of the marker and may be accomplished by any relevant measurement, such as any relevant measurement known to those of skill in the art. Thus, the specificity reflects the probability that the marker will detect the absence of an event or state of interest when measured in a sample that does not characterize the event or state of interest. In certain embodiments in which the event or condition of interest is lung cancer, specific refers to the probability that a marker will detect the absence of lung cancer in a subject lacking lung cancer. The absence of lung cancer may be determined, for example, by histology.

As used herein, "sensitivity" of a marker refers to the percentage of a sample characterized by the presence of an event or state of interest, wherein measurement of the marker accurately indicates the presence of the event or state of interest (true positive rate). In various embodiments, the characterization of the positive sample is independent of the marker and may be accomplished by any relevant measurement, such as any relevant measurement known to those of skill in the art. Thus, sensitivity reflects the probability that a marker will detect the presence of an event or state of interest when measured in a sample characterized by the presence of the event or state of interest. In particular embodiments in which the event or state of interest is lung cancer, sensitivity refers to the probability that a marker will detect the presence of lung cancer in a subject having lung cancer. The presence of lung cancer may be determined, for example, by histology.

The term "subject" as used herein refers to an organism, typically a mammal (e.g., a human). In some embodiments, in one embodiment, the subject has cancer. In one embodiment, the subject has lung cancer.

Nucleic acid isolated from lung cancer patients

The invention provides isolated nucleic acids that are isolated from a sample of a subject. For example, the isolated nucleic acid is isolated from free DNA in the plasma of a lung cancer patient. The isolated nucleic acid is one or more specific methylation markers, preferably lung cancer tissue specific methylation markers. Methylation markers are the following regions or the sites of the regions, which are the following genes and the 2.2kb upstream region and the 2.2kb downstream region of the chromosome in which they are located: gene ARHGEF16; located in gene CASZ1; gene MAP3K6; gene TRIM58; gene ARHGEF33; gene PSD4; gene HOXD4; gene SLC12A8; gene DGKG; a gene TERT; gene NR2F1; the gene PCDHGC5; gene KCNMB1; gene FOXC1; gene HIST1H4F; gene TYW; the gene LRRC4; gene DGKI; gene PDLIM2; gene RHOBTB2; gene TMEM75; the gene OPLAH; gene NR5A1; gene SPAG6; gene WAPAL; gene BTBD16; gene DPYSL4; gene TTC40; gene ADAM8; gene SLC22a11; gene CPT1A; gene B4GALNT1; the gene FBRSL1; gene XPO4; gene TFDP1; gene GCH1; gene TMEM179; the gene ITPKA; gene SOX8; gene SLC9A3R2; gene SEPT-9; gene MBP; gene NFATC1; gene DNM2; gene RASAL3; gene TAF4; gene NTSR1; gene SLC17A9. The site is a site of methylation. It will be appreciated by those skilled in the art that mutations may be present in the genes of the genome, and thus it is contemplated that variants of these genes may also serve as methylation markers, provided that the methylation sites in the variants are not mutated. A variant may comprise a sequence having at least 70% identity to the sequence of either gene. The site selected as a marker may comprise 1 or more cpgs, for example 2 cpgs, 3 cpgs, 4 cpgs, 5 cpgs, 6 cpgs, 10 cpgs, 20 cpgs or 30 cpgs. Suitable sites may be 150bp to 500bp in length. For example, the length of the site may be 160bp, 170bp, 180bp, 190bp, 200bp, 210bp, 220bp, 230bp, 240bp, 250bp, 260bp, 270bp, 280bp, 290bp, 300bp, 310bp, 320bp, 330bp, 340bp, 350bp, 360bp, 370bp, 380bp, 390bp, 400bp, 410bp, 420bp, 430bp, 440bp, 450bp, 460bp, 470bp, 480bp, 490bp or 500bp.

Those skilled in the art understand that a gene has the same or similar methylation levels or status as regions upstream and downstream thereof. Thus, when the inventors found that a methylation site within a particular gene, it was contemplated that the gene and the 2.2kb upstream region and the 2.2kb downstream region in situ on the chromosome also possessed the same or similar methylation levels or status. The invention encompasses the gene of the invention and the 1.9kb, 1.8kb, 1.7kb, 1.6kb, 1.5kb, 1.4kb, 1.3kb, 1.2kb, 1.1kb, 1kb, 900bp, 800bp, 700bp, 600bp, 500bp, 400bp, 300bp, 200bp, 100bp, 90bp, 80bp, 70bp, 60bp, 50bp, 40bp, 30bp, 20bp, 10bp or 5bp upstream and downstream regions of the chromosome in which the gene is located.

In this context, the following nucleotide sequences are used as methylation markers in the present invention.

Wherein the coordinates of the chromosomal location are determined with reference to the human whole genome sequence hg 19. Based on the methylation markers screened for lung cancer tissue specificity and the genes in which they reside, one skilled in the art will appreciate that the following loci can be used as methylation markers: located within or upstream or downstream of the gene ARHGEF 16; located within gene CASZ1 or upstream or downstream; located within or upstream or downstream of the gene MAP3K 6; within gene TRIM58 or upstream or downstream; located within or upstream or downstream of the gene ARHGEF 33; located within gene PSD4 or upstream or downstream; located within or upstream or downstream of the gene HOXD 4; located within or upstream or downstream of the gene SLC12 A8; located within or upstream or downstream of the gene DGKG; located within or upstream or downstream of the gene TERT; located within or upstream or downstream of the gene NR2F 1; located within or upstream or downstream of the gene PCDHGC 5; located within or upstream or downstream of the gene KCNMB 1; located within or upstream or downstream of gene FOXC 1; located within gene HIST1H4F or upstream or downstream; within gene TYW, or upstream or downstream; located within or upstream or downstream of the gene LRRC 4; located within or upstream or downstream of the gene DGKI; located within or upstream or downstream of gene PDLIM 2; located within or upstream or downstream of the gene RHOBTB 2; located within or upstream or downstream of gene TMEM 75; located within or upstream or downstream of the gene OPLAH; located within or upstream or downstream of the gene NR5A 1; within or upstream or downstream of gene SPAG 6; within the gene WAPAL or upstream or downstream regions; located within or upstream or downstream of gene BTBD 16; located within or upstream or downstream of gene DPYSL 4; located within or upstream or downstream of gene TTC 40; located within or upstream or downstream of gene ADAM 8; located within or upstream or downstream of the gene SLC22A 11; located within or upstream or downstream of the gene CPT 1A; located within or upstream or downstream of gene B4GALNT 1; located within or upstream or downstream of the gene FBRSL 1; within gene XPO4 or upstream or downstream; located within or upstream or downstream of gene TFDP 1; located within or upstream or downstream of the gene GCH 1; located within or upstream or downstream of gene TMEM 179; within or upstream or downstream of the gene ITPKA; located within or upstream or downstream of gene SOX 8; located within or upstream or downstream of the gene SLC9A3R 2; located within or upstream or downstream of gene SEPT-9; located within or upstream or downstream of the gene MBP; located within or upstream or downstream of the gene NFATC 1; located within or upstream or downstream of gene DNM 2; located within or upstream or downstream of the gene RASAL 3; located within gene TAF4 or upstream or downstream; located within or upstream or downstream of gene NTSR 1; located within or upstream or downstream of the gene SLC17A 9. Combinations of one or more methylation markers alone may be used as lung cancer specific methylation markers. In one embodiment, the methylation marker is within a 2kb upstream and 2kb downstream region of any of the genes described above.

The precursor Andy fibre of the epigenetic kingdom has once indicated that most methylation changes in colon cancer occur not only in the promoter, but also not on CpG islands, but in the 2kb sequence upstream thereof, we call "CpG island coast" (Andy fibre et al 2009). CpG island shore methylation is closely related to gene expression, is highly conserved in mammals, and can distinguish tissue types. In subsequent studies, researchers have found this phenomenon not only in intestinal cancer species, but also in breast cancer, gastric cancer, bladder cancer, and in some tissue types, the vicinity of these target methylation sites (Guo YL et al, 2016; rao X et al, 2013; dudziec E et al, 2011; chae H et al, 2016). Therefore, protection of these neighboring areas is also important as is protection of the target area.

Kit for diagnosing lung cancer

According to the methylation markers of the present invention, one skilled in the art can prepare kits or devices for detecting the methylation level or status of these markers for diagnosing lung cancer, or distinguishing lung cancer from other pan-cancerous species. The kit or device may comprise reagents or components to detect the status and/or level of one or more lung cancer tissue-specific methylation markers in nucleic acid from the sample. For example, the reagent or component may comprise a reagent or component used in one or more of the following methods: bisulfite conversion-based PCR, DNA sequencing, methylation-sensitive restriction enzyme analysis, fluorescence quantification, methylation-sensitive high resolution melting curve, and chip-based methylation profile analysis and mass spectrometry. The reagent may comprise an oligonucleotide for detecting a methylation marker. For example, an oligonucleotide is a primer and/or probe. Preferably, the primer is a primer for detecting the methylation level/state of a site using methylation sequencing or a PCR primer for amplifying one or more methylation sites. Preferably, the reagent comprises bisulfite and derivatives thereof, PCR buffers, polymerase, dntps, primers, probes, methylation sensitive or insensitive restriction enzymes, cleavage buffers, fluorescent dyes, fluorescence quenchers, fluorescence reporters, exonucleases, alkaline phosphatase, internal standards, and/or controls that are the aforementioned specific methylation markers from normal subjects or cancer patients other than lung cancer. Preferably, the non-lung cancer is colorectal cancer, liver cancer, gastric cancer, esophageal cancer, pancreatic cancer and/or breast cancer.

Methods for diagnosing lung cancer

The present invention provides a method of diagnosing lung cancer in a subject comprising: (1) Determining the methylation status or level of one or more lung cancer tissue-specific methylation markers of the invention in a sample of a subject; and (2) determining lung cancer based on the determined tissue-specific methylation status or level of lung cancer. In one embodiment, the subject is a cancer patient or a subject at risk for cancer. In one embodiment, the non-lung cancer is colorectal cancer, liver cancer, gastric cancer, esophageal cancer, pancreatic cancer, and/or breast cancer. In one embodiment, the sample is a cell, tissue, fine needle biopsy, or plasma. In one embodiment, the method of obtaining the methylation level data may be any suitable method of determining the methylation level of a nucleic acid sequence, such as bisulfite conversion-based PCR, DNA sequencing, methylation-sensitive restriction enzyme analysis, fluorometry, methylation-sensitive high resolution melting curve, and chip-based methylation profile analysis and mass spectrometry.

The present invention also provides a method for diagnosing lung cancer, comprising: (1) Detecting in a sample from a subject the methylation level of a sequence described herein; (2) Comparing with a control sample, or calculating to obtain a score; (3) identifying lung cancer in the subject based on the score. Typically, the method further comprises, prior to step (1): extraction of sample DNA and conversion of unmethylated cytosine on the DNA to bases that do not bind guanine. In one or more embodiments, the subject sample has an elevated or reduced methylation level when compared to a control sample. When the methylation level meets a certain threshold, lung cancer is identified. A mathematical analysis of the methylation level of the measured gene was performed to obtain a score. For the detected sample, when the score is greater than the threshold, the result is judged to be lung cancer, otherwise, the result is negative, namely the cancer except lung cancer. Methods of conventional mathematical analysis and processes for determining thresholds are known in the art.

The invention also provides a method comprising: (1) Obtaining the methylation level of a methylation marker described herein in genomic DNA of a lung cancer sample and a non-lung cancer sample; and (2) constructing a machine learning model of logistic regression using the data of methylation levels of the methylation markers. The sample may be cells, tissue, fine needle biopsy or plasma. Genomic DNA may be free DNA in plasma. Step (1) may include obtaining methylation sequencing data of sample DNA by a method that includes MethylTitan, and step (2) may include using a logistic regression model in the sklearn (V1.0.1) package in python (V3.9.7): the formula of the model is as follows, wherein x is the methylation level value of a sample target marker, w is the coefficient of a methylation marker, b is the intercept value, and y is the model predictive value

Training using the methylation level of the obtained methylation markers as a training set: all model (Traintata, traintheno), wherein Traintata is the data of the training set, traintheno is the property of the training set sample, wherein lung cancer is 1, other cancer species are 0, and the relevant threshold of the model is determined according to the training set sample. The method further comprises using the methylation level of the methylation marker in the genomic DNA of the sample to be tested as a test set: testpred=allrodel. Prediction_ proba (TestData) [: 1 ]Wherein TestData is test set data, testPred is model predictive score, and whether a sample is lung cancer is judged according to a threshold value by using the predictive score and is greater than the threshold valueThe value is predicted to be lung cancer, otherwise other cancer species. The method may be used (1) to distinguish between a lung cancer patient and a non-lung cancer patient, (2) to diagnose or aid in diagnosing lung cancer; or (3) is used for tracing the tissue of the lung cancer in the process of screening the pan-cancer.

System or device for diagnosing lung cancer

The invention also provides a system or a device. The system or apparatus may include a computer readable storage medium or memory for storing programs or instructions. The program or instructions are for executing a predictive model of the invention that distinguishes lung cancer from other non-lung cancers, or for executing a method of the invention. Computer-readable storage media or memory includes, but is not limited to, tangible storage media, carrier wave media, or physical transmission media. Nonvolatile storage media includes, for example, optical or magnetic disks, such as any storage devices in any computer or the like, volatile storage media includes dynamic memory, such as the main memory of a computer platform. Tangible transmission media include coaxial cables; copper wire and fiber optics, including the wires that comprise a bus within a computer system. Carrier wave transmission media can take the form of electrical or electromagnetic signals, or acoustic or light waves, such as those generated during radio frequency and infrared data communications. Thus, common forms of computer-readable media include, for example: a floppy disk, a flexible disk, hard disk, magnetic tape, any other magnetic medium, a CD-ROM, DVD or DVD-ROM, any other optical medium, punch cards paper tape, any other physical storage medium with patterns of holes, RAM, ROM, PROM and EPROM, FLASH-EPROM, any other memory chip or cartridge, a carrier wave transporting data or instructions, a cable or link transporting such a carrier wave, or any other medium from which a computer can read programming code and/or data. Many of these forms of computer readable media may be involved in carrying one or more sequences of one or more instructions to a processor for execution. The memory and processor may be physically separate. In this case, the operative connection may be realized via a wired and a wireless connection between the units allowing data transmission. The wireless connection may use a Wireless LAN (WLAN) or the internet. The wired connection may be made through optical and non-optical cable connections between the units. The cable for wired connection is further suitable for high-throughput data transmission.

Use for diagnosing lung cancer

The invention also provides the use of an isolated nucleic acid or reagent or component in the preparation of a kit or device for (1) distinguishing between a lung cancer patient and a non-lung cancer patient; (2) for diagnosing or aiding in diagnosing lung cancer; or (3) is used for tracing the tissue of the lung cancer in the process of screening the pan-cancer. Preferably, the non-lung cancer is colorectal cancer, liver cancer, gastric cancer, esophageal cancer, pancreatic cancer and/or breast cancer. The kit or device may contain reagents for determining the methylation level in a variety of available ways.

Examples

The invention will now be described in further detail with reference to the drawings and to specific examples. In the following examples, experimental procedures without specifying the specific conditions were generally carried out as described in conventional conditions.

Example 1: screening of lung cancer specific methylation sites by methylation targeted sequencing

The inventors collected a total of 490 patients with each cancer species, all patients in the group signed informed consent. The samples are divided into a training set and a testing set according to a certain proportion, wherein the training set is used for constructing a machine learning model, the testing set is used for performance testing of the model, sample information is shown in the following table 1, the total number of lung cancer samples in the training set is 51, and the total number of lung cancer samples in the testing set is 20.

TABLE 1 statistical table of the number of plasma samples for each cancer species

MethylTitan developed by the applicant ^TM Methylation sequencing data of cfDNA of the plasma of the target sample is obtained, and DNA methylation classification markers are identified. The process is as follows:

1. extraction of blood plasma cfDNA samples

2ml whole blood samples of patients were collected by a streck blood collection tube, and after timely centrifugation of plasma (within 3 days), cfDNA was extracted according to instructions by a QIAGEN QIAamp Circulating Nucleic Acid Kit kit after transport to the laboratory.

2. Sequencing and data preprocessing

a) The library was double-ended sequenced using an Illumina Nextseq 500 sequencer.

b) The Pear (v0.6.0) software combines double-ended sequencing data of the same fragment sequenced by 150bp on both ends of an Illumina Hiseq X10/Nextseq 500/Novaseq sequencer machine into one sequence, the shortest overlapping length is 20bp, and the shortest 30bp after combination.

c) The pooled sequencing data was subjected to a dehiscence process using trim_galore v0.6.0, cutadaptv 1.8.1 software. The linker sequence was removed at the 5' end of the sequence as "AGATCGGAAGAGCAC" and the bases with a sequencing mass value below 20 at both ends were removed.

3. Sequencing data alignment

The reference genome data used herein are from the UCSC database (UCSC: HG19, http:// hgdownload. Soe. UCSC. Edu/goldenPath/HG19/bigZips/HG19.Fa. Gz).

a) HG19 was first transformed with cytosine to thymine (CT) and adenine to Guanine (GA), respectively, using Bismark software, and the transformed genomes were indexed using Bowtie2 software, respectively.

b) CT and GA conversion were also performed on the data pre-processed on-press data from Illumina Nextseq 500 sequencer.

c) The transformed sequences were aligned to the transformed HG19 reference genome using Bowtie2 software, respectively, with a minimum seed sequence length of 20, the seed sequence not allowing for mismatches.

4. Calculation of Methylation Haplotype Frequencies (MHF)

And (3) for the CpG sites of each target region HG19, acquiring the methylation state corresponding to each site according to the comparison result. The nucleotide numbering of the sites herein corresponds to the nucleotide position numbering of HG 19. There may be multiple methylation maps for a target methylation region, and this value is calculated for each methylation map within the target region, as shown in the MHF calculation formula:

wherein i represents a target methylation interval, h represents a target methylation haplotype, N _i Representing the number of reads (reads) located in the methylation interval of interest, N _i,h The number of reads comprising the methylation haplotype of interest is indicated.

5. Methylation data matrix

a) The methylation sequencing data (methylation haplotype frequency) of each sample of the training set and the test set are respectively combined into a data matrix, and each site with depth lower than 200 is subjected to deletion value processing.

b) Sites with a deletion value ratio higher than 10% were removed.

c) And performing missing data interpolation on missing values of the data matrix by using a KNN algorithm.

6. Finding out lung cancer tissue specific methylation markers according to training set samples

a) Calculating the AUC of each methylation haplotype marker in the training set compared with other cancer species, and sorting from high to low, and screening out methylation markers which can better distinguish lung cancer from other cancer species as candidate markers;

b) And constructing a logistic regression model in a training set by using the methylation markers constructed in the previous step, and then verifying the effect of the model by using a test set sample. The steps are mainly based on the logic regressions function of the python3 sklearn packet linear_model module, and the specific steps are as follows:

1. the standard scaler is used for standardizing the training set data and storing a standardized conversion formula, wherein the formula is as follows: x= (x-u)/σ, μ is the mean value of all sample data, σ is the standard deviation of all sample data;

2. Inputting the standardized data into a logistic regression function, and training a logistic regression model;

3. applying a normalization formula to the test set data to normalize the test set;

4. and applying the trained logistic regression model to the test set sample for testing.

The methylation levels of these methylation markers in lung cancer and other 6 cancer species are shown in table 2 below and figures 1 and 2. These methylation markers have significant differences (u-test, p-value less than 0.05) in lung cancer versus other cancer species in both training and test sets, and also have large differences in methylation levels.

TABLE 2 methylation level mean of methylation markers in lung cancer and other 6 cancer species in training and test sets

Taking a single lung cancer tissue specific methylation marker Seq ID NO. 1 as an example to look at the distribution of methylation levels of the lung cancer tissue specific marker in seven cancer species in a training set and a test set as shown in figures 3 and 4 respectively, it can be seen that the methylation levels of the lung cancer tissue specific marker have significant differences (wilcox test: P < = 0.05) in lung cancer compared with other 6 cancer species, and the lung cancer tissue specific methylation marker is a good lung cancer tissue specific methylation marker.

Example 2: discrimination performance of single lung cancer tissue-specific methylation markers

To verify the potential of a single lung cancer tissue specific methylation marker to differentiate lung cancer from other 6 cancer species, a model was trained in the training set data of example 1 using methylation level data of a single lung cancer tissue specific methylation marker, and performance of the model was verified using the test set sample, as follows:

1. the logistic regression model in the sklearn (V1.0.1) package in python (V3.9.7) was used: allmodel=logistic regression (), the formula of the model is as follows, where x is the methylation level value of the sample target lung cancer tissue-specific methylation marker, w is the coefficient of the different markers, b is the intercept value, and y is the model predictive value:

2. the training is performed using samples of the training set, allrodel. Fit (Traintata, traintheno), wherein Traintata is data of target methylation sites in the training set samples, traintheno is the property of the training set samples (lung cancer is 1, other cancer species is 0), and the correlation threshold of the model is determined according to the samples of the training set.

3. The test is performed using samples of the test set, testpred=allrodel. Prediction_ proba (TestData) [: 1], where TestData is the data of the target methylation sites in the test set samples, testPred is the model predictive score, and the predictive score is used to determine whether the sample is lung cancer according to the above-mentioned threshold.

4. And (3) counting the AUC of the model, and counting indexes such as sensitivity, specificity, accuracy and the like according to the determined threshold.

The effect of the logistic regression model of the single lung cancer tissue-specific methylation marker in this example is shown in table 3, from which it can be seen that all lung cancer tissue-specific methylation markers can reach AUC above 0.67 and accuracy above 0.58 in both the test set and the training set, and are all good lung cancer tissue-specific markers, wherein excellent markers such as Seq ID No. 45,Seq ID NO:23,Seq ID NO:42 can reach sensitivity above 75% under more than 80% of the specificity in the test set, and the overall accuracy reaches more than 80%.

TABLE 3 expression of single lung cancer tissue-specific methylation marker logistic regression models

Example 3: machine learning model for all target lung cancer tissue-specific methylation markers

In this example, a logistic regression machine learning model was constructed using the methylation levels of all 48 lung cancer tissue-specific methylation markers to accurately distinguish lung cancer samples from multiple cancer species data. The specific procedure is consistent with example 2 except that the relevant sample brings data for all 48 methylation markers of interest. The method comprises the following steps:

1. The logistic regression model in the sklearn (V1.0.1) package in python (V3.9.7) was used: allmodel=logistic regression (), the formula of the model is as follows, where x is the methylation level value of the sample target methylation marker, w is the coefficient of the different methylation markers, b is the intercept value (the parameters are obtained by training a logistic regression model), and y is the model predictive score:

2. training is performed using samples of the training set, allrodel. Fit (traintata, traintheno), where traintata is the data of the training set (methylation haplotype frequency), traintheno is the behavior of the training set samples (lung cancer is 1, other cancer species is 0), and the correlation threshold of the model is determined from the training set samples.

3. The test is performed using samples of the test set testpred=allrodel. Prediction_ proba (TestData) [: 1], where TestData is the test set data (methylation haplotype frequency), testPred is the model predictive score, which is used to determine if the sample is lung cancer based on the above-described threshold.

Model predictive score distribution in training and test sets is shown in fig. 5, from which it can be seen that lung cancer and other cancer species have significant differences in model scores (wilcox test: P < = 0.05). The ROC curve is shown in fig. 6, in the test set, AUC of distinguishing lung cancer from other cancer species reaches 0.903, and when the AUC is greater than the AUC, the AUC is set to be 0.336, the lung cancer is predicted to be the lung cancer, otherwise, the AUC is predicted to be the other cancer species, when the specificity is 94.7%, the sensitivity reaches 80.0%, the accuracy of the overall prediction of the sample reaches 85.0%, and the lung cancer sample can be well distinguished from 7 cancer samples.

Example 4 Lung cancer tissue specific methylation marker combination 1 machine learning model

To verify the effect of the relevant lung cancer tissue-specific methylation marker combinations, the present example randomly selected 10 lung cancer tissue-specific methylation markers from all 48 lung cancer tissue-specific methylation markers, and constructed a new machine learning model based on the methylation level data of Seq ID NO:2,Seq ID NO:5,Seq ID NO:9,Seq ID NO:13,Seq ID NO:24,Seq ID NO:37,Seq ID NO:39,Seq ID NO:41,Seq ID NO:46,Seq ID NO:48.

The machine learning model construction method was also consistent with example 2, but the relevant samples used only data for 10 lung cancer tissue-specific methylation markers in this example, the model scores for the model in the training set and test set are shown in fig. 7, and the model ROC curves are shown in fig. 8. The model can be seen that in a training set and a test set, the lung cancer sample score has a significant difference (wilcox test: P < = 0.05) with other cancer species scores, the AUC of the model test set reaches 0.895, when the threshold is set to 0.226, the model test set is larger than the predicted value and is smaller than the predicted value and is other cancer species, when the specificity is 88.7%, the sensitivity reaches 80.0%, the overall accuracy reaches 87.7%, and the good performance of the combined model is illustrated.

Example 5: lung cancer tissue specific methylation marker combination 2 machine learning model

This example uses another lung cancer tissue-specific methylation marker combination: a machine learning model was constructed from a total of 5 lung cancer tissue-specific methylation markers of Seq ID No. 24,Seq ID NO:36,Seq ID NO:41,Seq ID NO:43,Seq ID NO:46.

The model construction method was identical to example 2, but the relevant samples used only the data of 5 markers in this example. The model scores of the model in the training set and the test set are shown in fig. 9, and the roc curve is shown in fig. 10. The graph shows that in the training set and the testing set, the lung cancer sample score is obviously higher than the scores of other cancer species (wilcox test: P < = 0.05), when the threshold is set to be 0.253, the sensitivity reaches 75.0% when the specificity is 95.4% in the testing set, the overall accuracy can reach 93.0%, and the lung cancer and other cancer species can be well distinguished.

According to the method, 48 lung cancer specific methylation markers are selected from methylation NGS sequencing data of 7 cancer species, a machine learning model constructed according to methylation level data of the methylation markers can well distinguish lung cancer samples from the data of 7 cancer species, and the methylation markers are good lung cancer tissue specific methylation markers and provide important references for tissue tracing of lung cancer in the early screening process of the pan-cancer species.

While various embodiments have been described, it will be apparent that the basic disclosure and examples can provide other embodiments that utilize or are encompassed in the markers and methods described herein. It is, therefore, to be understood that the scope of the invention is to be defined by the scope of the disclosure and the appended claims, and not by the specific embodiments.

Sequences as used herein:

>Seq ID NO:1

CTGGCCCTGACAGACTGCAGACCAGACCGGGGCATTGTTCTCTTTCTCGGCCTTCCCCGCCGTGGACGGGCCCCCCACCTGGTTTGTGAAACCTGCGCCCAGGCTGAGTTCACAGCTAAACTTAGCGCCTCCCATTGTTTCCCCGGGGCCGTGGAGTTTGGTTAATAACTTCCCCTGATTTTCCTCGGGATGGGCTGGAAAGAGCCACGAGCCAGCCAGGCGCATCCTGCGTTTGTTTGTGCGGGGAGCGAGGCCGGGAATATCTGATCGGGCGGAGCAAGCCGGGCGGGAGAGGCCCACCCAGGCCCGAGGAAGGGAGCCCAGCGGGGGGCAGTTTCCATTGTCCCTCCTGCCCGCTGCCCCCACGG

>Seq ID NO:2

CGAGAGAGTGCATTCAAGAAGGGCGATCCGGGCACATATGCGACCTGTGAGAGGCGGAGTCGGTGACAGGTGGGTCTTGTTTTTTAATAAAGAGCTTGTTCCTaatcagatcatggcactcagaactcttcaaaaagcttcttatttcactctgggtaaaagccagagttctcacaatggcctgcaaggcctacgggatctgagggccccccaccctgaccccctcgacttcagatggcatctgcccctcactctgctctagcca

>Seq ID NO:3

CCCAGTCCACAGGGCTCGAACTCTCAGGTCCTACGAGCCCGCCCACTAGGCCCCGCCCACAGGAGCCGCTCCGCTCGTGGCCCGGCTCACTCGGCCCTCGCGAGCCCTCAGCCCCACCCGCGCTGCCACGCACCGCACCTGCTGTCCCGCTCCGGGATCTCCTTGATGGCGATGCGCACCCTCGTGTGGCGATCGCGGCCCGCGTACACCACCCCATACGTGCCCTTGCCCAGCACCAGCCGCTCGCCCGTCTCCGTGTACTCATAATCAAACTGCCGGGCGCGGGGTGAGATGGGAGTTCAGCAGGGCCCGCGGCCCCTCGCCCTCCGCGAGCTCCCAGTCCCGCGTCCTCACCTCCAACATCTCCCCCGCGCCCTCCGCCTCCTCCGCGG

>Seq ID NO:4

GCGTGCGGCGGCTGGGGTTgggcgcggggcccggggcgcggcgATGCGCGCGGCACGGCGAGGACCTGAGCCGCTTCTGCGAGGAGGACGAGGCGGCGCTGTGCTGGGTGTGCGACGCCGGCCCCGAGCACAGGACGCACCGCACGGCGCCGCTGCAGGAGGCCGCCGGCAGCTACCAGGTGAggcgccccccggcgggggctgcgggcgcTGCGGTGACCGGGAAGCGGGCGACAGTCCGGAGCGGAGCCGCCGAGGCCACCCGTCTCCTGAGCGGCTCCCACGGCCGCTCCCCCCACCGCGCGCCGTCCCCCCCGCCCACGCGGCTCACTCAGTGTGGGTCTCTTTGCCTTGGCTGTGGTAACCCCCTTTGCGACACACACCCAT

>Seq ID NO:5

CAAACTGGAGGCGGCGGCGCAGGCGCACGGCAAGGCCAAGCCGCTGAGCCGCTCTCTCAAAGAGTTCCCGCGTGCGCCGCCAGCCGACGGCGTGGCCCCACGCCTCTACAGCACGCGCAGCAGCAGCGGCGGCCGCGCGCCCATCAAGGCCGAgcgcgccgcgcaggcgcacggcccggccgccgccgccgtcgccgcccg

>Seq ID NO:6

TGAGGAGGAGCGGAAGTCGGAAGCTCCAGCCGTCACAGCCACATTCACTGGGCAAGCCGACTGTGAGCCAGGAAGTGCTCTTGGGGAGCCCAGGCCAAGCCATCCATTCTTGGGTCCTTTGGAGGTGAGCTAAGTGGGTCTGCCTAGGTTGGGGCTGGTGGAACCTGTGGGAGCAGGGAATGTGGAGAGTCACATGTGGGT

>Seq ID NO:7

AGCGGTTgcggcgggccggcgggcccggggAAGCGGGCGGTGGCCGCTCAGAGAATACCTTCCTTCCGGCAGGAGACCGTTTGGCCCTGTATTCCGGGCCTGCGGTTGGGCCTCCAAGCTGAGTTGGGCAACTTCCCAGCACCGCAAGAAAGGGCGAGCCAGACCTATTTGGCACCCCTTTCCCAGGAGGAGCAGGGGATGGCGCCGGCGGAGTTTGGGGAGGCTGCCCTGGCCAGTTCCCCGGGCTAGAGGGTGGAGGAGAGGAGGAGGGAGAGGAAAGGGCAGCTGAGGACTTGGAAGAAATGAGAAGCCGTGC

>Seq ID NO:8

GGGCGCAGGAAGAGCGGCTCTGCGAGGAAAGGGAAAGGAGAGGCCGCTTCTGGGAAGGGACCCGCACGACGACGCCCGAAGGGCGTCGGGGGAAGTGGTAGGCCCCGGAGACTGCGCGAGGCTCCTCAGCAAAGGAAGTGGGCGCGGCGCGCACGCAAGACCTCGCACCCGGCCTCGCGCGCCGCCTCTGGACAGCCCAGC

>Seq ID NO:9

AAAACTAATGTTTCTTCCTCCTTCTGTGATCTTCCTTCTTTCTGTTTTGAGCAGCTTCTATCACCTGTGTCCTCTGCGGATGAACTGCATAAAGCTCTCCGCCAAAGCCTACTTCTCCCTCATGGTGGAGAGGGAGCCGTGTGAGTAGTCCGGTACCGCAGCCATCCACCCTCTGCAGATCAGCTTTTCCTTCCTTGGCTC

>Seq ID NO:10

ACTCACCCTGCACGGGACAGGGACACCCGGGGACAGTGCCTCACTCACCCTACACGTGACAGGGACACCTGGGGACCGCGCCTCACTCACCCTGCACGTGACAGGGACACCCGGGGACAGTGCCTCACTCACCCTATACCTGGGAGGGACACCCAGGGACGGTGCCTCACTCACCCTACACGTGACAGGGACACCTGGGGC

>Seq ID NO:11

CCAACTGCCCGCGCGGAACCGGGCCGTGGGCCTGGGGTTCGGGAAGCGTGCGCCACCCCCGGTCGGGCCTGGCTTCCTTCTTGAATGCCCCCGGCGCAGGCCCGGTGCTTTGTCCCTCCGGCCTTCTCAAGGAGTGGTGGCCTTCTGCGGGGGCGAGAGCACGGCCTCTAGCCTTCCGCCGACGTCTCAGTGCGCAGATAccgcggcccgggcccctccgccgcgcgggggACCGCACTAGCGTCGACCTCCCGGCAGCCAACCCCGCGCGCAAGGCTCCGCGGCCGGATATGGGCCTAGCTTCCGGGATCCGCTCCCTGCGGGGCCGCGCTTAGGGTCGGAGTTCGCTAGTCCAGGGAAAGG

>Seq ID NO:12

GCGTGTCAGTGTGCAGTGGAGTGTGCAGTCTAAGCTTGCGGCTGTCTCCAGGCAGAAGAGGAGAccccggcgcgggcgggggcgggTTGGCGCCGGGCAAACGCCTTGGGTAGAGGGGAGAGGACGTTTCGTTAGTTCCCGCCCCTTCCTGACTAAAATTGCCTACCCGAAGCGCCCCGGAGGGCTTCACGGGAGGAGGGTAGACTCTCC

>Seq ID NO:13

GGAATAGGACGCTGGTTTCGTTCCCCCGAGGTGCGGAGAAGCAGTAGAAGACCTGCTGCTCTTGGAATTTGGCTCTGACCTTCTCCACGTCGGCCCGGGCCGTCTGGTAATTGTCCACGCTGCCTGGGATGTAGGAGCACTGTGGGGAGAAACAAGAGCAGCTGTGGGCTTGGAAATCCCCATTTCTTAGCCAAGGGCTTG

>Seq ID NO:14

CTTAATGCtttttttttttttttttttttttttttATAACATGAAGTTGTCAGGGACGCTCCTATGAGAACTGTTTGGAATTGCTGCACTTCTCTGGCTAGGAGGGAAGTGAGTAAATCACCAGGCGCCCCTCCCAGCTGCCCGTGTCCCTGCGCCGCTCAGCTCCTGCCGCAGGGCTGGCCGCGCCAAGCGCGCGTCCTA

>Seq ID NO:15

CAAGCGCCATCGCAAAGTGCTGCGTGACAACATACAGGGCATCACGAAGCCCGCCATCCGTCGCTTGGCCCGACGCGGCGGCGTGAAACGCATTTCGGGCCTCATTTATGAGGAGACCCGCGGTGTTCTTAAGGTGTTCCTGGAGAATGTGATACGGGACGCCGTAACCTACACGGAGCACGCCAAGCGTAAGACAGTCAC

>Seq ID NO:16

AGCCGTGGCTTCCCGTGGCTGCACTTGGAAAAAGCACTCGACGCTGCCCGGGCAGCTTTCCATCTCAAGTGGGAACGCGGCTGCCGGCTGTCTCCGCTCTTCAAAGTTAGTGGAGGCTCATTTGGAATAAACTCTTCTCTTCTGCTTCCCAGTCAGGCCCTGGTGGAATACAGAGTCTGTCCTGATCCCTGCCCTTTGACA

>Seq ID NO:17

ctcggcaacgcgccctcggcccgcagcctcctgccCCCTGTGCCCCGCTTCGGCCCCCAGCGCAGCTGCAGAGGGGCCCCCCTCGACGCATACACTCAAGAGCCCGACCGCGCGGCTGAAATCGCGGAGCTCGGAGCCGCGGCTGGCTGAGCGATCGCGGTTCCTGGGCTGCGTGCGCGCCCCTTGGAGCTGAAAGGAGCGCCAGGATCGGGGGCGCTGCACCGGGCTGGGCCCCTCAACGCTCGCAGACCGGGCCGGGCTGCAGCTGGAGATGGCAGCAATCCCGGGAGGTCTCCGGGCCTCTTCAGGGTGCGTCCAGGAGGCGGGTTCCGTGCGACGCGGCGCAGCCCACCCCCACGAGACCGCTTAACTTCGCGGGGGCAGCCTCGGGCGCTCGGAGACGCGGAGGCCCAGACTGCAGCCTCCGGATGCTGGAAGCCCAGACTCCCTGGGGTCACCGGCTCTCCCGCCACCCCAGCTGCAAAGAGTCCCATTGCTTCACCGTCCGGAGCTTAGTCTCCTTGTTCCTCTACCAGTCCCTCCCTCCGCAGGTCTCTGGGGACTTCTGACCGCCTGTTCTTA

>Seq ID NO:18

atctcggctcactgcaagctctgcctcccgggttcacgccattctcctgcctcagcctcccaagtagctgggactacaggtgcccgccaccacgcccggctaattttttgtatttttagtagagacggggtttcactgtgttagccgggatggtctcgatctcctgatctcgtgatccacctgccttggcctcccaaagtg

>Seq ID NO:19

CGGGCCAGCGCCCTGGGGCTTCCGTATCACAGGGGGCAGGGATTTCCACACGCCCATCATGGTGACTAAGGTAAGGATGGTGGCTCAAAGAGATGAGAAGGTCCTGCCAGAAGCGAGGTCGGCCCTGTTCACCCCACTCTGCACAGATGGCTTGCTTTTTCTGTTCTGGAGCTAGGGATCTGCTGCTGCCTGGCGTGCTGG

>Seq ID NO:20

TGGCGGCAAAGAGGGGTTTGGTCTCGGGGCTTAAATGGCACCAGACTCTTGCTTTTGCCCATCTGGAGACTGCAGGCTCCCTTCCTTACCCTCAGAGAGTGCTTATGGTGGGTGTTTTTGCGGGGCTGCAATAGGGGCCAAAAGTCAGGGAAAGGGGCACTGACCTGTAGTGAAAGGCCACAGGACACAGCCTTATTACTG

>Seq ID NO:21

CTGGTGCTCTGCAGTGGCAGGGCTGAGATGATTATACAACCTGCACTCCAGGCCAAGTCCGGTACTCGTCCCAGCTGTCGGCTAAGCCTGCACTGCTATGGGTGAGGGAATCACTCCTCTCCAGCTGGCTTTCTCACGCTGGAGAAGCCTGACCTTTATTCAGAATCATCCTCCAGCGCCCACATCACACAGCACCCTGGC

>Seq ID NO:22

CTGCCGGCTGGGCACGCGCCAAAAGCAGCCCTGGGCCCTGGGTATCGCGCTTGGGGGGAGGGTACCCCCGCCGGCTGGGCACGCGCCAAGAGCAGCCCTGGGCCCTGGGTATCGTGCTTAGGGGGAGGGTATCGGAGCGGGAAGTGGACCTGGGGAGCGCCGTCGGCTGAGGCTCTGGCTGATGCCGCCCTCCCCCGGATCCCCCAGGGACCGCGCTGAGCACCTCCGTGCTCCACCAGTCCATGGCCTCCTCCCCCAAGATGCCGAGGCGGTGAGTTGCGACCTGGATGTAGGCACTGCCCGCCCGAAGCGCGCGGAGGGGCCCTGGCCTTGATGACACCGCCCCCCTACCAGGGCCCTGGAGCAGGAGAAAGGGCGCCACCTCTACCTGGCCGGCCTTCCCGGCAGAAGCCGCCGAGCTAAGCCCTGGAGAGGTCGGCGCCTGGACTACATCACGTACCGCGGAGTTCCCGGGTGGCTGGGCCTGCGGCACTGG

>Seq ID NO:23

TGAGGAGATAAGGCTTCAGGCCAAAAGCAGATGGGTCACGGTGACCCGGCTGGCCCAGCCCTGGGAGCAGGCTCTGTACCCAGACCTTAGACCCTGGATGGGGCAGCCCTGCCCAGTGAGGCTGATAGGGGTGCCAGGGGCACAGAGCCACAATATGGTCGCTGAGGCTTTGGTGCCCCGTGCCCTGCATTCGAGCCCCCATCCGGCCATGCATCCTCCACCCTAATTTCCTGTTTTGTGAAGCAGGAAATGTAATTTCTCTCTTTTTTGGTTAAAACGTAAGAACACACATTGGGATGTATGGGAATCGGTGGACCTGCTGTTGGTTCTTACGTGGATGCT

>Seq ID NO:24

CGAGTCCTCGAGCTCGGGCGTCTTCGCGCCGCCGCCCCGCTCAGTGCGCCCAGGCACCGCGGCCGTGACGTCACGCCCGGGACTGGCCGTTGCAGCAAGACGGCCGCGTTCCGGTTCCGGTAGGTTGCCCGGGAGACGCGGGTACACAGAGAAGCGGCTCCCGTCGGAGGCCGAGTCGTCGCCACGATCGCCCCCTTGGTG

>Seq ID NO:25

AGCCGCGGCGGATTAGGCCGCCCGCCCCAACCTGGGCTTTGATCTTATCTGAGACTTGTGAGTCCAAAAGGGCTTAGCAACCGCAGCCATGGCAGCCCCAACGACGTGAACATCCGCACCTCTGAGCCTCCCCCTGAGAAGTACCTTCGAGGTGAGGCCTGCGCAGCCCCAGGAAGAGGGTGTGGGCGCAAACCTGAGGTGGGGAGCAAGGCCCGCCGGCTACACGGTTCCTGCCATCCTCGCTGCGCCCTTT

>Seq ID NO:26

TGCGCTCTGGTGGACGTTCCGTCTAGTTAGCCTAAGCATCATCCACATACTCTGGTGAACACTCGAGGACAAGGCCGCTTGCTATTATTAGTAAAGGGCCGAACCGTCCTGTCATTGGTGGAGGCAGTGCTTGACTGTGCATCGATCCAGGAATCCGATCTTTTCTCTCAACCACAGAGCTAACGTGCTCAGAAGTGGCCT

>Seq ID NO:27

GCCTGCCGTGGTCATAAGTCAGGGCCGAGTGGCGCTGGAGGACGGGAAGATGTTTGTCACCCCGGGGGCGGGCCGCTTCGTCCCTCGGAAAACATTCCCGGACTTTGTCTACAAGAGGATCAAAGCTCGCAACAGGGTAGGGCGGCACCCGCAAGGGTGTTGTGCAGGTAGGCAGGTGGGCGCTGAGTTCTAGGCCCAGAACGCACCCCTGGTCA

>Seq ID NO:28

GGGCGACCCCGGGGGCTGGGCCTCCCCTGGCTGGTGTCCACCCTCTCGGCCAGCACAGGGGTTCACCTTCAGGAGCCACTCAACGGCATCCTCCCCTGGAGCCCGTGCCGCCCTCACTGCCCCTGGGCAGGGCCCCGCAGCACCTCCTGCTGGGTGTAGGTGCTGTCTCGGCCCCACAGCCAGCAGTGGACATGCACCTGACCCCCAGGCAGCCAGCAGCACA

>Seq ID NO:29

TCGCGTCCTGCGGGGAGAGCCACCCTGCCCCGCGCTGCGCCCGGGACGGTTCCCTGGAACCACTCACCAGGCAGCATCATCGCGCCCAGCAGCCAGAGCCCGAGGCCGCGCATGGCCGGGTCGGGGAGCAGAGGCGGAGGTGACAGCCCCGCGGGACACGGTCTGGTTCCTGCGCTCCTGGCCCGAGGCTCTTTTccgcgcgccccgccccggcgcc

>Seq ID NO:30

TACCACTTTCCTAGAGACCATGGCCATGCTCCTAGAGGGTGAACCTGCATTCGCTGACCCCTCCATGCAAccccacttcactgatggggaaagaggatcccagaggggtaaggaacaagcccaaaataatagagcCTGCATTGGAACCGGGCTGAGCTAACACTTGGCTTACCGGCACTGTCACTGCCAGGGCCCGCGCGA

>Seq ID NO:31

CCTCCTCTAAGGCCCAGGGTCGGGGGAGGTGGGGAGGGAGCGGCCGACCGGCCGAATAGCGCTGCTTTCTTTGTTTTTCATGCAACATAATTCCATGGCCAGTCCAGGCGCTGCAGCCCCCTCCCCTGCCGGCCCCGGCGCCCGCGCAGGACCGCAGAGGGGCTGGGGGTCCAGGGCGCAGTCTAGTTCCAGGGCGCCCGC

>Seq ID NO:32

CGCGTGACCGTGCGCCAGCTCCCCGTGGGGCTCCTGCCAGGGTCGACCGGGAGGGGGTGCCACTCACCCAGATGAGCCACGCGGCTGAGGCGGGGGTCGAAACCGACCTCGCGCACCTTGTCAGTCCGCGCCAGGAAGAAGTTAACCACGCCGTCGGTGACCACGCAGCCTGGGAAGCCGACGAGCTCGTGGTGGAAGCCG

>Seq ID NO:33

CCATCCTCAGGCCTGGCGTTGGCTGCTCCTTGGCTTGTGTGCCCCTCCCTGCACCCCAATATGCCAGGATCTCCCCGCACCTCCTCATTCTACCATCACCTCACGGAGACATCCTGGTCACCCCGTGAGGCATTGCTCACGCCCTCCCCGGCACTCCACAGCCTTGAAGGGCACTGACCGCCAGTGCCTCCACCCACTGTG

>Seq ID NO:34

AGGGCTCCGGAAAACTGCGTTCTCACAAGACCAAAGGGAGGGGAGGGAGGGGGAGATGTGGCTGCAAGTGCAGTTGGAGAGGGTGTGAAGAGATCGGGAGTCCTCTGCGAGGCTCTGGAGCACCCGGCGCCTAAGAGGCTAGTGCGCCCCGTGCCGCTGCGGTAGGACCTGGCGGTCCGCAGCTCCTGAAGGGCCTGGCCG

>Seq ID NO:35

GTCACGGGTCTGGACGGGGTCGCAGGTCTGGACGGGGTCGCAGGTCTGGATGGGGTCGCACAGCTTTGGACCGGGTCGCGGGTCTGGACGGGGTCGCGGGTCTGGACGGGGTTGCACAGGTCTGGATGGGGTCGCACAGGTCTGGACGGGGTCGCGAAGGTCTGGACAGGGTCGTGGGTCTGGACAGGGTCGCAGGTCTGG

>Seq ID NO:36

TGCAAGCCCCTTTTCTAGAAGTTAGAGTTCTCCTGGGATCTTTGCCTCCCAAATTCTTGCTGGCGGCTCTGCTCTCCACCCCAGTGGGGCTGAACTAACAAGTTCCCCTTTTGCTTTTCTCACCAGAACCTGTGGTTTGCCAACCCCGGGGGCAGCAATAGCATGCCAAGCCGCACCCACAGCTCAGTCCAGAGGACCCGC

>Seq ID NO:37

AGTGCTGCACTGGGGCCCCGGGAAGCAGAAGACGGCTCCTGGCACATCTCCTGGGTGCATCTGTGGATTGCTGGGGCCCCCAGCAGCTCTCCCAATCCCCAGAAACCCCTCCTGGATCTGCTGTATCCACCTGGAGCCTCTTGGTGCACAGCGGCACACACAATACCTCCACTCTCCACCCCGAAGGATGCCCACTGCAGCGGGGTCCTCA

>Seq ID NO:38

TCCTGAAGCGCTGCTCGGAGCCGGAGCGCTACTGCCTGGCGCGGCTGATGGCTGACGCGCTGCGCGGCTGCGTGCCTGCCTTCCACGGCGTGGTGGAGCGCGACGGCGAAAGCTACCTGCAGCTGCAGGACCTGCTCGATGGCTTCGACGGACCTTGTGTGCTCGACTGCAAAATGGGCGTCAGGTATGCGTGCCCTGCCAGGTCGGTTGGGGGGATCAAGTAGGGGTCCGGGGCCGGGACAGCTGCTTGAGGGGGACCCGGGGCGAGTGCTCGAAGGGGTCTCCGTGTGCGCCCCCTCATGCCCTGGCCGCTGCCTGCGCCCCCACAGGACTTACCTAGAGGAGGAGCTGACCAAGGCCCGTGAGCGGCCCAAGCTGCGGAAGGACATGTACAAGAAAATGCTGGCGGTGGATCCTGAAGCTCCCACGGAGGAGGAGCACGCGCAGCGCGCCGTCACCAAGCCGCGCTACATGCAGTGGCGGGAAGGCATCAGCTCCA

>Seq ID NO:39

ACCTGAGGCTGGTGCGGGGGCGTCTCGGGGCTGGGGGCCACCCCTGGGGTGCAGACACCCGGCTTCTCAAGGCATCTTGGTCGGGGGTGGCAGAGGATGCACTGCTCACAGGAACCCAAATTCGAAAGACAGCCGCATCTACAATTTTAACACGGTGGCCTGGGTAGGGGGCCACCCACCCCGTCTCCTTGCCCGCCTGGCCGCCCTGCCCCTCACCCCACAGTGG

>Seq ID NO:40

CCTGCCCCAGCCCCTGCTTGCTGGGCCCACGGGGGTGGGGCGGCTCATTTTCCTGGAATGTGAAAGCAAACAGAGCCGCCACCGCAGCCAGCCCCACGGAGGCCTCTGGAGAGAAAACAAAACTGCTGGCCTAGGAGCGCCTGCCCCACGCTCTGGAGGAGAGCCCGGGGCAGGGGGACGCACAGGCAGAGCCCTCAGGGACAACCGCCCCAGGAGGCCAACGGCGACAGTTCATCCCACCTGGTGCTTCCTCCCACCCTGCCTGTGCGCCACGCTGGCCTCGAGCCAAAGGAATTCTCCCAGCAACCCGGGAAGGCGGCTGGGCCCGTCGGGGAGGCTTCTGGGTTTGAAAACAGGCTTTGCCCAAGTTCCCACAGCT

>Seq ID NO:41

TTTGGCTCTCTCCTGTCTTCGGGGTTTACAAAGTGTGTTGGGACTTGCGGGGCTGCTCTGTCCAAGCCTGGGTCTGGCGTCCGCGTCTCTGAGCCTGTGAGTGCGTGCGCTTTCCTGCGTCCTCTTGACTGCCGGTGCTGGGGCTCTGCGTCCTGCGTCCGCGGGAGTAAATACAGCAGGCGAAGGGGAAGCTCACACAATGGTCTCCAGCGCTCTGGGGCAGGGCTTCTGAGGGGCGGGCCTGCCTCT

>Seq ID NO:42

ATTGTGTTCCTCAAAAGTCTCTCTTTAGAAAAGAGAATTGCCTGACAGCTGAGCTTTTCCATCTCCCATGTTACCGGGGTCCCTTTTTGGTGGCTCAGGAAGACTGGCTGAGGACACTTTTCTGCAGGCGGGCACCCCCATCACCCCACAGCCACTGGAAGGATTGCTGAGAAGAGAAGCAAACGCCTACAGCACAGTCGC

>Seq ID NO:43

CCACACGGAACGATGGCTTATCACTGGAGAAAACCAGCCAGTGAAAGGGTCGCGGGAGAAGCCCGGGGACGACCCTGGGACTGGAGGGTTTCTCGCCTCTGGAAAAGGCAGTGCCCGCGGGGCAGGCCAGAGGGAGCGCTCCGAGGAGCTTTGGGGTTGCCAGCCTTGACACGCGCACCCCTCCGCCCGGGCCGGCTCCCCTCCGCCCTCAGACTCCCACCATCCTCCTACTATTCCACATGTCGGGTGTATATGGTGCGGAGAGCCCGGGGGAAGTTAGAACACGCGGCGGGAGAGGCAGGCCCAGGGCGGCCTCAGCTAAGCAGCCCGGCTTTCCGGATCCCCGCCGCGCACAGGC

>Seq ID NO:44

TAACTTACAGAGTGTGTCTGTGTCTTCTTGAGGAAGTGGCCTGTCTGGGTCCCCCTCCCAGTCTGAGCGTCATTGCAGTGGAATATCTCCCCTTCTCACCAATCATAACACGTCACTGTGGCAGCAGCGGATAGCTGGAAACCACCTGCCAGTGCCCAGCATGTAGGGCGTGCCCCTAGAGCGGGAGCTGCCACCTGCTTC

>Seq ID NO:45

GGCTGTGCGGGCACAGCTGTTACAGGCAGGGGGCAGGGGCCTCGTGGAGCTTGTGTAGACGGAGGGGCGGCGGGCCGTGTAGTGCAGGCTGCGAAGACTCACCGCGGTGAAGTGCGGCCAGGTGCGCAGCAGGTCGAAGAGCGCGTCGCCGGGGCAGTCGGTGCGCACCAGCTGGCGGTGGCCCAGCAGCGCGTAGTCTGGCCGCAGGAGGCCGGCGCGCACCGCACAACTCGGGAGCGTGTCGCGCACCGTGCGCAGAGCGGCCTCGGTGGGCAGCGCCGCGGTGTAGTTGCCCACTATGGCCACGCCGAAGCCCCGGGAGTTGTGGCCGAGCGTGTGGGCGCCCACCCAGTGCCAGCCGCGTCCCTCGTACACGTAG

>Seq ID NO:46

CAGCAGGGCAAGCTGAGCACACACGTGTGCAGAGCCAGGGCAGGAACACCGGAAGGTGGCGGGCAGAGTCCAGCCCCAGGACTTCCAGGTGAGAGAGCCCGCCGTGCCAGCATCAGGAGACAGCAGTCAGGAGCTCACAGAGCGGGGCCTCCACCGGGTACAGCGCTAGCACAGAGTTGGTGCTCAGTAGGCAGGGACTAAAGCCCCCACCCACCACTGCTCCCAGCAGAGCTTGGTCCTCAGACCTGGAGATGTCCTGAGGCCA

>Seq ID NO:47

GTGGCGTCCAGGGCAGGGCAGGTGCGTCATCCGGGCGGGATGCAGAGACACGTCCTTCCACCAACCATCTGAGGAGCACTTGGCACCCACACAATGAGCCCGGCAAGGGCCACGCCAGGAGGCAGCGCACGGGGCAGAGCCTCTGAGCCAGAGAGGGGGAGGTCCCTTGGGAGGCCCCTGCCATCCCCCGCTCTGGGTGGGCCTCTCCAGCCAGACTCTGCGCCCCAA

>Seq ID NO:48

GTTGGAGGAGGGAAGGCTGTTCACTGAGAGAGCAGACCCAGGAGCCCCAGTGGCAGAAGGGGCCCGGCAGGGAGTGCTGGGCAGGGAGCGCCCATGTGCCCACCCGAGTGCCAGTGCCAGCCAGCTGCTGCCCGGAGAGCCCCGGCCCTCTGTAGCTATCTGGCCTCTGCTCATGGCTGTTGCTCAGAGAGAATCTGACCAGCACTGACTTCACCTCCGCCCACCCCCTGAGGCGGCAGCTGGACCTCAGCGTTGCTTCAGGAAGAAGTCCTCAGCCAATAGTGTCC

Claims

1. use of a reagent or component in the preparation of a kit or device for (1) distinguishing between a lung cancer patient and a non-lung cancer patient, (2) diagnosing or aiding in the diagnosis of lung cancer; or (3) tissue traceability to lung cancer during a pan-cancer screening procedure, wherein the reagent or module comprises a reagent or module that detects the methylation level of a lung cancer tissue-specific methylation marker in the genomic DNA of the sample, said methylation marker being the region or locus thereof that is the gene that is 2.2kb upstream and 2.2kb downstream in the chromosome in which it is located: gene ARHGEF16; located in gene CASZ1; gene MAP3K6; gene TRIM58; gene ARHGEF33; gene PSD4; gene HOXD4; gene SLC12A8; gene DGKG; a gene TERT; gene NR2F1; the gene PCDHGC5; gene KCNMB1; gene FOXC1; gene HIST1H4F; gene TYW; the gene LRRC4; gene DGKI; gene PDLIM2; gene RHOBTB2; gene TMEM75; the gene OPLAH; gene NR5A1; gene SPAG6; gene WAPAL; gene BTBD16; gene DPYSL4; gene TTC40; gene ADAM8; gene SLC22a11; gene CPT1A; gene B4GALNT1; the gene FBRSL1; gene XPO4; gene TFDP1; gene GCH1; gene TMEM179; the gene ITPKA; gene SOX8; gene SLC9A3R2; gene SEPT-9; gene MBP; gene NFATC1; gene DNM2; gene RASAL3; gene TAF4; gene NTSR1; gene SLC17A9; or a complementary sequence or variant of either gene, provided that the methylation site in the variant is not mutated; preferably, the length of the site is 120bp-500bp, preferably 200bp-480bp.

2. The use of claim 1, wherein the non-lung cancer or pan-cancer comprises colorectal cancer, liver cancer, gastric cancer, esophageal cancer, pancreatic cancer, and/or breast cancer.

3. The use of claim 1 or 2, wherein the methylation marker comprises a nucleotide sequence set forth in any one or more of the following, or a complement or variant sequence thereof: SEQ ID NOS: 1-48.

4. The use according to any one of claims 1-3, wherein the reagent or module comprises a reagent or module for use in a method of detecting methylation by one or more of: bisulfite conversion-based PCR, DNA sequencing, methylation-sensitive restriction enzyme analysis, fluorescence quantification, methylation-sensitive high resolution melting curve, and chip-based methylation profile analysis and mass spectrometry.

5. Use according to any one of claims 1-4, wherein the reagent or component comprises primers and/or probes for detecting methylation markers and/or the sample is a cell, a tissue, a fine needle biopsy and/or plasma, preferably the sample genomic DNA is free DNA in plasma.

6. A method of constructing a predictive model for distinguishing lung cancer from other non-lung cancer cancers, comprising:

(1) Obtaining methylation levels of methylation markers in genomic DNA of lung cancer samples and non-lung cancer samples as a training set; the methylation marker is selected from the following regions or the sites of the regions, the regions being the following genes and the 2.2kb upstream region and the 2.2kb downstream region of the chromosome in which the genes are located: gene ARHGEF16; located in gene CASZ1; gene MAP3K6; gene TRIM58; gene ARHGEF33; gene PSD4; gene HOXD4; gene SLC12A8; gene DGKG; a gene TERT; gene NR2F1; the gene PCDHGC5; gene KCNMB1; gene FOXC1; gene HIST1H4F; gene TYW; the gene LRRC4; gene DGKI; gene PDLIM2; gene RHOBTB2; gene TMEM75; the gene OPLAH; gene NR5A1; gene SPAG6; gene WAPAL; gene BTBD16; gene DPYSL4; gene TTC40; gene ADAM8; gene SLC22a11; gene CPT1A; gene B4GALNT1; the gene FBRSL1; gene XPO4; gene TFDP1; gene GCH1; gene TMEM179; the gene ITPKA; gene SOX8; gene SLC9A3R2; gene SEPT-9; gene MBP; gene NFATC1; gene DNM2; gene RASAL3; gene TAF4; gene NTSR1; gene SLC17A9; or a complementary sequence or variant of either gene, provided that the methylation site in the variant is not mutated; preferably, the length of the locus is 120bp-500bp, preferably 200bp-480bp; preferably, the non-lung cancer is colorectal cancer, liver cancer, gastric cancer, esophageal cancer, pancreatic cancer and/or breast cancer; and

7. The method of claim 6, wherein the methylation marker comprises a nucleotide sequence set forth in any one or more of the following, or a complement or variant sequence thereof: SEQ ID NOS 1-48;

preferably, wherein the sample is a cell, tissue, fine needle biopsy or plasma, preferably the genomic DNA is free DNA in plasma.

8. The method of claim 6 or 7, wherein step (1) comprises obtaining methylation sequencing data of the sample DNA.

9. The method according to any one of claims 6-8, wherein step (2) comprises building a logistic regression model to obtain model predictive scores; and training using the methylation level of the obtained methylation marker as a training set, and determining a correlation threshold of the model according to a sample of the training set.

10. A predictive model of lung cancer constructed according to the method of any one of claims 6-9.

11. An apparatus for diagnosing lung cancer comprising a memory and a processor that processes instructions stored by the memory, the instructions performing the method of any one of claims 6-9 to construct a predictive model of lung cancer; and the methylation level of the methylation marker in the genome DNA of the sample to be detected is used as a test set to obtain a model predictive value, whether the sample is lung cancer is judged according to a threshold value by using the predictive value, and lung cancer is predicted to be larger than the threshold value, otherwise, other cancer species are predicted to be.

12. A kit or device for detecting lung cancer tissue specific methylation markers, comprising reagents or components for detecting the status and/or level of one or more lung cancer tissue specific methylation markers in genomic DNA from a sample, said lung cancer tissue specific methylation markers being the following regions or sites thereof, said regions being the following genes and the 2.2kb upstream region and the 2.2kb downstream region of the chromosome in which they are located: gene ARHGEF16; located in gene CASZ1; gene MAP3K6; gene TRIM58; gene ARHGEF33; gene PSD4; gene HOXD4; gene SLC12A8; gene DGKG; a gene TERT; gene NR2F1; the gene PCDHGC5; gene KCNMB1; gene FOXC1; gene HIST1H4F; gene TYW; the gene LRRC4; gene DGKI; gene PDLIM2; gene RHOBTB2; gene TMEM75; the gene OPLAH; gene NR5A1; gene SPAG6; gene WAPAL; gene BTBD16; gene DPYSL4; gene TTC40; gene ADAM8; gene SLC22a11; gene CPT1A; gene B4GALNT1; the gene FBRSL1; gene XPO4; gene TFDP1; gene GCH1; gene TMEM179; the gene ITPKA; gene SOX8; gene SLC9A3R2; gene SEPT-9; gene MBP; gene NFATC1; gene DNM2; gene RASAL3; gene TAF4; gene NTSR1; gene SLC17A9; or a complementary sequence or variant of either gene, provided that the methylation site in the variant is not mutated; preferably, the length of the locus is 120bp-500bp, preferably 200bp-480bp;

Preferably, wherein the methylation marker comprises a nucleotide sequence set forth in any one or more of the following, or a complement or variant sequence thereof: SEQ ID NOS: 1-48.

13. The kit or device of claim 12, wherein the sample is a cell, tissue, fine needle biopsy or plasma, preferably wherein the nucleic acid is free DNA in plasma.

14. The kit or device of claim 12 or 13, wherein the reagents or components comprise reagents or components for use in one or more of the following methods: bisulfite conversion-based PCR, DNA sequencing, methylation-sensitive restriction enzyme analysis, fluorescence quantification, methylation-sensitive high resolution melting curve, and chip-based methylation profile analysis and mass spectrometry;

preferably, the reagent comprises an oligonucleotide for detecting a methylation marker, preferably the oligonucleotide is a primer and/or a probe;

preferably, the primer is a primer for detecting the methylation level/state of a site using methylation sequencing or a PCR primer for amplifying one or more methylation sites;

preferably, the reagent comprises bisulfite and derivatives thereof, PCR buffers, polymerase, dntps, primers, probes, methylation sensitive or insensitive restriction enzymes, cleavage buffers, fluorescent dyes, fluorescence quenchers, fluorescence reporters, exonucleases, alkaline phosphatase, internal standards, and/or controls that are the aforementioned specific methylation markers from normal subjects or cancer patients other than lung cancer; preferably, the non-lung cancer is colorectal cancer, liver cancer, gastric cancer, esophageal cancer, pancreatic cancer and/or breast cancer.