CN112862018A - Tumor classification device based on 5hmC modified lncRNA - Google Patents

Tumor classification device based on 5hmC modified lncRNA Download PDF

Info

Publication number
CN112862018A
CN112862018A CN202110421040.3A CN202110421040A CN112862018A CN 112862018 A CN112862018 A CN 112862018A CN 202110421040 A CN202110421040 A CN 202110421040A CN 112862018 A CN112862018 A CN 112862018A
Authority
CN
China
Prior art keywords
lncrna
data
5hmc
tumor
5hmc modified
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110421040.3A
Other languages
Chinese (zh)
Other versions
CN112862018B (en
Inventor
周猛
孙杰
白玉
苏建忠
侯萍
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wenzhou Medical University
Original Assignee
Wenzhou Medical University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wenzhou Medical University filed Critical Wenzhou Medical University
Priority to CN202110421040.3A priority Critical patent/CN112862018B/en
Publication of CN112862018A publication Critical patent/CN112862018A/en
Application granted granted Critical
Publication of CN112862018B publication Critical patent/CN112862018B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/088Non-supervised learning, e.g. competitive learning
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/20Allele or variant detection, e.g. single nucleotide polymorphism [SNP] detection
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/30Detection of binding sites or motifs

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Health & Medical Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Software Systems (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Mathematical Physics (AREA)
  • Computing Systems (AREA)
  • Medical Informatics (AREA)
  • Biotechnology (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Biomedical Technology (AREA)
  • Computational Linguistics (AREA)
  • Genetics & Genomics (AREA)
  • Analytical Chemistry (AREA)
  • Chemical & Material Sciences (AREA)
  • Investigating Or Analysing Biological Materials (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The invention relates to a tumor classification device based on 5hmC modified lncRNA. According to the method, 5hmC modified lncRNA is used as an analysis object, 5hmC modified lncRNA with tissue specificity difference is screened, a tumor classification model is established by adopting a machine learning method, a 5hmC modified lncRNA composition which can accurately distinguish tumor patients and healthy people and different types of tumors is obtained, characteristic data is obtained, the tumor classification model is established by utilizing a characteristic selection result, and the method has a very good clinical application value.

Description

Tumor classification device based on 5hmC modified lncRNA
Technical Field
The invention relates to the technical field of biological information, in particular to a tumor classification device, equipment and a computer-readable storage medium based on 5hmC modified lncRNA.
Background
5-hydroxymethylcytosine (5hmC) is the oxidation of 5-methylcytosine (5mC) by TET enzyme, and the role and function of 5hmC is not completely clear compared to classical 5mC, however, as research progresses, it is recognized that 5hmC is a stable epigenetic marker in the human genome, and not merely an intermediate product of 5mC demethylation. 5hmC differs significantly in different cell and tissue types and may be a potential biomarker. However, the effect of 5 hmC-modified lncRNA (long non-coding RNAs) has not been clarified yet, although 5hmC is mostly studied in the protein-coding gene body and promoter.
cfDNA (circulating cell-free DNA) is a degraded DNA fragment released into plasma, originating from dead cells in different tissues, which can be used for early tumor screening and classification. The classification of tumors is very important for the treatment of tumors, and it is a question to be studied whether the classification of tumors can be performed by 5hmC modified lncRNA derived from tissue-specific plasma.
Disclosure of Invention
The invention aims to provide a tumor classification device based on 5hmC modified lncRNA, which comprises: a memory for storing program instructions and a processor;
the processor is configured to invoke program instructions that, when executed, are configured to:
acquiring 5hmC modified lncRNA data of a sample to be detected;
inputting the 5hmC modified lncRNA data into a tumor classification model to obtain a prediction result of the tumor classification of the sample to be detected;
the tumor classification model obtains the prediction result of the tumor classification of the sample to be detected through the feature data of one or more of 22 types of 5hmC modified lncRNA: the 5hmC modified lncRNA is: ENSG00000125899, ENSG00000203971, ENSG00000215112, ENSG00000215304, ENSG00000224189, ENSG00000224267, ENSG00000225960, ENSG00000227068, ENSG00000227716, ENSG 02300000292, ENSG00000231662, ENSG00000234182, ENSG00000234567, ENSG00000255229, ENSG00000257568, ENSG00000258026, ENSG00000259926, ENSG 00000260220220220220223, ENSG 00000262722728, ENSG00000263904, ENSG00000272129, ENSG 00000273792.
The invention provides a tumor classification device based on 5hmC modified lncRNA, which comprises: a memory and a processor;
the memory is to store program instructions;
the processor is configured to invoke program instructions that, when executed, are configured to:
acquiring 5hmC modified lncRNA data of a sample to be detected;
inputting the 5hmC modified lncRNA data into a tumor classification model to obtain a prediction result of the tumor classification of the sample to be detected;
the tumor classification model is determined in a manner that comprises the following steps:
acquiring 5hmC modified lncRNA data of tumor patients and healthy people;
screening 5hmC modified lncRNA data with tissue specificity difference as characteristic data, wherein the 5hmC modified lncRNA with the tissue specificity difference has difference in tumor patients and healthy people and also has difference in different types of tumors;
and performing feature selection on the feature data by adopting a machine learning method, and establishing a tumor classification model by using a feature selection result.
Further: the 5hmC modified lncrnas for screening tissue-specific differences specifically include: respectively comparing the 5hmC modified lncRNA data of each tumor patient with that of the healthy population, respectively obtaining 5hmC modified lncRNA which is different between each tumor and the healthy population, screening the 5hmC modified lncRNA which is different between each tumor, and removing the 5hmC modified lncRNA which is not different between the 5hmC modified lncRNA data of different types of tumors.
Further: the 5hmC modified lncrnas for screening tissue-specific differences specifically include: firstly, 5hmC modified lncRNA of different types of tumors is taken, 5hmC modified lncRNA shared by two or more types of tumors is removed, the obtained data of the 5hmC modified lncRNA is compared with the data of the 5hmC modified lncRNA of healthy people, and the 5hmC modified lncRNA with the difference between the data of the 5hmC modified lncRNA in the different types of tumors and the data of the 5hmC modified lncRNA of healthy people is selected.
Further, the 5hmC modified lncRNA data comprises:
obtaining 5hmC sequencing data, comparing the sequencing data with a human genome, and keeping the unique non-repetitive match with the human genome;
downloading a latest released lncRNA reference gene annotation file;
said retaining unique non-repetitive matches to the human genome obtains data for 5hmC modified lncrnas based on said annotation file;
wherein when the human genome version is the same as the release version of the most recently released lncRNA reference gene annotation file, data for 5hmC modified lncRNA is obtained based on the annotation file; when the version of the human genome is different from the release version of the latest release lncRNA reference gene annotation file, lncRNA localization information is first transmitted from the version of the lncRNA reference gene annotation file to the same version as the human genome, and data of 5 hmC-modified lncRNA is obtained based on the lncRNA reference gene annotation file of the same version as the human genome.
Further, the different 5hmC modified lncRNA is judged by including fold change and P-value indexes;
a5 hmC modified lncRNA with a preferred | fold change | >0.58and a P-value <0.05 is judged to be a 5hmC modified lncRNA with a difference.
Further, the feature selection also comprises cluster analysis after the feature selection; preferably, the clustering analysis is unsupervised hierarchical clustering analysis.
Further, the feature selection is performed in parallel by adopting a plurality of different machine learning methods, and features in the model which result in the maximum accuracy are selected;
preferably, the machine learning method includes one or more of recursive feature elimination, CART, random forest, linear regression, naive bayes, and a customized training model.
Further, the tumor classification model also includes a regularization term.
A tumor classification system based on 5hmC modified lncRNA comprising:
the acquisition unit is used for acquiring 5hmC modified lncRNA data of a sample to be detected;
the processing unit is used for inputting the 5hmC modified lncRNA data into a tumor classification model to obtain a prediction result of the tumor classification of the sample to be detected;
the tumor classification model is determined in a manner that comprises the following steps:
acquiring 5hmC modified lncRNA data of tumor patients and healthy people;
screening 5hmC modified lncRNA data with tissue specificity difference as characteristic data, wherein the 5hmC modified lncRNA with the tissue specificity difference has difference in tumor patients and healthy people and also has difference in different types of tumors;
and performing feature selection on the feature data by adopting a machine learning method, and establishing a tumor classification model by using a feature selection result.
A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, implements the above-mentioned tumor classification system.
The application has the advantages that: according to the method, 5hmC modified lncRNA is used as an analysis object, 5hmC modified lncRNA data with tissue specificity difference are screened, a tumor classification model is established by adopting various machine learning methods, a 5hmC modified lncRNA composition which can accurately distinguish tumor patients from healthy people and different types of tumors is obtained, the characteristics of the model are obtained, and the tumor classification model is established by utilizing the characteristic selection result.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
Fig. 1 is a schematic flow chart of a tumor classification method based on 5hmC modified lncRNA provided in an embodiment of the present invention;
fig. 2 is a schematic block diagram of a tumor classification system based on 5hmC modified lncRNA provided by an embodiment of the present invention;
FIG. 3 is a graph of 5hmC modified lncRNA profile for each tumor enrichment, A for positive enrichment and B for negative enrichment;
FIG. 4 is a graph of consensus cluster analysis of tissue-specific 5hmC modified lncRNA;
FIG. 5 is a summary of the classification performance of the classification model of FIG. 3;
FIG. 6 is a detection case of a training set classification model;
FIG. 7 is a test case of the test set classification model;
FIG. 8 is the agreement of tumor practice with classification model predictions; a, training set prediction and B test set prediction.
Detailed Description
In order to make the technical solutions of the present invention better understood, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention.
In some of the flows described in the present specification and claims and in the above figures, a number of operations are included that occur in a particular order, but it should be clearly understood that these operations may be performed out of order or in parallel as they occur herein, with the order of the operations being indicated as 101, 102, etc. merely to distinguish between the various operations, and the order of the operations by themselves does not represent any order of performance. Additionally, the flows may include more or fewer operations, and the operations may be performed sequentially or in parallel.
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Fig. 1 is a schematic flow chart of a tumor classification method based on 5hmC modified lncRNA provided in an embodiment of the present invention, specifically, the method includes the following steps:
101: acquiring 5hmC modified lncRNA data of a sample to be detected;
102: and inputting the 5hmC modified lncRNA data into a tumor classification model to obtain a prediction result of the tumor classification of the sample to be detected.
The tumor classification model is determined in a manner that comprises the following steps:
acquiring 5hmC modified lncRNA data of tumor patients and healthy people;
screening 5hmC modified lncRNA data with tissue specificity difference as characteristic data, wherein the 5hmC modified lncRNA with the tissue specificity difference has difference in tumor patients and healthy people and also has difference in different types of tumors;
and performing feature selection on the feature data by adopting a machine learning method, and establishing a tumor classification model by using a feature selection result.
The term "sample" may be any biological sample isolated from a subject. For example, a sample may include, but is not limited to, body fluids, whole blood, platelets, serum, plasma, stool, red blood cells, white or white blood cells, endothelial cells, tissue biopsies, synovial fluid, lymph, ascites, interstitial or extracellular fluids, fluids of the intercellular spaces including gingival crevicular fluid, bone marrow, cerebrospinal fluid, saliva, mucus, sputum, semen, sweat, urine, nasal brush fluid, pap smear fluid, or any other bodily fluid. The bodily fluid may include saliva, blood or serum. For example, the polynucleotide may be cell-free DNA isolated from a bodily fluid such as blood or serum. The sample may also be a tumor sample, which may be obtained from a subject by various methods including, but not limited to, venipuncture, excretion, ejaculation, massage, biopsy, needle aspiration, lavage, scraping, surgical incision or intervention, or other methods. The sample may be a cell-free sample (e.g., not containing any cells).
In one embodiment, said screening for tissue specific differences of 5hmC modified lncrnas specifically comprises: respectively comparing the 5hmC modified lncRNA data of each tumor patient with that of the healthy population, respectively obtaining 5hmC modified lncRNA which is different between each tumor and the healthy population, screening the 5hmC modified lncRNA which is different between each tumor, and removing the 5hmC modified lncRNA which is not different between the 5hmC modified lncRNA data of different types of tumors.
In one embodiment, said screening for tissue specific differences of 5hmC modified lncrnas specifically comprises: firstly, 5hmC modified lncRNA of different types of tumors is taken, 5hmC modified lncRNA shared by two or more types of tumors is removed, the obtained data of the 5hmC modified lncRNA is compared with the data of the 5hmC modified lncRNA of healthy people, and the 5hmC modified lncRNA with the difference between the data of the 5hmC modified lncRNA in the different types of tumors and the data of the 5hmC modified lncRNA of healthy people is selected.
In one embodiment, the 5hmC modified lncRNA data comprises: obtaining 5hmC sequencing data, comparing the sequencing data with a human genome, and keeping the unique non-repetitive match with the human genome; downloading a latest released lncRNA reference gene annotation file; the unique non-duplicate matches to the human genome were retained based on the annotation file to obtain data for 5hmC modified lncrnas.
In one embodiment, obtaining 5hmC modified lncRNA data for tumor patients and healthy populations comprises: obtaining 5hmC sequencing data of tumor patients and healthy people, comparing the sequencing data with human genome, and reserving Unique non-repetitive matches (Unique non-duplicate matches) with the human genome; downloading a latest published lncRNA reference gene annotation file, and obtaining data of 5hmC modified lncRNA based on annotation;
in one embodiment, the 5hmC modified lncRNA data is a vector, having a size and a sign.
Wherein when the human genome version is the same as the release version of the most recently released lncRNA reference gene annotation file, data for 5hmC modified lncRNA is obtained based on the annotation file; when the version of the human genome is different from the release version of the latest release lncRNA reference gene annotation file, lncRNA localization information is first transmitted from the version of the lncRNA reference gene annotation file to the same version as the human genome, and data of 5 hmC-modified lncRNA is obtained based on the lncRNA reference gene annotation file of the same version as the human genome.
In one embodiment, the obtaining 5hmC modified lncRNA data for tumor patients and healthy populations comprises: 5hmC sequencing data reads obtained from tumor patients and healthy populations were aligned to the human genome GRCh37 using Bowtie2, retaining the only non-duplicate match to the human genome in picard-2.18.4. Download release version of lncRNA reference gene annotation file (GRCh38 version) from Gencode database, lifttover used to transfer localization information from GRCh38 version to GRCh37 version of lncRNA reference gene annotation file, extract lncRNA genes based on GRCh37 annotation, obtain 5hmC modified lncRNA data by counting the fragments in each RefSeq lncRNA obtained by the tool.
In one embodiment, reads from which 5hmC modified incrna was obtained are converted to TPM (5hmC per kilobase transcript of incrna per million mapped reads).
In one embodiment, the 5hmC sequencing data for tumor patients and healthy populations may be self-contained sequencing data or database-published sequencing data. For example, the sequencing data GSE8957 published in the database, etc., a portion of the sample was taken, and the details of the sample are given in table 1.
TABLE 1
Figure BDA0003027852410000071
Figure BDA0003027852410000081
In one embodiment, said screening for tissue specific differences of 5hmC modified lncrnas specifically comprises: respectively comparing the 5hmC modified lncRNA data of each tumor patient with that of the healthy population, respectively obtaining 5hmC modified lncRNA which is different between each tumor and the healthy population, screening the 5hmC modified lncRNA which is different between each tumor, and removing the 5hmC modified lncRNA which is not different between the 5hmC modified lncRNA data of different types of tumors.
In one embodiment, the number of different types of tumors may be n (n is an integer).
In one embodiment, the number of tumors of different types is at least 2.
In one embodiment, the tumor may be one or more of the following tumors: acute Lymphoblastic Leukemia (ALL), acute myelogenous leukemia, adrenocortical carcinoma, adult acute myelogenous leukemia, unknown cancer of the primary site of an adult, adult malignant mesothelioma, AIDS-related cancer, AIDS-related lymphoma, anal cancer, appendiceal cancer, astrocytoma, childhood cerebellar or cerebral cancer, basal cell carcinoma, cholangiocarcinoma, bladder cancer, bone tumor, osteosarcoma/malignant fibrous histiocytoma, brain cancer, brain stem glioma, breast cancer, bronchial adenoma/carcinoid cancer, Burkitt's lymphoma, carcinoid tumor, primary unknown cancer, central nervous system lymphoma, cerebellar astrocytoma, cerebral astrocytoma/malignant glioma, cervical cancer, childhood acute myelogenous leukemia, unknown cancer of the primary site of a childhood, childhood cancer, childhood cerebral astrocytoma, pediatric mesothelioma, Chondrosarcoma, chronic lymphocytic leukemia, chronic myelogenous leukemia, chronic myeloproliferative disorder, colon cancer, cutaneous T-cell lymphoma, desmoplastic small round cell tumor, endometrial cancer, endometrial uterine cancer, ependymoma, epithelioid angioendothelioma (EHE), esophageal cancer, Ewing tumor family, Ewing's sarcoma family in Ewing tumor family, extracranial germ cell tumor, extragonally germ cell tumor, extrahepatic bile duct cancer, eye cancer, intraocular melanoma, gall bladder cancer, stomach (stomatic) (gastric (stomach)) cancer, stomach carcinoid cancer, gastrointestinal carcinoid tumor, gastrointestinal stromal tumor (GIST), gestational trophoblastoma, brain stem glioma, hairy cell leukemia, head and neck cancer, heart cancer, hepatocellular (liver) cancer, Hodgkin's lymphoma, hypopharyngeal cancer, hypothalamic and visual pathway glioma, glioma, Pancreatic islet cell carcinoma (endocrine pancreas), kaposi's sarcoma, kidney cancer (renal cell carcinoma), laryngeal cancer, acute lymphoblastic leukemia (also known as acute lymphocytic leukemia), acute myelogenous leukemia (also known as acute myelogenous leukemia), chronic lymphocytic leukemia (also known as chronic lymphocytic leukemia), leukemia (leukaemia), chronic myelogenous leukemia (also known as chronic myelogenous leukemia), hairy cell leukemia (leukamia), lip and oral cancer, liposarcoma, liver cancer (primary), non-small cell lung cancer, lymphoma (aids-related), lymphoma, macroglobulinemia, male breast cancer, malignant fibrous histiocytoma/osteosarcoma of bone, medulloblastoma, melanoma, merkel cell carcinoma, primary metastatic cervical squamous cell carcinoma, oral cancer, multiple endocrine tumor syndrome, multiple myeloma, Multiple myeloma (myelocarcinoma) in children, multiple myeloma/plasma cell neoplasm, mycosis fungoides, myelodysplastic syndrome, myelodysplastic/myeloproliferative disorders, chronic myelogenous leukemia, myxoma, cancer of the nasal and paranasal sinuses, nasopharyngeal carcinoma, neuroblastoma, non-hodgkin's lymphoma, non-small cell lung carcinoma, oligodendroglioma, oral cancer, oropharyngeal cancer, osteosarcoma/osteocarcinoma malignant fibrous histiocytoma, ovarian cancer, ovarian epithelial cancer (superficial epithelial mesenchymal tumor), ovarian germ cell tumor, ovarian low malignant potential tumor, pancreatic cancer, islet cell cancer, paranasal sinuses and nasal cavity cancer, parathyroid carcinoma, penile cancer, pharyngeal cancer, pheochromocytoma, pineal astrocytoma, pineal germ cell tumor, pineal blastoma and supratentorial primitive neuroectodermal tumor, Pituitary adenoma, plasma cell neoplasm/multiple myeloma, pleuropulmonoblastoma, primary central nervous system lymphoma, prostate cancer, rectal cancer, renal cell carcinoma (kidney cancer), transitional cell carcinoma of renal pelvis and ureter, retinoblastoma, rhabdomyosarcoma, salivary gland carcinoma, Szary syndrome, skin cancer (melanoma), skin cancer (non-melanoma), merkel cell skin cancer, small cell lung cancer, small cell carcinoma of intestine, soft tissue sarcoma, squamous cell carcinoma, primary focus occult metastatic cervical squamous carcinoma, gastric cancer, supratentorial primary neuroectodermal tumor, cutaneous T-cell lymphoma, testicular cancer, laryngeal cancer, thymoma and thymoma, thyroid cancer, transitional cell carcinoma of renal pelvis and ureter, transitional cell carcinoma of ureter and renal pelvis, urethral carcinoma, uterine sarcoma, vaginal carcinoma, visual pathway and hypothalamus glioma, Children's visual pathway and hypothalamic glioma, vulvar cancer, macroglobulinemia and nephroblastoma (renal cancer).
In one embodiment, the differential 5hmC modified lncRNA is determined by including a fold change and a P-value indicator. For example: the 5hmC modified lncRNA with | fold change | >0.58and P-value <0.05 is judged to be the 5hmC modified lncRNA with difference.
In one embodiment, a 5 hmC-modified lncRNA that is differentially expressed refers to a difference in transcriptome, e.g., a difference in expression level of the 5 hmC-modified lncRNA between a tumor patient and a healthy population. For another example, if a 5 hmC-modified lncRNA is expressed in a tumor patient and a healthy population at a relative amount | fold change | >0.58and P-value <0.05, the 5 hmC-modified lncRNA is considered to be a 5 hmC-modified lncRNA having a difference.
In one embodiment, 5hmC modified lncRNA differences in CC, GC, HCC and healthy population were identified using the DESeq2 software package, for example, in the plum cohort. Those lncRNAs with | fold change | >0.58and FDR adjusted P-value <0.05 as modified lncRNA with a difference of 5 hmC. Through analysis, 1402 colon cancer molecular markers (1340 in a positive direction, 62 in a negative direction), 3189 stomach cancer molecular markers (2583 in a positive direction, 606 in a negative direction) and 230 liver cancer molecular markers (201 in a positive direction and 29 in a negative direction) are identified in 5hmC modified lncRNA enriched in tumor patients compared with healthy people, and the figure is shown in figure 3. When the intersection of 5hmC modified lncrnas from different tumor species was taken, 2081 tumor-enriched 5hmC modified lncrnas were considered as 5hmC modified lncrnas with tissue-specific differences.
In one embodiment, 5hmC modified lncrnas with different specificity differences among different tumor types are screened, then subjected to cluster analysis, and the screening effect is tested. For example, consensus clustering analysis of 2081 tissue-specific 5 hmC-modified lncrnas (consensus clustering analysis) showed three distinct patient populations, and the patient population from unsupervised hierarchical clustering analysis (unsupervised hierarchical clustering analysis) was divided between different cancers (fig. 4). These results indicate that 5hmC modified lncRNA profiles differ significantly depending on tissue origin. Plasma derived 5hmC modified lncRNA may be used for fluid biopsy of patients.
In one embodiment, 5hmC modified lncRNA data from tumor patients and healthy populations are obtained as a training set, and the cohort of training set data is randomly divided into three quarters of the cohort and one quarter of the cohort using a data splitting function. For example, the cohort of CC, GC, HCC and healthy patients of the lie group was randomly divided into three quarters of the cohort (training set) and one quarter of the cohort (internal validation set) using the data splitting function "createdatatarition".
In one embodiment, a machine learning method is adopted to perform feature selection on the feature data, and a tumor classification model is established by using a feature selection result. In the process of feature selection, such as bag classification and regression tree (CART) based Recursive Feature Elimination (RFE), followed by 10-fold cross-validation (10-fold cross-validation procedure), the feature selection process for cancer diagnosis was repeated 5 times, and the model that resulted in the greatest "accuracy" was selected. Selection was performed using the "rfe" and "treebagFuncs" functions in the R software package of Caret. Thus, two reduced subgroups of 5hmC modified lncRNAs were generated and used as input for further analysis.
In one embodiment, the method of machine learning includes one or more of recursive feature elimination, CART, random forest, linear regression, naive bayes, custom training models.
In one embodiment, the tumor classification model further includes a regularization term. For example, elastic net regularization (elastic net regularization) is performed on a multivariate logistic regression model using the glmnet method. The model was cross-validated 10-fold and trained for alphaFarad (alpha) and lambda (lambda) (alpha range: 0.05-1, length: 10; lambda range: 10)-1To 5 x 10-1In increments of 0.1) parameter value grid optimizes the Receiver Operating Characteristic (ROC) curve, where alpha controls the relative ratio between Ridge and Lasso penalties and lambda controls the overall strength of the penalties. This selection process was repeated 20 times.
In one embodiment, the L1 and L2 penalties of lasso (lasso) and ridge methods (ridge) are linearly combined and used to build a tumor classification model (5hmC-lncRNA classification score model, abbreviated as 5hLC model).
In one embodiment, the 140 5hmC modified lncrnas are characterized based on bagged cart (bated cart) to obtain 22 tumor-associated plasma-derived 5hmC modified lncrnas as non-invasive biomarkers, wherein the 5hmC modified lncrnas are: ENSG00000125899, ENSG00000203971, ENSG00000215112, ENSG00000215304, ENSG00000224189, ENSG00000224267, ENSG00000225960, ENSG00000227068, ENSG00000227716, ENSG 02300000292, ENSG00000231662, ENSG00000234182, ENSG00000234567, ENSG00000255229, ENSG00000257568, ENSG00000258026, ENSG00000259926, ENSG 00000260220220220220223, ENSG 00000262722728, ENSG00000263904, ENSG00000272129, ENSG 00000273792.
In one embodiment, applicants evenly separated the samples according to sample type (healthy population samples and tumor patient samples) and used 75% of the samples as a training set, with the remaining 25% representing the test set in a homogeneous study of the plum cohort. Based on these tumor-associated 5hmC modified lncrnas, a 5hLC model was constructed using elastic-net algorithm (elastic-net algorithm). The classification performance of the model is shown in FIG. 5: ten-fold cross validation of the training set yielded an AUC of 0.839 (95% CI: 0.769-0.910), indicating that the 5hLC score is most predictive of CC for detection of GC and HCC samples (FIG. 6). Similarly, the classification performance of the GCs from HCC and CC was 0.843 (95% CI: 0.767-0.918), and the AUC metric of the HCC from CC and GC was 0.906 (95% CI: 0.823-0.989) (FIG. 6). In addition, the test data set in the li cohort was re-analyzed and confirmed tissue-specific plasma derived 5hmC modified lncRNA, as well as training results above 0.7AUC (fig. 7). Furthermore, the 5hmC TPM pattern of tissue-specific 5hmC modified lncRNAs is consistent with the 5hLC scores observed in the lee cohort (training and testing) at CC-derived 5hmC modified lncRNAs or GC-derived 5 hmC-modified lncRNAs or HCC-derived 5hmC modified lncRNAs (fig. 8A, 8B). These data underscore the potential of these 5hmC modified lncrnas as tissue specific biomarkers.
In one embodiment, after the feature selection, clustering analysis is performed to check the feature selection effect.
In one embodiment, The cluster analysis is consensus cluster analysis (The consensus clustering analysis), implemented using The R package "consensus cluster plus", which automatically selects The number of clusters, an unsupervised clustering method. Hierarchical clustering (Hierarchical clustering) is performed using the R-package "pheatmap".
In one embodiment, unsupervised hierarchical cluster analysis of the classification markers from three tumor patients and a healthy population; substantially all cancer samples can be identified as cancer-like clusters, while the vast majority of healthy samples in other clusters are identified as healthy clusters. Hierarchical clustering analysis using 22 tumor tissue-specific related 5hmC modified lncrnas separated tumor patients well from healthy controls and also separated tumors of different kinds.
A device for tumor classification based on 5hmC modified lncRNA, the apparatus comprising: a memory for storing program instructions and a processor;
the processor is configured to invoke program instructions that, when executed, are configured to:
acquiring 5hmC modified lncRNA data of a sample to be detected;
inputting the 5hmC modified lncRNA data into a tumor classification model to obtain a prediction result of the tumor classification of the sample to be detected;
the tumor classification model obtains the prediction result of the tumor classification of the sample to be detected through the feature data of one or more of 22 types of 5hmC modified lncRNA: the 5hmC modified lncRNA is: ENSG00000125899, ENSG00000203971, ENSG00000215112, ENSG00000215304, ENSG00000224189, ENSG00000224267, ENSG00000225960, ENSG00000227068, ENSG00000227716, ENSG 02300000292, ENSG00000231662, ENSG00000234182, ENSG00000234567, ENSG00000255229, ENSG00000257568, ENSG00000258026, ENSG00000259926, ENSG 00000260220220220220223, ENSG 00000262722728, ENSG00000263904, ENSG00000272129, ENSG 00000273792.
Fig. 2 is a diagram of a tumor classification system based on 5hmC modified lncRNA according to an embodiment of the present invention, including:
the acquisition unit is used for acquiring 5hmC modified lncRNA data of a sample to be detected;
the processing unit is used for inputting the 5hmC modified lncRNA data into a tumor classification model to obtain a prediction result of the tumor classification of the sample to be detected;
the tumor classification model is determined in a manner that comprises the following steps:
acquiring 5hmC modified lncRNA data of tumor patients and healthy people;
screening 5hmC modified lncRNA data with tissue specificity difference as characteristic data, wherein the 5hmC modified lncRNA with the tissue specificity difference has difference in tumor patients and healthy people and also has difference in different types of tumors;
and performing feature selection on the feature data by adopting a machine learning method, and establishing a tumor classification model by using a feature selection result.
A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, implements the system for tumor classification as defined above.
The validation results of this validation example show that assigning an intrinsic weight to an indication can moderately improve the performance of the method relative to the default settings.
It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.
In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other manners. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.
Those skilled in the art will appreciate that all or part of the steps in the methods of the above embodiments may be implemented by associated hardware instructed by a program, which may be stored in a computer-readable storage medium, and the storage medium may include: read Only Memory (ROM), Random Access Memory (RAM), magnetic or optical disks, and the like.
It will be understood by those skilled in the art that all or part of the steps in the method for implementing the above embodiments may be implemented by hardware that is instructed to implement by a program, and the program may be stored in a computer-readable storage medium, and the above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.
While the invention has been described in detail with reference to specific embodiments thereof, it will be apparent to one skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention.

Claims (10)

1. A device for tumor classification based on 5hmC modified lncRNA, the device comprising: a memory for storing program instructions and a processor;
the processor is configured to invoke program instructions that, when executed, are configured to:
acquiring 5hmC modified lncRNA data of a sample to be detected;
inputting the 5hmC modified lncRNA data into a tumor classification model to obtain a prediction result of the tumor classification of the sample to be detected;
the tumor classification model obtains the prediction result of the tumor classification of the sample to be detected through the feature data of one or more of 22 types of 5hmC modified lncRNA: the 5hmC modified lncRNA is: ENSG00000125899, ENSG00000203971, ENSG00000215112, ENSG00000215304, ENSG00000224189, ENSG00000224267, ENSG00000225960, ENSG00000227068, ENSG00000227716, ENSG 02300000292, ENSG00000231662, ENSG00000234182, ENSG00000234567, ENSG00000255229, ENSG00000257568, ENSG00000258026, ENSG00000259926, ENSG 00000260220220220220223, ENSG 00000262722728, ENSG00000263904, ENSG00000272129, ENSG 00000273792.
2. A device for tumor classification based on 5hmC modified lncRNA, the device comprising: a memory and a processor;
the memory is to store program instructions;
the processor is configured to invoke program instructions that, when executed, are configured to:
acquiring 5hmC modified lncRNA data of a sample to be detected;
inputting the 5hmC modified lncRNA data into a tumor classification model to obtain a prediction result of the tumor classification of the sample to be detected;
the tumor classification model is determined in a manner that comprises the following steps:
acquiring 5hmC modified lncRNA data of tumor patients and healthy people;
screening 5hmC modified lncRNA data with tissue specificity difference as characteristic data, wherein the 5hmC modified lncRNA with the tissue specificity difference has difference in tumor patients and healthy people and also has difference in different types of tumors;
and performing feature selection on the feature data by adopting a machine learning method, and establishing a tumor classification model by using a feature selection result.
3. The device of claim 2, wherein the screening for 5 hmC-modified lncrnas for tissue-specific differences specifically comprises:
respectively comparing the 5hmC modified lncRNA data of each tumor patient with that of the healthy population, respectively obtaining 5hmC modified lncRNA which is different between each tumor and the healthy population, screening the 5hmC modified lncRNA which is different between each tumor, and removing the 5hmC modified lncRNA which is not different between the 5hmC modified lncRNA data of different types of tumors.
4. The device of claim 2, wherein the screening for 5 hmC-modified lncrnas for tissue-specific differences specifically comprises: firstly, 5hmC modified lncRNA of different types of tumors is taken, 5hmC modified lncRNA shared by two or more types of tumors is removed, the obtained data of the 5hmC modified lncRNA is compared with the data of the 5hmC modified lncRNA of healthy people, and the 5hmC modified lncRNA with the difference between the data of the 5hmC modified lncRNA in the different types of tumors and the data of the 5hmC modified lncRNA of healthy people is selected.
5. The apparatus of claim 2, wherein the 5 hmC-modified lncRNA data comprises:
obtaining 5hmC sequencing data, comparing the sequencing data with a human genome, and keeping the unique non-repetitive match with the human genome;
downloading a latest released lncRNA reference gene annotation file;
said retaining unique non-repetitive matches to the human genome obtains data for 5hmC modified lncrnas based on said annotation file;
wherein when the human genome version is the same as the release version of the most recently released lncRNA reference gene annotation file, data for 5hmC modified lncRNA is obtained based on the annotation file; when the version of the human genome is different from the release version of the latest release lncRNA reference gene annotation file, lncRNA localization information is first transmitted from the version of the lncRNA reference gene annotation file to the same version as the human genome, and data of 5 hmC-modified lncRNA is obtained based on the lncRNA reference gene annotation file of the same version as the human genome.
6. The apparatus of claim 2, wherein the differential 5 hmC-modified incrna is determined by a method comprising a fold change and a P-value indicator; a5 hmC modified lncRNA with a preferred | fold change | >0.58and a P-value <0.05 is judged to be a 5hmC modified lncRNA with a difference.
7. The apparatus of claim 2, wherein the feature selection further comprises a cluster analysis after feature selection; preferably, the clustering analysis is unsupervised hierarchical clustering analysis.
8. The apparatus of claim 2, wherein the feature selection is performed in parallel by employing a plurality of different machine learning methods, selecting the features in the model that result in the greatest accuracy; preferably, the machine learning method comprises one or more of recursive feature elimination, CART, random forest, linear regression, naive Bayes and self-defined training models; preferably, the tumor classification model further comprises a regularization term.
9. A tumor classification system based on 5hmC modified lncRNA comprising:
the acquisition unit is used for acquiring 5hmC modified lncRNA data of a sample to be detected;
the processing unit is used for inputting the 5hmC modified lncRNA data into a tumor classification model to obtain a prediction result of the tumor classification of the sample to be detected;
the tumor classification model is determined in a manner that comprises the following steps:
acquiring 5hmC modified lncRNA data of tumor patients and healthy people;
screening 5hmC modified lncRNA data with tissue specificity difference as characteristic data, wherein the 5hmC modified lncRNA with the tissue specificity difference has difference in tumor patients and healthy people and also has difference in different types of tumors;
and performing feature selection on the feature data by adopting a machine learning method, and establishing a tumor classification model by using a feature selection result.
10. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out a tumor classification according to any one of claims 1 to 9.
CN202110421040.3A 2021-04-19 2021-04-19 Tumor classification device based on 5hmC modified lncRNA Active CN112862018B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110421040.3A CN112862018B (en) 2021-04-19 2021-04-19 Tumor classification device based on 5hmC modified lncRNA

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110421040.3A CN112862018B (en) 2021-04-19 2021-04-19 Tumor classification device based on 5hmC modified lncRNA

Publications (2)

Publication Number Publication Date
CN112862018A true CN112862018A (en) 2021-05-28
CN112862018B CN112862018B (en) 2022-09-02

Family

ID=75992643

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110421040.3A Active CN112862018B (en) 2021-04-19 2021-04-19 Tumor classification device based on 5hmC modified lncRNA

Country Status (1)

Country Link
CN (1) CN112862018B (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110031624A (en) * 2019-02-28 2019-07-19 中国科学院上海高等研究院 Tumor markers detection system based on multiple neural networks classifier, method, terminal, medium
CN110272985A (en) * 2019-06-26 2019-09-24 广州市雄基生物信息技术有限公司 Tumor screening kit and its System and method for based on peripheral blood plasma DNA high throughput sequencing technologies
CN111243673A (en) * 2019-12-25 2020-06-05 北京橡鑫生物科技有限公司 Tumor screening model, and construction method and device thereof

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110031624A (en) * 2019-02-28 2019-07-19 中国科学院上海高等研究院 Tumor markers detection system based on multiple neural networks classifier, method, terminal, medium
CN110272985A (en) * 2019-06-26 2019-09-24 广州市雄基生物信息技术有限公司 Tumor screening kit and its System and method for based on peripheral blood plasma DNA high throughput sequencing technologies
CN111243673A (en) * 2019-12-25 2020-06-05 北京橡鑫生物科技有限公司 Tumor screening model, and construction method and device thereof

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
HANYANG HU 等: "Epigenomic landscape of 5-hydro-xymethylcytosine receals its transcriptional regulation of lncRNAs in colorectal cancer", 《BRITISH JOURNAL OF CANCER》 *

Also Published As

Publication number Publication date
CN112862018B (en) 2022-09-02

Similar Documents

Publication Publication Date Title
Sozzi et al. Clinical utility of a plasma-based miRNA signature classifier within computed tomography lung cancer screening: a correlative MILD trial study
Goldstraw et al. The IASLC Lung Cancer Staging Project: proposals for the revision of the TNM stage groupings in the forthcoming (seventh) edition of the TNM Classification of malignant tumours
CN109478231A (en) The method and composition of the obvious Lung neoplasm of benign and malignant radiograph is distinguished in help
Qian et al. Identification of serum miR-146a and miR-155 as novel noninvasive complementary biomarkers for ankylosing spondylitis
Esposito et al. Microbiome composition indicate dysbiosis and lower richness in tumor breast tissues compared to healthy adjacent paired tissue, within the same women
CN104293914A (en) MiRNA marker combination for detecting primary hepatocellular carcinoma serum and application thereof
CN106399304B (en) A kind of SNP marker relevant to breast cancer
Liang et al. Diagnostic value of microRNAs as biomarkers for cholangiocarcinoma
Yao et al. A new biomarker of fecal bacteria for non-invasive diagnosis of colorectal cancer
WO2020034543A1 (en) Marker for breast cancer diagnosis and screening method therefor
CN111833963A (en) cfDNA classification method, device and application
CN116403644B (en) Method and device for predicting cancer risk
CN109234395A (en) A kind of circular rna detection primer, the application of its detection kit and the primer in gynecological tumor detection kit
Zhang et al. Leveraging fecal bacterial survey data to predict colorectal tumors
CN113345592B (en) Construction and diagnosis equipment for acute myeloid leukemia prognosis risk model
CN109979532B (en) Thyroid papillary carcinoma distant metastasis molecular mutation prediction model, method and system
US20240124941A1 (en) Multi-modal methods and systems of disease diagnosis
CN112862018B (en) Tumor classification device based on 5hmC modified lncRNA
CN113096798B (en) Tumor diagnosis equipment based on 5hmC modified lncRNA
Ye et al. Use of cell free DNA as a prognostic biomarker in non-small cell lung cancer patients with bone metastasis
JP2018532422A (en) Histological diagnosis and treatment of disease
Casey et al. A machine learning approach to prostate cancer risk classification through use of RNA sequencing data
Jensen-Battaglia et al. Trajectories of physical well-being among adults with acute myeloid leukemia
CN106811528B (en) A kind of breast cancer is cured the disease gene new mutation and its application
CN106520957B (en) The susceptible SNP site detection reagent of DHRS7 and its kit of preparation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant