CN114464255A - Methylation age assessment method based on DNA methylation level data - Google Patents

Methylation age assessment method based on DNA methylation level data Download PDF

Info

Publication number
CN114464255A
CN114464255A CN202210107574.3A CN202210107574A CN114464255A CN 114464255 A CN114464255 A CN 114464255A CN 202210107574 A CN202210107574 A CN 202210107574A CN 114464255 A CN114464255 A CN 114464255A
Authority
CN
China
Prior art keywords
age
methylation
model
data
sample
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210107574.3A
Other languages
Chinese (zh)
Inventor
马玉昆
陈晨
张晓伟
孙琼琳
李伟华
李峰峰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Fruit Shell Biotechnology Co ltd
Original Assignee
Beijing Fruit Shell Biotechnology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Fruit Shell Biotechnology Co ltd filed Critical Beijing Fruit Shell Biotechnology Co ltd
Priority to CN202210107574.3A priority Critical patent/CN114464255A/en
Publication of CN114464255A publication Critical patent/CN114464255A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/30Detection of binding sites or motifs
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/10Sequence alignment; Homology search
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B5/00ICT specially adapted for modelling or simulations in systems biology, e.g. gene-regulatory networks, protein interaction networks or metabolic networks

Landscapes

  • Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Medical Informatics (AREA)
  • Biophysics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Theoretical Computer Science (AREA)
  • Molecular Biology (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Chemical & Material Sciences (AREA)
  • Analytical Chemistry (AREA)
  • Physiology (AREA)
  • Artificial Intelligence (AREA)
  • Bioethics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Epidemiology (AREA)
  • Evolutionary Computation (AREA)
  • Public Health (AREA)
  • Software Systems (AREA)
  • Genetics & Genomics (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The invention discloses a methylation age assessment method based on DNA methylation level data. The present invention constructs a methylation age prediction model that requires that the input file contain information on important methylation sites, which are methylation sites in gene regions reported in the literature and/or associated with age changes. According to the embodiment of the invention, R-square, MAE, MSE and RMSE are calculated between the obtained verification set predicted methylation age and the known time age of the verification set sample, and the result shows that the R-square is 0.855.

Description

Methylation age assessment method based on DNA methylation level data
Technical Field
The invention relates to the technical field of biological information analysis, in particular to a methylation age assessment method based on DNA methylation level data.
Background
DNA methylation (DNA methylation) refers to the chemical modification of a particular base on a DNA sequence to obtain a methyl group under the catalytic action of methyltransferase. Methylation of DNA has various effects on growth and development of organisms, such as the ability to regulate gene expression, inhibition of transposable elements (transposable elements), etc. in the case of mammalian cells, methylation occurs at about 3/4 CpG sites, but with age of an individual, the methylation level in the individual is in a dynamic process.
In the limited life process of a person, Chronological age (chronologic age) is often used for describing the growth and aging state of an individual, so that the description is performed by time, and the method is simple and applicable to all individuals, but due to various reasons such as different individual genetic information, difference of external environmental factors of the environment where the growth and development are located and the like, the physiological conditions of the individuals at the same Chronological age are greatly different, and the difference is more obvious along with the increase of Chronological age, so that Chronological age can only describe the growth and development experience of the individuals in time, and the actual physiological conditions in the individuals need to be measured by a more objective and accurate standard.
Horvath et al established an epigenetic clock in 2013 to evaluate the age of individuals based on the results of DNA methylation levels in tissues of different individuals, but the evaluation results may be affected due to genetic differences among individuals and differences in different tissue sites (such as skin, blood, etc.) of individuals, ribosomal DNA (rDNA) is a highly evolutionarily conserved genomic nucleic acid fragment, Wang et al performed methylation age evaluation of human, mouse, dog, etc. in 2019 on different species using ribosomal clock (ribosomal clock), and considered ribosomal clock to be a universal methylation age evaluation tool across species. The DNA methylation levels of different tissue parts can influence the prediction result due to larger individual difference or larger difference between tissues and parts, and meanwhile, the physiological level of an individual is influenced by the interaction of all genes in the body, the methylation age is predicted only through the site of the region where the conserved sequence is located, and the possibility of losing important gene site information exists.
Establishing a methylation clock according to important sites and predicting the methylation age in individuals become important problems to be solved at present.
Disclosure of Invention
The technical problem to be solved by the invention is how to predict or evaluate the methylation age of an individual through important sites.
In order to solve the above technical problems, the present invention first provides a method for predicting methylation age. The method comprises the following steps: extracting methylation site information in the sample methylation level file to obtain a methylation site information file; splitting the methylation site information file into a known age data set and an unknown age data set according to age information; constructing a model for obtaining a predicted methylation age by using an elastic network function for the known age data set; predicting a methylation age of the sample of the unknown age data set using the model of predicted methylation age.
In the method described above, the samples may be classified into known age dataset samples and unknown age dataset samples. The known age dataset samples described above may be modeling samples. The sample unknown age dataset sample may be a sample to be tested. The sample may be a biological tissue and/or cell sample.
In the method described above, the methylation site information can be important methylation site information. The important methylation sites can be methylation sites within regions of genes reported in the literature and/or associated with age changes.
The genes can be ELOVL2, FHL2, SMARCB1, AEBP2, ZYG11A, OTUD7A, TRIM59, TTC7A, EVL, WBP1L, KLF14, and GRIN1 genes on human genome.
The important methylation sites can be methylation sites of EPIC methylation array (annotation file path: https:// support. Illumina. com. cn/array/array _ kit/infinite _ loads. html. annotation file update time Mar 13,2020) on Illumina methylation chip, which are aimed at by the following probes: cg01912455, cg20793144, cg13868026, cg18071806, cg09038267, cg06784991, cg07955995, cg22454769, cg11667847, cg04875128, cg07553761 and cg 16867657.
Of the genes described above, the nucleotide sequence of the ELOVL2 Gene may correspond to position 10980992-11044538 (Gene ID:54898, update date 17-Jan-2022) of chromosome 6 of GRCh37 version of the human genome.
The nucleotide sequence of the FHL2 Gene may correspond to position 105974169-106054986 of chromosome 2 of the GRCh37 version of the human genome (Gene ID:2274, update date 5-Jan-2022).
The nucleotide sequence of the SMARCB1 Gene may correspond to position 24129153-24180196 of chromosome 22 of GRCh37 version of the human genome (Gene ID:6598, update date 17-Jan-2022).
The nucleotide sequence of the AEBP2 Gene may correspond to position 19592426-19675161 of chromosome 12 of GRCh37 version of the human genome (Gene ID:121536, update date 5-Jan-2022).
ZYG11 the nucleotide sequence of 11A Gene may correspond to the 53308432-53360670 position of chromosome 1 of GRCh37 version of the human genome (Gene ID:440590, update date 5-Jan-2022).
The nucleotide sequence of the OTUD7A Gene may correspond to position 317676767601-32162876 of chromosome 15 of GRCh37 version of the human genome (Gene ID:161725, update date 5-Jan-2022).
The nucleotide sequence of the TRIM59 Gene may correspond to position 160153291-160167574 of chromosome 3 of GRCh37 version of the human genome (Gene ID:286827, update date 5-Jan-2022).
The nucleotide sequence of the TTC7A Gene may correspond to position 47143005-47303262 of chromosome 2 of GRCh37 version of the human genome (Gene ID:57217, update date 5-Jan-2022).
The nucleotide sequence of the EVL Gene may correspond to position 100437759-100610573 of chromosome 14 of the human genome version GRCh37 (Gene ID:51466, update date 5-Jan-2022).
The nucleotide sequence of the WBP1L Gene may correspond to positions 104503705-104576019 of chromosome 10 of GRCh37 version of the human genome (Gene ID:54838, update date 5-Jan-2022).
The nucleotide sequence of the KLF14 Gene may correspond to position 130415525-130418967 of chromosome 7 of GRCh37 version of the human genome (Gene ID:136259, update date 5-Jan-2022).
The nucleotide sequence of the GRIN1 Gene may correspond to positions 140033606-140063208 of chromosome 9 of GRCh37 of the human genome (Gene ID:2902, update date 23-Jan-2022).
In the method described above, the age information may be derived from the methylation level file. The age information of the sample to be tested may be unknown (unknown), and the age information of the modeling sample may be known real-time age.
In the method described above, the methylation level file can be a file obtained by quality control of the methylation raw data.
The methylated raw data can be raw off-line data obtained by sequencing a sample through a methylated chip. The methylation chip can be an illumina methylation chip.
The quality control can comprise the following steps: processing the methylated raw data according to R packet (ChAMP): raw data were normalized to obtain normalized methylation level data using the parameters cor ═ 5, method ═ BMIQ, and arraytype ═ corresponding methylation chip types. Combining the obtained normalized methylation data with age data (obtained from the raw data) of the sample into an integrated file, wherein the integration process can be as follows: if the age of the corresponding sample is a known age, adding corresponding age information in an age row of the merged file; if the age information of the corresponding sample is unknown, the supplementary information of the age row information corresponding to the merged file is 'unknown', and the merged file is used for subsequent analysis.
In the method, the process of constructing a model for obtaining the predicted methylation age by using an elastic network function may include the steps of constructing a model and evaluating the model.
The model construction may comprise the steps of:
A1) data format conversion: converting the known age data set data into a known age data set data box using a pandas module;
A2) data splitting: splitting the known age dataset data box into a training dataset and a validation dataset using a train _ test _ split function in a sklern module; the verification data set accounts for 20%;
A3) selecting optimal parameters: optimally selecting parameters of the elastic network function by using GridSearchCV in a model _ selection sub-module under a sklern module, wherein the selected parameter pool is 'alpha': 0.0001,0.001,0.01,0.1,0,5,1], 'l1_ ratio': 0.2,0.1,0.05,0.01 ];
A4) obtaining a model: training the training data set using the elastic network function to obtain a model of the predicted methylation age;
the model evaluation may comprise predicting the methylation age of the validation data set samples using the obtained model of predicted methylation age, calculating a correlation coefficient, a mean absolute error, a mean square error and/or a root mean square error of the methylation age with the time age of the validation data set samples to evaluate the prediction accuracy of the model.
A closer the correlation coefficient to 1 indicates a higher degree of conformity of the model predicted methylation age to the true temporal age. A closer the mean absolute error, mean squared error, and/or root mean squared error to 0 indicates a higher degree of conformity of the model predicted methylation age to the true temporal age.
In the method described above, the correlation coefficient may be greater than 0.80.
In order to solve the technical problem, the invention also provides a device for predicting the methylation age of the sample to be detected. The device comprises:
B1) a data identification module: for identifying whether the sample data is a standard methylation level file; when the sample data to be tested is original methylation sequencing data, executing B2) the data quality control module; when the sample data to be detected is a standard methylation level file, executing B3) the methylation site information extraction module; the samples comprise a sample to be tested and a modeling sample; the sample data contains age information; the age information of the sample to be tested is unknown (unknown), and the age information of the modeling sample is the known real time age.
B2) The data quality control module: and performing quality control on the data to obtain a standard methylation level file.
B3) And (3) a methylation site information extraction module: for extracting methylation site information of the standard methylation level file. Can be established by the following method steps: and splitting the methylation site information file into a known age data set and an unknown age data set according to the age information.
B4) Constructing a model module for predicting methylation age: for selecting the known age dataset in block B3) a model for predicting methylation age was constructed using an elastic network function.
B5) And (3) predicting the methylation age module of the sample to be tested: a model for predicting the methylation age of the sample to be tested using the B3) the unknown age data set and B4) the predicted methylation age.
In the above-mentioned apparatus, B3) the methylation site information may be important methylation site information. The important methylation sites can be methylation sites in gene regions reported in the literature or associated with age changes.
The genes can be ELOVL2, FHL2, SMARCB1, AEBP2, ZYG11A, OTUD7A, TRIM59, TTC7A, EVL, WBP1L, KLF14 and GRIN1 genes on human genome.
The important methylation sites can be methylation sites of EPIC methylation array (annotation file path: https:// support. Illumina. com. cn/array/array _ kit/infinite _ loads. html. annotation file update time Mar 13,2020) on Illumina methylation chip, which are aimed at by the following probes: cg01912455, cg20793144, cg13868026, cg18071806, cg09038267, cg06784991, cg07955995, cg22454769, cg11667847, cg04875128, cg07553761 and cg 16867657.
Of the genes described above, the nucleotide sequence of the ELOVL2 Gene may correspond to position 10980992-11044538 (Gene ID:54898, update date 17-Jan-2022) of chromosome 6 of GRCh37 version of the human genome.
The nucleotide sequence of the FHL2 Gene may correspond to position 105974169-106054986 of chromosome 2 of GRCh37 version of the human genome (Gene ID:2274, update date 5-Jan-2022).
The nucleotide sequence of the SMARCB1 Gene may correspond to position 24129153-24180196 of chromosome 22 of GRCh37 version of the human genome (Gene ID:6598, update date 17-Jan-2022).
The nucleotide sequence of the AEBP2 Gene may correspond to position 19592426-19675161 of chromosome 12 of GRCh37 version of the human genome (Gene ID:121536, update date 5-Jan-2022).
ZYG11 the nucleotide sequence of 11A Gene may correspond to the 53308432-53360670 position of chromosome 1 of GRCh37 version of the human genome (Gene ID:440590, update date 5-Jan-2022).
The nucleotide sequence of the OTUD7A Gene may correspond to position 317676767601-32162876 of chromosome 15 of GRCh37 version of the human genome (Gene ID:161725, update date 5-Jan-2022).
The nucleotide sequence of the TRIM59 Gene may correspond to position 160153291-160167574 of chromosome 3 of GRCh37 version of the human genome (Gene ID:286827, update date 5-Jan-2022).
The nucleotide sequence of the TTC7A Gene may correspond to position 47143005-47303262 of chromosome 2 of GRCh37 version of the human genome (Gene ID:57217, update date 5-Jan-2022).
The nucleotide sequence of the EVL Gene may correspond to position 100437759-100610573 of chromosome 14 of GRCh37 version of the human genome (Gene ID:51466, update date 5-Jan-2022).
The nucleotide sequence of the WBP1L Gene may correspond to positions 104503705-104576019 of chromosome 10 of GRCh37 version of the human genome (Gene ID:54838, update date 5-Jan-2022).
The nucleotide sequence of the KLF14 Gene may correspond to position 130415525-130418967 of chromosome 7 of GRCh37 version of the human genome (Gene ID:136259, update date 5-Jan-2022).
The nucleotide sequence of the GRIN1 Gene may correspond to position 140033606-140063208 of chromosome 9 of the GRCh37 version of the human genome (Gene ID:2902, update date 23-Jan-2022).
The methylated raw data can be raw off-line data obtained by sequencing the sample through a methylated chip. The methylation chip can be an illumina methylation chip. The sample may be a biological tissue and/or cell sample.
The quality control can comprise the following steps: processing the methylated raw data according to R packet (ChAMP): raw data were normalized to obtain normalized methylation level data using the parameters cor ═ 5, method ═ BMIQ, and arraytype ═ corresponding methylation chip types. Combining the obtained normalized methylation data with age data (obtained from the raw data) of the sample into an integrated file, wherein the integration process can be as follows: if the age of the corresponding sample is a known age, adding corresponding age information in an age row of the merged file; if the age information of the corresponding sample is unknown, the supplementary information of the age row information corresponding to the merged file is 'unknown', and the merged file is used for subsequent analysis.
In the above device, the module for constructing a predictive methylation age model according to B4) may specifically include a model constructing module and a model evaluating module.
The model building module may include the following modules:
A1) a data format conversion module: for converting the known age data set data into a known age data set data box using the pandas module.
A2) A data splitting module: for splitting the known age dataset data box into a training dataset and a validation dataset using a train _ test _ split function in a sklern module; the validation data set is 20%.
A3) An optimal parameter selection module: the method is used for optimally selecting the parameters of the elastic network function by using GridSearchCV in a model _ selection sub-module under a sklern module, and the selected parameter pool can be 'alpha': 0.0001,0.001,0.01,0.1,0,5,1], 'l1_ ratio': 0.2,0.1,0.05,0.01 ].
A4) Obtaining a model module: the model of the predicted methylation age can be obtained by training the training data set using, in particular, ElasticNet.
The model evaluation comprises predicting the methylation age of the verification data set sample by using the obtained model for predicting the methylation age, and calculating a correlation coefficient, a mean absolute error, a mean square error and a root mean square error of the methylation age and the time age of the verification data set sample so as to evaluate the prediction accuracy degree of the model.
A closer the correlation coefficient to 1 indicates a higher degree of conformity of the model predicted methylation age to the true temporal age. A closer the mean absolute error, mean squared error, and/or root mean squared error to 0 indicates a higher degree of conformity of the model predicted methylation age to the true temporal age.
The sample described above may be human, animal, plant or microbial.
In order to solve the above technical problem, the present invention also provides a computer-readable storage medium storing a computer program. The computer program may cause a computer to perform the steps of the method as described above.
The invention provides a methylation clock age prediction method based on methylation level corresponding to methylation sites. In the embodiment of the invention, R-square, MAE, MSE and RMSE are calculated between the predicted methylation age of the obtained verification set and the known time age of the sample of the verification set so as to evaluate the prediction accuracy of the obtained model. The result shows that the R-square is 0.965 close to 1, and shows that the methylation age of the sample of the verification set predicted by the model is highly consistent with the real time age of the sample; the MAE value was 3.575, indicating that the average absolute error of the age result predicted by using the constructed methylation age prediction model from the true value was 3.57 years.
The invention has the following beneficial effects: the important loci are used for predicting the methylation age of the sample to be detected, the retained information is complete, the obtained methylation age predicted value is accurate, and the degree of coincidence with the real time age is higher.
Drawings
FIG. 1 is an analysis flow of a methylation age assessment method based on the DNA methylation level.
Detailed Description
The present invention is described in further detail below with reference to specific embodiments, which are given for the purpose of illustration only and are not intended to limit the scope of the invention. The examples provided below serve as a guide for further modifications by a person skilled in the art and do not constitute a limitation of the invention in any way.
The experimental procedures in the following examples, unless otherwise indicated, are conventional and are carried out according to the techniques or conditions described in the literature in the field or according to the instructions of the products. Materials, reagents and the like used in the following examples are commercially available unless otherwise specified.
Example 1 establishment of methylation age assessment method
1.1 input data quality control
When the input sample data for methylation age assessment is raw data for sequencing of a methylation chip, quality control is required on the raw data.
The quality control comprises the following specific steps: the methylated raw data was processed using R-package (ChAMP): raw data were normalized to obtain normalized methylation level data using the parameters cor ═ 5, method ═ BMIQ, and arraytype ═ corresponding methylation chip types.
When the input sample data for methylation age assessment is a standardized methylation level data file, the quality control step can be omitted.
The samples comprise a sample to be detected and a modeling sample; the sample data contains age information; the age information of the sample to be tested is unknown (unknown), and the age information of the modeling sample is the known real time age.
The normalized methylation data obtained were combined with age data (obtained from raw data) of the samples and combined into one integrated file by the process of: if the age of the corresponding sample is a known age (modeling sample), adding corresponding age information to the age row of the merged file, and if the age information of the corresponding sample is unknown, adding 'unknown' (sample to be measured) to the age row information corresponding to the merged file, and using the merged file for subsequent analysis.
1.2 extraction of methylation site information
And (3) extracting methylation site information of the sample from the integration file obtained in the step 1.1 to obtain a methylation site information file. The methylation sites are important methylation sites, the important methylation sites are methylation sites in a region of 12 genes reported in the literature or related to age change, and the screening correlation threshold is 0.6. The 12 genes include ELOVL2, FHL2, SMARCB1, AEBP2, ZYG11A, OTUD7A, TRIM59, TTC7A, EVL, WBP1L, KLF14, and GRIN1 of the human genome.
Wherein the nucleotide sequence of the ELOVL2 Gene corresponds to the nucleotide sequence at positions 10980992-11044538 (Gene ID:54898, update date 17-Jan-2022) of chromosome 6 of GRCh37 version of the human genome;
the nucleotide sequence of the FHL2 Gene corresponds to position 105974169-106054986 of chromosome 2 of GRCh37 version of the human genome (Gene ID:2274, update date 5-Jan-2022);
the nucleotide sequence of the SMARCB1 Gene corresponds to position 24129153-24180196 of chromosome 22 of GRCh37 version of the human genome (Gene ID:6598, update date 17-Jan-2022);
the nucleotide sequence of the AEBP2 Gene corresponds to position 19592426-19675161 of chromosome 12 of GRCh37 version of the human genome (Gene ID:121536, update date 5-Jan-2022);
the nucleotide sequence of the ZYG11A Gene corresponds to the 53308432-53360670 th chromosome of the GRCh37 version 1 of the human genome (Gene ID 440590, update date 5-Jan-2022);
the nucleotide sequence of the OTUD7A Gene corresponds to position 317676767601-32162876 of chromosome 15 of GRCh37 version of the human genome (Gene ID:161725, update date 5-Jan-2022);
the nucleotide sequence of the TRIM59 Gene corresponds to position 160153291-160167574 of chromosome 3 of GRCh37 version of the human genome (Gene ID:286827, update date 5-Jan-2022);
the nucleotide sequence of the TTC7A Gene corresponds to the 47143005-47303262 locus of chromosome 2 of GRCh37 version of the human genome (Gene ID:57217, update date 5-Jan-2022);
the nucleotide sequence of the EVL Gene corresponds to the 100437759-100610573 position of chromosome 14 of GRCh37 version of the human genome (Gene ID:51466, update date 5-Jan-2022);
the nucleotide sequence of the WBP1L Gene corresponds to positions 104503705-104576019 of chromosome 10 of GRCh37 version of the human genome (Gene ID:54838, update date 5-Jan-2022);
the nucleotide sequence of the KLF14 Gene corresponds to position 130415525-130418967 of chromosome 7 of GRCh37 version of the human genome (Gene ID:136259, update date 5-Jan-2022);
the nucleotide sequence of the GRIN1 Gene corresponds to positions 140033606-140063208 of chromosome 9 of GRCh37 version of the human genome (Gene ID:2902, update date 23-Jan-2022).
And (3) carrying out sample splitting on the methylation site information file according to the age data to obtain a known age data set (information of the modeling sample) and an unknown age data set (information of the sample to be tested). The known age data set is used for establishing a methylation clock model; and taking the unknown age sample as a sample to be tested, and substituting the unknown age data set into the methylation clock model to obtain the methylation age of the sample to be tested.
1.3 building a methylation clock model and performing validation
1.3.1 building a methylation clock model
And (3) processing the known age data set (information of the known age sample) obtained in the step 1.2 by using an elastic net (elastic network) function, constructing and obtaining a methylation clock model, and processing and constructing by using python version 3.7.
1) Data format conversion: known age data set data is read in and converted to a data frame (Dataframe) using the pandas module.
2) Data splitting: and splitting the data of the known age data set into two training data sets and a verification data set by using a train test split function in the sklern module, wherein the training data set is used for training to obtain a model, and the verification data set is used for verifying the obtained model. The proportion of the verification data set in the sample is 20%, the use parameters are test _ size ═ 0.2, and random _ state ═ 1; the training data set is subjected to data training by using an elastic Net function in a linear model of a submodule under a skleran module.
3) Selecting optimal parameters: before training data, GridSearchCV in a model _ selection sub-module under a sklern module is used for optimally selecting parameters of an elastic network function, wherein the selected parameter pool is 'alpha': 0.0001,0.001,0.01,0.1,0,5,1 ',' l1_ ratio ': 0.2,0.1,0.05, 0.01'.
4) Obtaining a model: after the optimal parameter selection is completed, the training data set is trained by using an elastic network (Elasticenet) function, and a methylation age evaluation model, namely a methylation clock model, is obtained after the training is completed.
1.3.2 methylation clock model and validation evaluation
Analyzing the verification data set by using the obtained methylation age evaluation model to obtain a predicted methylation age result of a verification data set sample, calculating a correlation coefficient (R-square), an average Absolute Error (MAE), a Mean Squared Error (MSE), a Root Mean Squared Error (RMSE) and the like between the obtained verification set prediction result and the known time age of the verification set sample, wherein the calculation result can objectively reflect the prediction accuracy of the model, and numerically evaluate the quality of the model. Wherein, the closer the R-square is to 1, the higher the degree of coincidence between the methylation age of the verification set sample predicted by the model and the real time age of the sample is; the closer the MAE, MSE and/or RMSE is to 0, the higher the methylation age of the validation set samples predicted by the model matches the true temporal age of the samples.
1.4 predicting methylation age of sample to be tested according to methylation clock model
And (3) selecting the unknown age sample obtained in the step (1.2) as a sample to be detected, and substituting the unknown age data set corresponding to the sample to be detected into the methylation clock model to obtain the methylation age of the sample to be detected.
Example 2, 12 sample methylation clock age prediction
Sample information: 92 samples (samples in database)https://www.ncbi.nlm.nih.gov/geo/ geo2r/?acc=GSE184047Download), where 80 samples are set for a known time age, 12 samples are for an unknown time age:
2.1 sequencing data quality control
The following is an integration file of methylation level results files of 80 samples with age data after removing sites where deletion values exist after quality control (table 1).
TABLE 1 statistics of data quality control results
Figure BDA0003493911440000091
Figure BDA0003493911440000101
Methyl _ site column (excluding age row): methylation site name
The age row: sample corresponding age
Sample column: the rest columns except the methyl _ site column are corresponding columns of each sample
The numerical value corresponding to the methylation site in each sample column represents the methylation level of the site; in the chip data, if the value is greater than 0.6, the site is in the methylated state, and if the value is less than 0.2, the site is in the unmethylated state.
Note: when the age is unknown, it indicates that the age information of the sample needs to be predicted
2.2 extraction of methylation site information
Extracting methylation site information from the integration file in step 2.1
Methylation sites comprise important methylation sites, wherein the important methylation sites are methylation sites of genes reported in the literature or related to age change, and the screening correlation threshold is 0.6, and comprises methylation-related sites in gene regions of 12 genes of a human genome, namely ELOVL2, FHL2, SMARCB1, AEBP2, ZYG11A, OTUD7A, TRIM59, TTC7A, EVL, WBP1L, KLF14 and GRIN 1. Specifically comprises methylation sites of the following probe pairs on an Illumina methylation chip EPIC methylation array (annotation file path: https:// support.illumina.com.cn/array/array _ kits/infinite-methylation-metadata-kit/dow nloads. html, annotation file update time Mar 13,2020): cg01912455, cg20793144, cg13868026, cg18071806, cg09038267, cg06784991, cg07955995, cg22454769, cg11667847, cg04875128, cg07553761 and cg 16867657.
Of the 12 genes, the nucleotide sequence of the ELOVL2 Gene corresponds to position 10980992 and 11044538 of chromosome 6 of GRCh37 version of the human genome (Gene ID:54898, update date 17-Jan-2022);
the nucleotide sequence of the FHL2 Gene corresponds to position 105974169-106054986 of chromosome 2 of GRCh37 version of the human genome (Gene ID:2274, update date 5-Jan-2022);
the nucleotide sequence of the SMARCB1 Gene corresponds to position 24129153-24180196 of chromosome 22 of GRCh37 version of the human genome (Gene ID:6598, update date 17-Jan-2022);
the nucleotide sequence of the AEBP2 Gene corresponds to position 19592426-19675161 of chromosome 12 of GRCh37 version of the human genome (Gene ID:121536, update date 5-Jan-2022);
the nucleotide sequence of the ZYG11A Gene corresponds to the 53308432-53360670 th chromosome of the GRCh37 version 1 of the human genome (Gene ID 440590, update date 5-Jan-2022);
the nucleotide sequence of the OTUD7A Gene corresponds to position 317676767601-32162876 of chromosome 15 of GRCh37 version of the human genome (Gene ID:161725, update date 5-Jan-2022);
the nucleotide sequence of the TRIM59 Gene corresponds to position 160153291-160167574 of chromosome 3 of GRCh37 version of the human genome (Gene ID:286827, update date 5-Jan-2022);
the nucleotide sequence of the TTC7A Gene corresponds to the 47143005-47303262 locus of chromosome 2 of GRCh37 version of the human genome (Gene ID:57217, update date 5-Jan-2022);
the nucleotide sequence of the EVL Gene corresponds to the 100437759-100610573 position of chromosome 14 of GRCh37 version of the human genome (Gene ID:51466, update date 5-Jan-2022);
the nucleotide sequence of the WBP1L Gene corresponds to positions 104503705-104576019 of chromosome 10 of GRCh37 version of the human genome (Gene ID:54838, update date 5-Jan-2022);
the nucleotide sequence of the KLF14 Gene corresponds to position 130415525-130418967 of chromosome 7 of the GRCh37 version of the human genome (Gene ID:136259, update date 5-Jan-2022);
the nucleotide sequence of the GRIN1 Gene corresponds to positions 140033606-140063208 of chromosome 9 of GRCh37 version of the human genome (Gene ID:2902, update date 23-Jan-2022).
And (3) carrying out sample splitting on the file obtained after the methylation site information is extracted according to the age data to obtain a known age data set (the information of the known age sample comprises 80 samples and is shown in table 2) and an unknown age data set (the information of the unknown age sample comprises 12 samples and is shown in table 3).
TABLE 2 known age data set
Figure BDA0003493911440000111
TABLE 3 unknown age data set
Figure BDA0003493911440000112
2.3 methylation clock model construction and evaluation results
2.3.1 methylation clock model construction
The known age data set obtained in step 2.2 is processed and analyzed by using an elastic net function, and the analysis process is processed by using python version 3.7.
1) Data conversion: known age data set data is read in and converted to a data frame (Dataframe) using the pandas module.
2) Data splitting: splitting the data of the known age data set into two parts of a training data set and a verification data set by using a train _ test _ split function in a skearn module, wherein the use parameters are that test _ size is 0.2 and random _ state is 1, 64 training data set samples are obtained, and 16 verification data set samples are obtained; the training dataset will be data trained using the ElasticNet function in the liner _ model sub-module under the skleran module.
3) Selecting optimal parameters: before training data, GridSearchCV in a model _ selection sub-module under a sklern module is used for optimally selecting parameters of an elastic network function, wherein the selected parameter pool is 'alpha': 0.0001,0.001,0.01,0.1,0,5,1 ',' l1_ ratio ': 0.2,0.1,0.05, 0.01'.
4) Obtaining a model: after the optimal parameter selection is completed, the training data set is trained by using an elastic network (Elasticenet) function, and a methylation age evaluation model, namely a methylation clock model, is obtained after the training is completed.
2.3.2 methylation clock model evaluation
And analyzing the verification dataset by using the obtained methylation age evaluation model to obtain a predicted methylation age result of the verification dataset sample. And calculating R-square, MAE, MSE and RMSE between the predicted methylation age of the obtained verification set and the known time age of the verification set sample so as to evaluate the prediction accuracy of the model. The results are shown in Table 4, where R-square is 0.965 close to 1, indicating that the methylation age of the validation set samples predicted by the model is in high agreement with the true time age of the samples; the MAE value is 3.575, which shows that the average absolute error between the predicted result and the true value is 3.57 years.
TABLE 4 methylation clock model evaluation statistics
Figure BDA0003493911440000121
Note: called: calculating a value; ideal: the optimum value.
2.4 methylation clock age prediction results
Substituting the unknown age data set information of the sample to be tested, which is the unknown age sample (including 12 samples) obtained in step 2.2, into the methylation age estimation model obtained in step 2.3 to obtain the methylation clock age prediction result of the sample to be tested (table 5).
TABLE 5 prediction of methylation clock age
Figure BDA0003493911440000122
Figure BDA0003493911440000131
Note: sample: sample name, PredictAge: predicted Age, Age: actual age
The present invention has been described in detail above. It will be apparent to those skilled in the art that the invention can be practiced in a wide range of equivalent parameters, concentrations, and conditions without departing from the spirit and scope of the invention and without undue experimentation. While the invention has been described with reference to specific embodiments, it will be appreciated that the invention can be further modified. In general, this application is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the invention and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains. The use of some of the essential features is possible within the scope of the claims attached below.

Claims (8)

1. A method of predicting methylation age comprising the steps of: extracting methylation site information in the sample methylation level file to obtain a methylation site information file; splitting the methylation site information file into a known age data set and an unknown age data set according to age information; constructing a model of the known age data set by using an elastic network function to obtain a predicted methylation age; predicting a methylation age of the sample of the unknown age data set using the model of predicted methylation age.
2. The method of claim 1, wherein: the methylation level file is a file obtained by performing quality control on the methylation raw data.
3. The method according to claim 1 or 2, characterized in that: constructing a model for obtaining the predicted methylation age by using an elastic network function, wherein the method comprises the steps of constructing the model and evaluating the model;
the model construction comprises the following steps:
A1) data format conversion: converting the known age data set data into a known age data set data box using a pandas module;
A2) data splitting: splitting the known age dataset data box into a training dataset and a validation dataset using a train _ test _ split function in a sklern module; the verification data set accounts for 20%;
A3) selecting optimal parameters: optimally selecting parameters of the elastic network function by using GridSearchCV in a model _ selection sub-module under a sklern module, wherein the selected parameter pool is 'alpha': 0.0001,0.001,0.01,0.1,0,5,1], 'l1_ ratio': 0.2,0.1,0.05,0.01 ];
A4) obtaining a model: training the training data set using the elastic network function to obtain a model of the predicted methylation age;
the model evaluation comprises predicting the methylation age of the verification dataset samples by using the obtained model of the predicted methylation age, and calculating a correlation coefficient, a mean absolute error, a mean square error and/or a root mean square error of the methylation age and the time age of the verification dataset samples so as to evaluate the prediction accuracy of the model.
4. The method of claim 3, wherein: the correlation coefficient is greater than 0.80.
5. An apparatus for predicting methylation age of a sample to be tested, comprising: the device comprises:
B1) a data identification module: identifying whether the sample data is a standard methylation level file; when the sample data to be tested is original methylation sequencing data, executing B2) the data quality control module; when the sample data to be detected is a standard methylation level file, executing B3) the methylation site information extraction module; the samples comprise a sample to be tested and a modeling sample; the sample data contains age information; the age information of the sample to be tested is unknown, and the age information of the modeling sample is the known real time age;
B2) the data quality control module: performing quality control on the data to obtain a standard methylation level file;
B3) and (3) a methylation site information extraction module: extracting methylation site information of the standard methylation level file; splitting the methylation site information file into a known age data set and an unknown age data set according to age information;
B4) constructing a model module for predicting methylation age: selecting the known age data set in the module B3) to construct a model for predicting methylation age by using an elastic network function;
B5) and (3) a module for predicting methylation age of the sample to be tested: predicting the methylation age of the sample to be tested by using the B3) unknown age data set and B4) the model for predicting the methylation age.
6. The apparatus of claim 5, wherein: B4) the model module for constructing the predicted methylation age specifically comprises the steps of model construction and model evaluation;
the model construction comprises the following steps:
A1) and (3) data format conversion: converting the known age data set data into a known age data set data box using a pandas module;
A2) data splitting: splitting the known age dataset data box into a training dataset and a validation dataset using a train _ test _ split function in a sklern module; the verification data set accounts for 20%;
A3) selecting optimal parameters: optimally selecting parameters of the elastic network function by using GridSearchCV in a model _ selection sub-module under a sklern module, wherein the selected parameter pool is 'alpha': 0.0001,0.001,0.01,0.1,0,5,1], 'l1_ ratio': 0.2,0.1,0.05,0.01 ];
A4) obtaining a model: training the training dataset using ElasticNet to obtain a model of the predicted methylation age;
the model evaluation comprises predicting the methylation age of the verification data set sample by using the obtained model for predicting the methylation age, and calculating a correlation coefficient, a mean absolute error, a mean square error and a root mean square error of the methylation age and the time age of the verification data set sample so as to evaluate the prediction accuracy degree of the model.
7. The method according to any one of claims 1-4 and/or the device according to claim 5 or 6, characterized in that: the sample is an animal, plant or microorganism.
8. A computer-readable storage medium having stored thereon a computer program for causing a computer to perform the steps of the method according to any one of claims 1-4.
CN202210107574.3A 2022-01-28 2022-01-28 Methylation age assessment method based on DNA methylation level data Pending CN114464255A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210107574.3A CN114464255A (en) 2022-01-28 2022-01-28 Methylation age assessment method based on DNA methylation level data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210107574.3A CN114464255A (en) 2022-01-28 2022-01-28 Methylation age assessment method based on DNA methylation level data

Publications (1)

Publication Number Publication Date
CN114464255A true CN114464255A (en) 2022-05-10

Family

ID=81411772

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210107574.3A Pending CN114464255A (en) 2022-01-28 2022-01-28 Methylation age assessment method based on DNA methylation level data

Country Status (1)

Country Link
CN (1) CN114464255A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114174529A (en) * 2019-05-29 2022-03-11 香港精准医学技术有限公司 EPI aging: novel ecosystem for managing healthy aging

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR101873303B1 (en) * 2017-01-24 2018-07-02 연세대학교 산학협력단 Age Predicting method using DNA Methylation level in saliva
CN111763742A (en) * 2019-04-02 2020-10-13 深圳华大法医科技有限公司 Methylation marker, method for determining age of individual and application
CN113373236A (en) * 2021-02-19 2021-09-10 中国科学院北京基因组研究所(国家生物信息中心) Method for obtaining individual age of Chinese population
US20210301341A1 (en) * 2018-08-17 2021-09-30 President And Fellows Of Harvard College Methods for measuring ribosomal methylation age

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR101873303B1 (en) * 2017-01-24 2018-07-02 연세대학교 산학협력단 Age Predicting method using DNA Methylation level in saliva
US20210301341A1 (en) * 2018-08-17 2021-09-30 President And Fellows Of Harvard College Methods for measuring ribosomal methylation age
CN111763742A (en) * 2019-04-02 2020-10-13 深圳华大法医科技有限公司 Methylation marker, method for determining age of individual and application
CN113373236A (en) * 2021-02-19 2021-09-10 中国科学院北京基因组研究所(国家生物信息中心) Method for obtaining individual age of Chinese population

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
GEMMA L SHIREBY ET AL: "Recalibrating the epigenetic clock: implications for assessing biological age in the human cortex", BRAIN, vol. 143, no. 12, 31 December 2020 (2020-12-31), pages 3765 - 3767 *
RENATA ZBIEĆ-PIEKARSKA ET AL: "Development of a forensically useful age prediction method based on DNA methylation analysis", FORENSIC SCIENCE INTERNATIONAL: GENETICS, vol. 17, 31 July 2015 (2015-07-31), pages 173 - 179, XP029239625, DOI: 10.1016/j.fsigen.2015.05.001 *
彭和香;高文静;曹卫华;李立明;: "DNA甲基化年龄研究进展", 预防医学情报杂志, no. 07, 28 July 2020 (2020-07-28), pages 896 - 902 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114174529A (en) * 2019-05-29 2022-03-11 香港精准医学技术有限公司 EPI aging: novel ecosystem for managing healthy aging

Similar Documents

Publication Publication Date Title
Foll et al. A genome-scan method to identify selected loci appropriate for both dominant and codominant markers: a Bayesian perspective
CN109411015B (en) Tumor mutation load detection device based on circulating tumor DNA and storage medium
Vandesompele et al. Reference gene validation software for improved normalization
Hofreiter Drafting human ancestry: What does the Neanderthal genome tell us about hominid evolution? Commentary on Green et al.(2010)
CN108913776B (en) Screening method and kit for DNA molecular markers related to radiotherapy and chemotherapy injury
CN107451419B (en) Method for generating simplified DNA methylation sequencing data by computer program simulation
CN111402951A (en) Copy number variation prediction method, device, computer device and storage medium
Scott Jr et al. Assessing the reproductive competence of individual embryos: a proposal for the validation of new “-omics” technologies
CN104899474A (en) Method and system for rectifying MB-seq methylation level based on ridge regression
CN114464255A (en) Methylation age assessment method based on DNA methylation level data
KR101770962B1 (en) A method and apparatus of providing information on a genomic sequence based personal marker
CN112599194B (en) Method and device for processing methylation sequencing data
CN109584955A (en) A method of mankind's rdaiation response biomarker is identified based on various plants genome
CN117079723B (en) Biomarker and diagnostic model related to amyotrophic lateral sclerosis and application of biomarker and diagnostic model
CN115948521B (en) Method for detecting aneuploidy deletion chromosome information
CN116612814A (en) Regression model-based batch detection method, device, equipment and medium for gene sample pollution
WO2003095624B1 (en) Liver inflammation predictive genes
Hernandez-Lopez et al. Lossy compression of quality scores in differential gene expression: A first assessment and impact analysis
Bergman et al. Worldwide late-Quaternary population declines in extant megafauna are due to Homo sapiens rather than climate
NZ766350A (en) Sequencing data-based itd mutation ratio detecting apparatus and method
Medo et al. A comprehensive comparison of tools for fitting mutational signatures
KR101809046B1 (en) Method and device for re-arranging data for analyzing the gene expression of orthologous gene
CN114317706A (en) Gene height prediction kit based on SNP locus, chip and method
KR102519739B1 (en) Non-invasive prenatal testing method and devices based on double Z-score
CN112530591B (en) Method for generating auscultation test vocabulary and storage equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination