CN114045333B - Method for predicting age by pyrosequencing and random forest regression analysis - Google Patents

Method for predicting age by pyrosequencing and random forest regression analysis Download PDF

Info

Publication number
CN114045333B
CN114045333B CN202111223180.6A CN202111223180A CN114045333B CN 114045333 B CN114045333 B CN 114045333B CN 202111223180 A CN202111223180 A CN 202111223180A CN 114045333 B CN114045333 B CN 114045333B
Authority
CN
China
Prior art keywords
sites
gene
cpg
age
random forest
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202111223180.6A
Other languages
Chinese (zh)
Other versions
CN114045333A (en
Inventor
严江伟
杨丰隆
张更谦
张君
郝青青
张晓梦
漆小琴
杨婷婷
王雅雅
余代静
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanxi Medical University
Original Assignee
Shanxi Medical University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanxi Medical University filed Critical Shanxi Medical University
Priority to CN202111223180.6A priority Critical patent/CN114045333B/en
Publication of CN114045333A publication Critical patent/CN114045333A/en
Application granted granted Critical
Publication of CN114045333B publication Critical patent/CN114045333B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6876Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6844Nucleic acid amplification reactions
    • C12Q1/6858Allele-specific amplification
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/30Detection of binding sites or motifs
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B25/00ICT specially adapted for hybridisation; ICT specially adapted for gene or protein expression
    • G16B25/20Polymerase chain reaction [PCR]; Primer or probe design; Probe optimisation
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q2600/00Oligonucleotides characterized by their use
    • C12Q2600/154Methylation markers

Landscapes

  • Life Sciences & Earth Sciences (AREA)
  • Chemical & Material Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Organic Chemistry (AREA)
  • Biophysics (AREA)
  • Biotechnology (AREA)
  • General Health & Medical Sciences (AREA)
  • Analytical Chemistry (AREA)
  • Molecular Biology (AREA)
  • Genetics & Genomics (AREA)
  • Medical Informatics (AREA)
  • Wood Science & Technology (AREA)
  • Zoology (AREA)
  • Evolutionary Biology (AREA)
  • Theoretical Computer Science (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Microbiology (AREA)
  • Immunology (AREA)
  • Chemical Kinetics & Catalysis (AREA)
  • Biochemistry (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Bioethics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Epidemiology (AREA)
  • Evolutionary Computation (AREA)
  • Public Health (AREA)
  • Software Systems (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The invention provides a method for age prediction, which comprises pyrosequencing and random Forest regression analysis, wherein a random Forest regression analysis model is constructed by using an R package random Forest, and the optimal site combination is determined by adopting a forward selection method. The method provided by the invention only needs 0.1ng of template DNA, and can be used for forensic blood mark examination materials with higher difficulty; the whole process can be completed within 10 hours; aiming at the gender difference, two independent age prediction models are established according to the gender; the accuracy of predicting age can reach MAD <3 years by only using 3-4 CpG sites.

Description

Method for predicting age by pyrosequencing and random forest regression analysis
Technical Field
The invention belongs to the field of forensic medicine, and particularly relates to a method for predicting age by pyrosequencing and random forest regression analysis.
Background
Physiological age assessment of unknown sample donors is one of the most important tools in forensic investigations. It narrows down the scope of criminal suspects, and further supplements the external visible feature prediction and biological geographic ancestor inference of criminals. Previously established age classification methods involve morphological analysis of skeletal features. When solid tissues such as bones and teeth are available, the age can be precisely determined by anthropological methods. However, it is difficult to use such methods in practice because other tissues, such as body fluids, are more easily encountered during forensic investigations. Recently, several molecular level-based methods have been proposed to estimate age, including telomere length analysis, age-dependent deletion of mitochondrial DNA or T cell DNA rearrangement, and protein alterations such as racemization of aspartic acid and advanced glycosylation endproducts. However, all of these methods have limitations that limit their applicability to crime scenes, particularly their low accuracy and stringent sample requirements. For example, the standard error of age prediction based on signal combined with T cell receptor rearrangement excising loops (sjTRECs) quantification is ± 8.0 years.
One possible alternative to these methods is to detect epigenetic modifications (e.g., methylation), which are now known to vary with age. To date, forensic age prediction studies have focused primarily on whole blood samples with Mean Absolute Deviation (MAD) of 3-10 years, primarily using multiple linear regression models. A small number of studies have achieved relatively low prediction errors (3.24-4.7 years) using machine learning algorithms such as Support Vector Machines (SVMs), artificial Neural Networks (ANN), and Random Forest Regression (RFR); however, these studies were only performed in fresh body fluids. Furthermore, age prediction based on scarring (more common in crime scene investigation) has not been systematically investigated.
Therefore, the invention aims to establish a sensitive, quick and reliable age prediction method based on a pyrosequencing technology and a random forest regression calculation model, which is suitable for various detected materials including blood stains.
Disclosure of Invention
The invention screens a set of DNA methylation age prediction sites for analyzing materials to be detected in a forensic case from a genome sequence, designs a primer for each site, analyzes the methylation level of each site by using a pyrosequencing technology, and then establishes age prediction models for men and women respectively by using random forest regression. The invention aims to provide a sensitive, rapid and reliable age prediction analysis method which can still keep high accuracy by using fewer sites, the detection method can be used for age prediction of blood stains and other detection materials, and DNA extraction, primer design and sequencing schemes are optimized in the method.
The terms:
RFR: random Forest Regressor, random Forest regression.
SVR: support Vector Regression.
MAD: mean Absolute Deviation development, mean Absolute error.
In one aspect, the present invention provides a method for age prediction.
The method comprises pyrosequencing and random Forest regression analysis, wherein a random Forest regression analysis model is constructed by using an R package random Forest, and the optimal site combination is determined by adopting a forward selection method.
The method comprises the following steps of setting parameters in the construction of a random forest regression analysis model: the mtry parameter is the same as the number of CpG sites per modeling, the minimum node size is 5, and the number of trees is set to 1000.
The random forest regression analysis model selects age-related DNA methylation markers respectively located in ELOVL2, C1orf132, TRIM59, KLF14, FHL2 and NPTX2 genes.
When the random forest regression is used for establishing an age prediction model, 7 age-related DNA methylation sites are selected, wherein 3 male sites are: trim59.Pos7, klf14.Pos2, elovl2.Pos7; 4 women were TRIM59.Pos8, KLF14.Pos3, clorf132.Pos2 and FHL2.Pos6.
The volume of the PCR product used in pyrosequencing was 12. Mu.L.
In some embodiments, the method comprises the steps of:
(1) Extracting DNA;
(2) Converting sulfite;
(3)PCR;
(4) Pyrosequencing;
(5) And (5) model prediction.
In another aspect, the invention provides a set of gene combinations for age prediction using random forest regression analysis.
The gene combination comprises ELOVL2, C1orf132, TRIM59, KLF14, FHL2 and NPTX2.
The methylation sites comprise male related sites: trim59.Pos7, klf14.Pos2, elovl2.Pos7 and female associated sites: TRIM59.Pos8, KLF14.Pos3, clorf132.Pos2, FHL2.Pos6.
In yet another aspect, the invention provides a set of primers for use in random forest regression analysis for age prediction.
The primer is used for pyrosequencing.
The primers and the sequencing sites thereof are as follows:
Figure BDA0003313375470000031
wherein, the primer sequences F, R and S respectively represent a forward primer, a reverse primer and a sequencing primer, and the sequence pre-labeled biotin represents that the primer is provided with a biotin label.
In a further aspect, the invention provides the use of the aforementioned methods and/or gene combinations and/or methylation sites and/or primers in the preparation of a kit for predicting age.
In yet another aspect, the present invention provides a kit for predicting age.
The kit comprises the following primers:
Figure BDA0003313375470000032
Figure BDA0003313375470000041
the kit also comprises other reagents for pyrosequencing.
The kit is used in combination with a random forest regression model.
The invention has the beneficial effects that:
(1) Only 0.1ng of template DNA is needed, and the DNA can be used for forensic blood mark examination materials with higher difficulty
Many techniques, such as EpiTYPER, snapshotts, pyrosequencing and Massively Parallel Sequencing (MPS), can provide more accurate methods of DNA methylation measurement. One of the main reasons that limits the applicability of EpiTYPER analysis in forensic medicine is that it requires up to 1 μ g of genomic DNA, however such high amounts of DNA are difficult to obtain in actual crime scene investigations, often more often than not meeting body fluid blotches at crime scenes. Compared to EpiTYPER, the template DNA required for MPS can be reduced to 10ng, and 4ng for Snapshoots. In the present invention, however, accurate age prediction can be performed using 0.1ng of template DNA. Previous studies have shown that 10-20ng of template DNA is required for successful age prediction based on methylation. Therefore, the detection method has the highest sensitivity in the existing blood mark detection and has good forensic application prospect.
(2) The whole process can be completed within 10 hours
The method of the invention can be completed in one day, far faster than other available methods. DNA extraction/quantification, sodium bisulfate conversion, PCR and pyrosequencing assays required 2h, 2.5h, 3h and 2h, respectively. In contrast, both the standard procedures for epistype and MPS require more than 2 days. In particular, MPS requires specialized equipment and a complex bioinformatics analysis system, and is difficult to complete within 3 days.
(3) Establishing two independent age prediction models according to gender difference
Random Forest Regression (RFR) was chosen to build an age prediction model using 3 (trim59. Pos7, klf14.Pos2, elovl2. Pos7) sites for males and 4 (trim59. Pos8, klf14.Pos3, clorf132.Pos2, and fhl2. Pos6) sites for females, respectively, with the final model of 7 sites being 2.8 years (R = 0.99) and 2.93 years (R = 0.98) for male and female, respectively.
(4) The accuracy of the predicted age can reach MAD <3 years by only using 3-4 CpG sites
Meta-analysis of the age prediction studies over the past few years showed that almost all MADs of the age prediction models established in the previous studies were >3 years old. Our model is most efficient due to the use of RFR (MAD <3 years and requires only 3-4 CpG sites, male samples only 3 CpG sites and female samples only 4 CpG sites). The age prediction model with few sites and high accuracy can be more practical for forensic inference.
Drawings
FIG. 1 shows that Random Forest Regression (RFR) outperforms Support Vector Regression (SVR) in age prediction.
FIG. 2 is a graph of predicted age versus actual age for a Random Forest Regression (RFR) test data set.
FIG. 3 shows the sensitivity detection of 7 methylation markers in trace DNA.
FIG. 4 is an analysis of the correlation between the methylation level of 7 CpG sites and age.
Figure 5 is a comparison of the accuracy of the age prediction method with published studies.
Detailed Description
The present invention will be described in further detail with reference to specific examples, which are not intended to limit the present invention, but to illustrate the present invention. The experimental methods used in the following examples, unless otherwise specified, and experimental methods not specified in specific conditions in the examples, are generally commercially available according to conventional conditions, and materials, reagents, and the like used in the following examples, unless otherwise specified.
Example 1DNA extraction and site selection
(1) DNA extraction:
and the DNA extraction scheme is optimized, and the trace DNA loss of blood stains is reduced.
The accuracy of methylation analysis depends on the extraction of high quality DNA from blood tracks. The QIAamp DNA investior kit has been identified as a more reliable method for extracting DNA from forensic samples, with successful extraction of high quality DNA within 2 hours. We further optimized the kit, including shorter incubation times at higher temperatures, addition of carrier RNA to the lysate, and heating of the reagents that dissolve the DNA.
The former method is to incubate the sample at 56 ℃ for 1 hour, and the modified method is to incubate the sample at 85 ℃ for 10 minutes, and then incubate the sample at 56 ℃ for a second time for 1 hour, thereby increasing the amount of extraction of blood stain DNA.
Heating the reagent to dissolve the DNA may also accelerate cell shedding from the blood spot and increase DNA dissolution, thereby reducing the loss of trace amounts of DNA.
(2) Site selection:
according to the literature, six age-related DNA methylation markers were selected, located in ELOVL2, C1orf132, TRIM59, KLF14, FHL2 and NPTX2, respectively, to ensure that we focused on the age-related region.
(3) Designing a primer:
since the amount of DNA obtained from the blood traces is very small and of low quality, the accuracy and sensitivity of the PCR is of great importance. PCR primers and sequencing primers were designed using Pyromark Assay version 2.0 (Qiagen, germany). In designing the primers, the target sequence is adjusted so that the primers contain as many cytosines (C) as possible to detect more methylated sites. We avoided SNPs and other polymorphisms in the target region as they could lead to bias in the sequencing reaction. In addition, primers with high specificity (i.e., no primer dimer formation) were selected by keeping the GC content below 60% excluding possible methylation sites in the primer binding sequence. If necessary, we changed the published method (e.g., addition of dimethyl sulfoxide (DMSO) to avoid dimer formation) to optimize the protocol. In the PCR primers, the 5' end of one primer needs to be labeled by biotin so as to be combined with magnetic beads coated by streptavidin for separation and purification of subsequent single-stranded PCR products, and the other primer does not need to be labeled. The biotin-labeled primers contained free biotin, which competed with the template for binding to streptavidin-coated magnetic beads, reducing the signal level, and HPLC-purified biotin-labeled primers were used. The amplicon length for each gene of interest ranged from 105-306bp. The final primers are shown in the following table:
TABLE 1 PCR primers, pyrosequencing primers and CpG sequences for age-related methylation analysis
Figure BDA0003313375470000061
Figure BDA0003313375470000071
Example 2 detection of DNA methylation by Pyrophosphate sequencing technology
(1) Bisulfite conversion
Extracted DNA (40. Mu.L) was bisulfite converted using the EpiTect fast DNA bisulfite kit (Qiagen, germany). The DNA sample was mixed with CT conversion reagent (bisulfite kit) to obtain a final volume of 140. Mu.L of product, which was then incubated at 95 ℃ for 5 minutes, 60 ℃ for 20 minutes, and then purified.
(2)PCR
The reaction mixture (25. Mu.L) contained 2. Mu.L of the transforming DNA, 12.5. Mu.L of the PCR premix (Qiagen, germany) and 0.1-0.5mM primers. Primer concentrations were adjusted to obtain specific DNA products without dimers. The thermal cycling conditions were as follows: denaturation at 95 ℃ for 10 min; 45 cycles at 95 ℃ for 30 seconds, 30 seconds at 56 ℃ (NPTX 2 ℃,30 seconds) at 72 ℃ for 30 seconds; the final extension was then carried out at 72 ℃ for 5 minutes. Detection by electrophoresis was performed using agarose gel electrophoresis.
(3) Pyrophosphoric acid sequencing
The template prepared from the biotin-labeled PCR amplification product was sequenced using a Pyromark Q48 thermal sequencer (Qiagen, germany) and a Pyro-Gold kit (Qiagen, germany). In previous pyrosequencing processes, the PCR product, which had a volume of 10. Mu.L, produced an unstable signal that was not clearly distinguishable from the background signal. Our method increases the volume of PCR product to 12. Mu.L, which can effectively avoid the generation of unstable signal.
Example 3 construction of blood mark age prediction model
(1) Comparing age prediction accuracy of SVR and RFR models
The previous research results show that the SVR model is more accurate than methods such as multiple linear regression, multiple nonlinear regression and back propagation neural network, therefore, the SVR and RFR models are utilized to combine based on all 46 CpG loci, establish a best fit age prediction model and calculate the prediction accuracy. The SVR model is constructed in an R package e1071, and parameters are set as follows: cost =2,gamma =0.8,epsilon =0.1. The RFR model is constructed by using an R package random Forest, mtry parameters are the same as the number of CpG sites modeled each time, the size of a minimum node is 5, and the number of trees is set to be 1000.
To increase the computation speed, the optimal combination of sites is determined using a forward selection method. 70% of the blood samples were randomly drawn from 241 blood-stain samples (241 healthy Chinese Han volunteers in the age range of 10-79 years, including whole blood samples from 128 males and 113 females all donors provided informed consent, and the Beijing genome institute of Chinese academy of sciences passed the ethical approval of this study) to form a training dataset, and the remaining 30% were used as test datasets to evaluate the accuracy of the RFR model. The training is repeated 100 times, each time selecting the best site (i.e., the minimum MAD). The site with the highest frequency of recording was selected as the appropriate site for the final model. In the two-site training model, after the optimal site, the site with the highest frequency and the lowest MAD is recorded as the second optimal site.
Age prediction model constructed by RFR women used 4 sites (trim59. Pos8, klf14.Pos3, clorf132.Pos2 and fhl2. Pos6) men used 3 sites (trim59. Pos7, klf14.Pos2, elovl2. Pos7) and the resulting MADs were <3 years. Under the SVR model, MAD was stable for around 4.5 years even though both men and women had 8 sites, which indicates that RFR is superior to SVR in age prediction (fig. 1).
(2) Test data set validation prediction accuracy
The remaining 30% of the blood stain samples (38 men, 33 women) were used as test data sets and the age prediction accuracy of 7 selected sites of the final model (3 sites of men: trim59.Pos7, klf14.Pos2, elovl2.Pos7; 4 sites of women: trim59.Pos8, klf14.Pos3, clorf132.Pos2 and fhl2. Pos6) was verified in the RFR model, giving predicted MAD for men and women of 2.8 years (R = 0.99) and 2.93 years (R = 0.98), respectively (fig. 2).
Example 4 sensitivity detection
Whole blood samples were collected from 241 healthy Chinese Han volunteers (128 males and 113 females) aged in the range of 10-79 years. All donors provided informed consent and were approved by the ethical institute of genomics in Beijing, the Chinese academy of sciences.
20 μ L of whole blood was aliquoted onto filter paper to prepare a blood stain, which was then stored at room temperature for 1 year. To determine the detection sensitivity, DNA extracted from blood stains was serially diluted to 100, 50, 10, 5, 2.5, 1.0, 0.50, 0.25, and 0.10ng. Blood stain samples of different concentrations were subjected to methylation analysis, first to bisulfite conversion, then to PCR amplification and pyrosequencing (see example 1). The difference in percent methylation between 0.1ng DNA and higher DNA concentrations was compared to determine the sensitivity of our proposed methylation detection method in blood trace detection.
We observed no significant difference in the methylation percentages between 0.1ng DNA and higher concentrations of 4 CpG sites for age prediction (TRIM59. Pos8, KLF14.Pos3, clorf132.Pos2, and FHL2. Pos6) in women and 3 CpG sites in men (TRIM59. Pos7, KLF14.Pos2, ELOVL2. Pos7) (P.gtoreq.0.05, KS test; FIG. 3). Elovl2.Pos7 site, 1.0ng DNA was required to achieve similar levels.
Example 5 correlation of DNA methylation level with age
Whole blood samples of 241 healthy Chinese Han volunteers (128 males and 113 females) aged 10-79 were collected, blood mark samples were prepared and subjected to methylation analysis, and correlation between the sites and the ages was analyzed. The finally formed blood mark age prediction model comprises 7 CpG sites spanning 3 genes, 3 known sites and 4 new CpG sites. The results showed that 5 of these CpG sites were from 3 genes (TRIM 59, KLF14 and C1orf 132) and were age-related in the analysis of blood traces in chinese subjects (fig. 4).
Example 6 age prediction accuracy versus published studies
Meta-analysis of the age prediction studies over the past years showed that almost all MADs were >3 years old (fig. 5). Compared to previous studies, our model is most efficient due to the use of RFR (MAD <3 years, only 3-4 CpG sites are needed). In fig. 5, the filled dots represent published results, and mathematical methods in different models are represented in different shapes. And the "cross" and "meter" symbols represent the age predictions we have established for women and men, respectively.
Sequence listing
<110> Shanxi university of medical science
<120> method for age prediction using pyrosequencing and random forest regression analysis
<160> 18
<170> SIPOSequenceListing 1.0
<210> 1
<211> 29
<212> DNA
<213> Artificial Sequence (Artificial Sequence)
<400> 1
tagtaaatat ataagtgggg gaagaaggg 29
<210> 2
<211> 27
<212> DNA
<213> Artificial Sequence (Artificial Sequence)
<400> 2
ttaataaaac caaattctaa aacattc 27
<210> 3
<211> 24
<212> DNA
<213> Artificial Sequence (Artificial Sequence)
<400> 3
caccttacca ccaaaccaaa attt 24
<210> 4
<211> 21
<212> DNA
<213> Artificial Sequence (Artificial Sequence)
<400> 4
aggggagtag ggtaagtgag g 21
<210> 5
<211> 30
<212> DNA
<213> Artificial Sequence (Artificial Sequence)
<400> 5
caaaaccatt tccccctaat atatacttca 30
<210> 6
<211> 20
<212> DNA
<213> Artificial Sequence (Artificial Sequence)
<400> 6
gggaggagat ttgtaggttt 20
<210> 7
<211> 21
<212> DNA
<213> Artificial Sequence (Artificial Sequence)
<400> 7
gggttttggg agtatagtag t 21
<210> 8
<211> 27
<212> DNA
<213> Artificial Sequence (Artificial Sequence)
<400> 8
acacctccta aaacttctcc aatctcc 27
<210> 9
<211> 21
<212> DNA
<213> Artificial Sequence (Artificial Sequence)
<400> 9
gttttgggag tatagtagtt a 21
<210> 10
<211> 28
<212> DNA
<213> Artificial Sequence (Artificial Sequence)
<400> 10
ggttttaggt taagttatgt ttaatagt 28
<210> 11
<211> 30
<212> DNA
<213> Artificial Sequence (Artificial Sequence)
<400> 11
actaaaaaat ttccctctat taccattacc 30
<210> 12
<211> 24
<212> DNA
<213> Artificial Sequence (Artificial Sequence)
<400> 12
atagttttag aaattatttt gttt 24
<210> 13
<211> 29
<212> DNA
<213> Artificial Sequence (Artificial Sequence)
<400> 13
tagtaaatat ataagtgggg gaagaaggg 29
<210> 14
<211> 28
<212> DNA
<213> Artificial Sequence (Artificial Sequence)
<400> 14
atttaataaa accaaattct aaaacatt 28
<210> 15
<211> 25
<212> DNA
<213> Artificial Sequence (Artificial Sequence)
<400> 15
ggggttaagt tattaagttt tgaag 25
<210> 16
<211> 21
<212> DNA
<213> Artificial Sequence (Artificial Sequence)
<400> 16
tataggtggt ttgggggaga g 21
<210> 17
<211> 27
<212> DNA
<213> Artificial Sequence (Artificial Sequence)
<400> 17
aaaaaacact accctccaca acataac 27
<210> 18
<211> 15
<212> DNA
<213> Artificial Sequence (Artificial Sequence)
<400> 18
ttgggggaga ggttg 15

Claims (4)

1. A method for age prediction is characterized in that the method comprises pyrosequencing and random Forest regression analysis, wherein a random Forest regression analysis model is constructed by using an R package random Forest, and the optimal site combination is determined by adopting a forward selection method; when the random forest regression is used for establishing an age prediction model, 7 age-related DNA methylation sites are selected, wherein 3 male sites are:TRIM59.pos7、KLF14.pos2、ELOVL2pos7; 4 female patients areTRIM59.pos8、KLF14.pos3、Clorf132Pos2 andFHL2.pos6;
the specific information of the DNA methylation sites is as follows:
geneClorf132The chromosome coordinate of (1) is chr1:207823675, cpG _IDis cg10501210;
geneELOVL2The chromosome coordinate of (1) is chr6:11044644, the CpG_ID is cg16867657;
geneFHL2The chromosome coordinate of (1) is chr2:105399282, the CpG \uID is cg06639320;
geneKLF14Has the chromosome coordinate of chr7:130734355 and CpG _IDof cg14361627;
geneNPTX2The chromosome coordinate of (1) is chr7:98616518, cpG _IDis cg00548268;
geneTRIM59Has the chromosome coordinates of chr3:160450199;
the specific information of the DNA methylation site takes hg38 as a reference genome;
the method comprises the following steps of setting parameters in the construction of a random forest regression analysis model: the mtry parameter is the same as the number of CpG sites per modeling, the minimum node size is 5, and the number of trees is set to 1000.
2. The method of claim 1, wherein the volume of the PCR product used in pyrosequencing is 12. Mu.L.
3. Use of a set of methylation site combinations for the prediction of age, wherein said methylation sites consist of male-related sites and female-related sites; the male-associated sites are:TRIM59.pos7、KLF14.pos2、ELOVL2pos7; the female relevant sites were:TRIM59.pos8、KLF14.pos3、Clorf132.pos2、FHL2.pos6;
the specific information of the DNA methylation sites is as follows:
geneClorf132The chromosome coordinates of (1) are chr1:207823675, cpG \\ ID is cg10501210;
geneELOVL2The chromosome coordinate of (1) is chr6:11044644, the CpG_ID is cg16867657;
geneFHL2Has the chromosome coordinate of chr2:105399282, and CpG _IDof cg06639320;
geneKLF14Has the chromosome coordinate of chr7:130734355, and the CpG\ ID of cg14361627;
geneNPTX2The chromosome coordinate of (1) is chr7:98616518, cpG _IDis cg00548268;
geneTRIM59Has chr3:160450199;
the specific information of the DNA methylation site takes hg38 as a reference genome;
the age prediction is realized by random forest regression analysis.
4. Use of a reagent for detecting a combination of methylation sites according to claim 3 for the preparation of a kit for predicting age.
CN202111223180.6A 2021-10-20 2021-10-20 Method for predicting age by pyrosequencing and random forest regression analysis Active CN114045333B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111223180.6A CN114045333B (en) 2021-10-20 2021-10-20 Method for predicting age by pyrosequencing and random forest regression analysis

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111223180.6A CN114045333B (en) 2021-10-20 2021-10-20 Method for predicting age by pyrosequencing and random forest regression analysis

Publications (2)

Publication Number Publication Date
CN114045333A CN114045333A (en) 2022-02-15
CN114045333B true CN114045333B (en) 2022-10-11

Family

ID=80205735

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111223180.6A Active CN114045333B (en) 2021-10-20 2021-10-20 Method for predicting age by pyrosequencing and random forest regression analysis

Country Status (1)

Country Link
CN (1) CN114045333B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115992259B (en) * 2022-11-23 2023-10-10 四川大学 Primer group and kit based on 13Y chromosome methylation genetic markers

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2012162139A1 (en) * 2011-05-20 2012-11-29 The Regents Of The University Of California Method to estimate age of individual based on epigenetic markers in biological sample
CN110257494B (en) * 2019-07-19 2020-08-11 华中科技大学 Method and system for obtaining individual ages of Chinese population and amplification detection system
CN111139292A (en) * 2019-12-03 2020-05-12 河南远止生物技术有限公司 Biological age inference method established based on pyrosequencing
CN113373236B (en) * 2021-02-19 2021-12-31 中国科学院北京基因组研究所(国家生物信息中心) Method for obtaining individual age of Chinese population

Also Published As

Publication number Publication date
CN114045333A (en) 2022-02-15

Similar Documents

Publication Publication Date Title
Alvarez-Cubero et al. Next generation sequencing: an application in forensic sciences?
CN110257494B (en) Method and system for obtaining individual ages of Chinese population and amplification detection system
US20140113978A1 (en) Multifocal hepatocellular carcinoma microrna expression patterns and uses thereof
BR112018015913B1 (en) method, implemented using a computer system comprising one or more processors and memory system, for determining a copy number variation of a nucleic acid sequence of interest, and system for evaluating the copy number of a nucleic acid sequence of interest
JP6968894B2 (en) Multiple detection method for methylated DNA
CN108350500A (en) Nucleic acid for detecting chromosome abnormality and method
CN109790198A (en) Detect hepatocellular carcinoma
CN105506746A (en) Method for constructing variable region sequencing library, and method for determining variable region nucleic acid sequence
Bacher et al. Mutational profiling in patients with MDS: ready for every-day use in the clinic?
TW201936921A (en) A primer for next generation sequencer and a method for producing the same, a DNA library obtained through the use of a primer for next generation sequencer and a method for producing the same, and a DNA analyzing method using a DNA library
CN112703254A (en) Free DNA damage analysis and clinical application thereof
TW201538732A (en) Methods for full-length amplification of double-stranded linear nucleic acids of unknown sequences
CN114045333B (en) Method for predicting age by pyrosequencing and random forest regression analysis
Refn et al. Prediction of chronological age and its applications in forensic casework: methods, current practices, and future perspectives
Naue Getting the chronological age out of DNA: using insights of age-dependent DNA methylation for forensic DNA applications
CN108085399B (en) Novel application of lncRNA and trans-regulatory gene WNT11 thereof
CN112823392B (en) Method and system for assessing microsatellite instability status
KR102368835B1 (en) Primer sets for determining genotype of HSPB1 gene related to meat of Korean native cattle and uses thereof
CN111321229B (en) Construction and application of liver cancer prediction model
CN108875314B (en) Target gene detection method based on epigenetics modification difference
US11248261B2 (en) RhD gene allele associated with a weak D phenotype and its uses
CN108103064B (en) Long-chain non-coding RNA and application thereof
KR101902481B1 (en) SNP molecular biomarker composition for discrimination of horse temperament in AR gene
CN108950005A (en) A kind of the Forensic detection system and its application of 30 SNP sites of autosome first ancestor
CN110643713A (en) STR locus set for pandas and application thereof

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant