CN114045333A - Method for predicting age by pyrosequencing and random forest regression analysis - Google Patents

Method for predicting age by pyrosequencing and random forest regression analysis Download PDF

Info

Publication number
CN114045333A
CN114045333A CN202111223180.6A CN202111223180A CN114045333A CN 114045333 A CN114045333 A CN 114045333A CN 202111223180 A CN202111223180 A CN 202111223180A CN 114045333 A CN114045333 A CN 114045333A
Authority
CN
China
Prior art keywords
age
random forest
regression analysis
sites
forest regression
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202111223180.6A
Other languages
Chinese (zh)
Other versions
CN114045333B (en
Inventor
严江伟
杨丰隆
张更谦
张君
郝青青
张晓梦
漆小琴
杨婷婷
王雅雅
余代静
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanxi Medical University
Original Assignee
Shanxi Medical University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanxi Medical University filed Critical Shanxi Medical University
Priority to CN202111223180.6A priority Critical patent/CN114045333B/en
Publication of CN114045333A publication Critical patent/CN114045333A/en
Application granted granted Critical
Publication of CN114045333B publication Critical patent/CN114045333B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6876Nucleic acid products used in the analysis of nucleic acids, e.g. primers or probes
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6844Nucleic acid amplification reactions
    • C12Q1/6858Allele-specific amplification
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/30Detection of binding sites or motifs
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B25/00ICT specially adapted for hybridisation; ICT specially adapted for gene or protein expression
    • G16B25/20Polymerase chain reaction [PCR]; Primer or probe design; Probe optimisation
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q2600/00Oligonucleotides characterized by their use
    • C12Q2600/154Methylation markers

Landscapes

  • Life Sciences & Earth Sciences (AREA)
  • Chemical & Material Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Organic Chemistry (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Biophysics (AREA)
  • Biotechnology (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Analytical Chemistry (AREA)
  • Wood Science & Technology (AREA)
  • Zoology (AREA)
  • Molecular Biology (AREA)
  • Genetics & Genomics (AREA)
  • Theoretical Computer Science (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Immunology (AREA)
  • General Engineering & Computer Science (AREA)
  • Chemical Kinetics & Catalysis (AREA)
  • Biochemistry (AREA)
  • Microbiology (AREA)
  • Epidemiology (AREA)
  • Bioethics (AREA)
  • Public Health (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The invention provides a method for age prediction, which comprises pyrosequencing and random Forest regression analysis, wherein a random Forest regression analysis model is constructed by using an R package random Forest, and the optimal site combination is determined by adopting a forward selection method. The method provided by the invention only needs 0.1ng of template DNA, and can be used for forensic blood mark examination materials with higher difficulty; the whole process can be completed within 10 hours; aiming at the gender difference, two independent age prediction models are established according to the gender; the accuracy of predicting age can reach MAD <3 years by using only 3-4 CpG sites.

Description

Method for predicting age by pyrosequencing and random forest regression analysis
Technical Field
The invention belongs to the field of forensic medicine, and particularly relates to a method for predicting age by pyrosequencing and random forest regression analysis.
Background
Physiological age assessment of unknown sample donors is one of the most important tools in forensic investigations. It narrows down the scope of criminal suspects, and further supplements the external visible feature prediction and biological geographic ancestor inference of criminals. Previously established age classification methods involve morphological analysis of skeletal features. When solid tissues such as bones and teeth are available, the age can be precisely determined by anthropological methods. However, it is difficult to use such methods in practice because other tissues, such as body fluids, are more likely to be encountered during forensic investigations. Recently, several molecular-level based methods have been proposed to estimate age, including telomere length analysis, age-dependent deletion of mitochondrial DNA or T cell DNA rearrangement, and protein alterations such as racemization of aspartic acid and advanced glycosylation endproducts. However, all of these methods have limitations that limit their applicability to crime scenes, particularly their low accuracy and stringent sample requirements. For example, the standard error of age prediction based on signal combined with T cell receptor rearrangement excising loops (sjTRECs) quantification is ± 8.0 years.
One possible alternative to these methods is to detect epigenetic modifications (e.g., methylation), which are now known to vary with age. To date, forensic age prediction studies have focused primarily on whole blood samples with Mean Absolute Deviation (MAD) of 3-10 years, primarily using multiple linear regression models. A small number of studies have achieved relatively low prediction errors (3.24-4.7 years) using machine learning algorithms such as Support Vector Machines (SVMs), Artificial Neural Networks (ANN), and Random Forest Regression (RFR); however, these studies were only performed in fresh body fluids. Furthermore, age prediction based on scarring (more common in crime scene investigation) has not been systematically investigated.
Therefore, the invention aims to establish a sensitive, quick and reliable age prediction method based on a pyrosequencing technology and a random forest regression calculation model, which is suitable for various detected materials including blood stains.
Disclosure of Invention
The invention screens a set of DNA methylation age prediction sites for analyzing materials to be detected in a forensic case from a genome sequence, designs a primer for each site, analyzes the methylation level of each site by using a pyrosequencing technology, and then establishes age prediction models for men and women respectively by using random forest regression. The invention aims to provide a sensitive, rapid and reliable age prediction analysis method which can still keep high accuracy by using fewer sites, the detection method can be used for age prediction of blood stains and other detected materials, and in the method, DNA extraction, primer design and sequencing schemes are optimized.
The terms:
RFR: random Forest Regressor, Random Forest regression.
SVR: support Vector Regression.
MAD: mean Absolute Deviation development, Mean Absolute error.
In one aspect, the present invention provides a method for age prediction.
The method comprises pyrosequencing and random Forest regression analysis, wherein a random Forest regression analysis model is constructed by using an R package random Forest, and the optimal site combination is determined by adopting a forward selection method.
The method comprises the following steps of setting parameters in the construction of a random forest regression analysis model: the mtry parameter is the same as the number of CpG sites modeled each time, the minimum node size is 5, and the number of trees is set to 1000.
The random forest regression analysis model selects DNA methylation markers related to age to be respectively positioned in ELOVL2, C1orf132, TRIM59, KLF14, FHL2 and NPTX2 genes.
When the random forest regression is used for establishing an age prediction model, 7 age-related DNA methylation sites are selected, wherein 3 male sites are: trim59.pos7, klf14.pos2, elovl2. pos7; 4 women were TRIM59.pos8, KLF14.pos3, Clorf132.pos2 and FHL2. pos6.
The volume of the PCR product used in pyrosequencing was 12. mu.L.
In some embodiments, the method comprises the steps of:
(1) extracting DNA;
(2) converting sulfite;
(3)PCR;
(4) pyrosequencing;
(5) and (5) model prediction.
In another aspect, the invention provides a set of gene combinations for age prediction using random forest regression analysis.
The gene combination comprises ELOVL2, C1orf132, TRIM59, KLF14, FHL2 and NPTX 2.
The methylation sites comprise male related sites: trim59.pos7, klf14.pos2, elovl2.pos7 and female related sites: TRIM59.pos8, KLF14.pos3, Clorf132.pos2, FHL2. pos6.
In yet another aspect, the invention provides a set of primers for use in random forest regression analysis for age prediction.
The primer is used for pyrosequencing.
The primers and the sequencing sites thereof are as follows:
Figure BDA0003313375470000031
wherein, the primer sequence F, R, S represents a forward primer, a reverse primer and a sequencing primer respectively, and the sequence pre-labeled biotin represents that the primer carries a biotin label.
In a further aspect, the invention provides the use of the aforementioned methods and/or combinations of genes and/or methylation sites and/or primers in the manufacture of a kit for predicting age.
In yet another aspect, the present invention provides a kit for predicting age.
The kit comprises the following primers:
Figure BDA0003313375470000032
Figure BDA0003313375470000041
the kit also comprises other reagents for pyrosequencing.
The kit is used in combination with a random forest regression model.
The invention has the beneficial effects that:
(1) only 0.1ng of template DNA is needed, and the kit can be used for blood trace examination materials of forensic doctors with higher difficulty
Many techniques, such as EpiTYPER, snapshotts, pyrosequencing, and Massively Parallel Sequencing (MPS), can provide more accurate measurements of DNA methylation. One of the main reasons that EpiTYPER analysis is limited in its application in forensic medicine is that it requires up to 1 μ g of genomic DNA, however, such high amounts of DNA are difficult to obtain in actual crime scene investigation, often more commonly encountered as body fluid spots at crime scenes. Compared with EpiTYPER, the template DNA required by MPS can be reduced to 10ng, and the template DNA required by Snapshoots is 4 ng. In the present invention, however, accurate age prediction can be performed using 0.1ng of template DNA. Previous studies have shown that 10-20ng of template DNA is required for successful age prediction based on methylation. Therefore, the detection method has the highest sensitivity in the existing blood mark detection and has good forensic application prospect.
(2) The whole process can be completed within 10 hours
The method of the invention can be completed in one day, much faster than other available methods. DNA extraction/quantification, sodium bisulfate conversion, PCR and pyrosequencing assays required 2h, 2.5h, 3h and 2h, respectively. In contrast, both the standard procedures for epistype and MPS require more than 2 days. In particular, MPS requires specialized equipment and a complex bioinformatics analysis system, and is difficult to complete within 3 days.
(3) Establishing two independent age prediction models according to gender difference
Random Forest Regression (RFR) was chosen to build an age prediction model using 3 (trim59.pos7, klf14.pos2, elovl2.pos7) male and 4 (trim59.pos8, klf14.pos3, clorf132.pos2 and fhl2.pos6) female sites, respectively, with the final model of 7 sites being 2.8 years (R0.99) and 2.93 years (R0.98) predicted mean absolute error (MAD) for male and female, respectively.
(4) The accuracy of the predicted age can reach MAD <3 years by only using 3-4 CpG sites
Meta-analysis of the age prediction studies over the past few years showed that almost all MADs of the age prediction models established in the previous studies were >3 years old. Our model is most efficient due to the use of RFR (MAD <3 years and only 3-4 CpG sites, 3 CpG sites for male samples and 4 CpG sites for female samples). The age prediction model with few sites and high accuracy can be more practical for forensic inference.
Drawings
FIG. 1 shows that Random Forest Regression (RFR) outperforms Support Vector Regression (SVR) in age prediction.
FIG. 2 is a graph of predicted age versus actual age for a Random Forest Regression (RFR) test data set.
FIG. 3 shows the sensitivity of detection of 7 methylation markers in trace amounts of DNA.
FIG. 4 is an analysis of the correlation of methylation levels of 7 CpG sites with age.
Figure 5 is a comparison of the accuracy of the age prediction method with published studies.
Detailed Description
The present invention will be further illustrated in detail with reference to the following specific examples, which are not intended to limit the present invention but are merely illustrative thereof. The experimental methods used in the following examples are not specifically described, and the materials, reagents and the like used in the following examples are generally commercially available under the usual conditions without specific descriptions.
Example 1DNA extraction and site selection
(1) DNA extraction:
and the DNA extraction scheme is optimized, and the trace DNA loss of blood stains is reduced.
The accuracy of methylation analysis depends on the extraction of high quality DNA from blood tracks. The QIAamp DNA investior kit has been identified as a more reliable method for extracting DNA from forensic samples, with successful extraction of high quality DNA within 2 hours. We further optimized the kit, including shorter incubation times at higher temperatures, addition of carrier RNA to the lysate, and heating of the reagents that dissolve the DNA.
The former method is to incubate the sample at 56 ℃ for 1 hour, and the improved method is to incubate the sample at 85 ℃ for 10 minutes, and then incubate the sample at 56 ℃ for a second time for 1 hour, so as to increase the extraction amount of blood trace DNA.
Heating the reagent to dissolve the DNA may also accelerate cell shedding from the blood spot and increase DNA dissolution, thereby reducing the loss of trace amounts of DNA.
(2) Site selection:
according to the literature, six age-related DNA methylation markers were selected, located in ELOVL2, C1orf132, TRIM59, KLF14, FHL2 and NPTX2, respectively, to ensure that we focused on the age-related region.
(3) Designing a primer:
since the amount of DNA obtained from the blood traces is very small and of low quality, the accuracy and sensitivity of the PCR is of great importance. PCR primers and sequencing primers were designed using Pyromark Assay version 2.0(Qiagen, Germany). In designing the primers, the target sequence is adjusted so that the primers contain as many cytosines (C) as possible to detect more methylated sites. We avoided SNPs and other polymorphisms in the target region as they could lead to bias in the sequencing reaction. In addition, primers with high specificity (i.e., no primer dimer formation) were selected by keeping the GC content below 60% excluding possible methylation sites in the primer binding sequence. If necessary, we changed the published method (e.g., addition of dimethyl sulfoxide (DMSO) to avoid dimer formation) to optimize the protocol. In the PCR primers, the 5' end of one primer needs to be labeled by biotin so as to be combined with streptavidin-coated magnetic beads for subsequent separation and purification of single-stranded PCR products, and the other primer does not need to be labeled. The biotin-labeled primers contained free biotin, which competed with the template for binding to streptavidin-coated magnetic beads, reducing the signal level, and HPLC-purified biotin-labeled primers were used. The amplicon length for each gene of interest ranged from 105-306 bp. The final primers are shown in the following table:
TABLE 1 PCR primers, pyrosequencing primers and CpG sequences for age-related methylation analysis
Figure BDA0003313375470000061
Figure BDA0003313375470000071
Example 2 detection of DNA methylation by Pyrophosphate sequencing technology
(1) Bisulfite conversion
Extracted DNA (40. mu.L) was bisulfite converted using the EpiTect fast DNA bisulfite kit (Qiagen, Germany). The DNA sample was mixed with a CT conversion reagent (bisulfite kit) to obtain a final volume of 140. mu.L of product, followed by incubation at 95 ℃ for 5 minutes, 60 ℃ for 20 minutes, and then purification.
(2)PCR
The reaction mixture (25. mu.L) contained 2. mu.L of the transforming DNA, 12.5. mu.L of the PCR premix (Qiagen, Germany) and 0.1-0.5mM primers. Primer concentrations were adjusted to obtain specific DNA products without dimers. The thermal cycling conditions were as follows: denaturation at 95 ℃ for 10 min; 45 cycles at 95 ℃ for 30 seconds, 30 seconds at 56 ℃ (NPTX 258 ℃, 30 seconds) and 30 seconds at 72 ℃; a final extension was then carried out at 72 ℃ for 5 minutes. Detection by electrophoresis was performed using agarose gel electrophoresis.
(3) Pyrophosphoric acid sequencing
The template prepared from the biotin-labeled PCR amplification product was sequenced using a Pyromark Q48 thermal sequencer (Qiagen, Germany) and a Pyro-Gold kit (Qiagen, Germany). In previous pyrosequencing processes, the PCR product, which had a volume of 10. mu.L, produced an unstable signal that was not clearly distinguishable from the background signal. Our method increases the volume of PCR product to 12. mu.L, which can effectively avoid the generation of unstable signal.
Example 3 construction of blood mark age prediction model
(1) Comparing age prediction accuracy of SVR and RFR models
The previous research results show that the SVR model is more accurate than methods such as multiple linear regression, multiple nonlinear regression and back propagation neural network, therefore, the SVR and RFR models are utilized to combine based on all 46 CpG loci, establish a best fit age prediction model and calculate the prediction accuracy. The SVR model is constructed in an R package e1071, and parameters are set as follows: cost is 2, gamma is 0.8, epsilon is 0.1. The RFR model is constructed by using an R package random Forest, mtry parameters are the same as the number of CpG sites modeled each time, the size of a minimum node is 5, and the number of trees is set to be 1000.
To increase the computation speed, the optimal combination of sites is determined using a forward selection method. From 241 blood trace samples (241 healthy chinese han-nationality volunteers aged 10-79 years including whole blood samples of 128 males and 113 females all donors provided informed consent, and the ethical approval of this study was passed by the beijing genomics institute of chinese academy of sciences) 70% of the samples were randomly drawn to form a training dataset, and the remaining 30% were used as a test dataset to evaluate the accuracy of the RFR model. The training was repeated 100 times, each time selecting the best site (i.e., the minimum MAD). The site with the highest frequency of recording was selected as the appropriate site for the final model. In the two-site training model, after the optimal site, the site with the highest frequency and the lowest MAD is recorded as the second optimal site.
Age prediction model constructed by RFR women used 4 sites (trim59.pos8, klf14.pos3, clorf132.pos2 and fhl2.pos6) men used 3 sites (trim59.pos7, klf14.pos2, elovl2.pos7) and the resulting MADs were <3 years. Under the SVR model, MAD was stable for around 4.5 years even though both men and women had 8 sites, which indicates that RFR is superior to SVR in age prediction (fig. 1).
(2) Test data set validation prediction accuracy
The remaining 30% of blood stain samples (38 men and 33 women) were used as test data sets and the age prediction accuracy of 7 sites (3 sites in men: trim59.pos7, klf14.pos2, elovll 2. pos7; 4 sites in women: trim59.pos8, klf14.pos3, clorf132.pos2 and fhl2.pos6) selected in the final model was verified in the RFR model, yielding predicted MADs for men and women of 2.8 years (R ═ 0.99) and 2.93 years (R ═ 0.98), respectively (fig. 2).
Example 4 sensitivity detection
Whole blood samples were collected from 241 healthy Chinese Han volunteers (128 males and 113 females) aged in the range of 10-79 years. All donors provided informed consent and the Beijing genome institute, the Chinese academy of sciences, passed ethical approval for this study.
20 μ L of whole blood was aliquoted onto filter paper to prepare a blood stain, which was then stored at room temperature for 1 year. To determine the detection sensitivity, DNA extracted from blood stains was serially diluted to 100, 50, 10, 5, 2.5, 1.0, 0.50, 0.25 and 0.10 ng. Blood stain samples of different concentrations were subjected to methylation analysis, first to bisulfite conversion, then to PCR amplification and pyrophosphate sequencing (see example 1). The difference in percent methylation between 0.1ng DNA and higher DNA concentrations was compared to determine the sensitivity of our proposed methylation detection method in blood trace detection.
We observed no significant difference in the percentage of methylation between 0.1ng DNA and higher concentrations of 4 CpG sites in females (TRIM59.pos8, KLF14.pos3, Clorf132.pos2 and FHL2.pos6) and 3 CpG sites in males (TRIM59.pos7, KLF14.pos2, ELOVL2.pos7) for age prediction (P.gtoreq.0.05, KS test; FIG. 3). Elovl2.pos7 site, 1.0ng DNA was required to achieve similar levels.
Example 5 correlation of DNA methylation level with age
Whole blood samples of 241 healthy Chinese Han volunteers (128 males and 113 females) aged 10-79 were collected, blood mark samples were prepared and subjected to methylation analysis, and correlation between the sites and the ages was analyzed. The finally formed blood mark age prediction model comprises 7 CpG sites spanning 3 genes, 3 known sites and 4 new CpG sites. The results showed that 5 of these CpG sites were from 3 genes (TRIM59, KLF14 and C1orf132) and were age-related in blood trace analysis of chinese subjects (fig. 4).
Example 6 age prediction accuracy versus published studies
Meta-analysis of the age prediction studies over the past years showed that almost all MADs were >3 years old (fig. 5). Compared to previous studies, our model is most efficient due to the use of RFR (MAD <3 years, only 3-4 CpG sites are needed). In fig. 5, the filled dots represent published results, and mathematical methods in different models are represented in different shapes. And the "cross" and "meter" symbols represent the age predictions we have established for women and men, respectively.
Sequence listing
<110> university of Shanxi medical science
<120> method for age prediction using pyrosequencing and random forest regression analysis
<160> 18
<170> SIPOSequenceListing 1.0
<210> 1
<211> 29
<212> DNA
<213> Artificial Sequence (Artificial Sequence)
<400> 1
tagtaaatat ataagtgggg gaagaaggg 29
<210> 2
<211> 27
<212> DNA
<213> Artificial Sequence (Artificial Sequence)
<400> 2
ttaataaaac caaattctaa aacattc 27
<210> 3
<211> 24
<212> DNA
<213> Artificial Sequence (Artificial Sequence)
<400> 3
caccttacca ccaaaccaaa attt 24
<210> 4
<211> 21
<212> DNA
<213> Artificial Sequence (Artificial Sequence)
<400> 4
aggggagtag ggtaagtgag g 21
<210> 5
<211> 30
<212> DNA
<213> Artificial Sequence (Artificial Sequence)
<400> 5
caaaaccatt tccccctaat atatacttca 30
<210> 6
<211> 20
<212> DNA
<213> Artificial Sequence (Artificial Sequence)
<400> 6
gggaggagat ttgtaggttt 20
<210> 7
<211> 21
<212> DNA
<213> Artificial Sequence (Artificial Sequence)
<400> 7
gggttttggg agtatagtag t 21
<210> 8
<211> 27
<212> DNA
<213> Artificial Sequence (Artificial Sequence)
<400> 8
acacctccta aaacttctcc aatctcc 27
<210> 9
<211> 21
<212> DNA
<213> Artificial Sequence (Artificial Sequence)
<400> 9
gttttgggag tatagtagtt a 21
<210> 10
<211> 28
<212> DNA
<213> Artificial Sequence (Artificial Sequence)
<400> 10
ggttttaggt taagttatgt ttaatagt 28
<210> 11
<211> 30
<212> DNA
<213> Artificial Sequence (Artificial Sequence)
<400> 11
actaaaaaat ttccctctat taccattacc 30
<210> 12
<211> 24
<212> DNA
<213> Artificial Sequence (Artificial Sequence)
<400> 12
atagttttag aaattatttt gttt 24
<210> 13
<211> 29
<212> DNA
<213> Artificial Sequence (Artificial Sequence)
<400> 13
tagtaaatat ataagtgggg gaagaaggg 29
<210> 14
<211> 28
<212> DNA
<213> Artificial Sequence (Artificial Sequence)
<400> 14
atttaataaa accaaattct aaaacatt 28
<210> 15
<211> 25
<212> DNA
<213> Artificial Sequence (Artificial Sequence)
<400> 15
ggggttaagt tattaagttt tgaag 25
<210> 16
<211> 21
<212> DNA
<213> Artificial Sequence (Artificial Sequence)
<400> 16
tataggtggt ttgggggaga g 21
<210> 17
<211> 27
<212> DNA
<213> Artificial Sequence (Artificial Sequence)
<400> 17
aaaaaacact accctccaca acataac 27
<210> 18
<211> 15
<212> DNA
<213> Artificial Sequence (Artificial Sequence)
<400> 18
ttgggggaga ggttg 15

Claims (10)

1. The method for predicting the age is characterized by comprising pyrosequencing and random Forest regression analysis, wherein a random Forest regression analysis model is constructed by using an R package random Forest, and the optimal site combination is determined by adopting a forward selection method.
2. The method of claim 1, wherein the parameters in the construction of the random forest regression analysis model are set as follows: the mtry parameter is the same as the number of CpG sites modeled each time, the minimum node size is 5, and the number of trees is set to 1000.
3. The method of claim 2, wherein the stochastic forest regression analysis model employs age-related DNA methylation markers in ELOVL2, C1orf132, TRIM59, KLF14, FHL2, and NPTX2 genes, respectively.
4. The method of claim 3, wherein the random forest regression model for age prediction is selected from a total of 7 age-related DNA methylation sites, 3 male individuals being: trim59.pos7, klf14.pos2, elovl2. pos7; 4 women were TRIM59.pos8, KLF14.pos3, Clorf132.pos2 and FHL2. pos6.
5. The method of claim 1, wherein the volume of the PCR product used in pyrosequencing is 12. mu.L.
6. A group of gene combinations for age prediction by random forest regression analysis is characterized by comprising ELOVL2, C1orf132, TRIM59, KLF14, FHL2 and NPTX 2.
7. A set of methylation sites for age prediction by random forest regression analysis, wherein said methylation sites comprise male-associated sites: trim59.pos7, klf14.pos2, elovl2.pos7 and female related sites: TRIM59.pos8, KLF14.pos3, Clorf132.pos2, FHL2. pos6.
8. A group of primers for age prediction by random forest regression analysis is characterized in that the primers are used for pyrosequencing, and the primers and sequencing sites thereof are as follows:
Figure FDA0003313375460000011
Figure FDA0003313375460000021
wherein, the primer sequence F, R, S represents a forward primer, a reverse primer and a sequencing primer respectively, and the sequence pre-labeled biotin represents that the primer carries a biotin label.
9. Use of the method of any one of claims 1 to 5 and/or the gene combination of claim 6 and/or the methylation site of claim 7 and/or the primer of claim 8 for the preparation of a kit for predicting age.
10. A kit for predicting age, comprising the primer of claim 8.
CN202111223180.6A 2021-10-20 2021-10-20 Method for predicting age by pyrosequencing and random forest regression analysis Active CN114045333B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111223180.6A CN114045333B (en) 2021-10-20 2021-10-20 Method for predicting age by pyrosequencing and random forest regression analysis

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111223180.6A CN114045333B (en) 2021-10-20 2021-10-20 Method for predicting age by pyrosequencing and random forest regression analysis

Publications (2)

Publication Number Publication Date
CN114045333A true CN114045333A (en) 2022-02-15
CN114045333B CN114045333B (en) 2022-10-11

Family

ID=80205735

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111223180.6A Active CN114045333B (en) 2021-10-20 2021-10-20 Method for predicting age by pyrosequencing and random forest regression analysis

Country Status (1)

Country Link
CN (1) CN114045333B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115992259A (en) * 2022-11-23 2023-04-21 四川大学 Primer group and kit based on 13Y chromosome methylation genetic markers

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140228231A1 (en) * 2011-05-20 2014-08-14 The Regents Of The University Of California Method to estimate age of individual based on epigenetic markers in biological sample
CN110257494A (en) * 2019-07-19 2019-09-20 华中科技大学 A kind of method, system and augmentation detection system obtaining Chinese population individual age
CN111139292A (en) * 2019-12-03 2020-05-12 河南远止生物技术有限公司 Biological age inference method established based on pyrosequencing
CN113373236A (en) * 2021-02-19 2021-09-10 中国科学院北京基因组研究所(国家生物信息中心) Method for obtaining individual age of Chinese population

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140228231A1 (en) * 2011-05-20 2014-08-14 The Regents Of The University Of California Method to estimate age of individual based on epigenetic markers in biological sample
CN110257494A (en) * 2019-07-19 2019-09-20 华中科技大学 A kind of method, system and augmentation detection system obtaining Chinese population individual age
CN111139292A (en) * 2019-12-03 2020-05-12 河南远止生物技术有限公司 Biological age inference method established based on pyrosequencing
CN113373236A (en) * 2021-02-19 2021-09-10 中国科学院北京基因组研究所(国家生物信息中心) Method for obtaining individual age of Chinese population

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
FAN H等: "Chronological Age Prediction: Developmental Evaluation of DNA Methylation-Based Machine Learning Models", 《FRONTIERS IN BIOENGINEERING AND BIOTECHNOLOGY》 *
JUNG S E等: "DNA methylation of the ELOVL2, FHL2, KLF14, C1orf132/MIR29B2C, and TRIM59 genes for age prediction from blood, saliva, and buccal swab samples", 《FORENSIC SCIENCE INTERNATIONAL: GENETICS》 *
NAUE J等: "Chronological age prediction based on DNA methylation: massive parallel sequencing and random forest regression", 《FORENSIC SCIENCE INTERNATIONAL: GENETICS》 *
孟航等: "基于DNA甲基化推断年龄的研究进展", 《法医学杂志》 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115992259A (en) * 2022-11-23 2023-04-21 四川大学 Primer group and kit based on 13Y chromosome methylation genetic markers
CN115992259B (en) * 2022-11-23 2023-10-10 四川大学 Primer group and kit based on 13Y chromosome methylation genetic markers

Also Published As

Publication number Publication date
CN114045333B (en) 2022-10-11

Similar Documents

Publication Publication Date Title
US11342047B2 (en) Using cell-free DNA fragment size to detect tumor-associated variant
EP3283657B1 (en) Quality assessment of circulating cell-free dna using multiplexed droplet digital pcr
US20180068058A1 (en) Methods and compositions for sample identification
Pośpiech et al. Towards broadening Forensic DNA Phenotyping beyond pigmentation: Improving the prediction of head hair shape from DNA
Coudry et al. Successful application of microarray technology to microdissected formalin-fixed, paraffin-embedded tissue
CN110257494B (en) Method and system for obtaining individual ages of Chinese population and amplification detection system
CN108350500A (en) Nucleic acid for detecting chromosome abnormality and method
CN109593862B (en) Method and system for obtaining age of male individuals of Chinese population
CN107881249B (en) Application of lncRNA and target gene thereof in breeding high-quality livestock and poultry variety
CN109943643B (en) Method for obtaining individual age of Chinese population
TW201936921A (en) A primer for next generation sequencer and a method for producing the same, a DNA library obtained through the use of a primer for next generation sequencer and a method for producing the same, and a DNA analyzing method using a DNA library
Hitzemann et al. Introduction to sequencing the brain transcriptome
CN114045333B (en) Method for predicting age by pyrosequencing and random forest regression analysis
CN108085399B (en) Novel application of lncRNA and trans-regulatory gene WNT11 thereof
Naue Getting the chronological age out of DNA: using insights of age-dependent DNA methylation for forensic DNA applications
CN110295234B (en) Real-time fluorescence PCR kit for acquiring age and predicting occurrence of tumor diseases
Refn et al. Prediction of chronological age and its applications in forensic casework: methods, current practices, and future perspectives
Viteri et al. Identifying polymorphic microsatellite loci for Andean bear research
KR102368835B1 (en) Primer sets for determining genotype of HSPB1 gene related to meat of Korean native cattle and uses thereof
CN108875314B (en) Target gene detection method based on epigenetics modification difference
CN108103064B (en) Long-chain non-coding RNA and application thereof
KR101902481B1 (en) SNP molecular biomarker composition for discrimination of horse temperament in AR gene
CN110643713A (en) STR locus set for pandas and application thereof
CN113278697B (en) Lung cancer diagnostic kit based on peripheral blood internal gene methylation
CN108950005A (en) A kind of the Forensic detection system and its application of 30 SNP sites of autosome first ancestor

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant