CN114045333B

CN114045333B - Method for predicting age by pyrosequencing and random forest regression analysis

Info

Publication number: CN114045333B
Application number: CN202111223180.6A
Authority: CN
Inventors: 严江伟; 杨丰隆; 张更谦; 张君; 郝青青; 张晓梦; 漆小琴; 杨婷婷; 王雅雅; 余代静
Original assignee: Shanxi Medical University
Current assignee: Shanxi Medical University
Priority date: 2021-10-20
Filing date: 2021-10-20
Publication date: 2022-10-11
Anticipated expiration: 2041-10-20
Also published as: CN114045333A

Abstract

The present invention provides a method for age prediction. The method includes pyrosequencing and random forest regression analysis. The random forest regression analysis model is constructed using R package random Forest, and determined by forward selection method. optimal site combination. The method provided by the invention only needs 0.1 ng of template DNA, which can be used for forensic bloodstain samples with relatively high difficulty; the whole process can be completed within 10 hours; according to gender differences, two independent age prediction models are established by gender; only 3 ‑4 CpG loci, the accuracy of predicting age can reach MAD<3 years.

Description

Methods for Age Prediction Using Pyrosequencing and Random Forest Regression Analysis

技术领域technical field

本发明属于法医学领域，具体涉及利用焦磷酸测序和随机森林回归分析进行年龄预测的方法。The invention belongs to the field of forensic medicine, and in particular relates to a method for age prediction using pyrosequencing and random forest regression analysis.

背景技术Background technique

对未知样本捐赠者的生理年龄评估是法医调查中最重要的工具之一。它缩小了犯罪嫌疑人的范围，进而对罪犯的外部可见特征预测和生物地理祖先推断进行补充。先前建立的年龄分类方法涉及对骨骼特征的形态学分析。当骨骼和牙齿等固体组织可用时，可通过人类学方法精确地确定年龄。然而，由于在法医调查过程中更容易遇到其他组织，如体液，因此在实践中很难使用此类方法。最近，提出了几种基于分子水平的方法来估算年龄，包括端粒长度分析，线粒体DNA的年龄依赖性缺失或T细胞DNA重排，以及蛋白质改变，如天冬氨酸的外消旋作用和晚期糖基化终产物。然而，所有这些方法都有局限性，限制了它们在犯罪现场的适用性，特别是它们的低准确性和严格的样本要求。例如，基于信号联合T细胞受体重排切除环(sjTRECs)量化的年龄预测标准误差为±8.0年。The biological age assessment of donors of unknown samples is one of the most important tools in forensic investigations. It narrows down the criminal suspects, which in turn complements the offender's externally visible feature prediction and biogeographic ancestry inference. Previously established age classification methods involve morphological analysis of skeletal features. When solid tissues such as bones and teeth are available, age can be accurately determined by anthropological methods. However, it is difficult to use such methods in practice as other tissues, such as bodily fluids, are more likely to be encountered during forensic investigations. Recently, several molecular-level-based methods have been proposed to estimate age, including telomere length analysis, age-dependent deletions of mitochondrial DNA or T-cell DNA rearrangements, and protein alterations such as racemization of aspartate and Advanced glycation end products. However, all these methods have limitations that limit their applicability to crime scenes, especially their low accuracy and strict sample requirements. For example, the standard error of age prediction based on the quantification of signaling combined with T-cell receptor rearrangement excision circles (sjTRECs) is ±8.0 years.

这些方法的一个可能替代方法是检测表观遗传修饰(例如甲基化)，现在已知这些修饰可随年龄变化。迄今为止，法医学年龄预测的研究主要集中在全血样本上，平均绝对偏差(MAD)为3-10年，主要采用多元线性回归模型。少量研究使用机器学习算法，如支持向量机(SVM)、人工神经网络(ANN)和随机森林回归(RFR)，实现了相对较低的预测误差(3.24-4.7年)；然而，这些研究仅在新鲜体液中进行。此外，基于斑痕的年龄预测(在犯罪现场调查中更常见)尚未得到系统研究。A possible alternative to these methods is the detection of epigenetic modifications, such as methylation, which are now known to vary with age. To date, studies on forensic age prediction have mainly focused on whole blood samples, with mean absolute deviation (MAD) ranging from 3 to 10 years, mainly using multiple linear regression models. A small number of studies have achieved relatively low prediction errors (3.24-4.7 years) using machine learning algorithms such as Support Vector Machines (SVM), Artificial Neural Networks (ANN), and Random Forest Regression (RFR); however, these studies only in fresh body fluids. In addition, age prediction based on scarring (more common in crime scene investigations) has not been systematically studied.

因此，本发明旨在建立一种灵敏、快速、可靠的基于焦磷酸测序技术和随机森林回归计算模型，适用于包括血痕在内的各种检材的年龄预测方法。Therefore, the present invention aims to establish a sensitive, fast, and reliable method for age prediction based on pyrosequencing technology and random forest regression, which is suitable for age prediction of various samples including bloodstains.

发明内容SUMMARY OF THE INVENTION

本发明在基因组序列中筛选出一套用于分析法医学案件中检材的DNA甲基化年龄预测位点，并对每一位点设计了引物，使用焦磷酸测序技术对各位点甲基化水平进行分析，而后利用随机森林回归分别为男性和女性建立年龄预测模型。旨在发明一种灵敏、快速、可靠且使用较少位点仍能保持高准确度的年龄预测分析方法，该检测方法可用于血痕等检材的年龄预测，在该方法中我们对DNA提取、引物设计和测序方案都进行了优化。The present invention screens out a set of DNA methylation age prediction sites for analyzing the samples in forensic cases from the genome sequence, designs primers for each site, and uses pyrosequencing technology to analyze the methylation level of each site. Analysis, and then use random forest regression to build age prediction models for males and females separately. The aim is to invent a sensitive, fast, reliable and high-accuracy age prediction analysis method using fewer sites. This detection method can be used for age prediction of blood stains and other samples. Primer design and sequencing protocols were optimized.

术语：the term:

RFR：Random Forest Regressor，随机森林回归。RFR: Random Forest Regressor, random forest regression.

SVR：Support Vector Regression，支持向量回归。SVR: Support Vector Regression, support vector regression.

MAD：Mean Absolute Deviation，平均绝对误差。MAD: Mean Absolute Deviation, mean absolute error.

一方面，本发明提供了一种用于年龄预测的方法。In one aspect, the present invention provides a method for age prediction.

所述的方法中包括焦磷酸测序和随机森林回归分析，所述的随机森林回归分析模型使用R package random Forest构建，并采用正向选择法确定最佳的位点组合。The method includes pyrosequencing and random forest regression analysis. The random forest regression analysis model is constructed using the R package random Forest, and the forward selection method is used to determine the best site combination.

所述的方法中随机森林回归分析模型构建中的参数设置：mtry参数与每次建模的CpG位点数相同，最小节点大小为5，树的数量设置为1000。The parameter settings in the random forest regression analysis model construction in the described method: the mtry parameter is the same as the number of CpG sites for each modeling, the minimum node size is 5, and the number of trees is set to 1000.

所述的随机森林回归分析模型选用与年龄相关的DNA甲基化标记分别位于ELOVL2、C1orf132、TRIM59、KLF14、FHL2和NPTX2基因。The random forest regression analysis model selected age-related DNA methylation markers located in ELOVL2, C1orf132, TRIM59, KLF14, FHL2 and NPTX2 genes, respectively.

所述的随机森林回归建立年龄预测模型时，共选用7个与年龄相关的DNA甲基化位点，其中男性3个为：TRIM59.pos7、KLF14.pos2、ELOVL2.pos7；女性4个为TRIM59.pos8、KLF14.pos3、Clorf132.pos2和FHL2.pos6。When the described random forest regression established the age prediction model, a total of 7 age-related DNA methylation sites were selected, of which 3 were male: TRIM59.pos7, KLF14.pos2, ELOVL2.pos7; 4 females were TRIM59 .pos8, KLF14.pos3, Clorf132.pos2 and FHL2.pos6.

所述的焦磷酸测序中使用的PCR产物的体积为12μL。The volume of PCR product used in the described pyrosequencing was 12 μL.

在一些实施例中，所述的方法中包括以下步骤：In some embodiments, the method includes the following steps:

(1)DNA提取；(1) DNA extraction;

(2)亚硫酸盐转化；(2) sulfite conversion;

(3)PCR；(3) PCR;

(4)焦磷酸测序；(4) Pyrosequencing;

(5)模型预测。(5) Model prediction.

另一方面，本发明提供了一组用于随机森林回归分析进行年龄预测的基因组合。In another aspect, the present invention provides a set of gene combinations for age prediction by random forest regression analysis.

所述的基因组合中包括ELOVL2、C1orf132、TRIM59、KLF14、FHL2和NPTX2。The gene panel includes ELOVL2, C1orf132, TRIM59, KLF14, FHL2 and NPTX2.

所述的甲基化位点中包括男性相关位点：TRIM59.pos7、KLF14.pos2、ELOVL2.pos7和女性相关位点：TRIM59.pos8、KLF14.pos3、Clorf132.pos2、FHL2.pos6。The methylation sites include male-related sites: TRIM59.pos7, KLF14.pos2, ELOVL2.pos7 and female-related sites: TRIM59.pos8, KLF14.pos3, Clorf132.pos2, FHL2.pos6.

再一方面，本发明提供了一组用于随机森林回归分析进行年龄预测的引物。In yet another aspect, the present invention provides a set of primers for age prediction by random forest regression analysis.

所述的引物用于焦磷酸测序。The primers described were used for pyrosequencing.

所述的引物及其测序位点如下：The primers and their sequencing sites are as follows:

其中，引物序列F、R、S分别代表正向引物、反向引物和测序引物，序列前标记biotin表示引物带有生物素标记。The primer sequences F, R, and S represent forward primer, reverse primer and sequencing primer, respectively, and the label biotin before the sequence indicates that the primer is labeled with biotin.

又一方面，本发明提供了前述的方法和/或基因组合和/或甲基化位点和/或引物在制备用于预测年龄的试剂盒中的应用。In yet another aspect, the present invention provides the aforementioned method and/or application of gene combination and/or methylation site and/or primer in preparing a kit for predicting age.

又一方面，本发明提供了一种用于预测年龄的试剂盒。In yet another aspect, the present invention provides a kit for predicting age.

所述的试剂盒中包括以下引物：The kit includes the following primers:

所述的试剂盒中还包括其他用于焦磷酸测序的试剂。The kit also includes other reagents for pyrosequencing.

所述的试剂盒与随机森林回归模型联合使用。The kit was used in conjunction with a random forest regression model.

本发明的有益效果：Beneficial effects of the present invention:

(1)仅需0.1ng模板DNA，可用于难度较高的法医血痕检材(1) Only 0.1ng of template DNA is needed, which can be used for difficult forensic blood stains

许多技术，如EpiTYPER、Snapshots、焦磷酸测序和大规模平行测序(MPS)都可以提供较准确的DNA甲基化测量方法。而限制EpiTYPER分析法在法医学中应用的一个主要原因是其需要高达1μg的基因组DNA，然而实际犯罪现场调查中很难获得如此高量的DNA，往往在犯罪现场更常遇见体液斑迹。相较于EpiTYPER，MPS所需的模板DNA可降至10ng，Snapshots需要4ng模板DNA。而在本发明中，使用0.1ng的模板DNA即可进行准确的年龄预测。先前的研究表明基于甲基化进行成功的年龄预测需要10-20ng模板DNA。因此，本发明的检测方法在现有的血痕检测中具有最高的灵敏度，并具有良好的法医学应用前景。Many techniques, such as EpiTYPER, Snapshots, pyrosequencing, and massively parallel sequencing (MPS), can provide relatively accurate DNA methylation measurements. One of the main reasons that limit the application of EpiTYPER analysis in forensics is that it requires up to 1 μg of genomic DNA. However, it is difficult to obtain such a high amount of DNA in actual crime scene investigations, and bodily fluid stains are often encountered at crime scenes. Compared with EpiTYPER, the template DNA required for MPS can be reduced to 10ng, and Snapshots require 4ng template DNA. In the present invention, however, accurate age prediction can be performed using 0.1 ng of template DNA. Previous studies have shown that 10-20ng of template DNA is required for successful age prediction based on methylation. Therefore, the detection method of the present invention has the highest sensitivity in the existing blood stain detection, and has a good forensic application prospect.

(2)整个过程可在10小时内完成(2) The whole process can be completed within 10 hours

本发明的方法可在一天内完成，远远快于其他可用的方法。DNA提取/定量、硫酸氢钠转化、PCR和焦磷酸测序试验分别需要2h、2.5h、3h和2h。相比之下，EpiTyper和MPS的标准程序都需要2天以上的时间。特别是，MPS需要专门的设备和复杂的生物信息学分析系统，难以在3天内完成。The method of the present invention can be completed in one day, much faster than other available methods. DNA extraction/quantification, sodium bisulfate conversion, PCR and pyrosequencing assays took 2h, 2.5h, 3h and 2h, respectively. In comparison, the standard procedures for both EpiTyper and MPS take more than 2 days. In particular, MPS requires specialized equipment and a complex bioinformatics analysis system, which is difficult to complete within 3 days.

(3)针对性别差异，分性别建立两个独立的年龄预测模型(3) For gender differences, establish two independent age prediction models by gender

选择随机森林回归(random forest regression，RFR)建立年龄预测模型，分别使用男性3个(TRIM59.pos7、KLF14.pos2、ELOVL2.pos7)和女性4个(TRIM59.pos8、KLF14.pos3、Clorf132.pos2和FHL2.pos6)位点，共7个位点的最终模型为男性和女性的预测平均绝对误差(MAD)分别为2.8年(R＝0.99)和2.93年(R＝0.98)。Select random forest regression (RFR) to establish age prediction model, using 3 males (TRIM59.pos7, KLF14.pos2, ELOVL2.pos7) and 4 females (TRIM59.pos8, KLF14.pos3, Clorf132.pos2) and FHL2.pos6) loci, the final model for a total of 7 loci had a mean absolute error (MAD) of prediction for males and females of 2.8 years (R=0.99) and 2.93 years (R=0.98), respectively.

(4)仅使用3-4个CpG位点，预测年龄的准确性可达到MAD＜3年(4) Using only 3-4 CpG sites, the accuracy of predicting age can reach MAD<3 years

对过去几年的年龄预测研究进行的荟萃分析表明，先前研究建立的年龄预测模型，几乎所有MAD的年龄均>3年。由于使用RFR，我们的模型是最有效的(MAD<3年，且只需3-4个CpG位点，男性样本仅使用3个CpG位点，女性样本仅使用4个CpG位点)。本发明的位点少且仍能保持高准确度的年龄预测模型对于法医推断更为实用。A meta-analysis of age-prediction studies over the past few years showed that, in age-prediction models established by previous studies, nearly all MADs were >3 years old. Our model is the most efficient due to the use of RFR (MAD < 3 years and requires only 3-4 CpG loci, using only 3 CpG loci in male samples and 4 CpG loci in female samples). The age prediction model of the present invention, which has few sites and still maintains high accuracy, is more practical for forensic inference.

附图说明Description of drawings

图1为随机森林回归(RFR)在年龄预测方面优于支持向量回归(SVR)。Figure 1 shows that random forest regression (RFR) outperforms support vector regression (SVR) in age prediction.

图2为随机森林回归(RFR)测试数据集的预测年龄与实际年龄。Figure 2 shows the predicted age versus actual age for the random forest regression (RFR) test dataset.

图3为微量DNA中7种甲基化标记的灵敏度检测。Figure 3 shows the sensitivity detection of seven methylation markers in trace amounts of DNA.

图4为7个CpG位点甲基化水平与年龄的相关性分析。Figure 4 shows the correlation analysis between the methylation levels of 7 CpG sites and age.

图5为与已发表研究的年龄预测方法准确度比较。Figure 5 is a comparison of the accuracy of age prediction methods with published studies.

具体实施方式Detailed ways

下面结合具体实施例，对本发明作进一步详细的阐述，下述实施例不用于限制本发明，仅用于说明本发明。以下实施例中所使用的实验方法如无特殊说明，实施例中未注明具体条件的实验方法，通常按照常规条件，下述实施例中所使用的材料、试剂等，如无特殊说明，均可从商业途径得到。The present invention will be described in further detail below with reference to specific embodiments. The following embodiments are not intended to limit the present invention, but are only used to illustrate the present invention. The experimental methods used in the following examples, unless otherwise specified, the experimental methods that do not specify specific conditions in the examples are usually in accordance with conventional conditions, and the materials, reagents, etc. used in the following examples, unless otherwise specified, are all Commercially available.

实施例1DNA提取及位点筛选Example 1 DNA extraction and site screening

(1)DNA提取：(1) DNA extraction:

优化DNA提取方案，减少血痕微量DNA损失。Optimize the DNA extraction protocol to reduce the loss of trace DNA in blood.

甲基化分析的准确性取决于从血迹中提取高质量的DNA。QIAamp DNAInvestigator kit已被认为是从法医样本中提取DNA的更可靠方法，可在2小时内获得成功提取出高质量的DNA。我们对该试剂盒进行了进一步优化，包括在较高温度下缩短孵育时间、在裂解液中添加载体RNA以及加热溶解DNA的试剂。The accuracy of methylation analysis depends on extracting high-quality DNA from bloodstains. The QIAamp DNAInvestigator kit has been recognized as a more reliable method for DNA extraction from forensic samples, resulting in successful high-quality DNA extraction within 2 hours. We have further optimized the kit to include shorter incubation times at higher temperatures, addition of carrier RNA to the lysate, and heating reagents to dissolve the DNA.

先前的方法是样本在56℃下孵育1小时，改进后的方法是样本在85℃下孵育10分钟，然后在56℃下二次孵育1小时，以此增加血痕DNA的提取量。The previous method was to incubate the sample at 56°C for 1 hour, and the improved method is to incubate the sample at 85°C for 10 minutes, followed by a second incubation at 56°C for 1 hour to increase the amount of bloodstained DNA extracted.

加热溶解DNA的试剂也可以加速血斑上的细胞脱落和增加DNA溶解，以此来减少微量DNA的损失。Heating reagents that lyse DNA can also accelerate cell shedding on blood spots and increase DNA lysis, thereby reducing the loss of trace amounts of DNA.

(2)位点选择：(2) Site selection:

根据文献，选择了六个与年龄相关的DNA甲基化标记，分别位于ELOVL2、C1orf132、TRIM59、KLF14、FHL2和NPTX2，以确保我们关注的是与年龄相关的区域。Based on the literature, six age-related DNA methylation marks, located at ELOVL2, C1orf132, TRIM59, KLF14, FHL2, and NPTX2, were selected to ensure that we focused on age-related regions.

(3)引物设计：(3) Primer design:

由于从血痕中获得的DNA数量极少且质量较低，PCR的准确性和敏感性至关重要。使用PyroMark Assay Design version 2.0(Qiagen,德国)设计PCR引物和测序引物。设计引物时，对目标序列进行调整，使引物包含尽可能多的胞嘧啶(C)，以检测更多的甲基化位点。我们避免了目标区域的SNP和其他多态性，因为它们可能会导致测序反应出现偏差。此外，排除引物结合序列中可能的甲基化位点，将GC含量保持在60％以下，选择具有高特异性的引物(即不形成引物二聚体)。必要时，我们改变已公布的方法(例如，添加二甲基亚砜(DMSO)以避免二聚体形成)以优化方案。PCR的引物中，一条引物的5’端需使用生物素标记，以与链霉亲和素包被的磁珠结合，用于后续单链PCR产物的分离纯化，另一条不要标记。生物素标记的引物中含有游离的生物素，游离生物素会与模板竞争结合到链霉亲和素包被的磁珠上，而降低信号水平，须使用HPLC纯化的生物素标记的引物。每个目的基因的扩增子长度范围为105-306bp。最终得到的引物如下表所示：Since the DNA obtained from bloodstains is extremely small in quantity and of low quality, the accuracy and sensitivity of PCR is critical. PCR primers and sequencing primers were designed using PyroMark Assay Design version 2.0 (Qiagen, Germany). When designing primers, adjust the target sequence so that the primers contain as many cytosines (C) as possible to detect more methylation sites. We avoided SNPs and other polymorphisms in the target region because they could bias the sequencing reaction. Furthermore, possible methylation sites in the primer-binding sequence were excluded, the GC content was kept below 60%, and primers with high specificity (ie, no primer-dimer formation) were selected. When necessary, we modified published methods (eg, adding dimethyl sulfoxide (DMSO) to avoid dimer formation) to optimize the protocol. Among the PCR primers, the 5' end of one primer should be labeled with biotin to bind to streptavidin-coated magnetic beads for subsequent separation and purification of single-stranded PCR products, and the other should not be labeled. Biotin-labeled primers contain free biotin. Free biotin will compete with the template for binding to streptavidin-coated magnetic beads and reduce the signal level. HPLC-purified biotin-labeled primers must be used. The amplicon length of each gene of interest ranged from 105-306 bp. The final primers are shown in the table below:

表1年龄相关甲基化分析的PCR引物、焦磷酸测序引物和CpG序列Table 1 PCR primers, pyrosequencing primers and CpG sequences for age-related methylation analysis

实施例2焦磷酸测序技术检测DNA甲基化Example 2 Detection of DNA methylation by pyrosequencing technology

(1)亚硫酸氢盐转化(1) bisulfite conversion

使用EpiTect fast DNA亚硫酸氢盐试剂盒(德国，Qiagen)对提取的DNA(40μL)进行亚硫酸氢盐转化。将DNA样本与CT转化试剂(亚硫酸氢盐试剂盒)混合以获得最终体积为140μL的产物，然后在95℃下孵育5分钟，60℃20分钟，然后纯化。Extracted DNA (40 μL) was subjected to bisulfite conversion using EpiTect fast DNA bisulfite kit (Qiagen, Germany). DNA samples were mixed with CT conversion reagent (bisulfite kit) to obtain a final volume of 140 μL of product, then incubated at 95°C for 5 minutes, 60°C for 20 minutes, and then purified.

(2)PCR(2) PCR

反应混合物(25μL)包含2μL转化DNA、12.5μL PCR预混物(德国，Qiagen)和0.1-0.5mM引物。调整引物浓度以获得不含二聚体的特异性DNA产物。热循环条件如下：95℃变性10分钟；在95℃下进行45次循环，持续30秒，在56℃下进行30秒(NPTX2 58℃，30秒)，在72℃下进行30秒；然后在72℃下进行5分钟的最终延伸。使用琼脂糖凝胶电泳进行电泳检测。The reaction mixture (25 μL) contained 2 μL of transforming DNA, 12.5 μL PCR master mix (Qiagen, Germany) and 0.1-0.5 mM primers. Primer concentrations were adjusted to obtain dimer-free specific DNA products. Thermal cycling conditions were as follows: denaturation at 95°C for 10 min; 45 cycles at 95°C for 30 s, 30 s at 56°C (NPTX2 58°C, 30 s), 30 s at 72°C; A final extension was performed at 72°C for 5 minutes. Electrophoretic detection was performed using agarose gel electrophoresis.

(3)焦磷酸测序(3) Pyrosequencing

使用Pyromark Q48热测序仪(德国，Qiagen)和Pyro-Gold试剂盒(德国，Qiagen)对生物素标记的PCR扩增产物制备的模板进行测序。先前的焦磷酸测序过程中，PCR产物的体积为10μL，会产生无法与背景信号明确区分的不稳定信号。我们的方法将PCR产物的体积增加到12μL，可有效避免不稳定信号的产生。Templates prepared from biotin-labeled PCR amplification products were sequenced using a Pyromark Q48 thermal sequencer (Qiagen, Germany) and a Pyro-Gold kit (Qiagen, Germany). During the previous pyrosequencing procedure, the volume of PCR product was 10 μL, which produced an unstable signal that was indistinguishable from the background signal. Our method increases the volume of PCR products to 12 μL, which can effectively avoid the generation of unstable signals.

实施例3构建血痕年龄预测模型Example 3 Construction of bloodstain age prediction model

(1)对比SVR和RFR模型的年龄预测准确性(1) Compare the age prediction accuracy of SVR and RFR models

我们先前的研究结果表明，SVR模型比多元线性回归、多元非线性回归和反向传播神经网络等方法更精确，因此，我们利用SVR和RFR模型，基于所有46个CpG位点进行组合，建立最佳拟合年龄预测模型，并计算其预测精度。SVR模型是在R package e1071中构建，参数设置：cost＝2，gamma＝0.8，epsilon＝0.1。RFR模型用R package random Forest构建，mtry参数与每次建模的CpG位点数相同，最小节点大小为5，树的数量设置为1000。Our previous findings showed that the SVR model was more accurate than methods such as multiple linear regression, multiple nonlinear regression, and backpropagation neural networks, therefore, we utilized the SVR and RFR models, based on the combination of all 46 CpG sites, to establish the most The best fit age prediction model is calculated and its prediction accuracy is calculated. The SVR model was built in R package e1071, with parameter settings: cost=2, gamma=0.8, epsilon=0.1. The RFR model was constructed with the R package random Forest, the mtry parameter was the same as the number of CpG sites for each modeling, the minimum node size was 5, and the number of trees was set to 1000.

为了提高计算速度，采用正向选择法确定最佳的位点组合。从241个血痕样本(年龄范围为10-79岁的241名健康中国汉族志愿者，其中包括128名男性和113名女性的全血样本。所有捐助者都提供了知情同意书，中国科学院北京基因组研究所通过了这项研究的伦理批准)中随机抽取70％的样本形成训练数据集，剩余的30％作为测试数据集，以评估RFR模型的准确性。训练重复100次，每次选择最佳位点(即最小MAD)。选择记录频率最高的位点作为最终模型的合适位点。在双位点训练模型中，在最佳位点之后，记录频率最多、MAD最小的位点作为第二最佳位点。In order to improve the calculation speed, the forward selection method was used to determine the optimal site combination. Whole blood samples from 241 bloodstain samples (age range 10-79 years) from 241 healthy Chinese Han volunteers, including 128 males and 113 females. All donors provided informed consent, Chinese Academy of Sciences Beijing Genome The Institute passed the ethical approval of this study) randomly selected 70% of the samples to form the training dataset and the remaining 30% as the test dataset to evaluate the accuracy of the RFR model. The training was repeated 100 times, each time selecting the best site (ie, the smallest MAD). The site with the highest recording frequency was selected as a suitable site for the final model. In the two-site training model, after the best site, the site with the highest frequency and the smallest MAD was recorded as the second best site.

RFR构建的年龄预测模型，女性使用4个位点(TRIM59.pos8、KLF14.pos3、Clorf132.pos2和FHL2.pos6)男性使用3个位点(TRIM59.pos7、KLF14.pos2、ELOVL2.pos7)，所得的MADs<3年。在SVR模型下，即使男性和女性都有8个位点，MAD稳定在4.5年左右，这一结果表明，RFR在年龄预测方面优于SVR(图1)。The age prediction model constructed by RFR, females use 4 loci (TRIM59.pos8, KLF14.pos3, Clorf132.pos2 and FHL2.pos6) males use 3 loci (TRIM59.pos7, KLF14.pos2, ELOVL2.pos7), Resulting MADs < 3 years. Under the SVR model, even though both males and females had 8 loci, the MAD was stable at around 4.5 years, a result suggesting that RFR outperformed SVR in predicting age (Fig. 1).

(2)测试数据集验证预测准确性(2) Test data set to verify the prediction accuracy

剩余的30％的血痕样本(男性38名，女性33名)作为测试数据集，在RFR模型中验证最终模型筛选出的7个位点(男性3个位点：TRIM59.pos7、KLF14.pos2、ELOVL2.pos7；女性4个位点：TRIM59.pos8、KLF14.pos3、Clorf132.pos2和FHL2.pos6)的年龄预测准确性，得出男性和女性的预测MAD分别为2.8年(R＝0.99)和2.93年(R＝0.98)(图2)。The remaining 30% of the blood stain samples (38 males and 33 females) were used as the test dataset to verify the 7 loci (3 loci in males: TRIM59.pos7, KLF14.pos2, ELOVL2.pos7; female 4 loci: TRIM59.pos8, KLF14.pos3, Clorf132.pos2, and FHL2.pos6) age prediction accuracy, resulting in a predicted MAD of 2.8 years for males and females (R=0.99) and 2.93 years (R=0.98) (Figure 2).

实施例4灵敏度检测Example 4 Sensitivity detection

收集年龄范围为10-79岁的241名健康中国汉族志愿者(128名男性和113名女性)的全血样本。所有捐助者都提供了知情同意书，中国科学院北京基因组研究所通过了这项研究的伦理批准。Whole blood samples were collected from 241 healthy Chinese Han volunteers (128 males and 113 females) ranging in age from 10-79 years. All donors provided informed consent, and the Beijing Institute of Genomics, Chinese Academy of Sciences received ethical approval for this study.

将20μL全血等分到滤纸上制备血迹，然后在室温下保存1年。为了确定检测灵敏度，将从血痕中提取的DNA连续稀释至100、50、10、5、2.5、1.0、0.50、0.25和0.10ng。不同浓度的血痕样本均进行甲基化分析，先进行亚硫酸氢盐转化，然后进行PCR扩增和焦磷酸测序(参照实施例1的方法)。对比0.1ng DNA和较高DNA浓度之间甲基化百分比的差异，判定我们所提出的甲基化检测方法在血痕检测中的灵敏度。Bloodstains were prepared by aliquoting 20 μL of whole blood onto filter paper and then stored at room temperature for 1 year. To determine detection sensitivity, DNA extracted from bloodstains was serially diluted to 100, 50, 10, 5, 2.5, 1.0, 0.50, 0.25, and 0.10 ng. Bloodstain samples with different concentrations were all subjected to methylation analysis, firstly bisulfite conversion, and then PCR amplification and pyrosequencing (refer to the method of Example 1). The difference in methylation percentage between 0.1 ng DNA and higher DNA concentrations was compared to judge the sensitivity of our proposed methylation detection method in blood stain detection.

我们观察到用于年龄预测的女性4个CpG位点(TRIM59.pos8、KLF14.pos3、Clorf132.pos2和FHL2.pos6)和男性3个CpG位点(TRIM59.pos7、KLF14.pos2、ELOVL2.pos7)，0.1ng DNA与较高浓度DNA之间的甲基化百分比无显著差异(P≥0.05，KS检验；图3)。ELOVL2.pos7位点，需要1.0ng DNA能达到相似的水平。We observed 4 CpG loci (TRIM59.pos8, KLF14.pos3, Clorf132.pos2 and FHL2.pos6) in females and 3 CpG loci in males (TRIM59.pos7, KLF14.pos2, ELOVL2.pos7) for age prediction ), there was no significant difference in percent methylation between 0.1 ng of DNA and higher concentrations of DNA (P≥0.05, KS test; Figure 3). For the ELOVL2.pos7 locus, 1.0 ng of DNA is required to achieve similar levels.

实施例5DNA甲基化水平与年龄的相关性分析Example 5 Correlation analysis between DNA methylation level and age

采集年龄范围为10-79岁的241名健康中国汉族志愿者(128名男性和113名女性)的全血样本，制备成血痕样本并进行甲基化分析，将本发明中位点与年龄进行相关性分析。本发明最终形成的血痕年龄预测模型包含跨越3个基因的7个CpG位点，3个已知位点，4个新CpG位点。结果表明，其中的5个CpG位点来自3个基因(TRIM59、KLF14和C1orf132)，在中国受试者的血痕分析中与年龄相关(图4)。Whole blood samples were collected from 241 healthy Chinese Han volunteers (128 males and 113 females) with an age range of 10-79 years, and blood stain samples were prepared and subjected to methylation analysis. Correlation analysis. The bloodstain age prediction model finally formed by the present invention includes 7 CpG sites spanning 3 genes, 3 known sites and 4 new CpG sites. The results showed that five of these CpG sites were derived from three genes (TRIM59, KLF14, and C1orf132), which were age-related in bloodstain analysis of Chinese subjects (Fig. 4).

实施例6对比已发表研究的年龄预测准确度Example 6 Comparison of Age Prediction Accuracy of Published Studies

对过去多年的年龄预测研究进行的荟萃分析表明，几乎所有MAD的年龄均>3年(图5)。与之前的研究相比，由于使用RFR，我们的模型是最有效的(MAD<3年，只需3-4个CpG位点)。图5中，实心点代表已公布的结果，不同模型中的数学方法以不同的形状表示。而“十”字和“米”字符号分别代表我们对女性和男性建立的年龄预测结果。A meta-analysis of age-prediction studies over the past several years showed that almost all MADs were >3 years old (Figure 5). Compared to previous studies, our model is the most efficient (MAD < 3 years with only 3-4 CpG sites) due to the use of RFR. In Figure 5, the solid dots represent published results, and the math in different models is represented by different shapes. The "ten" and "meter" symbols represent our age prediction results for women and men, respectively.

序列表sequence listing

<110> 山西医科大学<110> Shanxi Medical University

<120> 利用焦磷酸测序和随机森林回归分析进行年龄预测的方法<120> Methods for Age Prediction Using Pyrosequencing and Random Forest Regression Analysis

<160> 18<160> 18

<170> SIPOSequenceListing 1.0<170> SIPOSequenceListing 1.0

<210> 1<210> 1

<211> 29<211> 29

<212> DNA<212> DNA

<213> 人工序列(Artificial Sequence)<213> Artificial Sequence

<400> 1<400> 1

tagtaaatat ataagtgggg gaagaaggg 29tagtaaatat ataagtgggg gaagaaggg 29

<210> 2<210> 2

<211> 27<211> 27

<212> DNA<212> DNA

<213> 人工序列(Artificial Sequence)<213> Artificial Sequence

<400> 2<400> 2

ttaataaaac caaattctaa aacattc 27ttaataaaac caaattctaa aacattc 27

<210> 3<210> 3

<211> 24<211> 24

<212> DNA<212> DNA

<213> 人工序列(Artificial Sequence)<213> Artificial Sequence

<400> 3<400> 3

caccttacca ccaaaccaaa attt 24caccttacca ccaaaccaaa attt 24

<210> 4<210> 4

<211> 21<211> 21

<212> DNA<212> DNA

<213> 人工序列(Artificial Sequence)<213> Artificial Sequence

<400> 4<400> 4

aggggagtag ggtaagtgag g 21aggggagtag ggtaagtgag g 21

<210> 5<210> 5

<211> 30<211> 30

<212> DNA<212> DNA

<213> 人工序列(Artificial Sequence)<213> Artificial Sequence

<400> 5<400> 5

caaaaccatt tccccctaat atatacttca 30caaaaccatt tccccctaat atatacttca 30

<210> 6<210> 6

<211> 20<211> 20

<212> DNA<212> DNA

<213> 人工序列(Artificial Sequence)<213> Artificial Sequence

<400> 6<400> 6

gggaggagat ttgtaggttt 20gggaggagat ttgtaggttt 20

<210> 7<210> 7

<211> 21<211> 21

<212> DNA<212> DNA

<213> 人工序列(Artificial Sequence)<213> Artificial Sequence

<400> 7<400> 7

gggttttggg agtatagtag t 21gggttttggg agtatagtag t 21

<210> 8<210> 8

<211> 27<211> 27

<212> DNA<212> DNA

<213> 人工序列(Artificial Sequence)<213> Artificial Sequence

<400> 8<400> 8

acacctccta aaacttctcc aatctcc 27acacctccta aaacttctcc aatctcc 27

<210> 9<210> 9

<211> 21<211> 21

<212> DNA<212> DNA

<213> 人工序列(Artificial Sequence)<213> Artificial Sequence

<400> 9<400> 9

gttttgggag tatagtagtt a 21gttttgggag tatagtagtt a 21

<210> 10<210> 10

<211> 28<211> 28

<212> DNA<212> DNA

<213> 人工序列(Artificial Sequence)<213> Artificial Sequence

<400> 10<400> 10

ggttttaggt taagttatgt ttaatagt 28ggttttaggt taagttatgt ttaatagt 28

<210> 11<210> 11

<211> 30<211> 30

<212> DNA<212> DNA

<213> 人工序列(Artificial Sequence)<213> Artificial Sequence

<400> 11<400> 11

actaaaaaat ttccctctat taccattacc 30actaaaaaat ttccctctat taccattacc 30

<210> 12<210> 12

<211> 24<211> 24

<212> DNA<212> DNA

<213> 人工序列(Artificial Sequence)<213> Artificial Sequence

<400> 12<400> 12

atagttttag aaattatttt gttt 24atagttttag aaattatttt gttt 24

<210> 13<210> 13

<211> 29<211> 29

<212> DNA<212> DNA

<213> 人工序列(Artificial Sequence)<213> Artificial Sequence

<400> 13<400> 13

tagtaaatat ataagtgggg gaagaaggg 29tagtaaatat ataagtgggg gaagaaggg 29

<210> 14<210> 14

<211> 28<211> 28

<212> DNA<212> DNA

<213> 人工序列(Artificial Sequence)<213> Artificial Sequence

<400> 14<400> 14

atttaataaa accaaattct aaaacatt 28atttaataaa accaaattct aaaacatt 28

<210> 15<210> 15

<211> 25<211> 25

<212> DNA<212> DNA

<213> 人工序列(Artificial Sequence)<213> Artificial Sequence

<400> 15<400> 15

ggggttaagt tattaagttt tgaag 25ggggttaagt tattaagttt tgaag 25

<210> 16<210> 16

<211> 21<211> 21

<212> DNA<212> DNA

<213> 人工序列(Artificial Sequence)<213> Artificial Sequence

<400> 16<400> 16

tataggtggt ttgggggaga g 21tataggtggt ttggggggaga g 21

<210> 17<210> 17

<211> 27<211> 27

<212> DNA<212> DNA

<213> 人工序列(Artificial Sequence)<213> Artificial Sequence

<400> 17<400> 17

aaaaaacact accctccaca acataac 27aaaaaacact accctccaca acataac 27

<210> 18<210> 18

<211> 15<211> 15

<212> DNA<212> DNA

<213> 人工序列(Artificial Sequence)<213> Artificial Sequence

<400> 18<400> 18

ttgggggaga ggttg 15ttgggggaga ggttg 15

Claims

1. a method for age prediction, is characterized in that, comprises pyrosequencing and random forest regression analysis in described method, described random forest regression analysis model uses R package random Forest to build, and adopts positive selection. The best combination of loci was determined by the method; when the described random forest regression established the age prediction model, a total of 7 age-related DNA methylation loci were selected, of which 3 were male: TRIM59.pos7 , KLF14.pos2 , ELOVL2.pos7 ; 4 females are TRIM59.pos8 , KLF14.pos3 , Clorf132.pos2 and FHL2.pos6 ;

The specific information of the DNA methylation sites is as follows:

The chromosomal coordinate of the gene Clorf132 is chr1:207823675, and the CpG_ID is cg10501210;

The chromosomal coordinate of the gene ELOVL2 is chr6:11044644, and the CpG_ID is cg16867657;

The chromosome coordinate of the gene FHL2 is chr2:105399282, and the CpG_ID is cg06639320;

The chromosomal coordinate of the gene KLF14 is chr7:130734355, and the CpG_ID is cg14361627;

The chromosomal coordinate of the gene NPTX2 is chr7:98616518, and the CpG_ID is cg00548268;

The chromosomal coordinate of the gene TRIM59 is chr3:160450199;

The specific information of the DNA methylation site takes hg38 as the reference genome;

The parameter settings in the random forest regression analysis model construction in the described method: the mtry parameter is the same as the number of CpG sites for each modeling, the minimum node size is 5, and the number of trees is set to 1000.

2. The method according to claim 1, wherein the volume of the PCR product used in the pyrosequencing is 12 μL.

3. the application of a group of methylation site combinations in age prediction, wherein the methylation site is made up of a male-related site and a female-related site; the male-related site is: TRIM59.pos7 , KLF14.pos2 , ELOVL2.pos7 ; female-related sites are: TRIM59.pos8 , KLF14.pos3 , Clorf132.pos2 , FHL2.pos6 ;

The specific information of the DNA methylation sites is as follows:

The chromosomal coordinate of the gene TRIM59 is chr3:160450199;

The described age prediction is achieved by random forest regression analysis.

4. Use of the reagent for detecting the combination of methylation sites described in claim 3 in the preparation of a kit for predicting age.