CN109402241A

CN109402241A - Identification and the method for analyzing ancient DNA sample

Info

Publication number: CN109402241A
Application number: CN201710667605.XA
Authority: CN
Inventors: 郭小森; 兰天明; 蒋慧
Original assignee: BGI Shenzhen Co Ltd
Current assignee: BGI Shenzhen Co Ltd
Priority date: 2017-08-07
Filing date: 2017-08-07
Publication date: 2019-03-01

Abstract

The invention discloses identifications and the method for analyzing ancient DNA sample, including the method for the DNA information for obtaining DNA sample to be measured, method includes the following steps: carrying out building library and sequencing to the DNA sample to be measured, to obtain sequencing data；Processing is filtered to the sequencing data；Processing is compared in the sequencing data by filtration treatment, to obtain comparison result, the comparison result includes the DNA information of the DNA sample to be measured, the mispairing for comparing processing and at most allowing 4 bases.It can be effectively based on building library and sequencing to ancient DNA sample to be measured using this method, obtain the DNA information of ancient DNA sample to be measured, and, the information is accurate, it is with a high credibility, it can be effective for the genome analysis of Gu DNA to be measured, such as variation detection, the identification of Gu DNA, sex determination and the assessment of modern's DNA pollution rate.

Description

Identification and the method for analyzing ancient DNA sample

Technical field

The present invention relates to biological order-checking technical fields, in particular to identification and the method for analyzing ancient DNA sample.

Background technique

Extinct plants and animal sample is most important to the evolutionary history research of modern biotechnology population, and the research achievement of ancient human's genome makes People re-recognize the not only African ancestors' ingredient of genetic constitution of modern, but after walking out Africa again with ancient Buddhist nun An Dete people and Gu Danni Sol human hair gave birth to gene exchange, had overturned previous people to the understanding of modern's evolutionary history.Meanwhile The research of extinct plants and animal genome also has and can not replace to the natural selection of modern biotechnology population, the especially mankind and the research of disease The important function in generation, the plateau adaptability gene of people from Tibetan are proved to be the infiltration between the genome from Gu Danni Sol people Effect thoroughly.The genetic resources that extinct plants and animal sample can not be replicated as one kind, evolution, selection and disease etc. to modern biotechnology group Research has huge facilitation and can not substitute.

Extinct plants and animal genetics research has been deep into genomic level.China is as a paleontological resources big country, not only There are fossil animal and plant extremely abundant and subfossil resource, more there is ancient human's sample abundant to be constantly unearthed, limitation China is ancient One of the maximum bottleneck of human activities environment development is just a lack of the summary to ancient DNA processing and information analysis method.

Thus, identification at present and the method for analyzing ancient DNA sample still have much room for improvement.

Summary of the invention

The present invention is directed at least solve one of the technical problems existing in the prior art.For this purpose, one object of the present invention It is to construct the standard information analysis process of a set of ancient DNA based on Illumina bis- generations sequencing data, a set of ancient human is provided Genome analytical method.

It should be noted that the present invention is following discovery based on inventor and work and completes:

Inventor has carried out a series of theoretical research and experimental exploring for the method for Gu DNA processing and information analysis, As a result, it has been found that:

1, the fragmentation degree of Gu DNA is very high, therefore is not required to carry out fragmentation to DNA during constructing DNA library Processing, DNA can directly carry out library construction after the completion of extracting.

2, it is directed to Gu DNA, in the sequencing of upper machine, long segment should not be selected to be sequenced, read length controls within 100bp, Because Gu DNA average length is in 50-70bp or so, if the length of read is more than 100bp when sequencing, on the one hand can introduce a large amount of Connector pollution, on the other hand will cause the waste of a large amount of data.

3, it is directed to Gu DNA, the most important step under original Fastq data after machine is exactly according to Illumina data characteristics And the sequence signature of Gu DNA is filtered data, it is therefore intended that removes low-quality sequence and outer to greatest extent The DNA sequence dna of source pollution.Data filtering mainly includes 4 aspects: butt joint is filtered, to the low quality alkali of mass value Q≤10 Base is filtered, is filtered to the area N (region that cannot be identified), and removal length is less than 30bp and length is greater than 99bp Read.If read is less than 30bp, it will cause more mistake in subsequent comparison process and compare.Because of ancient DNA sequence Height fragmentation, average length generally probably come from existing in 50-70bp if read is too long (being greater than 99bp) For the pollution of DNA, therefore, in order to retain ancient DNA to greatest extent, then it should delete these and read read.This step and its important, if The read greater than 99bp is not deleted, it will influences the accuracy of subsequent species identification, this is also to reflect with modern biotechnology sample species A fixed very big difference.

4, in order to be compatible with most comparison result analysis, original lower machine data are passed through after Quality Control, are used respectively Original ancient human DNA data are compared in SoapAligner and BWA, wherein the number compared using SoapAligner According to ultimately producing the comparison result of Soap format；The data compared using BWA, ultimately produce the comparison result of sam format, and And in view of mutation caused by the deamination of Gu DNA is more, the mispairing of 4 bases is at most allowed during comparison. Comparison result is accurate and reliable as a result, is conducive to subsequent analysis use.

5, variation detection is carried out to the data after comparison simultaneously using two softwares of SoapSnp and GATK, mainly to monokaryon Thuja acid variation is detected；Meanwhile using SoapSnp carry out variation detection when, output cns format as a result, i.e. all positions Point output.Be conducive to subsequent analysis use as a result,.

6, identify for Gu DNA: ancient DNA identification is the most basic premise for carrying out subsequent customized information analysis, inventor The characterization of molecules that comprehensive Gu DNA has proposes the method for carrying out Gu DNA identification based at least one of following 2 aspects:

(1) it is based on deamination Characteristics of Mutation:

The deamination Characteristics of Mutation of ancient DNA: for extinct plants and animal sample during long-term preservation, double-stranded DNA will receive one kind Important chemical damage, i.e. cytosine deamination.Deamination occurs mainly in the end position of DNA fragmentation, that is, 5 ' End and 3 ' ends.This deamination can make cytimidine be converted into uracil, therefore can draw when library construction and sequencing Enter the mutation of C- > T.Therefore Gu DNA carry out two generations sequencing when, reads 5 ' end and 3 ' hold will appear a large amount of C- > T and G- > The mutation of A.It has been recognised by the inventors that this Catastrophe Model can just be utilized to identify gained sequence whether the evidence for being ancient DNA One of.

(2) it is based on DNA fragmentation feature:

Depurination (DNA fragmentation) feature: depurination is that DNA chain fracture occurs during ancient DNA is saved One most important chemical action, that is to say, that in the fragmentation process of ancient DNA, quite a few is due to having occurred Caused by depurination.It has been recognised by the inventors that this depurination is just when comparing ancient DNA fragmentation to reference genome It can show the 5 ' end reads to greatly increase toward the ratio that previous base is purine again, on the contrary in 3 ' ends again toward the latter Base is that the ratio regular meeting of pyrimidine greatly increases.It thus, can also it has been recognised by the inventors that this fracture mode of Gu DNA is as deamination Using as identify whether one of the material evidence for being ancient DNA.

7, inventor also constructs for women Gu DNA sample, and the side of exogenous DNA Contamination Assessment is carried out by Y chromosome Method: this method is to obtain Y chromosome specific region (YUR, other any homologies of chromosome of getting along well and no repetition sequence first The region of column)；Then the reads of obtained Gu DNA is compared to YUR, is calculated further according to YUR and specific reads quantity Assume to be the desired value in the case of male out, the ratio between the reads in practical comparison finally obtained and desired value, i.e., For the pollution rate from male.

As a result, in the first aspect of the present invention, the present invention provides a kind of sides of DNA information for obtaining DNA sample to be measured Method.According to an embodiment of the invention, method includes the following steps: carry out building library and sequencing to the DNA sample to be measured, so as to Obtain sequencing data, wherein it is described build library when without DNA fragmentation the step of, it is described sequencing read it is of length no more than 100bp；Processing is filtered to the sequencing data, to obtain the sequencing data by filtration treatment；And by the warp Processing is compared in the sequencing data for crossing filtration treatment, and to obtain comparison result, the comparison result includes the DNA to be measured The DNA information of sample, wherein the filtration treatment includes at least one of following: (1) filtering removal joint sequence；(2) it filters Remove the low quality base of mass value Q≤10, wherein when the quantity of the low quality base accounts for whole read total bases amount When 50% or more, whole read is deleted；When the low quality base is in the end of read, and quantity is no more than whole read When 50%, the low quality base is only cut off；(3) area N is filtered, wherein when ratio containing N is greater than 10% in read, Remove the read；When the area N exists only in read both ends, the area N at the read both ends is only cut off；(4) removal length is less than 30bp and length are greater than the read of 99bp, the mispairing for comparing processing and at most allowing 4 bases.

It should be noted that described herein " is filtered the area N, wherein when ratio containing N is greater than 10% in read When, remove the read；When the area N exists only in read both ends, the area N at the read both ends is only cut off ", wherein " area N " is Refer to the region that cannot be identified, " ratio containing N " refers to the ratio containing the base that cannot be identified.

According to an embodiment of the invention, can be effectively based on building library and survey to ancient DNA sample to be measured using this method Sequence obtains the DNA information of ancient DNA sample to be measured, also, the information is accurate, with a high credibility, can be effective for Gu DNA to be measured Genome analysis, such as variation detection, the identification of Gu DNA, sex determination and modern's DNA pollution rate assess.

According to an embodiment of the invention, carrying out the comparison processing using SoapAligner and BWA simultaneously.It compares as a result, As a result accurate and reliable.

According to some embodiments of the present invention, when carrying out comparison processing using SoapAligner, Soap format is generated Comparison result；When carrying out comparison processing using BWA, the comparison result of sam format is generated.As a result, convenient for two kinds of comparisons As a result merger, final comparison result credibility are high.

In the second aspect of the present invention, the present invention also provides it is a kind of determine DNA sample to be measured whether the side for being ancient DNA Method.According to an embodiment of the invention, method includes the following steps: being believed according to the mentioned-above DNA for obtaining DNA sample to be measured The method of breath obtains the DNA information of DNA sample to be measured；Based on the DNA information of the DNA sample to be measured, variation detection is carried out, To determine the variation information of the DNA sample to be measured；And the variation information based on the DNA sample to be measured, determine described in Whether DNA sample to be measured is ancient DNA, wherein there are at least one of following state be the DNA sample to be measured be Gu DNA It indicates: (1) the deamination feature that read is presented below as is sequenced: relative to reference genome, the 5 ' ends and 3 ' of the sequencing read There is the mutation of the C- > T and G- > A greater than 10% in end；(2) the fragmentation feature that sequencing read is presented below as: relative to reference base Because of group, 5 ' ends of the sequencing read are dramatically increased toward the ratio that previous base is purine again, and 3 ' ends are again toward latter A base is that the ratio of pyrimidine dramatically increases.The identification of ancient DNA can be effectively performed using this method, and result accurately may be used It leans on, is reproducible.

It is detected according to an embodiment of the invention, carrying out the variation using GATK and SoapSnp simultaneously.Detection knot as a result, Fruit is accurate and reliable.

According to some embodiments of the present invention, when carrying out variation detection using SoapSnp, the knot of cns format is exported Fruit.It is convenient for subsequent analysis as a result,.

According to an embodiment of the invention, the variation information of the DNA sample to be measured includes single nucleotide variations information.

In the third aspect of the present invention, the present invention also provides a kind of sides of gender individual belonging to determining ancient DNA sample Method.According to an embodiment of the invention, method includes the following steps: being believed according to the mentioned-above DNA for obtaining DNA sample to be measured The method of breath obtains the DNA information of ancient DNA sample to be measured；Based on the DNA information of the DNA sample to be measured, following gender is determined At least one of critical parameter: compare to X chromosome sequencing read and compare arrive Y chromosome sequencing read quantity ratio, It compares the sequencing read of X chromosome and compares the quantity ratio to the sequencing read of No. 8 chromosome, the sequencing of each chromosome is deep The heterozygote ratio of degree and each chromosome；And based at least one of described sex determination's parameter, determine the Gu to be measured Individual gender belonging to DNA sample, in which: (1) sequencing compared to the sequencing read and comparison to Y chromosome of X chromosome is read The quantity ratio of section is the instruction that individual belonging to the ancient DNA sample to be measured is male close to 9:1；Compare the sequencing of X chromosome Read and the quantity ratio to the sequencing read of No. 8 chromosome is compared close to 1:1, be that individual belonging to the Gu DNA sample to be measured is The instruction of women；(2) the sequencing depth of Y chromosome and the sequencing depth of other chromosomes are close, are the ancient DNA samples to be measured Affiliated individual is the instruction of male；The sequencing depth of Y chromosome is described to be measured significantly less than the sequencing depth of other chromosomes Individual belonging to ancient DNA sample is the instruction of women；(3) heterozygosis of the heterozygote ratio of X chromosome significantly less than other chromosomes Sub- ratio is the instruction that individual belonging to the ancient DNA sample to be measured is male；The heterozygote ratio of X chromosome is not significant small It is the instruction that individual belonging to the ancient DNA sample to be measured is women in the heterozygote ratio of other chromosomes.Utilize this method energy It is enough that affiliated individual sex identification effectively is carried out to ancient DNA sample, also, result is accurate and reliable, and it is reproducible.

In the fourth aspect of the present invention, the present invention also provides the male modern times DNA in a kind of determining women Gu DNA sample The method of pollution rate.According to an embodiment of the invention, method includes the following steps:

According to the method for the mentioned-above DNA information for obtaining DNA sample to be measured, the DNA letter of ancient DNA sample to be measured is obtained Breath；

Assuming that the women Gu DNA sample derives from male, and the DNA information based on the ancient DNA sample to be measured, determine Sequencing read compares the desired proportion to Y chromosome specific region, wherein the sequencing read is compared to Y chromosome specific region Desired proportion calculation formula are as follows:

R=(comparing sequencing read quantity/comparison to the genome sequencing read quantity for arriving Y chromosome specific region) × 0.5；And

The desired proportion to Y chromosome specific region is compared based on the sequencing read, determines the ancient DNA sample to be measured Y chromosome pollution rate, the Y chromosome pollution rate of the ancient DNA sample to be measured is that the male in women Gu DNA sample is modern DNA pollution rate,

Wherein, the calculation formula of the Y chromosome pollution rate of the ancient DNA sample to be measured are as follows:

C=(y/R) × (1/n),

Wherein, C is Y chromosome pollution rate ratio, and y is to compare the sequencing read quantity for arriving Y chromosome specific region, and R is The sequencing read compares the desired proportion to Y chromosome specific region, and n is the sequencing read sum compared to genome.

According to an embodiment of the invention, the modern times of the male in women Gu DNA sample can be effectively determined using this method DNA pollution rate method.Also, this method is reproducible, as a result accurately and reliably.

According to an embodiment of the invention, obtaining the Y chromosome specific region by following methods: the mankind are referred to gene The Y chromosome gene order of group is divided into the artificial read set of 30bp or so length；By the artificial read set with it is described The mankind are compared with reference to the part that genome does not include Y chromosome, to obtain the artificial read by comparison；For all By the artificial read of comparison, only retains the artificial read for the comparison mistake of 3 bases or more occur, then remove again comprising weight The artificial read in complex sequences region, then remaining owner's part work and part study section forms the Y chromosome specific region.

Some specific examples according to the present invention, the mankind are Hg19 with reference to genome.

In addition it is also necessary to explanation, according to an embodiment of the invention, method of the invention has following advantages at least One of:

1, property method for distinguishing individual belonging to determining ancient DNA sample of the invention, can provide the property to extinct plants and animal sample Do not determine, which can extend to the mankind and other all animal species with sex chromosome.Here gender is sentenced Surely the proprietary identification method for the ancient DNA characteristics for being different from the sex determination of modern biotechnology, but being found and being summarized based on inventor.

2, the method for male's modern times DNA pollution rate in determination women Gu DNA sample of the invention, is able to detect ancient DNA The modern DNA pollution rate of sequencing data, this method is significant for ancient DNA analysis, because only that accurate evaluation modern times DNA Pollution rate can just carry out ancient DNA subsequent analysis.

Additional aspect and advantage of the invention will be set forth in part in the description, and will partially become from the following description Obviously, or practice through the invention is recognized.

Detailed description of the invention

Above-mentioned and/or additional aspect of the invention and advantage will become from the description of the embodiment in conjunction with the following figures Obviously and it is readily appreciated that, in which:

Fig. 1 is ancient human's DNA information analysis flow chart diagram according to an embodiment of the present invention；

Fig. 2 is the DNA Damage analysis chart of hominid skeleton sample in embodiment 1,

Wherein,

A figure is DNA fragmentation analysis as a result, what is indicated in grey box is the base of ancient DNA fragmentation, indicates outside grey box Sequence and 3 ' before being the most previous base in the ancient end of DNA fragmentation 5 ' hold the sequence after the last one base,

B figure and c figure are deamination analyses as a result, what two width figure abscissas indicated is base positions on DNA fragmentation, direction For 5'-3', the 0-25 in b figure indicates preceding 25 bases at the end DNA5', and the 25-0 of c figure indicates last 25 in the end DNA fragmentation 3' Base；What ordinate indicated is percentage；

Fig. 3 is the percentage of the sum of base shared by heterozygote in each chromosome in embodiment 1；

Fig. 4 is that reads (the i.e. sequencing read) number compared in embodiment 1 to No. 8 chromosomes and sex chromosome compares；

Fig. 5 is that depth distribution situation is sequenced in hair swatch in embodiment 1, and the longitudinal axis indicates depth, and horizontal axis indicates chromosome.

Specific embodiment

The solution of the present invention is explained below in conjunction with embodiment.It will be understood to those of skill in the art that following Embodiment is merely to illustrate the present invention, and should not be taken as limiting the scope of the invention.Particular technique or item are not specified in embodiment Part, it described technology or conditions or is carried out according to the literature in the art according to product description.Agents useful for same or instrument Production firm person is not specified in device, and being can be with conventional products that are commercially available.

Conventional method:

According to an embodiment of the invention, the method according to the invention is referring to Fig.1 standardized ancient DNA sample to be measured Information analysis generally comprises following steps:

1, the DNA information of DNA sample to be measured is obtained

Specific step is as follows:

The DNA sample to be measured is carried out building library and sequencing, to obtain sequencing data, wherein it is described build library when not The step of carrying out DNA fragmentation, of length no more than 100bp of the sequencing read；

Processing is filtered to the sequencing data, to obtain the sequencing data by filtration treatment；And

SoapAligner and BWA is utilized simultaneously, and processing is compared in the sequencing data by filtration treatment, with Just comparison result is obtained, the comparison result includes the DNA information of the DNA sample to be measured,

Wherein,

The filtration treatment includes at least one of following:

(1) filtering removal joint sequence；

(2) filtering removal mass value Q≤10 low quality base, wherein when the quantity of the low quality base account for it is whole Read total bases amount 50% or more when, delete whole read；When the low quality base is in the end of read, and quantity is not More than whole read 50% when, only cut off the low quality base；

(3) area N is filtered, wherein when ratio containing N is greater than 10% in read, remove the read；When the area N only When being present in read both ends, the area N at the read both ends is only cut off；

(4) removal length is less than 30bp and length is greater than the read of 99bp,

The mispairing for comparing processing and at most allowing 4 bases,

When carrying out comparison processing using SoapAligner, the comparison result of Soap format is generated；It is carried out using BWA When the comparison is handled, the comparison result of sam format is generated.

2, determine DNA sample to be measured whether the method for being ancient DNA

Specific step is as follows:

Variation detection is carried out based on the DNA information of the DNA sample to be measured, while using GATK and SoapSnp, so as to true The variation information of the fixed DNA sample to be measured；And

Based on the variation information of the DNA sample to be measured, determine whether the DNA sample to be measured is ancient DNA,

Wherein, it is instruction of the DNA sample to be measured for Gu DNA there are at least one of following state:

(1) the deamination feature that is presented below as of sequencing read: relative to reference genome, 5 ' ends of the sequencing read and There is the mutation of C- > T and G- > A greater than 10% in 3 ' ends；

(2) the fragmentation feature that sequencing read is presented below as: relative to reference genome, 5 ' ends of the sequencing read It is dramatically increased again toward the ratio that previous base is purine, and 3 ' ends significantly increase toward the ratio that the latter base is pyrimidine again Add,

Using SoapSnp carry out the variation detect when, export cns format as a result,

The variation information of the DNA sample to be measured includes single nucleotide variations information.

3, gender individual belonging to ancient DNA sample is determined

Specific step is as follows:

Based on the DNA information of the DNA sample to be measured, at least one of following sex determination's parameter is determined: comparing to X and contaminate The sequencing read of colour solid and the quantity compared to the sequencing read of Y chromosome than, compare sequencing read and comparison to X chromosome To the quantity ratio of the sequencing read of No. 8 chromosome, the sequencing depth of each chromosome and the heterozygote ratio of each chromosome；With And

Based at least one of described sex determination's parameter, gender individual belonging to the ancient DNA sample to be measured is determined, In:

(1) it compares to the sequencing read of X chromosome and the quantity ratio of the sequencing read of comparison to Y chromosome close to 9:1, is Individual belonging to the ancient DNA sample to be measured is the instruction of male；It compares the sequencing read of X chromosome and compares to No. 8 and dye The quantity ratio of the sequencing read of body is the instruction that individual belonging to the ancient DNA sample to be measured is women close to 1:1；

(2) the sequencing depth of Y chromosome and the sequencing depth of other chromosomes are close, are the ancient DNA sample institutes to be measured Belong to the instruction that individual is male；The sequencing depth of Y chromosome is the Gu to be measured significantly less than the sequencing depth of other chromosomes Individual belonging to DNA sample is the instruction of women；

(3) the heterozygote ratio of X chromosome is the Gu DNA to be measured significantly less than the heterozygote ratio of other chromosomes Individual belonging to sample is the instruction of male；The heterozygote ratio of X chromosome is not significantly less than the heterozygote ratio of other chromosomes Rate is the instruction that individual belonging to the ancient DNA sample to be measured is women.

4, male's modern times DNA pollution rate in women Gu DNA sample is determined

Specific step is as follows:

C=(y/R) × (1/n),

Wherein, the Y chromosome specific region is obtained by following methods: the mankind are referred to the Y chromosome base of genome Because sequences segmentation is at the artificial read set of 30bp or so length；The artificial read set and the mankind are referred into genome Part not comprising Y chromosome is compared, to obtain the artificial read by comparison；For all artificial by what is compared Read only retains the artificial read for the comparison mistake of 3 bases or more occur, then removes the people comprising repetitive sequence region again Part work and part study section, then remaining owner's part work and part study section forms the Y chromosome specific region.The mankind are Hg19 with reference to genome.

Embodiment 1

The method of the invention according to shown in above-mentioned " conventional method " is standardized information point to ancient DNA sample to be measured Analysis, specific as follows:

Wherein, ancient 2: 1 hominid skeleton samples of DNA sample to be measured and 1 ancient human's sample of hair.This 2 ancient raw Object sample standard deviation by Chinese Academy of Sciences and ancient vertebrate animals ancient human research institute provide, the unearthed age about before 3000-8000, Wherein 1 is hominid skeleton sample (Human_Bone), and 1 is ancient human's sample of hair (Human_Hair) (being shown in Table 1).

Detailed process is as follows:

One, the acquisition of Illumina bis- generations sequencing data

The present invention is based on Illumina bis- generations sequencing data, the DNA of 2 extinct plants and animal samples is extracted and banking process is detailed in:

[1]N.Rohland,M.Hofreiter.Ancient DNA extraction from bones and teeth [J].NATURE PROTOCOLS,2007,2(7):1756-1762.doi:10.1038/nprot.2007.247；

[2]M.T.Gansauge,M.Meyer.Single-stranded DNA library preparation for the sequencing of ancient or damaged DNA[J].NATURE PROTOCOLS,2013,8(3):737- 748.doi:10.1038/nprot.2013.038

By referring to be incorporated by herein.

Sequencing strategy used by ancient human's sample uses 2000 PE 50 of Illumina Hiseq, the original of each sample See Table 1 for details for the sequencing data amount of machine Fastq format under beginning.Wherein, during building library, sample of hair generates deamination Uracil removed, sample bone does not process uracil.Bone and the last sequencing data amount of sample of hair are 15Gb。

Two, the Quality Control of original lower machine Fastq data

The present invention is in strict accordance with data filtering method described in technical solution, to 2 hominid skeletons and sample of hair Original lower machine Fasta data carried out stringent filtering.Specific execution standard is as follows: 1) if it find that including to connect in read Header sequence cuts off joint sequence part；If 2) the base number of mass value Q≤10 account for the 50% of whole read total bases amount with When upper, whole read was deleted, if low quality base, in the end of read, and quantity is no more than the 50% of whole read, then only Cut off the base of low quality part；3) read of the removal ratio containing N greater than 10% is only cut if the area N exists only in read both ends Except the area N at read both ends, remaining base retains；4) removal length is less than 30bp and length is greater than the read of 49bp.After filtering Data volume see Table 2 for details.

Three, analysis is compared

Original lower machine data are passed through after Quality Control, respectively using SoapAligner and BWA to original hominid skeleton sample It is compared with sample of hair DNA data, the version used with reference to genome is mankind Hg19.

The command parameter that SoapAligner is compared is as follows:

Sample of hair:

soap –D hg19.fa.index –a Human_Hair.fq1.gz –b Human_Hair.fq2.gz -o Human_Hair.soap -2 Human_Hair.single–u Human_Hair.unmapped -n 5 -r 1 -l 30 -s 30 -v 2 -p 4 -m 0 -x 80

Sample bone:

soap –D hg19.fa.index –a Human_Hair.fq1.gz –b Human_Hair.fq2.gz -o Human_Hair.soap -2 Human_Hair.single –u Human_Hair.unmapped -n 5 -r 1 -l 30 - s 30 -v 4 -p 4 -m 0 -x 80

The command parameter that BWA is compared is as follows:

Sample of hair:

bwa aln hg19.fa Human_Hair.fq1.gz -l 30 -k 2 -t 4 -q 15 -I>Human_ Hair.fq1.sai；bwa aln hg19.fa Human_Hair.fq2.gz-l 30 -k 2 -t 4 -q 15 -I>Human_ Hair.fq2.sai；bwa sampe -a 80 hg19.fa Human_Hair.fq1.sai Human_Hair.fq2.sai Human_Hair.fq1.gz Human_Hair.fq2.gz>Human_Hair.sam

Sample bone:

bwa aln hg19.fa Human_Hair.fq1.gz -l 30 -k 4 -t 4 -q 15 -I>Human_ Hair.fq1.sai；bwa aln hg19.fa Human_Hair.fq2.gz-l 30 -k 4 -t 4 -q 15 -I>Human_ Hair.fq2.sai；bwa sampe -a 80 hg19.fa Human_Hair.fq1.sai Human_Hair.fq2.sai Human_Hair.fq1.gz Human_Hair.fq2.gz>Human_Hair.sam

After the completion of comparison, the reads that the unique in comparison result is compared is extracted, while filtering out low-quality comparison knot Fruit and the comparison result not matched, for analyzing in next step.As a result see Table 2 for details and table 3 for data information.Wherein Chinese ancients The sequencing result of class sample bone only has minute quantity comparing to arrive human genome (0.1%~0.2%), these data are not It is enough to support the follow-ups analysis such as variation detection.Therefore filtering, comparison and DNA are only limitted to the information analysis of sample bone Damage analysis.The Chinese filtered comparison rate of ancient human's hair swatch has reached 10%, and data support subsequent SNP enough The information analyses such as calling, therefore inventor has carried out more comprehensive information analysis, including mistake to the sequencing result of this sample Filter, comparison, DNA Damage analysis, depth and coverage analysis, SNP calling analysis, sex determination's analysis and modern Pollution rate analysis etc..

Four, variation detection

The present invention carries out variation detection to ancient human's sample of hair using GATK and SoapSnp simultaneously.

GATK:

Variation detection is carried out using GATK to carry out fully according to the operating process of GATK, specifically can refer to https: // www.broadinstitute.org/gatk/.Using GATK carry out variation detection first to bwa compare generate sam file by It resequences according to caryotype (karyotypic)；Then the comparison file of sam format is converted into bam format；It will Entry in bam file is ranked up from small to large according to physical location；To repeating and compare to chromosome same position Reads be marked；The read compared to the region indel is compared again；Base mass value is corrected, is finally given birth to At Human_Hair.bam and Human_Hair.metrics；Finally variation detection is carried out using UnifiedGenotyper.Tool The parameter of body is as follows:

java –jar GenomeAnalysisTK.jar -glm SNP -l INFO -R hg19.fa -T UnifiedGenotyper -I Human_Hair.bam -D dbsnp_137.hg19.vcf-o Human_Hair.vcf- metrics Human_Hair.metrics-stand_call_conf 10

-stand_emit_conf 30。

SoapSnp:

Carrying out variation detection first using SoapSnp is also to the comparison result of SoapAligner according to caryotype (karyotypic) it resequences, is then ranked up from small to large in same chromosome according to physical location.Tool Body parameter is as follows:

soapsnp–i Human_Hair.soap.gz–d hg19.fa–o Human_Hair.cns-r 0.0001 -t - u -L 49 -m -M Human_Hair.mat

Five, Gu DNA assert

Since during carrying out single-stranded DNA banks building to ancient human's sample of hair, inventor has used a kind of spy Different enzyme UDG removes uracil, to prevent the mutation of C- > T from causing the inaccuracy of result to subsequent analysis.In this way single-stranded The result in the library of method building can not find out apparent DNADamage mode.Hominid skeleton sample is during building library And UDG is not used and processes, therefore, when doing ancient DNA identification, inventor uses hominid skeleton sample.

Inventor using mapDamage to the mispairing mode of the sequencing result of hominid skeleton sample and fragment pattern into It has gone and has statisticallyd analyze and draw, under specifically used parameter enters:

perl mapDamage-0.3.3.pl map–i Human_Bone.sam –d directory –r hg19.fa -c -t Hair -l 49；perl mapDamage-0.3.3.pl merge –d directory；mapDamage- 0.3.3.pl plot –d directory

As a result as shown in Fig. 2, from the point of view of fragment pattern, the ratio of 5 ' end purine is dramatically increased, and the ratio of pyrimidine is then It is corresponding to significantly reduce；In terms of deamination mode, 5 ' ends have accumulated a large amount of C- > T mutation, and 3 ' ends then accordingly have accumulated largely The mutation of G- > A.Therefore, either fragment pattern or deamination feature all comply fully with ancient DNA characteristics, therefore inventor It can determine that the sequencing data that inventor obtains is Gu DNA.

Six, sex determination

Inventor has carried out sex determination's analysis to the affiliated ancients' individual of hair swatch in terms of 3:

1: analysis is assumed: if the affiliated ancients of hair swatch are a male individuals, the heterozygote ratio on X chromosome Example will be far smaller than its chromosome.

Analyze result: there is no significantly less than other chromosomes for heterozygote ratio in X chromosome (see Fig. 3).It as a result is female Property.

2: analysis is assumed: X chromosome and Y chromosome effective length ratio are 9:1, and X chromosome and No. 8 chromosome ratios connect Nearly 1:1.It, should be close to 9:1 to the reads quantity of X chromosome and the reads quantity of Y chromosome then comparing if it is male； If it is women, the ratio between X chromosome and No. 8 chromosomes should be close to 1:1.

Analyze result: the reads number ratio of mapping to X chromosome and No. 8 chromosome is close to 1:1, and X chromosome and Y The ratio of chromosome is 40:1, is far longer than 9:1 (see Fig. 4).It as a result is women.

3: analysis is assumed: if it is male, the sequencing depth of Y chromosome should be close with other chromosomes.

Analysis result: Y chromosome sequencing depth is significantly less than other chromosome each regions (see Fig. 5).It as a result is women.

In summary three aspect as a result, the affiliated ancients' individual of sample of hair be a female individual.

Seven, modern's DNA pollution rate is assessed

Since the sample of inventor's sequencing only has 1 individual, and inventor can not learn the ancients and other ancient humans And the affiliation between modern, sequencing data amount are again less.Therefore the distinctive segregating of the ancients can not be found Site, is not available mtDNA and autosome data carry out the assessment of modern's pollution rate.But since inventor judges the individual For female individual, therefore the pollution rate assessment of modern male individual can be carried out.Basic principle is to compare obtained reads To the peculiar region of Y chromosome (YUR, other any homologies of chromosome of getting along well and the not region of repetitive sequence), further according to YUR It is calculated with specific reads quantity and assumes it is the desired value in the case of male, the last practical reads compared and desired value Between ratio be exactly the pollution rate from male.

The modern male's pollution rate finally obtained is 1.72%~5.98%, is polluted with other ancient human's DNA document reports Rate is compared to higher, since obtained actual amount of data is relatively low, is likely to result in and a degree of underestimate or over-evaluate.Subsequent Need to carry out the reads that may be from modern sufficiently filtering in analysis to guarantee the reliability of result.

12 extinct plants and animal sequencing data situations of table

The filtering and comparison analysis result of 2 ancient human's hair swatch of table

The filtering and comparison analysis result of 3 hominid skeleton sample of table

In the description of this specification, reference term " one embodiment ", " some embodiments ", " example ", " specifically show The description of example " or " some examples " etc. means specific features, structure, material or spy described in conjunction with this embodiment or example Point is included at least one embodiment or example of the invention.In the present specification, schematic expression of the above terms are not Centainly refer to identical embodiment or example.Moreover, particular features, structures, materials, or characteristics described can be any One or more embodiment or examples in can be combined in any suitable manner.

Although an embodiment of the present invention has been shown and described, it will be understood by those skilled in the art that: not A variety of change, modification, replacement and modification can be carried out to these embodiments in the case where being detached from the principle of the present invention and objective, this The range of invention is defined by the claims and their equivalents.

Claims

1. a kind of method for the DNA information for obtaining DNA sample to be measured, which comprises the following steps:

The DNA sample to be measured is carried out building library and sequencing, to obtain sequencing data, wherein it is described build library when without The step of DNA fragmentation, of length no more than 100bp of the sequencing read；

Processing is compared in the sequencing data by filtration treatment, to obtain comparison result, the comparison result packet DNA information containing the DNA sample to be measured,

Wherein,

The filtration treatment includes at least one of following:

(1) filtering removal joint sequence；

(2) the low quality base of filtering removal mass value Q≤10, wherein when the quantity of the low quality base accounts for whole read Total bases amount 50% or more when, delete whole read；When the low quality base is in the end of read, and quantity is no more than Whole read 50% when, only cut off the low quality base；

(3) area N is filtered, wherein when ratio containing N is greater than 10% in read, remove the read；When the area N there is only When read both ends, the area N at the read both ends is only cut off；

The mispairing for comparing processing and at most allowing 4 bases.

2. the method according to claim 1, wherein carrying out the comparison using SoapAligner and BWA simultaneously Processing.

3. according to the method described in claim 2, it is characterized in that, being given birth to when carrying out comparison processing using SoapAligner At the comparison result of Soap format；When carrying out comparison processing using BWA, the comparison result of sam format is generated.

4. it is a kind of determine DNA sample to be measured whether the method for being ancient DNA, which comprises the following steps:

Method according to claim 1-3 obtains the DNA information of DNA sample to be measured；

Based on the DNA information of the DNA sample to be measured, variation detection is carried out, to determine the variation letter of the DNA sample to be measured Breath；And

(1) the deamination feature that sequencing read is presented below as: relative to reference genome, the 5 ' ends and 3 ' ends of the sequencing read There is the mutation of the C- > T and G- > A greater than 10%；

(2) the fragmentation feature that sequencing read is presented below as: relative to reference genome, 5 ' ends of the sequencing read are past again Previous base is that the ratio of purine dramatically increases, and 3 ' ends are dramatically increased toward the ratio that the latter base is pyrimidine again.

5. according to the method described in claim 4, being examined it is characterized in that, carrying out the variation using GATK and SoapSnp simultaneously It surveys.

6. according to the method described in claim 5, it is characterized in that, being exported when carrying out variation detection using SoapSnp The result of cns format.

7. according to the method described in claim 4, it is characterized in that, the variation information of the DNA sample to be measured includes monokaryon glycosides Acid variation information.

8. individual property method for distinguishing belonging to a kind of determining ancient DNA sample, which comprises the following steps:

Method according to claim 1-3 obtains the DNA information of ancient DNA sample to be measured；

Based on the DNA information of the DNA sample to be measured, at least one of following sex determination's parameter is determined: comparing to X chromosome Sequencing read and the quantity that compares to the sequencing read of Y chromosome than, compare to the sequencing read of X chromosome and comparison to 8 The quantity ratio of the sequencing read of number chromosome, the sequencing depth of each chromosome and the heterozygote ratio of each chromosome；And

Based at least one of described sex determination's parameter, gender individual belonging to the ancient DNA sample to be measured is determined, in which:

(1) it compares to the sequencing read of X chromosome and the quantity ratio of the sequencing read of comparison to Y chromosome close to 9:1, is described Individual belonging to Gu DNA sample to be measured is the instruction of male；It compares the sequencing read of X chromosome and compares to No. 8 chromosome The quantity ratio of read is sequenced close to 1:1, is the instruction that individual belonging to the ancient DNA sample to be measured is women；

(2) the sequencing depth of Y chromosome and the sequencing depth of other chromosomes are close, are a belonging to the ancient DNA sample to be measured Body is the instruction of male；The sequencing depth of Y chromosome is the Gu DNA to be measured significantly less than the sequencing depth of other chromosomes Individual belonging to sample is the instruction of women；

(3) the heterozygote ratio of X chromosome is the ancient DNA sample to be measured significantly less than the heterozygote ratio of other chromosomes Affiliated individual is the instruction of male；The heterozygote ratio of X chromosome significantly less than the heterozygote ratio of other chromosomes, is not Individual belonging to the ancient DNA sample to be measured is the instruction of women.

9. a kind of method of male's modern times DNA pollution rate in determining women Gu DNA sample, which is characterized in that including following step It is rapid:

Assuming that the women Gu DNA sample derives from male, and the DNA information based on the ancient DNA sample to be measured, sequencing is determined Read compares the desired proportion to Y chromosome specific region, wherein the sequencing read compares the phase to Y chromosome specific region The calculation formula of prestige ratio are as follows:

R=(comparing sequencing read quantity/comparison to the genome sequencing read quantity for arriving Y chromosome specific region) × 0.5； And

The desired proportion to Y chromosome specific region is compared based on the sequencing read, determines the Y of the ancient DNA sample to be measured Chromosomal contamination rate, the Y chromosome pollution rate of the ancient DNA sample to be measured are the male modern times DNA in women Gu DNA sample Pollution rate,

C=(y/R) × (1/n),

Wherein, C is Y chromosome pollution rate ratio, and y is to compare the sequencing read quantity for arriving Y chromosome specific region, and R is described Read comparison is sequenced to the desired proportion of Y chromosome specific region, n is the sequencing read sum compared to genome.

10. according to the method described in claim 9, it is characterized in that, obtaining the Y chromosome given zone by following methods Domain:

The mankind are divided into the artificial read set of 30bp or so length with reference to the Y chromosome gene order of genome；

The artificial read set is compared with the mankind with reference to the part that genome does not include Y chromosome, to obtain Obtain the artificial read by comparing；

For all artificial reads by comparing, only retain the artificial read for the comparison mistake of 3 bases or more occur, then Removing the artificial read comprising repetitive sequence region again, then remaining owner's part work and part study section forms the Y chromosome specific region,

Optionally, the mankind are Hg19 with reference to genome.