CN110970086B - Method for filtering modern DNA pollution from ancient DNA data and application thereof - Google Patents

Method for filtering modern DNA pollution from ancient DNA data and application thereof Download PDF

Info

Publication number
CN110970086B
CN110970086B CN201811161836.4A CN201811161836A CN110970086B CN 110970086 B CN110970086 B CN 110970086B CN 201811161836 A CN201811161836 A CN 201811161836A CN 110970086 B CN110970086 B CN 110970086B
Authority
CN
China
Prior art keywords
dna
data
ancient
pollution
filtering
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201811161836.4A
Other languages
Chinese (zh)
Other versions
CN110970086A (en
Inventor
兰天明
徐文浩
罗忻
郭小森
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Huada Sansheng Garden Technology Co ltd
Original Assignee
Shenzhen Huada Sansheng Garden Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Huada Sansheng Garden Technology Co ltd filed Critical Shenzhen Huada Sansheng Garden Technology Co ltd
Priority to CN201811161836.4A priority Critical patent/CN110970086B/en
Publication of CN110970086A publication Critical patent/CN110970086A/en
Application granted granted Critical
Publication of CN110970086B publication Critical patent/CN110970086B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Abstract

The application discloses a method for filtering modern DNA pollution from ancient DNA data and application thereof. The method for filtering modern DNA pollution comprises the steps of performing ancient DNA characteristic filtration on the Illumina second-generation sequencing original data of the ancient DNA, and particularly comprises the steps of performing modern DNA pollution filtration on the Illumina second-generation sequencing original data according to the ancient DNA deamination characteristics, namely retaining at least N C-T mutations at the 5 'end and/or at least M C-T mutation reading lengths at the 3' end caused by the deamination of the ancient DNA after filtration. The method can process the currently widely used Illumina platform data, adopts the comparison algorithm and parameters which are most suitable for the ancient DNA, can lead the pollution rate after filtration to reach 0 under the proper N and M values, is one of the methods with the maximum filtration intensity so far, and lays the foundation for the subsequent deep study of the ancient DNA.

Description

Method for filtering modern DNA pollution from ancient DNA data and application thereof
Technical Field
The application relates to the field of ancient DNA detection and analysis, in particular to a method for filtering modern DNA pollution from ancient DNA data and application thereof.
Background
Ancient DNA refers to DNA information of ancient organisms obtained from archaeological materials, paleofossils, biological remains, or sediments. Mainly from ancient organism tissues preserved under special conditions such as museum specimens, amber or permafrost, forensic samples and the like. Modern DNA is DNA information of modern biological materials relative to paleo DNA.
With the advent and development of PCR technology and sequencing technology, ancient DNA research has progressed rapidly in recent 30 years. In 1984, higuchi et al successfully extracted and amplified 228bp of DNA from the ancient muscle tissue of the vanishing donkey, indicating the first successful extraction of ancient DNA by humans. Thereafter, with the advent and development of molecular technology, particularly PCR technology, it has become possible to obtain paleo DNA from various paleo-biological species and tissues, including bones, teeth, soft tissues, hair and even fecal fossils. Although the PCR technology has driven the development of the whole ancient DNA field, almost all researches are limited to the researches of mitochondrial DNA and short nuclear DNA in the period from 1984 to 2006, which greatly limits the utilization of the ancient DNA.
In 2006, the advent of new generation sequencing technology has truly pushed ancient DNA research to the full genome level and has progressed rapidly over the next decade. For example, whole genome sequencing of the first palace in 2008 was completed, whole genome sequencing of the first paleo-human in 2010 was completed, and high-depth genome data release of the first danish soldier in 2012; in 2015, eske Willerslev et al performed genomic sequencing on 101 ancient human genomes, formally marking that the study of ancient DNA was upgraded to population study level. There were then several groups of subsequent studies of the ancient genomics that have successively reported that push the ancient DNA study toward the top. At present, more than 1300 ancient human genome data are published, more than 10 and more than 50 ancient animal genome data are published, and the development speed of the ancient DNA field is far more than the estimation of the development speed of the computer field by moore's law.
With the rapid development of the field of ancient genome, the problems are increased, and the most critical problem is how to remove as much modern DNA pollution as possible while obtaining endogenous ancient DNA to the maximum extent. In the long-term preservation process of the archaea, serious putrefaction degradation can occur, a large amount of animal, plant and microorganism pollution can be carried, endogenous archaea DNA is difficult to directly obtain through DNA extraction, exogenous DNA pollution is difficult to remove through experimental means, and exogenous DNA pollution can only be filtered through subsequent information analysis means. Although most of the foreign DNA contamination can be removed by alignment with the reference genome, modern DNA contamination for closely related species is difficult to remove by alignment. Such as contamination of the human genome data in ancient times with DNA of modern people, which is the most difficult to remove. In this case, the most recent approach is to evaluate the contamination of modern DNA by means of bioinformatics, and if the percentage of contamination is controlled to be within 1%, the contamination can be not treated, since the subsequent analysis is hardly affected by too much contamination. However, in many cases, the contamination cannot be evaluated by a later bioinformatic analysis method, or the evaluation result is greater than 1%, and in all cases, some methods are necessary to remove the contamination to ensure the reliability of the subsequent analysis. At present, no method can well filter exogenous DNA pollution, especially modern DNA pollution of kindred species.
Disclosure of Invention
The application aims to provide a novel method for filtering modern DNA pollution from ancient DNA data and application thereof.
The application adopts the following technical scheme:
the application discloses a method for filtering modern DNA pollution from ancient DNA data, which comprises an ancient DNA characteristic filtering step for the Illumina second-generation sequencing original data of the ancient DNA, wherein the ancient DNA characteristic filtering step comprises the step of filtering the modern DNA pollution of the Illumina second-generation sequencing original data according to the deamination characteristics of the ancient DNA, and specifically comprises the steps of reserving the reading length of at least N C-T mutations at the 5 'end caused by deamination of the ancient DNA after filtering and/or reserving the reading length of at least M C-T mutations at the 3' end caused by deamination of the ancient DNA after filtering.
It should be noted that, the result of Illumina sequencing data shows that, the two ends of the ancient DNA read length have C- > T mutation characteristics of the 5 'end and the 3' end of the DNA fragment caused by deamination, and the damage frequency is generally higher than 10% at the two ends of the ancient DNA, based on the above knowledge, the application particularly provides a method for filtering modern DNA pollution from the ancient DNA data, and the modern DNA pollution in the Illumina second-generation sequencing original data can be effectively removed by the filtering method; in the other method, the comparison algorithm and the comparison algorithm parameters, and N and M are further optimized, so that the pollution rate after filtration can reach 0, and the method is one of the methods with the largest filtration intensity so far; the specific optimization method is described in detail in the following schemes.
It should be further noted that, the application retains the read length of at least N C- > T mutations at the 5 'end caused by deamination of the paleo DNA after filtration, and/or retains the read length of at least M C- > T mutations at the 3' end caused by deamination of the paleo DNA after filtration, specifically, only the screening result of the 5 'end or the 3' end may be retained, and also the screening result of the 5 'end and the screening result of the 3' end may be retained. In one implementation of the application, it has been demonstrated that the retention rate of RA for single-terminal filtration is too small, i.e., only 5 'or 3' screening results are retained, and that the retention rate is small, and therefore, in a preferred embodiment of the application it is suggested to retain both 5 'screening results and 3' screening results.
Preferably, the filtering method of the present application further comprises comparing the Illumina second generation sequencing raw data using at least one comparison algorithm of bwmem, bwialn seed, BWAaln disable seed and bowtie 2.
The application also discloses a method for filtering the modern DNA pollution from the ancient DNA data, which is one of the important improvements that the comparison algorithm and the comparison algorithm parameters which are most suitable for the ancient DNA are screened by comparing a plurality of comparison algorithms which are conventionally used, so that a foundation is laid for efficiently filtering the modern DNA pollution and obtaining the high-quality ancient DNA. It will be appreciated that BWAMEM, BWAAln seed, BWAaln disable seed and bowtie2 are but a few of the currently widely used alignment algorithms employed in one implementation of the present application, and that other alignment algorithms may be employed without limitation.
Preferably, the step of filtering the characteristic of the ancient DNA further comprises the step of filtering modern DNA pollution according to the characteristic of apurinic of the ancient DNA, and specifically comprises the step of reserving the reading lengths of guanine and adenine at the front and rear bases of the breaking position of the DNA chain caused by apurinic of the ancient DNA after filtering.
The filtering of the apurinic characteristic of the ancient DNA is particularly suitable for the ancient DNA with remarkable apurinic characteristic, and if the apurinic characteristic is not remarkable, the filtering by using the characteristic is not recommended. Therefore, prior to filtration using the apurinic profile of the paleo DNA, it is suggested to evaluate the significance of the apurinic profile of the paleo DNA in advance. Wherein the degree of significance is statistically defined as the significance or non-significance of the apurinic feature, and is not specifically defined herein.
Preferably, the step of filtering the paleo DNA features further comprises filtering modern DNA contamination according to paleo DNA fragmentation features, specifically comprises setting a threshold according to paleo DNA fragments, and filtering to remove read lengths greater than the set threshold.
It should be noted that the length distribution of the ancient DNA fragments is mainly 40bp to 80bp, while the length of the fragments polluted by modern DNA is usually more than 80bp; thus, the present application filters ancient DNA according to this feature. Under a loose filtering condition, the threshold is set to be 100bp, namely, the reading length larger than 100bp is removed through filtering; under a strict filtering condition, the threshold is set to 80bp, namely, the reading length larger than 80bp is removed by filtering; the specific set threshold may be changed or optimized according to the ancient DNA of different species or different ages, and is not particularly limited herein.
Preferably, the method for filtering modern DNA pollution from the ancient DNA data further comprises a data quality control step of the Illumina second-generation sequencing original data of the ancient DNA; the data quality control step comprises the following steps:
1) Filtering the joint, specifically comprising cutting off the joint sequence if the read length contains the joint sequence, and reserving the rest part;
2) Filtering low-quality bases, wherein the method specifically comprises deleting the whole read length if the number of bases with the quality value Q less than or equal to 10 is more than 50% of the total number of bases in the whole read length; if the base with the mass value Q less than or equal to 10 is at the end of the reading length and the number is not more than 50% of the whole reading length, only the low-mass base is cut off, and the rest part is reserved;
3) Filtering in the N area, specifically comprising removing the read length with the N content ratio of more than 10%; if the N region exists only at the two ends of the reading length, cutting off the N region at the two ends of the reading length, and reserving the rest bases;
4) The read length less than 30bp in length was removed.
It should be noted that, the data quality control may refer to the existing quality control scheme of Illumina second generation sequencing, for example, removing a linker sequence, removing a low-quality base or region, removing an N region, etc., and may refer to the existing data quality control; however, in order to obtain better ancient DNA filtering effect, the specific parameters of the conditions are optimized. In addition, it is proposed to filter out a read length of less than 30bp, because the read length is less than 30bp, which can cause more erroneous comparison in the subsequent comparison process, in order to improve the quality and efficiency of the comparison, the present application particularly removes a read length of less than 30 bp.
The application also discloses another method for filtering modern DNA pollution from ancient DNA data, which is hereinafter referred to as a second method, comprising the following steps,
performing ancient DNA damage mode analysis on the obtained Illumina second-generation sequencing original data of the ancient DNA conforming to the data quality control, and selecting data with typical ancient DNA damage modes for subsequent analysis;
quantitatively adding modern DNA data into data with typical ancient DNA damage modes to obtain simulated modern DNA pollution data;
adopting at least two comparison algorithms to analyze the comparison rate and the pollution rate of the ancient DNA fragments in the simulated modern DNA pollution data, comparing the analysis result with the real comparison rate and the pollution rate, and screening the comparison algorithm with the comparison rate and the pollution rate closest to the real situation; in one implementation mode of the application, a BWAMEM comparison algorithm which is widely used at present and BWAAln, BWAaln disable seed and bowtie2 comparison algorithms are specifically adopted, so that the modern DNA pollution filtering method has better universal practicability;
the comparison algorithm evaluation specifically comprises the steps of gradually adjusting various parameters of the comparison algorithm, comparing the comparison rate and pollution rate analysis results obtained after parameter adjustment with the actual comparison rate and pollution rate, and screening comparison algorithm parameters of which the comparison rate and pollution rate are closest to the actual situation;
Performing an ancient DNA characteristic filtering step on simulated modern DNA pollution data according to the screened comparison algorithm and comparison algorithm parameters; the step of filtering the ancient DNA features comprises the step of filtering modern DNA pollution according to the characteristic of deamination of the ancient DNA, and specifically comprises the step of reserving the reading length of at least N C-T mutations at the 5 'end or M C-T mutations at the 3' end caused by deamination after filtering; the filtered ancient DNA retention rate and pollution rate of N and M are analyzed by adopting a screened comparison algorithm and comparison algorithm parameters, and the values of N and M are determined according to the required retention rate and pollution rate;
and carrying out modern DNA pollution filtering on the Illumina second-generation sequencing original data of the ancient DNA conforming to the data quality control by adopting a screened comparison algorithm, comparison algorithm parameters and determined N and M values.
It should be noted that the key point of the above method is to optimize the comparison algorithm, the comparison algorithm parameters and the ancient DNA deamination characteristic parameters so as to further improve the effect and quality of filtering the modern DNA pollution. In addition, the optimization method adopts real ancient DNA data to artificially and quantitatively add modern DNA pollution, has real and referenceable pollution rate and comparison rate, and can evaluate and optimize the comparison algorithm and comparison algorithm parameters more accurately. In addition, according to the optimization method, all related data are Illumina platform data, and an optimal comparison algorithm suitable for ancient DNA filtering of an Illumina sequencing platform and a comparison algorithm parameter combination are provided in a targeted manner.
Preferably, N and M are each integers from 1 to 5.
Preferably, the data with typical ancient DNA damage pattern refers to the read length with the following conditions, 1) the proportion of 5' -end purine of the read length is obviously increased, and the proportion of pyrimidine is correspondingly obviously reduced; 2) Sequencing data of a single-stranded DNA library construction method accumulate a large number of C- > T mutations at both the 5 'end and the 3' end; sequencing data of the double-stranded DNA library construction method accumulate a large number of C- > T mutations at the 3' end.
It should be noted that, in the preferred scheme of the application, the actual ancient DNA data is adopted to artificially and quantitatively add the modern DNA pollution, so as to simulate the modern DNA pollution, in order to improve the quality and effect of the simulated data, the ancient DNA in the simulated data is screened so as to ensure that the ancient DNA used for the simulation has a typical ancient DNA damage mode, for example, from the aspect of a fragmentation mode, the proportion of 5' -purine is obviously increased, and the proportion of pyrimidine is correspondingly obviously reduced; from the deamination model, if the sample uses a single-stranded DNA library construction method during library construction, a large number of C- > T mutations accumulate at the 5 'and 3' ends; if the sample uses a double-stranded DNA library construction method in the library construction process, a large number of G- > A mutations are accumulated at the 3' end; such data are fully compatible with both fragmentation and deamination profiles, and thus with the requirements of further analysis.
Preferably, the second method of the present application further comprises, in the case that the result of the analysis of the damage pattern of the ancient DNA shows that the ancient DNA has significant apurinic characteristics, performing apurinic characteristic filtration on Illumina second generation sequencing raw data of the ancient DNA conforming to the quality control of data, specifically comprising, retaining the reading lengths of guanine and adenine respectively before and after the breaking position of the DNA strand after filtration.
It is understood that the filtering of the apurinic characteristic of the ancient DNA is only applicable to the ancient DNA with obvious apurinic characteristic, and the filtering of the apurinic characteristic of the ancient DNA is not needed for the ancient DNA with insignificant apurinic characteristic.
Preferably, the second method of the application further comprises performing archaic DNA fragmentation feature filtering on the simulated modern DNA pollution data, specifically, setting a threshold according to the archaic DNA fragments, and filtering to remove the read length larger than the set threshold; the filtered paleo DNA retention rate and pollution rate are analyzed by adopting a screened comparison algorithm and comparison algorithm parameters, and the set threshold value is determined according to the required retention rate and pollution rate; and adopting the determined set threshold value to perform archaea DNA fragmentation characteristic filtering on the Illumina second-generation sequencing original data of the archaea conforming to the data quality control.
It should be noted that, in the process of filtering modern DNA contamination, the paleo DNA retention rate and the contamination rate are two important parameters; it can be understood that under strict filtration conditions, the filtration removes modern DNA and also removes part of the paleo DNA, so that the pollution rate is reduced and the paleo DNA retention rate is reduced; under loose filtration conditions, the contamination rate increases, and the paleo DNA retention rate also increases. In the actual filtration of the DNA contamination, the retention rate is as high as possible and the contamination rate is as low as possible, so that the retention rate and the contamination rate are required to be selected and removed according to the test requirements, and the filtration is not particularly limited.
Preferably, in the second method of the application, the Illumina second generation sequencing raw data of the paleo DNA conforming to the data quality control is the data obtained by screening under the following conditions,
1) Filtering the joint, specifically comprising cutting off the joint sequence if the read length contains the joint sequence, and reserving the rest part;
2) Filtering low-quality bases, wherein the method specifically comprises deleting the whole read length if the number of bases with the quality value Q less than or equal to 10 is more than 50% of the total number of bases in the whole read length; if the base with the mass value Q less than or equal to 10 is at the end of the reading length and the number is not more than 50% of the whole reading length, only the low-mass base is cut off, and the rest part is reserved;
3) Filtering in the N area, specifically comprising removing the read length with the N content ratio of more than 10%; if the N region exists only at the two ends of the reading length, cutting off the N region at the two ends of the reading length, and reserving the rest bases;
4) The read length less than 30bp in length was removed.
Preferably, in the second method of the present application, illumina second generation sequencing raw data of paleo DNA comprises at least one set of paleo DNA data less than 10000 years and at least one set of paleo DNA data greater than 10000 years, each set of data being analyzed independently.
It should be noted that in the preferred method of the present application, the comparison and the filtering conditions of the ancient DNA data greater than 10000 years and the ancient DNA data less than 10000 years are respectively optimized, so that the method of the present application has the same effect as the filtering of the ancient genome data within 10000 years for the filtering of the ancient genome data greater than 10000 years, and the defect that the existing method only has good filtering effect for the recent samples less than 10000 years, but has larger limitation for the samples exceeding 10000 years is overcome. In one implementation of the application, the ancient DNA from 1300 to 50000 years ago was analyzed separately, and the test data was more representative.
Preferably, the modern DNA data added to the simulated modern DNA pollution data comprises modern DNA of the same species or closely related species as the archaea of the archaea data.
In the preferred method of the application, the modern DNA of the same species or a nearby species is particularly added into the ancient DNA, so that the filtration method of the application can better filter and remove the modern DNA pollution of the same species or nearby species, thereby improving the quality of the obtained ancient DNA and laying a foundation for the ancient DNA research.
The application also discloses a method for filtering modern DNA pollution from ancient human DNA data, which comprises the steps of performing ancient human DNA feature filtering on Illumina second-generation sequencing original data of the ancient human DNA, wherein the ancient human DNA feature filtering step comprises the step of filtering the modern DNA pollution on the Illumina second-generation sequencing original data according to the deamination features and the fragmentation features of the ancient human DNA;
filtering according to the deamination characteristics of the ancient human DNA, wherein the filtering specifically comprises the steps of reserving at least 1C- > T mutation in five bases at the 5 'end after filtering, and simultaneously, at least 2C- > T mutation reading lengths in five bases at the 3' end; meanwhile, after filtration, at least 2C- > T mutations exist in five bases at the 5 'end, and at least 1C- > T mutation exists in five bases at the 3' end;
the filtering is carried out according to the ancient human DNA fragmentation characteristics, and specifically comprises the steps of filtering to remove the reading length of more than 100bp, preferably, filtering to remove the reading length of more than 80 bp.
In the specific embodiment of the application, the analysis test is carried out by taking ancient human DNA as an example, and specific parameters and conditions suitable for filtering modern DNA pollution of the ancient human DNA are obtained; it will be appreciated that the present application is not limited to only ancient human DNA, but that other ancient DNA from other archaea are equally applicable to the present application, and that the specific parameters may vary, as long as the second method is optimized according to the present application, and is not specifically limited herein.
Preferably, the step of filtering the characteristic of the ancient human DNA further comprises filtering modern DNA pollution according to the characteristic of apurinic characteristic of the ancient DNA, and specifically comprises the steps of keeping the reading lengths of guanine and adenine respectively before and after the breaking position of the DNA strand after filtering.
Preferably, the method further comprises the step of performing data quality control on Illumina second-generation sequencing original data of the ancient human DNA; the data quality control step comprises the steps of,
1) Filtering the joint, specifically comprising cutting off the joint sequence if the read length contains the joint sequence, and reserving the rest part;
2) Filtering low-quality bases, wherein the method specifically comprises deleting the whole read length if the number of bases with the quality value Q less than or equal to 10 is more than 50% of the total number of bases in the whole read length; if the base with the mass value Q less than or equal to 10 is at the end of the reading length and the number is not more than 50% of the whole reading length, only the low-mass base is cut off, and the rest part is reserved;
3) Filtering in the N area, specifically comprising removing the read length with the N content ratio of more than 10%; if the N region exists only at the two ends of the reading length, cutting off the N region at the two ends of the reading length, and reserving the rest bases;
4) The read length less than 30bp in length was removed.
The application also discloses the application of all the methods in ancient DNA research, forensic identification or judicial identification; or in the preparation of devices or apparatus for filtering modern DNA contamination from ancient DNA data.
It should be noted that, the modern DNA pollution filtering method of the present application not only can remove the modern DNA pollution in the ancient DNA by high-efficiency filtration, in one implementation mode of the present application, the modern DNA pollution rate of the filtered sample can reach 0 at most; moreover, the filtering method is suitable for ancient DNA of different ages; more importantly, the filtering method of the application adopts the widely used comparison algorithm and Illumina platform data at present, so that the filtering method of the application has better practicability. Therefore, the pollution filtering method can be applied to the ancient DNA research, and lays a foundation for the deep research of the ancient DNA. Meanwhile, the pollution filtering method is also suitable for forensic identification or judicial identification, for example, for materials which have long death time and cause high degradation, the filtering method can effectively distinguish DNA pollution of modern people during sample collection, and can improve detection quality and accuracy.
It should also be noted that, the filtering method of the present application adopts the existing equipment or the combination of software to implement the steps; it can be understood that under the inventive concept of the filtering method of the present application, each device or software can be combined together to prepare a new device or equipment which is particularly suitable for filtering modern DNA pollution in ancient DNA according to the filtering method of the present application; thus, the filtration method of the present application can be used to prepare a device or apparatus for filtering modern DNA contamination from ancient DNA data.
The application has the beneficial effects that:
the method for filtering modern DNA pollution from ancient DNA data can process the currently widely used Illumina platform data, and the pollution filtering method of the application adopts a comparison algorithm and comparison algorithm parameters which are most suitable for the ancient DNA, can enable the pollution rate after filtering to reach 0 under the proper N and M values, is one of the methods with the maximum filtering strength so far, and lays a foundation for the subsequent deep study of the ancient DNA.
Drawings
FIG. 1 is a block flow diagram of filtering modern DNA contamination from paleo DNA data in an embodiment of the application;
FIG. 2 is a graph showing the damage profile of an ancient DNA sample JK2911 according to an embodiment of the present application;
FIG. 3 is a graph showing the damage profile of an ancient DNA sample Villabruna in an example of the present application;
FIG. 4 is a graph showing the damage profile of the ancient DNA sample Afontova Cora3 in the example of the present application;
FIG. 5 is a graph showing the damage profile of the ancient DNA sample Denisonva_8 according to the example of the present application;
FIG. 6 is a graph showing the percentage of deviation of the comparison rate of 4 samples from the comparison rate of the default value when the "BWAaln disable seed" algorithm "-n" takes different values in an embodiment of the present application;
FIG. 7 is a graph showing the percentage deviation of the contamination rate of 4 samples from the contamination rate of the default value when the "BWAaln disable seed" algorithm "-n" takes different values in an embodiment of the present application;
FIG. 8 is the RA retention of JK2911 samples filtered with different values for N and M in num=N/M based on the old DNA deamination feature filtering in the examples of the present application;
FIG. 9 is the contamination rate of JK2911 samples filtered with different values for N and M in num=N/M based on the old DNA deamination feature filtering in the example of the present application;
FIG. 10 shows the pollution rate reduction degree of JK2911 samples filtered by different values of N and M in num=N/M in the filtering based on the ancient DNA deamination characteristic in the embodiment of the present application;
FIG. 11 is a statistical plot of the length distribution of the ancient DNA of the Afontova 3 sample in the example of the present application;
FIG. 12 is a statistical chart of the length distribution of the Denisonva_8 sample paleo-DNA according to the example of the present application;
FIG. 13 is a statistical chart of the length distribution of the ancient DNA of the Villabruna sample in the embodiment of the application;
FIG. 14 is a statistical diagram of the length distribution of the ancient DNA of the JK2911 sample in the embodiment of the application;
FIG. 15 is a statistical plot of the length distribution of modern DNA contamination samples in an embodiment of the present application.
Detailed Description
The existing research report mainly aims at the evaluation of the modern DNA pollution condition, namely the comparison and judgment of the ancient DNA, and a complete and effective modern DNA filtering scheme does not exist yet. For example, in 2012, schubert et al, on the level of alignment, have sought the best combination of parameters for bwa software when performing paleo DNA alignment; in 2013, skoglund et al used the characteristic of C- > T mutation information formed at both ends of the paleo DNA fragment by paleo DNA deamination to determine whether a DNA sequence was paleo DNA.
The alignment method of Schubert et al, reference Schubert et al, improving ancient DNA read mapping against modern reference genome BMC genome 2012,13:178, mainly comprises the following steps:
1. ancient horses were sequenced for their old genome by Illumina and Helicos sequencing platform.
2. Sequencing data were aligned to the coumare genome using bwa software with different combinations of parameters.
3. Comparing the quantity of the ancient horse DNA obtained under different values of each parameter, calculating time consumption, and estimating pollution condition approximately by deamination characteristics of the ancient DNA fragments.
4. The best parameter combination for the paleo DNA alignment of bwa was evaluated by the results in step 3.
5. The optimal combination effect is obtained after artificial excision of C- > T and G- > A mutations at the ends of the ancient DNA fragments caused by deamination.
6. Finally, the best bwa parameter combination which is most suitable for the data of the Illumina (-l 1024) and the helicobacter (-l 1024-i 0-o 2-n 0.03) is obtained respectively.
The method for judging Skoglund et al refers to Skoglund P et al separating endogenous ancient DNA from modern day contamination in a Siberian Neandertal PNAS 2013,111:6 in detail, in the judging process, a maximum likelihood estimation method is used for integrating deamination characteristics, base quality values and natural mutation information of ancient DNA into a maximum likelihood algorithm, a PMD score is obtained through maximum likelihood estimation of the characteristic values, and whether a DNA fragment is from the ancient DNA is finally judged through the PMD score. If the PMD score is positive and the ancient DNA is supported, the higher the score is, the smaller the probability of DNA pollution of modern people is; if negative, it is judged that the DNA is contaminated. The method has been written as a tool called PMDtools.
For the technology of Schubert et al in 2012, the main drawbacks are:
1. The ancient Ma Yuanshi sequencing data are used, and the real ancient DNA comparison rate and pollution rate cannot be accurately estimated.
2. The evaluation of the data of the helicobacter sequencing platform is mainly performed, while the evaluation of the data of the Illumina platform and the lack of parameter suggestion are performed.
3. Only one ancient horse data about 13000 years is mainly tested, and the universal applicability is lacking.
4. Only the aln algorithm of the bwa software was tested, while other software and the now widely used bwa mem algorithm were not evaluated, on one side.
5. The bwa alignment parameters were estimated solely by the feature of paleo DNA deamination, and other paleo DNA features were not considered.
6. Only the bwa comparison algorithm was developed for the combination of parameters applicable to paleo-DNA, and the paleo-DNA features were not used to filter modern DNA contamination, but the importance of filtering contamination was far higher than the evaluation of contamination rate.
For the technology of Skoglund et al in 2013, the main drawbacks are:
1. the filter effect is good only for the recent samples less than 10000 years, and the filter effect is more limited for the samples more than 10000 years.
2. Only the paleodna characteristics resulting from deamination were considered and no other pattern was estimated.
3. The pollution rate is high.
In one implementation mode, the comparison effect of the comparison algorithms such as BWAMEM, BWA aln seed, BWA aln disable seed, and bowtie2 on the ancient DNA is specifically compared, the parameters of each comparison algorithm are optimized, and finally, the optimal comparison algorithm and the parameter combination which are suitable for the comparison of the ancient DNA are obtained through screening. The comparison effect is mainly considered from two aspects, namely, the comparison rate of the real ancient DNA in the comparison result and the pollution rate of the modern DNA in the comparison result. The higher the comparison rate is, the lower the pollution rate is, and the better the effect is. In order to accurately calculate the comparison effect of different algorithms, the application adopts the test of the known ancient DNA sequence and the simulated modern DNA pollution data of the modern DNA data which are added quantitatively by people, and can truly and accurately reflect the comparison rate and the pollution rate of the ancient DNA. Also, in one implementation of the present application, known paleo DNA sequences are divided into four groups: the first group was the ancient genome data in the samples of the ancient Egypt mummy 1300 years ago, the second group was the ancient human genome data about 10000 years ago, the third group was the ancient human genome data about 17000 years ago, and the fourth group was the danish soldier data 50000 years ago. The sequence of human contamination is divided into two groups: the first group is original machine-off data of the old cattle; the second group is modern DNA; modern DNA contamination is added primarily to make the problem of contamination of closely related species the most difficult to solve. Thus, the final simulated data are divided into 3 groups, namely real paleo DNA data of four paleo-individuals and simulated pollution data combined with two groups of pollution respectively, all data being Illumina data. And then, firstly, respectively comparing 3 groups of data through default parameters of 4 algorithms, calculating the comparison rate and pollution rate in the comparison result, and comparing the comparison rate and pollution rate with a true value to find the optimal comparison algorithm. After the optimal comparison algorithm is found, further exploration is carried out on each parameter under the algorithm, and finally, a parameter combination with the highest ancient DNA comparison rate and the lowest pollution rate is found; the modern DNA pollution filtration of the application is performed based on the obtained optimal alignment algorithm and the parameters of the alignment algorithm.
The second key technical problem solved by the application is to filter modern DNA pollution from ancient DNA data through the characteristics of deamination, depurination and fragmentation of the ancient DNA. The application integrates 3 ancient DNA features, and simultaneously sets a plurality of filtering schemes to ensure that the application is applicable to ancient genome data of various ages. The most relaxed filtering scheme is applied to the simulation data, so that the real ancient DNA can be reserved to the maximum extent, and the most strict filtering scheme is applied to the simulation data, so that the pollution rate can be reduced to 0.
The method for filtering modern DNA pollution from ancient DNA data is provided on the basis of mainly solving the two key problems. Compared with the existing comparison and judgment method of the ancient DNA, the filtering method has the following advantages:
with respect to the technology of Schubert et al in 2012:
1, the technology of Schubert et al cannot accurately evaluate the real ancient DNA comparison rate and pollution rate; the application uses the real ancient DNA data and the pollution data which are quantitatively added manually for simulation, has real and referent pollution rate and comparison rate, and can evaluate the test method more accurately.
2. The prior art mainly evaluates the helicobacter, but does not suggest much actual parameters for the Illumina platform data. The data related by the application are all Illumina platform data, and the comparison effect of a plurality of individuals is comprehensively evaluated, and the optimal parameter combination suitable for the paleo DNA comparison of an Illumina sequencing platform is provided in a targeted manner.
3. The prior art only tests one ancient horse data about 13000 years, and lacks universal applicability. The application is more representative of test data for four ancient people, with the annual distribution ranging from 1300 to 50000 years ago.
4. The prior art only tested the aln algorithm of the bwa software, and did not evaluate other software and the now widely used bwa mem algorithm. The application also tests and evaluates BWA mem, BWA aln seed, BWAaln disable seed and bowtie 2.
5. Compared with the technology of Schubert et al, the application not only filters pollution through C- > T mutation information caused by deamination, but also integrates DNA damage characteristics and fragmentation characteristics caused by depurination to filter and evaluate the pollution of modern DNA, and the evaluation method is more comprehensive.
6. The prior art only groves the combination of parameters of the comparison algorithm for the ancient DNA, and the characteristic of the ancient DNA is not used for filtering the modern DNA pollution. The application not only futures the comparison parameters, but also filters the modern DNA pollution by using the ancient DNA characteristic, and explores a set of optimal filtering scheme.
For the technology of Skoglund et al in 2013:
the technique of skoglund et al is only better for recent samples less than 10000 years, but has a large limitation for samples exceeding 10000 years. The filtering method considers samples of each year at the beginning of construction, so the method for filtering modern DNA pollution by using the ancient DNA has no limitation on the sample year, and greatly supplements the defects of the prior art.
2. Compared with the technology of Skoglund et al, the application not only filters pollution through C- > T mutation information caused by deamination, but also integrates DNA damage characteristics and fragmentation characteristics caused by depurination to filter and evaluate the pollution of modern DNA, and the evaluation method is more comprehensive.
3. Compared with the Skoglund et al technology, the application has the advantages that the application has practical application to the data of Danish soldier before 50000 years and is compared with the PMDtools result, and the application is obviously superior to the Skoglund et al technology in comparison rate and pollution rate.
One of the most important processes in ancient DNA research is the removal of modern DNA contamination present in the ancient genome, especially the filtration of modern DNA contamination in closely related species and its difficulties. The method for filtering the modern DNA pollution in the ancient genome provides a key technology for developing the relevant scientific research projects of the ancient DNA, and lays a foundation for the deep research of the ancient DNA. According to the method for filtering the modern DNA pollution in the ancient DNA data finally obtained through comparison analysis of 4 comparison algorithms which are popular at present, a set of optimal comparison tools is provided for subsequent ancient DNA related researches, the working efficiency of the ancient DNA researches is improved, and the risk of blindly using comparison software is reduced. According to the method for filtering modern DNA pollution from the ancient DNA data, the ancient DNA samples of different ages are considered at the beginning of construction, so that the pollution rate after filtration can reach 0, and the method is one of the methods with the maximum filtering strength so far; the filtering method has the same effect as that of the ancient genome data filtering within 10000 years for the ancient genome data filtering within 10000 years, and the key defect of the existing method is overcome.
The application is further illustrated by the following examples. The following examples are merely illustrative of the present application and should not be construed as limiting the application.
Examples
The method for acquiring endogenous palDNA from paleobiont genome sequencing data, namely a method for filtering modern DNA pollution, is developed based on the Illumina platform second-generation sequencing data and the biological information analysis method. The method of the present example, as shown in fig. 1, mainly comprises the following steps: 1) Acquiring Illumina second generation sequencing data; 2) Original off-machine data Fastq quality control; 3) Ancient DNA damage pattern analysis; 4) Comparing the algorithm to analyze the quality; 5) Searching an optimal parameter combination under an optimal comparison algorithm; 6) Filtering modern DNA contamination based on deamination-induced C- > T and G- > a mutations; 7) Filtering modern DNA contamination based on fragmentation characteristics resulting from depurination; 8) Filtering modern DNA pollution based on ancient DNA length analysis; 9) Statistics and comparison of the final results and acquisition of the best solution. The details are as follows:
1. samples and data
In the embodiment, the method is required to be compared with the real data by improving the method and optimizing the result of the endogenous ancient DNA obtained after relevant parameters, and further the effectiveness of the method is estimated, so that the real ancient DNA data in published articles are selected as the real data, and then the ancient DNA pollution and modern DNA pollution data are quantitatively added into the real data by people to form simulated pollution data. In this way, which DNA fragments in the pollution data are real ancient DNA fragments and which DNA fragments are from pollution can be clearly simulated, so that the authenticity and the effectiveness of the comparison result of the method are ensured.
In addition, to further determine whether the ancient genomes of different ages are suitable for the same filtering criteria, this example compares the effect of 4 ancient human genome samples of different ages on the data results to determine the optimal filtering conditions. Specifically, in this example, 4 data samples with sample ages of 1300 years to 50000 thousands of years were selected, and the relevant information of 4 sets of data is shown in table 1, and the relevant information of artificially added contaminated DNA is shown in table 2.
TABLE 1 archaeological DNA sample information
Age of year Sample name Source Data size
1300 years JK2911 Schuenemannetal.,2017 150M
7 thousands to 1.4 years Villabruna Fuetal.,2016 818M
1.7 perpetual AfontovaCora3 Fuetal.,2016 78M
Over 5 ten thousand years Denisova_8 Sawyeretal.,2015 54M
TABLE 2 information on artificially added contaminating DNA
Contaminating DNA Sequencing platform Data format
Ancient cow DNA Illumina Fq file
Modern DNA Illumina Fq file
All data in Table 1 were performed using the Illumina sequencing platform, data format of Bam, and all data were not processed by uracil-DNA glycosylase (UDG).
It should be noted that the data disclosed in the prior art is used in this example, and thus, data quality control is not required. If the data is the original down-machine Fastq data of the Illumina sequencing platform, the data quality control is also needed. The most important step after the original Fastq data is taken off the machine is to filter the data according to the characteristics of the Illumina data and the sequence characteristics of the ancient DNA, and the aim is to remove the low-quality sequence and the exogenous pollution DNA sequence to the maximum extent. The data filtering of the example mainly comprises 4 aspects:
First, the joint is filtered. In this step, if the read length is found to contain a linker sequence, the linker sequence is partially excised and the remainder of the read length is retained.
Second, low quality bases are filtered. When the number of the bases with the mass value Q less than or equal to 10 is more than 50% of the total number of the bases in the whole read length, deleting the whole read length. If the low-mass base is at the end of the read length and the number does not exceed 50% of the entire read length, only the low-mass portion of the base is excised.
Third, the N region is filtered. The read length with a N content greater than 10% is removed. If the N region exists only at the two ends of the reading length, only the N region at the two ends of the reading length is excised, and the rest base is reserved.
Fourth, the length of the removal is less than 30 bp. If the read length is less than 30bp, more erroneous comparisons can be made in the subsequent alignment process.
In addition, for the self-sequenced ancient DNA data, long fragment sequencing is not suitable to be selected when the machine is used for sequencing, and the reading length is controlled within 100bp, because the average length of the ancient DNA is about 50-80bp, if the reading length exceeds 100bp during sequencing, a large amount of joint pollution can be introduced on one hand, and a large amount of data waste can be caused on the other hand.
2. Ancient DNA damage characterization
Because the pollution filtration is performed based on the characteristics of the ancient DNA, particularly the C- > T replacement characteristics of the 5 'and 3' ends of the DNA fragments caused by deamination, the result of the Illumina sequencing data shows that the damage frequency is generally higher than 10% at the two ends of the ancient DNA, and therefore, before the method exploration is performed, the method is firstly performed, all data are ensured to have the damage characteristics of the ancient DNA. The damage profile is mainly analyzed from two aspects: 1. c- > T mutation frequency statistics caused by deamination at two ends of the ancient DNA fragment; 2. the fragmentation characteristics of ancient DNA due to depurination.
Since the endogenous palindromic DNA acquisition method of this example is based primarily on deamination and depurination characteristics of palindromic DNA, sequencing data is required to be untreated by uracil-DNA glycosylase (UDG), thereby preserving the sequence of DNA damage characteristics resulting from deamination at both ends. Thus, this example first performed a series of evaluations of the nature of the ancient DNA damage. The mismatch and fragmentation patterns of the 4 ancient human genome data published were statistically analyzed and plotted using mapDamage in this example, with the following parameters:
perl mapDamage-0.3.3.pl map–i Human_Bone.sam–d directory–rhg19.fa-c-t Hair-l 49;perl mapDamage-0.3.3.pl merge–d directory;mapDamage-0.3.3.plplot–d directory
as a result, as shown in FIGS. 2 to 5, the ancient DNA damage profiles of JK2911, villabruna, afontovaCora3 and Denisonva_8 are shown in FIGS. 2 to 5, wherein A, G, C, T respectively shows the distribution of the corresponding base over the read length, the left graph represents the frequency distribution of the substitution of C- > T and the right graph represents the frequency distribution of the substitution of G- > A in the two bottom graphs. The results of fig. 2-5 show that from the fragmentation pattern, the proportion of 5' purine increases significantly, while the proportion of pyrimidine decreases significantly correspondingly; from the deamination model, if the sample uses a single-stranded DNA library construction method during library construction, a large number of C- > T mutations accumulate at both the 5 'and 3' ends; if the sample is constructed by using a double-stranded DNA library construction method in the library construction process, a large number of G- > A mutations are accumulated at the 3' -end. Thus, both the fragmentation pattern and the deamination profile are fully compatible with the paleo-DNA profile, and thus, these data are compatible with the requirements for further analysis.
3. Alignment algorithm analysis
The present example uses the default parameters of the 4 comparison algorithms of "bwmem", "BWA aln seed", "BWA aln disable seed" and "bowtie2" for comparison of 4 samples, respectively, and compares the comparison results.
Specific parameters were used as follows:
BWAmem:bwa mem hg19.fa-t 4sample.fq>sample.sam
BWA aln seed:bwa aln-t 4hg19.fa sample.fq>sample.sai;bwa samse hg19.fa sampe.sai sample.fq>sample.sam
BWA aln disable seed:bwa aln-l 1024-t 4hg19.fa sample.fq>sample.sai;bwa samse hg19.fa sample.sai sample.fq>sample.sam
Bowtie2:bowtie2-x hg19.fa-U sample.fq-S sample.sam–p 4
the comparison results of the 4 comparison algorithms on the 4 samples are shown in table 3.
Table 3 comparison results of different comparison algorithms
In Table 3, RA is the true palindromic DNA read length aligned to the modern reference gene hg38, and RAR is the alignment of the true palindromic DNA read length.
The results of table 3 show that, in the comparison results based on the default parameters of the "BWA aln disable seed" comparison algorithm, the comparison rate (RAR) of the true paleo DNA read lengths of 4 samples is in the range of 94.9874% -99.4570%, and the contamination rate is in the range of 0.0023% -0.0349%. In the comparison result obtained based on the default parameters of the BWAAln seed algorithm, the comparison rate of the real ancient DNA read lengths of 4 samples is in the range of 94.9742% -99.5562%, and the pollution rate is in the range of 0.0022% -0.0345%. In the comparison result obtained based on the default parameters of the BWAMEM algorithm, the comparison rate of the real ancient DNA read lengths of 4 samples is in the range of 94.9415% -97.0078%, and the pollution rate is in the range of 0.0023% -0.0333%. In the comparison result obtained based on the default parameters of the bowtie2 algorithm, the comparison rate of the real ancient DNA read lengths of 4 samples is in the range of 68.6286% -89.8522%, and the pollution rate is in the range of 0.0025% -0.0499%. From the comparison results obtained by the 4 algorithms, the comparison result obtained by the bowtie2 is significantly lower than the other three methods. Therefore, only "BWA mem", "BWA alnseed" and "BWAaln disable seed" are further explored in parameter combinations in the following.
Based on the 'BWAaln disable seed' algorithm, the influence of the change of important parameters on the pollution rate and the comparison rate in the comparison result is further explored. The parameter "-e" is first tested, in this example a series of parameter values are set for testing, including default value-1 and set to 1, 2, 3, 4, respectively. The results show that the comparison rate (abbreviated as relative comparison rate) and the pollution rate (abbreviated as relative pollution rate) of the results obtained by the default value are improved along with the increase of the set value of the parameter '-e'. Setting "-e 4", wherein the relative comparison rate of 4 samples is in the range of-0.0004% -0.092%; the relative contamination rate is in the range of 0.004% -0.0064%. The parameter "-i" is then tested, which in this example tests a series of-i parameter values, including setting them to 1, 2, 3, 4, and default value 5. The result shows that the relative comparison rate slightly increases with the increase of the set value of the parameter '-i'; the relative contamination rate was unchanged and was always equal to 0%. For the parameter "-o", 5 sequencing values including a default value of 1 and 2, 3, 4 and 5 are set for testing, and the result shows that the relative comparison rate and the relative pollution rate are basically unchanged along with the increase of the set value of the parameter "-o". Finally, the parameter "-n" was tested, and this example set a series of test parameter values, including being set to 0.01, 0.02, 0.03, 0.04, 0.05, 0.06, where 0.04 is the default value. The result shows that the relative comparison rate is in a descending trend along with the increase of parameter set values, and the descending amplitude of the relative comparison rate is 2.5-13% compared with the value of minus n 0.01; when the value of n exceeds 0.03, the decrease amplitude increases significantly, as shown in fig. 6, and fig. 6 is a deviation percentage of the comparison rate of 4 samples from the comparison rate of the default value when "-n" takes different values.
The relative contamination rate clearly shows a valley trend when the parameter "-n" is within the set value range of 0.01 to 0.06, i.e. when "-n 0.03" the relative contamination rate reaches or is very close to the minimum value and the value of the relative contamination rate reaches a negative value or is very close to a positive value of 0.0002%, as shown in fig. 7, which is the percentage deviation of the contamination rate of 4 samples from the default value when "-n" takes different values.
Based on the "bwialn seed" algorithm, the example first explores the parameter "-l", sets a series of parameter values, including setting them to 8, 16, 24, 32, 40 for testing, where 32 is the default value. Through statistical analysis of the test results, as the parameter "-l" increases, i.e., the seed search sequence length increases, the relative alignment shows a pattern of increasing followed by decreasing, with the relative alignment being at a maximum at "-l 24". However, the relative alignment of the three other samples was negative except for sample Denisonva_8, which was 0.0281%, i.e., all samples showed the best alignment at "-l 32", except for sample Denisonva_8. The relative contamination rate of 4 samples was equal to 0. Finally, the parameter "-k" is tested, and a series of parameter values are set: 0. 1, 2, 3, 4, 5, where 2 is a default value. The results show that as the set value of the parameter "-k" increases, the relative contrast ratio of the 4 samples is in an upward trend, and at "-k 1", a plateau is reached, and the relative contrast ratio of the parameter value is negative, i.e., the contrast ratio of the parameter value is lower than the contrast ratio value of the parameter value "-k 2".
Based on the BWAMEM algorithm, the influence of the change of an important parameter '-r' on the pollution rate and the comparison rate is further explored. A series of "-r" parameter values are set: 1.1, 1.3, 1.5, 1.7, 2.1 and 2.3, and according to test results, the comparative comparison rate shows that the sample Denisonva_8 is in a descending trend in the range of 1.1 to 1.3, the other samples show a slowly ascending trend in the parameter value range of 1.1 to 2.3, and the improvement of the comparative comparison rate reaches the maximum of 0.2438% -0.6060% at "-r 2.3". The relative time consumption was essentially unchanged over a range of parameter values from 1.1 to 2.3, except for the large decreasing fluctuation at-r 1.3 for sample afontovivagora 3.
In summary, the comparison analysis of the comparison method, the finally obtained algorithm and parameter combination for comparing the ancient DNA are as follows: "BWAaln disable seed-n 0.03". The algorithm can improve the endogenous ancient DNA comparison rate to the greatest extent and reduce the pollution rate at the same time.
4. Filtration of modern DNA contamination based on ancient DNA deamination features
The present example filters modern DNA contamination based on C- > T features formed by deamination of ancient DNA. In this example, 35 filtering conditions are set, that is, at least 1, 2, 3, 4 or 5C- > T substitutions occur in 5 bases at the 5 'end respectively, at least 1, 2, 3, 4 or 5C- > T substitutions occur in 5 bases at the 3' end respectively, 0C- > T substitutions occur in 5 bases at the 3 'end respectively, and at least 1, 2, 3, 4 or 5C- > T substitutions occur in 5 bases at the 5' end respectively. The modern DNA pollution filtration test was performed on the above 35 filtration conditions, respectively.
Firstly, modern DNA pollution is filtered according to the replacement condition that C- > T appears at the 3 '-end and the 5' -end respectively, and the result shows that the RA retention rate and the pollution rate of 4 samples for filtering the 3 '-end and the 5' -end are reduced along with the increase of the quantity of C- > T, when the filtering condition is increased to 2 from at least 1 mutation of C- > T at two ends of a reading length, the descending trend is particularly obvious, but the filtering results among samples and among different ends in the samples are also different due to the different specific deamination degrees of different ends of different sample reading lengths. Analysis shows that RA retention rate through filtering aiming at a single end is too small, so that the method further adopts a 3 'and 5' end combined filtering mode to filter DNA pollution of modern people so as to improve RA retention rate; the combination is expressed by num=n/M, which means that at least N C- > T mutations occur at the 5 'end, or M C- > T mutations occur at the 3' end; that is, the read length of at least N C- > T mutations at the 5 'end is reserved, and the read length of M C- > T mutations at the 3' end is reserved.
The test results of the JK2911 samples among the 4 samples are shown in fig. 8 to 10, wherein fig. 8 is RA retention rate of the JK2911 samples when N and M take different values in num=n/M, fig. 9 is pollution rate of the JK2911 samples when N and M take different values, and fig. 10 is pollution rate reduction degree of the JK2911 samples when N and M take different values. The results of fig. 8 to 10 show that as the values of the parameters N and M are increased, both the RA retention and the contamination rate decrease, and the contamination rate decreases as m+n increases. For the Denisonva_8 sample, the overall trend of retention, contamination and contamination reduction was the same as the other three samples, but when the parameters N/M=3/5 and N/M=3/4, a large rise fluctuation of contamination and contamination reduction occurred.
To further compare this example to the filtering effect of existing software PMDtools, this example tested a series of "-threshold" parameters using PMDtools.
The calculated parameters for the PMDtools are as follows:
pmdtools.0.60.py--threshold[1-5]--header|samtools view-S->sample.sam
the test results show that the RA retention of the denisova_8 samples is low and the contamination rate is high. Mainly because the PMDtools algorithm has more limitations on a sequencing platform, sample ages and the like, and is more suitable for samples of not more than 1 million years, so that the retention rate and the pollution rate of the Denisonva_8 sample are poor. The method for filtering the DNA pollution of the modern people by utilizing the ancient DNA features can effectively break through the limitation of the method for acquiring the endogenous DNA by using the PMDtools software, so that the DNA pollution of the modern people in the more ancient samples can be effectively filtered. By filtering modern DNA contamination by C- > T substitution caused by deamination of any of the 3 'and 5' ends, the modern contamination rate of the Denisonva_8 sample can be controlled within 1%, the contamination reduction degree reaches more than 99%, the retention rate reaches more than 2%, and the recommended parameter N/M=1/2 or 2/1 can reach more than 50%, which is far higher than the filtering result of PMDtools. According to the increase of the parameters N and M, the pollution rate and RA retention rate are reduced, and although the retention rate of the filtration result of the PMDtools software is higher than the filtration result based on the deamination characteristic, the pollution rate of the filtration result of the PMDtools software is far higher than the filtration result based on the deamination characteristic. Combining the pollution rate and RA retention rate, and recommending to use the parameters N/M=1/2 or 2/1 to carry out the filtration of modern DNA pollution, namely, retaining the read length of at least 1C- > T substitution in the first 5 bases of the 5 'end, and retaining the read length of at least 2C- > T substitutions in the last 5 bases of the 3' end; or the read length of at least 2C- > T substitutions in the first 5 bases of the 5 'end is reserved, while the read length of at least 1C- > T substitutions in the last 5 bases of the 3' end is reserved.
5. Filtration of modern DNA contamination based on ancient DNA depurination feature
In the data set filtered by the deamination features, a certain proportion of DNA pollution still exists, so that the modern DNA pollution is further filtered based on the deamination features and the depurination features on the basis of the data filtered by the deamination features. Specifically, in this example, the filtering results of N/m=1/2 of 4 samples were subjected to further depurination characteristic filtering; the depurination characteristic filtering of the embodiment adopts two modes, firstly, the depurination occurs at both ends of the reading length; second, depurination occurs at any terminal. The retention rate, the pollution rate and other data after filtering in two modes are counted, and the results are shown in tables 4 and 5; table 4 shows the results of filtration when depurination occurs at both ends of the sample, and Table 5 shows the results of filtration when depurination occurs at either end.
Table 4 shows the filtering results of apurinic characteristics at both ends
Table 5 shows the filtering results of the apurinic character at either end
The results in tables 4 and 5 show that, except for the obvious improvement of the pollution rate and the RA retention rate of the JK2911 samples, the improvement of the pollution rate and the RA retention rate of the other three samples is not large, and even the pollution rate is increased and the RA retention rate is reduced. The effect of using this filtration is limited, probably because the depurination formation characteristics of the 3 other samples are not apparent, except for the JK2911 sample. Thus, when modern DNA contamination filtration is performed using the depurination feature, the degree of significance of the depurination feature is first evaluated, and if the depurination feature is not obvious, it is not recommended to use the feature for filtration, and if it is significant, filtration can be performed on the basis of the data after deamination filtration.
6. Filtration of modern DNA contamination based on ancient DNA fragmentation features
The method aims to further improve the acquisition rate of the endogenous DNA of the ancient human, and simultaneously reduce pollution to the greatest extent. Statistical analysis is carried out on the length distribution of the ancient DNA fragments and the modern human DNA pollution, and the overlapping area range is further filtered on the basis of the statistical analysis. The length distribution of 4 ancient DNA samples and modern DNA contamination samples was counted in this example, and the results are shown in FIGS. 11 to 15. FIG. 11 is a statistical plot of the length distribution of ancient DNA from an AfontovaCava3 sample; FIG. 12 is a statistical plot of the length distribution of Denisonva_8 sample paleo-DNA; FIG. 13 is a statistical plot of the length distribution of the paleo DNA of the Villabruna sample; FIG. 14 is a statistical plot of the length distribution of the ancient DNA of the JK2911 sample; FIG. 15 is a statistical plot of the length distribution of modern DNA contamination samples. The results in FIGS. 11 to 15 show that the length distribution of the paleo DNA fragments is mainly 40bp to 80bp, while the length of the DNA fragments contaminated by modern people is mainly 80bp or more.
According to the above length statistics results, the pollution rate and retention rate of the original non-length filtration are respectively counted, and after the reading length of more than 80bp, the reading length of more than 90bp or the reading length of more than 100bp are respectively removed by filtration, the pollution rate and retention rate of each filtration condition are respectively counted, and the results are shown in table 6.
Table 6 4 results of samples filtered based on different read lengths
The results in Table 6 show that as the length of the filter removal read increases in the conditions, the contamination rate increases, the RA retention increases but the degree of change in both the contamination rate and the RA retention is not great. Therefore, when the requirement on the contamination rate is greater than the RA retention rate, it is recommended to use a more stringent filtration condition of 80bp for filtration; when the contamination rate is less than the RA retention, the filtration is performed using 100bp looser filtration conditions.
The foregoing is a further detailed description of the application in connection with specific embodiments, and it is not intended that the application be limited to such description. It will be apparent to those skilled in the art that several simple deductions or substitutions can be made without departing from the spirit of the application.

Claims (8)

1. A method for filtering modern DNA contamination from ancient DNA data, characterized by: comprises the steps of,
performing ancient DNA damage mode analysis on the obtained Illumina second-generation sequencing original data of the ancient DNA conforming to the data quality control, and selecting data with typical ancient DNA damage modes for subsequent analysis;
quantitatively adding modern DNA data into data with typical ancient DNA damage modes to obtain simulated modern DNA pollution data;
Adopting at least two comparison algorithms to analyze the comparison rate and the pollution rate of the ancient DNA fragments in the simulated modern DNA pollution data, comparing the analysis result with the real comparison rate and the pollution rate, and screening the comparison algorithm with the comparison rate and the pollution rate closest to the real situation;
the comparison algorithm evaluation specifically comprises the steps of gradually adjusting various parameters of the comparison algorithm, comparing the comparison rate and pollution rate analysis results obtained after parameter adjustment with the actual comparison rate and pollution rate, and screening comparison algorithm parameters of which the comparison rate and pollution rate are closest to the actual situation;
according to the screened comparison algorithm and comparison algorithm parameters, performing an ancient DNA characteristic filtering step on the simulated modern DNA pollution data; the step of filtering the ancient DNA features comprises the step of filtering modern DNA pollution according to the characteristic of deamination of the ancient DNA, and specifically comprises the step of retaining the reading length of at least N C-T mutations in the first 5 bases of the 5 'end or at least M C-T mutations in the last 5 bases of the 3' end caused by deamination after filtering; the filtered ancient DNA retention rate and pollution rate of N and M are analyzed by adopting a screened comparison algorithm and comparison algorithm parameters, and the values of N and M are determined according to the required retention rate and pollution rate;
Carrying out modern DNA pollution filtration on the Illumina second generation sequencing original data of the ancient DNA conforming to data quality control by adopting a screened comparison algorithm, comparison algorithm parameters and determined N and M values;
n and M are each integers of 1 to 5.
2. The method according to claim 1, characterized in that: the data with the typical ancient DNA damage mode refers to the read length with the following conditions, 1) the proportion of 5' -end purine of the read length is obviously increased, and the proportion of pyrimidine is correspondingly obviously reduced; 2) Sequencing data of a single-stranded DNA library construction method accumulate a large number of C- > T mutations at both the 5 'end and the 3' end; sequencing data of the double-stranded DNA library construction method accumulate a large number of C- > T mutations at the 3' end.
3. The method according to claim 1, characterized in that: and under the condition that the result of the ancient DNA damage pattern analysis shows that the ancient DNA has obvious apurinic characteristics, carrying out the apurinic characteristic filtration of the ancient DNA on the Illumina second generation sequencing original data of the ancient DNA which accords with the data quality control, and specifically, reserving the reading lengths of guanine and adenine as the bases before and after the breaking position of the DNA chain after filtration.
4. The method according to claim 1, characterized in that: the method also comprises the steps of performing paleoDNA fragmentation characteristic filtering on the simulated modern DNA pollution data, specifically, setting a threshold according to paleoDNA fragments, and filtering to remove reading lengths larger than the set threshold; the filtered paleo DNA retention rate and pollution rate are analyzed by adopting a screened comparison algorithm and comparison algorithm parameters, and the set threshold value is determined according to the required retention rate and pollution rate; and adopting the determined set threshold value to perform archaea DNA fragmentation characteristic filtering on the Illumina second-generation sequencing original data of the archaea conforming to the data quality control.
5. The method according to any one of claims 1-4, wherein: the Illumina second generation sequencing original data of the ancient DNA conforming to the data quality control is data obtained by screening under the following conditions,
1) Filtering the joint, specifically comprising cutting off the joint sequence if the read length contains the joint sequence, and reserving the rest part;
2) Filtering low-quality bases, wherein the method specifically comprises deleting the whole read length if the number of bases with the quality value Q less than or equal to 10 is more than 50% of the total number of bases in the whole read length; if the base with the mass value Q less than or equal to 10 is at the end of the reading length and the number is not more than 50% of the whole reading length, only the low-mass base is cut off, and the rest part is reserved;
3) Filtering in the N area, specifically comprising removing the read length with the N content ratio of more than 10%; if the N region exists only at the two ends of the reading length, cutting off the N region at the two ends of the reading length, and reserving the rest bases;
4) The read length less than 30bp in length was removed.
6. The method according to any one of claims 1-4, wherein: the Illumina second-generation sequencing original data of the paleo DNA comprises at least one group of paleo DNA data smaller than 10000 years and at least one group of paleo DNA data larger than 10000 years, and all the groups of data are independently analyzed.
7. The method according to claim 6, wherein: the modern DNA data added to the simulated modern DNA pollution data includes modern DNA of the same species or closely related species of the paleoliving organism as the paleodna data.
8. Use of the method according to any one of claims 1-7 in ancient DNA research, forensic identification or judicial identification; or in the preparation of devices or apparatus for filtering modern DNA contamination from ancient DNA data.
CN201811161836.4A 2018-09-30 2018-09-30 Method for filtering modern DNA pollution from ancient DNA data and application thereof Active CN110970086B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811161836.4A CN110970086B (en) 2018-09-30 2018-09-30 Method for filtering modern DNA pollution from ancient DNA data and application thereof

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811161836.4A CN110970086B (en) 2018-09-30 2018-09-30 Method for filtering modern DNA pollution from ancient DNA data and application thereof

Publications (2)

Publication Number Publication Date
CN110970086A CN110970086A (en) 2020-04-07
CN110970086B true CN110970086B (en) 2023-08-15

Family

ID=70029262

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811161836.4A Active CN110970086B (en) 2018-09-30 2018-09-30 Method for filtering modern DNA pollution from ancient DNA data and application thereof

Country Status (1)

Country Link
CN (1) CN110970086B (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101921743A (en) * 2010-07-26 2010-12-22 中国农业大学 Method for acquiring genetic information of ancient animal
CN103806111A (en) * 2012-11-15 2014-05-21 深圳华大基因科技有限公司 Construction method and application of high-throughout sequencing library
CN107944223A (en) * 2017-11-10 2018-04-20 深圳裕策生物科技有限公司 Point mutation detection filter method, device and storage medium based on the sequencing of two generations
WO2018085862A2 (en) * 2016-11-07 2018-05-11 Grail, Inc. Methods of identifying somatic mutational signatures for early cancer detection

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101921743A (en) * 2010-07-26 2010-12-22 中国农业大学 Method for acquiring genetic information of ancient animal
CN103806111A (en) * 2012-11-15 2014-05-21 深圳华大基因科技有限公司 Construction method and application of high-throughout sequencing library
WO2018085862A2 (en) * 2016-11-07 2018-05-11 Grail, Inc. Methods of identifying somatic mutational signatures for early cancer detection
CN107944223A (en) * 2017-11-10 2018-04-20 深圳裕策生物科技有限公司 Point mutation detection filter method, device and storage medium based on the sequencing of two generations

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
True single-molecule DNA sequencing of a pleistocene horse bone;Ludovic Orlando etc;《Genome Research》;20111030;第1705-1719页 *

Also Published As

Publication number Publication date
CN110970086A (en) 2020-04-07

Similar Documents

Publication Publication Date Title
CN105886616B (en) Efficient specific sgRNA recognition site guide sequence for pig gene editing and screening method thereof
DE202013012824U1 (en) Systems for the detection of rare mutations and a copy number variation
CN110520542A (en) Method for targeting nucleic acid sequence enrichment and the application in the nucleic acid sequencing of error correcting
CN109554488B (en) Female sturgeon specific DNA fragment and application thereof
CN109767810B (en) High-throughput sequencing data analysis method and device
JP2014505935A (en) DNA sequence data analysis method
CN110622250A (en) Method and system for detecting insertions and deletions
CN105624322A (en) Method for developing SSR (Simple Sequence Repeat) molecular mark of nibea albiflora for population identification
CN110970086B (en) Method for filtering modern DNA pollution from ancient DNA data and application thereof
CN114530199A (en) Method and device for detecting low-frequency mutation based on double sequencing data and storage medium
CN107217091A (en) A kind of detection method of milch goat Fecundity Trait related gene SNP
CN110959178A (en) Systems and methods for targeted genome editing
CN110729025B (en) Paraffin section sample somatic mutation detection method and device based on second-generation sequencing
Chung et al. Tissue requirements and DNA quality control for clinical targeted next-generation sequencing of formalin-fixed, paraffin-embedded samples: a mini-review of practical issues
CA3004527A1 (en) Methods for determining the origin of dna molecules
CN113151489B (en) Molecular diagnosis method for evaluating growth traits based on cow ZNF146 gene CNV marker and application thereof
CN110964839B (en) Method for auxiliary detection of cattle growth traits through SERPINA3-1 gene CNV labeling and application thereof
CN111411167B (en) DNA fingerprint spectrum library of tobacco variety and application thereof
CN113337578A (en) Method for efficiently screening positive SNP of aquatic animals based on transcriptome data
CHIAPPINI et al. Genetic identity and relationship between four Anagrus species (Hymenoptera: Mymaridae) using RAPD analysis
CN110914454B (en) Genome sequence analysis of human DNA samples contaminated with microorganisms using whole genome capture inter-transposon segment sequences
CN111445956B (en) Efficient genome data utilization method and device for second-generation sequencing platform
CN113930518B (en) Molecular marker C49 for identifying ammonia nitrogen tolerance character of portunus trituberculatus and application thereof
CN117757951B (en) Megalobrama amblycephala genetic sex specific molecular marker, detection primer and application
Ng Metrics for Understanding the Degraded Metagenome of Stomach Cancer Patient

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right
TA01 Transfer of patent application right

Effective date of registration: 20220106

Address after: 518083 8th floor, building 11, Beishan Industrial Zone, 146 Beishan Road, Yangang community, Yantian street, Yantian District, Shenzhen, Guangdong

Applicant after: BGI SHENZHEN Co.,Ltd.

Address before: 518083 comprehensive building, Beishan Industrial Zone, Yantian District, Guangdong, Shenzhen

Applicant before: BGI SHENZHEN

TA01 Transfer of patent application right
TA01 Transfer of patent application right

Effective date of registration: 20230515

Address after: 518124 Peng Peng Road, Dapeng office, Dapeng New District, Shenzhen, Guangdong 7

Applicant after: Shenzhen Huada Sansheng Garden Technology Co.,Ltd.

Address before: 518083 8th floor, building 11, Beishan Industrial Zone, 146 Beishan Road, Yangang community, Yantian street, Yantian District, Shenzhen, Guangdong

Applicant before: BGI SHENZHEN Co.,Ltd.

GR01 Patent grant
GR01 Patent grant