Embodiment
In order to make purpose of the present invention, technical scheme and advantage clearer,, the present invention is further elaborated below in conjunction with drawings and Examples.Should be appreciated that specific embodiment described herein only in order to explanation the present invention, and be not used in qualification the present invention.
In embodiments of the present invention, the order-checking fragment that high throughput sequencing technologies obtains is compared on the reference genome, the massfraction of each base in the testing gene group that order-checking is obtained, calculate the likelihood probability of range gene type on each position of testing gene group, according to this likelihood probability be that the default prior probability of every kind of genotype calculates every kind of genotypic posterior probability, the genotype that posterior probability is the highest is defined as the most possible correct genotype in this site of testing gene group, obtain the concensus sequence of testing gene group, by in the concensus sequence that detects the testing gene group with the inconsistent site of reference genome sequence, can detect the pleomorphism site in the testing gene group.
Fig. 1 shows the realization flow of the method for detecting single nucleotide polymorphism that the embodiment of the invention provides, and details are as follows:
In step S101, the order-checking fragment that high throughput sequencing technologies is obtained is compared with reference to genome sequence and is listed.
In embodiments of the present invention, high throughput sequencing technologies can be Illumina GA sequencing technologies, also can be existing other high throughput sequencing technologies.Can be by any short sequence mapping program, as mapping programs such as soap, the order-checking fragment that high throughput sequencing technologies is obtained is compared with reference to genome sequence and is listed.
In step S102, the sequencing quality mark of each base is converted to the base mispairing rate according to the definition of PHRED mark in the testing gene group that order-checking is obtained, and long-pending according to mispairing rate to all order-checking fragment bases of comparing each site on the reference genome, obtain the likelihood probability of the range gene type in corresponding site on the testing gene group.
In step S103, according to likelihood probability be the default prior probability of every kind of genotype, calculating is with reference to every kind of genotypic posterior probability on each site on the genome, the genotype that posterior probability is the highest is defined as the most possible correct genotype in the corresponding site of testing gene group, obtains the concensus sequence of testing gene group.
Wherein every kind of genotypic prior probability can rule of thumb be worth and sets in advance, and as a given available reference genome, can estimate that testing gene group sequence is with respect to reference to genomic mutation rate.For example, the probability that most positions per generations undergos mutation on the human genome is at the 10-8 order of magnitude, and two individualities and its common ancestor's difference approximately will form through the time in 10000 generations.The haplochromosome SNP probability of calculating two people that come out in view of the above is approximately 0.001.Each monoploid probably has the difference of 1000 nucleotide with reference sequences.Suppose that people's reference genome sequence shows 0.00001 error rate, then this error rate can be ignored with respect to the incidence of true pleomorphism site.Comprehensive these data can estimate that the isozygoty probability of SNP of amphiploid is 0.0005, and the probability of generation heterozygosis SNP is 0.001.
To the research of NCBI dbSNPs, the frequency that transition mutations takes place was 4 times of transversion according in the past, but but almost was that equifrequent occurs between all types of in these two kinds of variations.Thus, in our SNP detection model, these ratios have been used.For example, if certain loci gene type is G on the reference sequences, monoploid base type may be A, the prior probability of C and T all is 6.67E-4, and the prior probability of G is 0.999, and be 0.9985 for the probability that GG combination appears in amphiploid, AA is 3.33E-4, and TT is 8.33E-5, and AC and AT are 1.11E-7, GC and GT are 1.67E-4, AG is 6.67E-4, and CT is 2.78E-8, shown in table 1.1.
|
A |
C |
G |
T |
A |
3.33E-4 |
1.11E-7 |
6.67E-4 |
1.11E-7 |
C |
|
8.33E-5 |
1.67E-4 |
2.78E-8 |
G |
|
|
0.9985 |
1.67E-4 |
T |
|
|
|
8.33E-5 |
According to above-mentioned analysis, can be that every kind of genotype is provided with prior probability according to the testing gene group to prior probability.
According to likelihood probability be the default prior probability of every kind of genotype, calculating is with reference on each site on the genome during every kind of genotypic posterior probability, according to likelihood probability be the default prior probability of every kind of genotype, adopt Bayesian formula to calculate with reference to every kind of genotypic posterior probability on each site on the genome, the genotype that posterior probability is the highest obtains the concensus sequence of testing gene group as the most possible correct genotype in the corresponding site of testing gene group.Wherein the probability that this genotype is correct is its posterior probability shared ratio in all genotype posterior probability sums.
In step S104, by in the concensus sequence that detects the testing gene group with the inconsistent site of reference genome sequence, detect the pleomorphism site in the mononucleotide in the testing gene group.
In embodiments of the present invention, with in the concensus sequence of testing gene group with the inconsistent site of reference genome sequence as potential pleomorphism site, thereby detect the polymorphism of mononucleotide in the testing gene group.
In order to make the SNP testing result more accurate, in another embodiment of the present invention, this method also comprises the steps:
The probability that genotype among the step S103 is correct is converted to massfraction, and by massfraction is provided with threshold values, filter out the pleomorphism site of massfraction in the detected potential pleomorphism site less than the threshold values that is provided with, thereby from detected potential pleomorphism site, filter out the bigger pleomorphism site of assurance, thereby optimize the accuracy of the pleomorphism site that is extracted.
In order further to optimize the accuracy of the pleomorphism site that is extracted, in another embodiment of the present invention, this method also comprises the steps:
Number to the order-checking fragment of supporting polymorphism base type is set threshold values, filters out the pleomorphism site of the number of the order-checking fragment of supporting polymorphism base type less than pre-set threshold value, thereby filters out the mismatched bases type of the overwhelming majority.Illustrate as follows, if the number of order-checking fragment of supporting detected potential pleomorphism site then filters out this potential pleomorphism site less than default threshold values.
Because the mistake in the order-checking process and the comparison mistake of short-movie section all may cause the appearance of difference between sequence to be measured and the reference sequences, and the order-checking fragment length that high throughput sequencing technologies obtains is very short, have more order-checking segment so and the comparison mistake occurs, wrong SNP therefore just occurred.In embodiments of the present invention, by the minimum number of the order-checking fragment of supporting polymorphism base type is set, can filter out the mismatched bases type of the overwhelming majority.
Below with test figure explanation, by the minimum number to the order-checking fragment of supporting polymorphism base type is set, filter out the mismatched bases type of the overwhelming majority.
Error rate by calculating base from simulated data is to detect the SNPs of wrong identification.At least 95% wrong base has all only occurred once in the order-checking fragment (35bp) of unidirectional order-checking and the order-checking fragment of two-way order-checking (inserting fragment 200bp).Based on these data, by all low-frequency wrong bases (being made as 4 usually) of a threshold value filtering are set.Search SNPs with this method on genome of Yan Di and Huang Di, two legendary rulers of remote antiquity, the result has only about 0.036% wrong allele not to be detected.In the simulation of random dna fragment, 36 times of sequencing datas have 0.008% order-checking mistake greatly down approximately not by filtering.In order to distinguish mistake with high frequency base type and real heterozygosis pleomorphism site, we utilize binomial distribution (P=0.0001) to detect the difference of these two kinds of base types, find in the remaining wrong base type 87.3% all by filtering.Altogether 99.93% since the wrong base type that causes of order-checking mistake of order-checking fragment all by filtering.
Contain the insertion deletion in the order-checking fragment owing to cause the factor of SNP mistake also to comprise, this mistake is relevant with the comparison mode of order-checking fragment.Because 5~10 times of the insertion of SNPs quantity the chances are small fragment deletion have been paid the utmost attention to the comparison situation that does not add gap in the comparison, this class contains the order-checking fragment of inserting deletion may compare failure.For fear of the problems referred to above, in another embodiment of the present invention, this method also comprises the steps:
Distance between the more detected potential pleomorphism site, and filter out the pleomorphism site of distance less than pre-set threshold value, contain the order-checking fragment of inserting deletion really thereby filter out.
We have simulated 10000 small fragments and have inserted the potential impact that the SNPs detection is estimated in deletion, found that 0.6% the reads that inserts deletion that comprises does not detect.If we require at least 4 reads to support just believable words, have only 3 (0.03%) individual wrong SNP allele to reexamine out.
Fig. 2 shows the realization flow of the method for detecting single nucleotide polymorphism that another embodiment of the present invention provides, and details are as follows:
In step S201, the order-checking fragment that high throughput sequencing technologies is obtained is compared with reference to genome sequence and is listed.
In step S202, the sequencing quality mark of each base is proofreaied and correct in the testing gene group that order-checking is obtained, and obtains the likelihood probability of the range gene type in corresponding site on the testing gene group according to the sequencing quality mark after proofreading and correct.The method that the sequencing quality mark of each base is proofreaied and correct in wherein a kind of testing gene group that order-checking is obtained is to proofread and correct the sequencing quality mark by four-matrix, obtains likelihood probability, and its detailed process is as follows:
A, according to unique order-checking fragment of comparing with reference to genome sequence, add up under specific sequencing quality mark and the sequencing sequence coordinate, mispairing ratio between per two kinds of bases, with of the estimation of this mispairing ratio as the mispairing probability, be recorded in the four-dimensional probability matrix, as the basis of parameters in the statistical models.
Wherein the dimension of four-dimensional probability matrix is respectively sequence coordinate, original base and the observed base on sequencing quality mark, the order-checking fragment.Add up under specific sequencing quality mark and the sequencing sequence coordinate, mispairing ratio between per two kinds of bases is meant the order-checking fragment of comparing on the reference sequences according to unique, statistics is observed the probability of order-checking base from original base under the sequence coordinate on sequencing quality mark, the order-checking fragment.
In embodiments of the present invention, when the order-checking fragment is compared the position that has minimum base mispairing on the reference sequences, think optimum matching, if an order-checking fragment lists at the reference genome sequence and has only a best match position, then thinking order-checking, fragment is unique compares with reference to genome sequence, if the order-checking fragment has listed a plurality of best match position at the reference genome sequence, think that then the order-checking fragment repeats to compare with reference to genome sequence.
In order to estimate accuracy and the uniqueness that the order-checking fragment covers, with the mankind with reference to No. 12 chromosomes of genome as the reference sequence, the order-checking fragment of simulation different length (SNP and the order-checking that have comprised 0.001 ratio are wrong) is compared the order-checking fragment of these simulations in the full genome reference sequences of the mankind then again.From the order-checking fragment of these simulations, can calculate number percent with unique comparison order-checking fragment.
For the order-checking fragment in the unique comparison of unidirectional order-checking, the ratio of its unique comparison changes greatly when 15bp rises to 25bp in order-checking length, and when increasing order-checking length more afterwards, the ratio of unique comparison only has variation seldom.For length is the order-checking fragment of 25bp, has 78.6% order-checking fragment can compare uniquely on the full genome reference sequences of the mankind.For length is the order-checking fragment of 50bp, has 91.5% order-checking fragment can compare uniquely on the full genome reference sequences of the mankind.As the Illumina sequencing technologies length that typically checks order is 35bp, in comparison result, find, there is not wrong order-checking fragment to have 85.7% can compare uniquely on the genome, the order-checking fragment of 1 mispairing has 86.3% can compare uniquely on the genome, and the order-checking fragment of two mispairing has 85.9% can compare on the genome uniquely.A genome actual measurement of the comparison result of simulated data and Yan Di and Huang Di, two legendary rulers of remote antiquity comparison result is similar.From analyze, can find, with two-way sequencing technologies can be very big the ratio of order-checking fragment of the unique comparison of raising.Two-way sequencing technologies inserts fragment length in 100bp to 10kbp (± 10%) scope, and order-checking fragment ratio that can unique comparison increases along with the increase of inserting fragment length.Wherein, when being 200bp, fragment has in the comparison that 95.4% paired order-checking fragment can be unique inserting.
Go to compare with short sequence order-checking fragment, the SNP that order-checking fragment the inside comprises and some order-checking mistakes fragment that all might cause checking order is compared on the incorrect position.For the order-checking fragment of simulation, owing to knowing in advance original position, so can estimate to compare the content of correct order-checking fragment.Length is the unidirectional order-checking order-checking fragment of 25bp, and 2.3% 2 the wrong order-checking fragments of order-checking that have with the wrong order-checking fragment of 1 order-checking and 3.5% are failed on the correct coupling, referring to Fig. 3 c.If order-checking length is 50bp, these two ratios have dropped to 0.6% and 0.8% respectively.In like manner simulated and contained a two-way order-checking order-checking fragment that order-checking is wrong, inserted fragment, 0.4% and 0.06% order-checking fragment comparison mistake has been arranged respectively from 100bp-10kbp; Have the wrong order-checking fragment of 2 order-checkings have under the same insertion fragment 0.3% and 0.06% can not than on.Referring to Fig. 3 d.
B, for each site on the reference genome, collect the base of comparison all order-checking fragments on this site, and write down its base type, sequencing quality mark and the sequence coordinate on the order-checking fragment, from four-dimensional probability matrix, find the probability that four kinds of bases are observed the order-checking base.
The probability that c, basis are observed each independent base from each true genotype is long-pending, obtains the likelihood probability of the range gene type in corresponding site on the testing gene group.
Wherein observe the probability of all bases that cover this site from each true genotype, long-pending for the probability of observing each independent base, find and observe in the four-dimensional probability matrix that the probability of each independent base can set up from step a from each true genotype.Like this, just obtained the likelihood probability that each potential possible genotype obtains the observation base in this site.
For the monoploid genome, its true genotypic possibility has four kinds of A, T, C, G, and for the diploid gene group, its true genotypic possibility has 10 kinds, and wherein homozygous genotype is 4 kinds: AA, CC, GG, TT; 6 kinds on heterozygous genes type: AC, AG, AT, CG, CT, GT.
The Illumina sequencing technologies comes the calibrating quality mark by each order-checking round-robin data.With actual mispairing rate certain difference is arranged through the massfraction after the calibration, and this difference be along with the order-checking fragment coordinate difference and up and down the fluctuation (Fig. 4 a), in embodiments of the present invention, the mark that the correction of massfraction is remained after proofreading and correct by comparison information and original series information or through Illumina gets.
Except sequencing error impacted massfraction along with the order-checking round-robin constantly increases, order-checking instrument itself had also influenced massfraction to the detection of base.The Illumina sequencing technologies utilizes four kinds of bases that were labeled of laser radiation of two kinds of different frequencies, A, and C represents with a kind of laser, G, T represents with another kind of laser.So
The wrong frequency that occurs of order-checking just is higher than other mispairing.We find
The probability of mispairing exceeds 58% and 72% respectively than the situation that we simulate comparison, simultaneously
The mispairing probability approximately will reduce by 36% (Fig. 4 b).For example, massfraction is in the base of 10 (theoretical error rates 0.1), can observed mispairing between order-checking fragment and the reference sequences
Be respectively 4.62%, 5.27%, 5.29%, 4.62%, and other mispairing type only has 1.62%~2.48%.Given this result in embodiments of the present invention, has also proofreaied and correct massfraction by the mispairing type.
The another kind of sequencing quality mark bearing calibration that the embodiment of the invention provides is to filter out known SNP site earlier, again according to unique order-checking fragment of comparing with reference to genome sequence, add up under specific sequencing quality mark and the sequencing sequence coordinate, mispairing ratio between per two kinds of bases, with of the estimation of this mispairing ratio, be recorded in the four-dimensional probability matrix as the mispairing probability.Adopt this four-dimension probability matrix to proofread and correct the sequencing quality mark again, obtain the likelihood probability of the range gene type in corresponding site on the testing gene group.
Fragment 3 ' end error rate will be far above 5 ' end because the wrong accumulation of order-checking causes checking order.When adopting the Illumina sequencing technologies that the testing gene group is checked order, can provide a sequencing quality mark, but this sequencing quality mark calculates by signal intensity, can not accurately represent the generation of error rate.In order to correct this problem, assess by result a gene order-checking of Yan Di and Huang Di, two legendary rulers of remote antiquity, in embodiments of the present invention, by before statistics mispairing rate, known SNP site is foreclosed, thus interference in correction in order to avoid SNPs to cause as much as possible.
Another sequencing quality mark bearing calibration that the embodiment of the invention provides, a plurality of identical order-checking fragments are arranged in its order-checking fragment on comparing the reference genome, and when the location is identical on the reference genome, a plurality of herein fragments may be the redundancies that the PCR reaction causes in the order-checking test, therefore after the step b of above-mentioned sequencing quality mark bearing calibration, also comprise the steps:
When article one order-checking fragment is compared with reference to genomic certain site, to from four-dimensional probability matrix, find four kinds of bases observe the probability of order-checking base long-pending carry out probability normalization after, obtain the likelihood probability of the range gene type in corresponding site on the testing gene group.
When second order-checking fragment is compared with reference to genomic above-mentioned site, after will from four-dimensional probability matrix, finding four kinds of bases and observing long-pending behind the probability * correction factor of order-checking base and carry out probability normalization, obtain the likelihood probability of the range gene type in corresponding site on the testing gene group.Wherein the value of correction factor is generally 0.9.
When N bar order-checking fragment is compared with reference to genomic above-mentioned site, then will from four-dimensional probability matrix, find four kinds of bases observe the order-checking base probability * correction factor (N-1) after long-pending carry out probability normalization after, obtain the likelihood probability of the range gene type in corresponding site on the testing gene group, N is a positive integer.
Because each locational candidate's allelotype D can observe out from the order-checking fragment of comparing fully with reference sequences.The possibility of supposing every kind of base type Ti is P (D|Ti), and influencing P (D|Ti) principal element has four, is respectively the position on base type, sequencing quality, the order-checking fragment and the number of times of appearance.And in embodiments of the present invention, when calculating likelihood probability, all considered the influence that above-mentioned factor detects SNP, thereby improved the accuracy that SNP detects.
Because DNA sets up in the process of library and used pcr amplification, the initial segment of the DNA of given very few number can produce a large amount of copies, therefore can obtain the dna fragmentation of a large amount of consistent length.Yet the segment of these massive duplications has a significant impact the randomness in the order-checking process, and the order-checking degree of depth in some zone is not very desirable.This also may cause various easily and the appearance of the mistake obscured of heterozygosis allele site.Particularly the damage of DNA is by the redundant sequence coverage that may bring after the pcr amplification, and these are repeatably wrong to be difficult to and pleomorphism site distinguishes.Therefore, be provided with the point penalty rule that repeats at amplification in embodiments of the present invention.If DNA library and order-checking process all are at random, the distribution of the starting point of sequence also should be to obey Poisson distribution so.The degree of depth of a gene order-checking of Yan Di and Huang Di, two legendary rulers of remote antiquity is 36 times, and using the order-checking fragment length is 35bp, and the position on about 0.39% the chromosome has the fragments of the order-checking above 6 initial; Yet this ratio should be only to have 0.07% in theory.Therefore, in embodiments of the present invention, can reduce the influence that the order-checking of common reference position fragment is arranged according to an experimental formula.As use Illumina 1MBeadChip that sample of Yan Di and Huang Di, two legendary rulers of remote antiquity is done Genotyping, and checked the site of isozygotying.In theory, meet Poisson distribution through adjusting the wrong frequency that occurs of back order-checking, referring to Fig. 5.
Test above method with individual genome 36X sequencing data of Yan Di and Huang Di, two legendary rulers of remote antiquity, on Illumina 1MBeadChip, used same dna sample that concensus sequence is carried out Genotyping.Suppose that all Genotypings all are correct, discern and owe to discern classify Genotyping and the different site of order-checking by crossing.What is called is owed identification and looked for a heterozygosis site exactly less in the allele search procedure, and identification just is meant and has looked for an incorrect base type more excessively.
In step S203, according to this likelihood probability be that the default prior probability of every kind of genotype calculates every kind of genotypic posterior probability, the genotype that posterior probability is the highest is defined as the most possible correct genotype in this site of testing gene group, obtains the concensus sequence of testing gene group.Its detailed process does not repeat them here as mentioned above.
In step S204, by in the concensus sequence that detects the testing gene group with the inconsistent site of reference genome sequence, detect the pleomorphism site in the mononucleotide in the testing gene group.
What Fig. 6 showed is the coverage of concensus sequence on reference sequences, the coverage in Illumina 1M BeadChip somatotype site and the error rate of concensus sequence.If there is not quality to filter, the coverage of whole genome will be lower than the coverage in Genotyping site.Because Genotyping site and genomic feature are differentiated.Increase the quality limitations of Q0~Q40, the coverage in Genotyping site has small minimizing: drop to 98.75% from 98.98%; But when critical value establish higher the time, the coverage in Genotyping site reduces just apparent in viewly.This can explain that the result is low much with respect to other site massfraction with the prior probability in SNP site.Cross the probability discerning and owe to discern along with the increase of quality threshold (is owed identification: 0.046%Q0~0.024%Q20, crossed identification: 0.096%Q0~0.067%Q20) and continuous minimizing.Obtaining a quality threshold according to these data is Q20, with the balance coverage with owe to discern the identification probability.In embodiments of the present invention, used five other filtration steps with removing insecure part in some consensus sequences: 1) require minimum two the order-checking fragments of monoploid, minimum 4 the order-checking fragments of amphiploid chromosome cover; 2) overall depth comprises the covering of the order-checking fragment that repeats to compare, must be less than 100; 3) repeat number that is listed on the genome of local sequence is less than 2; 4) at least one two-way order-checking fragment is supported; 5) SNPs is at interval at least more than the 5bp.
Remove in some consensus sequences after insecure part by adopting above-mentioned filtration step, for the monoploid chromosome x, we check order to the male genome that only contains an X chromosome, thus the concensus sequence of X chromosome to search with the monoploid genome be identical.There are four kinds of different possible genotype in each site.In 37933 Illumina 1M BeadChip Genotyping sites of X chromosome, 99.59% site can well be covered, and Genotyping and order-checking have 99.96% common point, sees also table 1.2.On the X chromosome of concensus sequence in the reference genome 88.07% coverage is arranged.The chromosomal region of covering not yet in effect is highly to repeat, and almost do not have can unique comparison the order-checking fragment.Y chromosome mainly is made up of repetitive sequence and reference sequences is assembled to such an extent that be not fine, so we do not discuss to it.
For the amphiploid autosome, in order to estimate the accuracy that concensus sequence and SNP detect, we contrast concensus sequence and NCBI reference sequences that all autosomes of Yan Di and Huang Di, two legendary rulers of remote antiquity assemble, find that concensus sequence has 92.25% covering at the autosome of reference sequences, on Illumina 1M BeadChip Genotyping site, 99.22% covering is arranged, wherein nearly 99.92% site is consistent, sees also table 1.2.Homozygous gene somatotype site has 0.062% quilt order-checking to think heterozygote.
Table 1.2
Be the identification accuracy of assessment this method, cross at part and use traditional Sanger sequencing technologies to measure again after the SNPs that discerns adopts pcr amplification SNP.In 57 tests, 49 (86.0%) individual base types are consistent with the result that chip draws.
In sum, the SNP detection method that the embodiment of the invention provides and the result of gene typing chips have very high consistance, and for the site of crossing identification, GA sequencing technologies degree of accuracy is higher.
The above only is preferred embodiment of the present invention, not in order to restriction the present invention, all any modifications of being done within the spirit and principles in the present invention, is equal to and replaces and improvement etc., all should be included within protection scope of the present invention.