CN103114150A

CN103114150A - Single nucleotide polymorphism site identification method based on digestion library-establishing and sequencing and bayesian statistics

Info

Publication number: CN103114150A
Application number: CN2013100775091A
Authority: CN
Inventors: 陶晔; 钱刚; 郑泽群; 胡秋萍
Original assignee: SHANGHAI MAJORBIO PHARM TECHNOLOGY Co Ltd
Current assignee: SHANGHAI MAJORBIO PHARM TECHNOLOGY Co Ltd
Priority date: 2013-03-11
Filing date: 2013-03-11
Publication date: 2013-05-22
Anticipated expiration: 2033-03-11
Also published as: CN103114150B

Abstract

The invention discloses a single nucleotide polymorphism (SNP) site identification method based on digestion library-establishing and sequencing and bayesian statistics. The method is used for processing RAD (restriction site associated deoxyribonucleic acid) sequencing data, searching candidate SNP on an RAD sequencing fragment, and identifying the SNP reliability by employing a bioinformatics analysis method based on bayesian statistics. The method can be used for model and non-model organisms to eliminate the limitation that lots of species are lack of reference sequences and reduce the sequencing cost, and can be used for solving the bottleneck that a reliable statistical method is absent in the process of performing SNP identification by utilizing the RAD data at present, so that the obtained SNP site accuracy is greatly improved.

Description

Cut the method for the mononucleotide polymorphism site evaluation of building storehouse order-checking and Bayesian statistics based on enzyme

Technical field

The present invention relates to a kind of method of cutting the mononucleotide polymorphism site evaluation of building storehouse order-checking and Bayesian statistics based on enzyme.Be specially and cut based on enzyme single nucleotide polymorphism (SNP) site of building that the single end in storehouse (single-end) order-checking or two end (pair-end) order-checking obtain and carry out a kind of special Bayesian statistics check, thereby accurately identify the genotypic method of SNP; Can be in the situation that lack with reference to genome sequence, for the SNP check provides the reliable statistics meaning.The method belongs to the bioinformatics technique field.This has great importance for the research of the non-model animals that lacks reference sequences and the study on accuracy of genotype identification.

Background technology

SNP (Single Nucleotide Polymorphisms) single nucleotide polymorphism refers to the variation of single core thuja acid on genome, and its quantity is a lot, and general every 1000 bases just have a SNP, rich polymorphism in human genome.SNP is the ideal mark of carrying out all kinds of molecular biology researches, as building genetic map, gene type, molecular mark, disease forecasting, the fields such as medication guide.

Nowadays, s-generation DNA sequencing technology is a kind of high-throughput sequencing technologies cheaply, and ultimate principle is order-checking while synthesizing.Take the solexa sequence measurement as example, first with physical method, the DNA chain is interrupted at random, then add given joint at fragment two ends, the amplimer sequence is arranged on joint.During order-checking, the complementary strand of the synthetic fragment to be measured of archaeal dna polymerase reads base sequence by detecting the new synthetic entrained fluorescent signal of base, thereby obtains the sequence of fragment to be measured.

S-generation sequencing technologies has been widely used in many fields of bio-science, particularly studies a polymorphism between the species Different Individual.The method that tradition is sought the SNP mark is that individuality is checked order, and obtains short reads, then by short sequence alignment software, these short reads is compared back reference sequences, thereby obtains the individual SNP information that checks order.Common flow process has (general procedure as shown in Figure 1): use BWA software that reads is compared back reference sequences, use SAMtools software processes comparison result to seek SNP site [1,2]; Use SOAP software that reads is compared back reference sequences, use SOAPsnp software processes comparison result to seek SNP site [3,4].Can carry out very easily searching of SNP mark for the species that reference sequences is arranged, but for those non-model animalss, be not have reference sequences basically.And in the situation that there is no reference sequences, the method that tradition is sought the SNP mark exists technical bottleneck.

The RAD sequencing technologies has adopted the new storehouse mode of building (enzyme is cut and built the storehouse), its order-checking detailed process as shown in Figure 2, cut off the specific site of DNA with restriction enzyme, DNA molecular after with physical method, enzyme being cut again interrupts at random, select the DNA molecular of length-specific by agarose gel DNA isolation technique, then add specific amplification joint and sequence measuring joints at select DNA end, thereby structure upward carries out high-flux sequence in the machine library.

Wherein the RAD sequence measurement is method well known in the art, but reference [5,6,7,8] for example.Utilize the RAD sequencing technologies to identify that the SNP site achieves success in a lot of fields, but till the present invention occurred, method used was all generally utilize empirical value to screen and filter.For example in document [6] with two kinds of base depth scalings in the 0.25-0.75(low depth base degree of depth: the site the high depth base degree of depth) judges into the heterozygosis site, and ratio is in the site that becomes to isozygoty of the judgement below 0.1.This method does not have statistical significance, and the impact that is subject to simultaneously other extraneous factors is larger, as the order-checking total amount, identifies that the SNP genotype accuracy that obtains can't guarantee.Document [9] improves authentication method on the basis of empirical value method, use maximum likelihood method to carry out the correction of loci gene type, but its greatest problem is to determine the error rate in statistical method.

Document:

1.Li,H.and?R.Durbin,Fast?and?accurate?short?read?alignment?with?Burrows-Wheeler?transform.Bioinformatics,2009.25(14):p.1754-60.

2.Li,H.,et?al.,The?Sequence?Alignment/Map?format?and?SAMtools.Bioinformatics,2009.25(16):p.2078-9.

3.Li,R.,et?al.,SNP?detection?for?massively?parallel?whole-genome?resequencing.Genome?Res,2009.19(6):p.1124-32.

4.Li,R.,et?al.,SOAP:short?oligonucleotide?alignment?program.Bioinformatics,2008.24(5):p.713-4.

5.Houston,R.D.,et?al.,Characterisation?of?QTL-linked?and?genome-wide?restriction?site-associated?DNA(RAD)markers?in?farmed?Atlantic?salmon.BMC?Genomics,2012.13(1):p.244.

6.Scaglione,D.,et?al.,RAD?tag?sequencing?as?a?source?of?SNP?markers?in?Cynara?cardunculus?L.BMC?Genomics,2012.13(1):p.3.

7.Davey,J.W.,et?al.,Special?features?of?RAD?Sequencing?data:implications?for?genotyping.Mol?Ecol,2012.

8.Dasmahapatra,K.K.,et?al.,Butterfly?genome?reveals?promiscuous?exchange?of?mimicry?adaptations?among?species.Nature,2012.

9.Hohenlohe,P.A.,et?al.,Population?genomics?of?parallel?adaptation?in?threespine?stickleback?using?sequenced?RAD?tags.PLoS?Genet.6(2):p.e1000862.

Summary of the invention

The purpose of this invention is to provide a kind of method of cutting the mononucleotide polymorphism site evaluation of building storehouse order-checking and Bayesian statistics based on enzyme; It is a kind ofly to build by processing to cut based on enzyme the sequencing data that storehouse order-checking (RAD sequencing technologies) obtains, and in individual or seek mononucleotide polymorphism site between individuality, and gives the technical scheme of statistical test.

Purpose of the present invention is achieved through the following technical solutions:

A kind of method of cutting the mononucleotide polymorphism site evaluation of building storehouse order-checking and Bayesian statistics based on enzyme, its step is as follows:

1) after the sequencing result that obtains the RAD high throughput sequencing technologies, the RAD enzyme is cut the end sequencing sequence filter to remove underproof sequencing sequence.

Wherein, the RAD high throughput sequencing technologies can be Illumina GA sequencing technologies, also can be existing other high throughput sequencing technologies.

To be sequencing quality surpass 50% sequence of whole piece sequence base number and the sequence that there is no enzyme Qie Tezheng lower than the base number of predetermined inferior quality threshold value to described underproof sequencing sequence.

2) cut the sequencing sequence of an end according to order-checking genes of individuals group enzyme, utilize the full same sex of sequence to generate the information of each individual heap, and calculate each sequence heap order-checking depth information.For example, the enzyme after each individual filtration is cut the sequencing sequence information of an end as the key of Hash, the value of Hash is pointed to a chained list, is used for depositing the sequence information of the other end, and calculates the order-checking depth information.Available any programming language is realized this process.

3) the interior all sequences heap of body is compared in twos one by one, and heap is carried out cluster to determine intraindividual candidate's heterozygosis SNP site.

For 3) cluster result, only have the cluster result of a heap to show to cut at enzyme not have the heterozygosis site on an end order-checking fragment, only have the cluster result of two heaps to show to cut at enzyme to have the heterozygosis site on an end order-checking fragment.

4) all sequences heap in Different Individual is compared in twos, heap is carried out cluster to determine interindividual candidate SNP locus.

For 4) cluster result, only have the cluster result of a heap to show that there is not the SNP site in two individualities, have the cluster result of two heaps to show and have the SNP site between individuality.

Due to follow-up use Bayes statistical method, so each heap or each cluster result are not carried out depth type filtration herein, be one of advantage of the method, keep as far as possible more SNP site.

5) utilize Bayes statistical method that the depth information of range gene type on each candidate SNP locus is analyzed, identify the accuracy of candidate SNP.Owing to lacking the prior probability that on reference sequences and each site, various bases occur, so can't obtain the actual base that records in the error rate in each site.Therefore, use the exhaustive error rate that might occur of walking method in this Bayes statistical method, then select and make genotype exist the highest situation of probability as the genotype in this SNP site.Concrete formula and calculation procedure are as follows:

For each candidate SNP locus, can there be base possible in 4, i.e. any or multiple in " ATCG ", the base type of the definition frequency of occurrences the highest (degree of depth is maximum) is G1, and the corresponding degree of depth is N1, all the other bases of definition of successively decreasing successively, be genotype Gi(i=1,2,3,4), degree of depth Ni(i=1,2,3,4).Biologically, general species only two kinds of base types can occur on a SNP site, and for example sequencing data shows that A or T(N appear in this SNP site _A〉=N _T), this site must be to isozygoty or two kinds of genotype of heterozygosis so.Therefore this bayes method condition lower probability that only to detect above two kinds of genotype be ε in error rate is:

P (N_{1}, N_{2}, N_{3}, N_{4} | G_{ii} ϵ) = \frac{N!}{N_{1}! N_{2}! N_{3}! N_{4}!} {(1 - 0.75 ϵ)}^{Ni} {(0.25)}^{N - Ni}

P (N_{1}, N_{2}, N_{3}, N_{4} | G_{ij} ϵ) \frac{N!}{N_{1}! N_{2}! N_{3}! N_{4}!} {(0.5 - 0.25 ϵ)}^{(Ni + Nj)} {(0.25)}^{N - Ni - Nj}

N=N1+N2+N3+N4 wherein,

The posterior probability of range gene type is:

Because there is no sequence and early-stage Study data, error rate ε can't determined value, but document [10,11] report Illumina GA order-checking error rate is in 1% left and right.Set ε from 0.01%-5%, step pitch 0.01% herein.The final posterior probability of using is:

P(N _ij) _Final＝max(P(N _ij,ε))i,j∈{1,2,3,4}，ε∈[0.01%,5%]。

If P is (N _ij) _FinalBe not less than 0.95, show that the genotype in this SNP site is ij, otherwise be defined as the data (missing data) that can't judge.

Technical scheme of the present invention has adopted the bioinformatic analysis method, process RAD(restriction-site associated DNA) sequencing data, seek the SNP site information on RAD order-checking fragment, utilize bayes method to identify the SNP genotype, the bottleneck that lacks reference sequences to break through non-model animals obtains result accurately when reducing costs.Introduce first Bayes statistical method when identifying the SNP loci gene type, compare with the method for empirical value before, statistical significance significantly improves, and accuracy is corresponding lifting also.

Document:

10.Li,Y.,et?al.,State?of?the?art?de?novo?assembly?of?human?genomes?from?massively?parallel?sequencing?data.Hum?Genomics,2010.4(4):p.271-7.

11.Xie,W.,et?al.,Parent-independent?genotyping?for?constructing?an?ultrahigh-density?linkage?map?based?on?population?sequencing.Proc?Natl?Acad?Sci?U?S?A.107(23):p.10578-83.

Description of drawings

Fig. 1 is the Principle of Process figure that tradition is sought SNP site method;

Fig. 2 is the order-checking detailed process schematic diagram of RAD sequencing technologies; In figure, (A) digestion with restriction enzyme genomic dna, and add the P1 joint, each P1 joint contains different sequence labels; (B), interrupt with the sample mix of different P1 joints together; (C) add top connection P2; (D) amplification enrichment RAD tags;

Fig. 3 is the illustration of RAD order-checking;

The cluster process figure that Fig. 4 makes a living in heaps uses the EcorI restriction enzyme in legend;

Fig. 5 cuts a terminal sequence information schematic diagram for enzyme in heap;

Fig. 6 seeks schematic flow sheet for SNP site in individual in heap and between individuality;

Fig. 7 is the illustration of candidate SNP base type and depth information, and 20 candidate SNP locus base type and depth information in 15 individualities respectively arranged in figure, and " C|9 " represents that this site C measures 9 times, and " C|9:T|3 " represents that this site C measures 9 times, T and measures 3 times.

Fig. 8 is the illustration of the genotype result in SNP site after Bayesian statistics, in figure, " a " and " b " represents respectively two kinds of different homozygous genotypes, " h " expression heterozygous genes type, for example " a " represents AA, and " b " represents CC, and " h " represents AC, x1 represents the posterior probability of isozygotying in " x1:x2:x3:x4 ", error rate values when x2 represents to isozygoty the posterior probability maximum, x3 represents the posterior probability of heterozygosis, the error rate values when x4 represents heterozygosis posterior probability maximum.

Fig. 9 is the deletion condition that utilizes data after empirical value method and Bayes statistical method.

Figure 10 is the statistic result of empirical value method and Bayes statistical method different loci.

Figure 11 is the result that the random choose empirical value method site different from Bayes statistical method utilizes the sanger sequence verification.

Embodiment

Further set forth technical characterstic of the present invention below in conjunction with accompanying drawing and specific embodiment.

As shown in Figure 2, what RAD order-checking was different from conventional high-flux sequence is to need to utilize restriction enzyme complete degestion genome before adding joint, then adds the special joint of RAD, after continue storehouse process and routine to build the storehouse identical.Fig. 3 is the illustration of RAD enzyme simple stage property end order-checking.Shown in Fig. 3 and used restriction enzyme Ecor1, the palindromic sequence of " G^AATTC " on the identification DNA molecular, and between G and A, DNA molecular is cut off, DNA molecular after enzyme is cut is broken into short sequence fragment with physical method, and add top connection at the DNA fragmentation two ends of containing enzyme simple stage property terminal sequence, single end (single-end) order-checking is also carried out in the PCR enrichment, and order-checking is read length and is generally 100nt, also can be 50nt.

Cut the method for the mononucleotide polymorphism site evaluation of building storehouse order-checking and Bayesian statistics based on enzyme, its step is as follows:

1) after the sequencing result that obtains the RAD high throughput sequencing technologies, the two end sequencing sequences of RAD are filtered to remove underproof sequencing sequence.

The inferior quality threshold value is decided by concrete sequencing technologies and order-checking environment, for example is set as single base sequencing quality lower than 20; The uncertain base of sequencing result in sequencing sequence (as the N in Illumina GA sequencing result) number surpasses 10% of whole piece sequencing sequence base number and thinks defective sequence; Except the sample joint sequence, compare with the exogenous array that other experiment is introduced, as the various terminal sequence.Think defective sequence if there is exogenous array in sequence; Cut in an end sequencing sequence at enzyme, do not filter out (as restriction enzyme Ecor1, sequencing sequence starts if not " AATTC " filters out whole sequencing sequence) if initial several bases are not enzyme simple stage property terminal sequences.

2) cut the sequencing sequence of an end according to order-checking genes of individuals group enzyme, utilize the full same sex of sequence to generate each individual information of piling.Detailed process as shown in Figure 4.The sequence information that the middle enzyme of heap (Stack) is cut an end can be preserved in the mode of Fig. 5, and in Fig. 5, what first row represented is the sequence information that enzyme is cut an end; What secondary series represented is that enzyme is cut the number of times that a terminal sequence is sequenced, and depth information namely checks order; The 3rd row are ID of this heap, are used for unique definite heap.

3) individual interior heap compares, if the situation in Fig. 6 does not appear in cluster, shows that individual inner these do not have heterozygosis SNP above sequence; If there is the situation in Fig. 6, show that there is heterozygosis SNP in individual inside.

When 4) between individuality, heap carries out comparing in twos, if there is Fig. 6 (a) situation, show to have the SNP that isozygotys between two individualities; If there is Fig. 6 (b) situation, show between two individualities to have heterozygosis SNP.

5) pile up class and relatively finish rear base type and the respective depth information that obtains each candidate SNP locus, representing that as Fig. 7 above-mentioned information gathers.

6) utilize Bayes statistical method that base type and the degree of depth are analyzed, as Fig. 8, obtain final SNP genotype result.

The embodiment data:

Bottle gourd F2 colony builds the genetic map project, comprises 139 F2 plant and two parents, and these 141 individualities are carried out the RAD-PE order-checking.(illustrate: the offspring that male parent and hybridization of female parent generate is F1, and the offspring that the F1 selfing generates is F2; Although use the RAD-PE order-checking, analyze and only use enzyme to cut terminal sequence)

Material source: Zhejiang Academy of Agricultural Science.

Embodiment concrete operations flow process:

The sequencing data that two parent RAD-PE order-checking is obtained, according to the sequencing quality value, the content of N, and whether contain enzyme simple stage property terminal sequence and filter, remove underproof sequencing sequence, the valid data statistics that obtains is as shown in table 1.

Table 1: bottle gourd RAD order-checking valid data statistics

Title	Usage data amount (bp)	Title	Usage data amount (bp)	Title	Usage data amount (bp)
						Male parent	585,377,540	F2-46	3,800,775	F2-93	1,556,104
Maternal	423,794,746	F2-47	2,522,407	F2-94	1,651,259
						F2-1	3,114,771	F2-48	4,636,152	F2-95	3,213,147
F2-2	2,302,730	F2-49	3,737,623	F2-96	2,202,354
						F2-3	537,822	F2-50	647,499	F2-97	1,956,440
F2-4	1,650,925	F2-51	3,678,334	F2-98	1,112,431
						F2-5	2,824,708	F2-52	2,153,996	F2-99	1,086,168
F2-6	579,177	F2-53	7,029,889	F2-100	1,705,836
						F2-7	2,805,093	F2-54	2,315,687	F2-101	2,311,919
F2-8	2,234,442	F2-55	4,116,520	F2-102	1,445,671
						F2-9	1,814,510	F2-56	1,554,335	F2-103	5,292,536
F2-10	2,063,581	F2-57	1,912,949	F2-104	371,736
						F2-11	437,393	F2-58	4,513,981	F2-105	5,528,190
F2-12	1,114,627	F2-59	4,660,158	F2-106	2,286,908
						F2-13	292,168	F2-60	2,963,600	F2-107	2,977,378

[0073]?

F2-14	L981．379	F2-61	1．912．346	F2-108	1．113．885
						F2-15	L710．808	F2-62	2．198．112	F2-109	2．358．577
F2-16	L666．317	F2-63	2．149．228	F2-110	2．014．988
						F2-17	3．837．185	F2-64	2．907．400	F2-111	5．021．837
F2-18	2．705．794	F2-65	2565399	F2-112	L687．183
						F2-19	L641．718	F2-66	1．802．757	F2-113	1．454．774
F2-20	4．181．837	F2-67	6．136．789	F2-114	L187．993
						F2-21	2．167．926	F2-68	5．106．060	F2-115	917．204
F2-22	8．967	F2-69	5．492．357	F2-116	673．176
						F2-23	L936．761	F2-70	4．925．717	F2-117	903．357
F2-24	4．907．028	F2-71	2．016．103	F2-118	1．252．469
						F2-25	2．641．269	F2-72	4．495．767	F2-119	L066．660
F2-26	L344．809	F2-73	957．643	F2-120	83．426
						F2-27	2．184．764	F2-74	3．193．347	F2-121	624．005
F2-28	2．312．351	F2-75	30．335	F2-122	4．246．910
						F2-29	L318．322	F2-76	2．067．906	F2-123	824．013
F2-30	L830．247	F2-77	200．856	F2-124	3．322．863
						F2-31	358．911	F2-78	6．978．303	F2-125	89．336
F2-32	L450．039	F2-79	5．309．200	F2-126	367．005
						F2-33	L767．194	F2-80	3．081．537	F2-127	1．707．758
F2-34	L587．589	F2-81	3．071．09l	F2-128	2．385．919
						F2-35	L008．970	F2-82	1．223．914	F2-129	2．786．068
F2-36	2．877．974	F2-83	5．586．662	F2-130	890．661
						F2-37	L268211	F2-84	1．880．660	F2-131	1．980．472
F2-38	6.O skilful .973	F2-85	2．620．672	F2-132	3．920．370
						F2-39	3．219262	F2-86	5．992．662	F2-133	500．349
F2-40	2．241．103	F2-87	5．636．602	F2-134	2．150．765
						F2-41	3．610．730	F2-88	544．490	F2-135	L115．366
F2-42	L641．063	F2-89	4．897．907	F2-136	1．306．934
						F2-43	2．382．923	F2-90	2．355．593	F2-137	616．728
F2-44	2．598．428	F2-91	2．506．27l	F2-138	949．308
						F2-45	692．675	F2-92	841．285	F2-139	1．014．677

Cut the sequencing sequence of an end according to order-checking genes of individuals group enzyme, utilize the full same sex of sequence to generate respectively the information that two parents pile.Obtain 3913 candidate SNP marks of F2 colony, base type and the similar Fig. 7 of corresponding degree of depth situation (data volume is too large, can't show).Utilize respectively current empirical value method commonly used (degree of depth is not less than 6, heterozygosis judging criterion 0.25-0.75) and Bayes statistical method that the candidate SNP mark is identified.To using respectively two kinds of results after the methods evaluation to carry out the statistics of missing data, as Fig. 9, show that both there are differences, but not remarkable.But each loci gene type is analyzed, and finding has 11l, and 005 site is accredited as in bayes method isozygotys, but is accredited as heterozygosis in empirical value; Discovery has 79,401 sites to be accredited as heterozygosis in bayes method, isozygotys but be accredited as in empirical value, claims that above two kinds of sites are uncertain site.Sanger method order-checking is carried out in 100 uncertain sites of random choose, find bayes method in uncertain site correct occupy 77%, be significantly higher than empirical value method (3 examples selecting in Figure 11).

This result shows that bayes method is a kind of reliable statistical means when utilizing the RAD sequencing data to carry out the SNP evaluation.

Claims

1. cut the method for the mononucleotide polymorphism site evaluation of building storehouse order-checking and Bayesian statistics based on enzyme, its step is as follows:

1) after the sequencing result that obtains the RAD high throughput sequencing technologies, the RAD sequencing sequence is filtered to remove underproof sequencing sequence;

2) utilize the full same sex of sequence to generate the information of each individual sequence heap, and calculate each sequence heap order-checking depth information;

3) the interior all sequences heap of body is compared in twos one by one, and heap is carried out cluster to determine intraindividual candidate's heterozygosis SNP site;

4) all sequences heap in Different Individual is compared in twos, heap is carried out cluster to determine interindividual candidate SNP locus;

5) utilize Bayes statistical method that the depth information of range gene type on each candidate SNP locus is analyzed, identify the accuracy of candidate SNP, be used for the work such as follow-up population analysis or experiment.

2. according to claim 1 cutting based on enzyme built the method that check order in the storehouse and the mononucleotide polymorphism site of Bayesian statistics is identified, it is characterized in that: in step 1), the RAD high throughput sequencing technologies is Illumina GA sequencing technologies; Order-checking type acquiescence is single end sequencing, if two end sequencing is only cut end to enzyme and carried out the mononucleotide polymorphism site analysis.

3. according to claim 1ly cut based on enzyme the method that storehouse order-checking and the mononucleotide polymorphism site of Bayesian statistics are identified of building, it is characterized in that: in step 1), to be sequencing quality surpass 50% sequence of whole piece sequence base number lower than the base number of predetermined inferior quality threshold value to described underproof sequencing sequence, and section start does not have enzyme to cut the sequence of characteristic sequence.

4. according to claim 1ly cut based on enzyme the method that storehouse order-checking and the mononucleotide polymorphism site of Bayesian statistics are identified of building, it is characterized in that: in step 3), when each sequence heap is compared mutually, only allow to occur the different sequence heap of two classes, filter out the sequence that surpasses two classes and pile.

5. method of cutting the mononucleotide polymorphism site evaluation of building storehouse order-checking and Bayesian statistics based on enzyme according to claim 1, is characterized in that: in step 4), keep the sequence heap that all comparisons obtain, do not process piling over the sequence of two classes.