CN104217135A - Optimized overlapping hybrid sequencing method - Google Patents

Optimized overlapping hybrid sequencing method Download PDF

Info

Publication number
CN104217135A
CN104217135A CN201410462490.7A CN201410462490A CN104217135A CN 104217135 A CN104217135 A CN 104217135A CN 201410462490 A CN201410462490 A CN 201410462490A CN 104217135 A CN104217135 A CN 104217135A
Authority
CN
China
Prior art keywords
order
checking
mixing
depth
sequencing
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201410462490.7A
Other languages
Chinese (zh)
Inventor
孙啸
曹唱唱
李成
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Southeast University
Original Assignee
Southeast University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Southeast University filed Critical Southeast University
Priority to CN201410462490.7A priority Critical patent/CN104217135A/en
Publication of CN104217135A publication Critical patent/CN104217135A/en
Pending legal-status Critical Current

Links

Abstract

The invention discloses an optimized overlapping hybrid sequencing method, which includes the following steps: on the basis of the general law that sequencing depth follows negative binomial distribution and sequencing errors follow binomial distribution in the process of sequencing, a depth model of hybrid sequencing is put forward, moreover, the optimal depth of hybrid sequencing is calculated and designed on the basis of the model, and sequencing cost is effectively reduced by reducing redundant sequencing depth; a grouped overlapping hybrid sequencing method based on rare mutation distribution probabilities is put forward, and compared with direct sequencing, the grouping strategy can greatly reduce the demand of sequencing on data volume and increase the efficiency of hybrid sequencing; a sequencing cost model is established, and on the basis of the model, an optimal overlapping hybrid sequencing scheme is chosen to screen rare mutation carriers. The optimized overlapping hybrid sequencing method reduces the sequencing cost of screening rare mutation carriers to the max.

Description

A kind of overlap mixing sequence measurement of optimization
Technical field
The invention belongs to gene sequencing field, especially a kind of overlap mixing sequence measurement of optimization.
Background technology
Utilize high flux DNA sequencing technology, relation between genetic mutation and human diseases of analyzing is the important method of biomedical research, screens and detect the focus that rare DNA mutation is then research at present.In order to find the rare mutation in human genome, exploring the relation between rare mutation and disease, needing check order to a large amount of individual DNA sample and analyze.In order to improve order-checking efficiency, making full use of the order-checking ability of existing order-checking instrument, needing multiple sample to mix to check order simultaneously, namely mixing order-checking.
The key of mixing order-checking is from sequencing result, how to separate the DNA sequencing fragment from different sample, to determine carrier's (i.e. positive sample) of rare mutation.Conventional method adds a unique DNA bar code to each sample before order-checking, after order-checking terminates, determines which sample is this fragment come from, and judge which sample is positive according to sequencing result according to the bar code on every bar sequenced fragments.Another kind method is then be blended in by sample in different mixing pits overlappingly, respectively each mixing pit is checked order, the pattern (i.e. overlapping mixed mode) finally occurred in different mixing pit according to each sample and the sequencing result determination positive sample of each mixing pit.
In this overlap mixing sequence measurement, sample is checked order according to after certain rule of combination mixing again.Overlapping mixing order-checking can use considerably less mixing order-checking number of times can identify the carrier of rare mutation in great amount of samples, thus reduces the total cost preparing workload required for sequencing library preparation and order-checking.
But also there is following problem in existing overlapping sequence measurement: cannot determine to need which type of order-checking degree of depth not only to ensure accurately to judge positive sample but also make the Least-cost that checks order actually? need how many mixing pits actually? how each sample is assigned to each mixing pit overlappingly? how to select best order-checking scheme?
Summary of the invention
Goal of the invention: the overlap mixing sequence measurement providing a kind of optimization, to solve the problems referred to above that prior art exists, optimizes sequencing procedure, improves order-checking efficiency.
Technical scheme: a kind of overlap mixing sequence measurement of optimization, comprises the following steps:
Step one, calculate the optimum order-checking degree of depth according to mixing order-checking depth model, great amount of samples is divided into groups overlapping mixing order-checking, and selects best order-checking scheme according to order-checking Cost Model;
Wherein, the optimum order-checking degree of depth is the universal law according to the order-checking degree of depth obeys negative binomial distribution, order-checking mistake obeys binomial distribution, calculates the minimum order-checking degree of depth meeting false positive mistake and False negative error requirement;
The overlapping mixing order-checking of grouping: extensive sample is divided into several group, according to the possible number of rare mutation carrier in each group of the probability calculation of known rare mutation, and then carries out independently overlapping mixing order-checking to each group;
Set up the Cost Model that reasonably checks order: the cost considering library preparation, sequencing data two aspect, and calculate the cost of overlapping mixing order-checking scheme according to Cost Model, select optimum overlap mixing order-checking scheme;
Step 2, the high-flux sequence utilizing said method to carry out examination rare mutation carrier from great amount of samples are tested.
The computation model of the described optimum order-checking degree of depth is as follows:
Assuming that the order-checking degree of depth obeys following negative binomial distribution:
P ( N r ) = NB ( N r ; D r - 1 , 1 r )
Wherein D is the degree of depth that on average checks order, N rfor certain position on genome is by the number of times measured, r is the parameter of negative binomial distribution and relevant with the object that checks order to order-checking platform, and NB represents negative binomial distribution, and:
NB ( k ; m , p ) = k + m - 1 k p k ( 1 - p ) m ; k = 0,1,2 . . .
Namely a series of independently test duration is until the probability of k success of the test, and wherein m is the number of times of test failure, and p is the probability of success of the test;
The mistake of supposition order-checking simultaneously obeys binomial distribution as follows:
P(E|N r)=Bin(E;N r,p error)
Wherein E is the number of times that order-checking mistake occurs, p errorfor the error rate that on average checks order, Bin represents binomial distribution, and:
Bin ( k ; m , p ) = m k p k ( 1 - p ) m - k ; k = 0,1,2 . . . m
The i.e. probability of successful k time in m independent experiment, wherein p is the probability of success of the test;
It is T that threshold value is observed in setting, namely observe and be no less than sequenced fragments that T to carry rare mutation and then think and comprise the sample carrying rare mutation in mixing sample, otherwise think that mixing sample is all made up of normal sample, on this basis, build mixing pit determined property and occur that the probability of false positive mistake F_P and False negative error F_N is as follows:
F _ P = Σ N r = T ∞ NB ( N r ; D r - 1 , 1 r ) Σ E = T N r Bin ( E ; N r , p error )
Wherein D is the mixing order-checking degree of depth, N rfor certain position on genome is by the number of times measured, E is the number of times that order-checking mistake occurs, p errorfor the error rate that on average checks order, r is the parameter of negative binomial distribution.
F _ N = Σ O = 0 T - 1 Σ x = 0 O Σ i = x ∞ NB ( i ; ( 1 - p ) D r - 1 , 1 r Bin ( x ; i , p error ) ) × Σ j = O - x ∞ NB ( j ; pD r - 1 , 1 r ) Bin ( j - O + x ; j , p error )
Wherein p is the chromosome ratio of carrying rare mutation in mixing pit, O is the sequenced fragments number of carrying sudden change observed, x is the sequenced fragments number of carrying sudden change coming from normal individual, i and j represents the sequenced fragments number coming from normal individual He carry mutated individual respectively, D is the mixing order-checking degree of depth, p errorfor the error rate that on average checks order.
Check order under the mixing pit misjudgment rate that can allow is the prerequisite of α in given overlap mixing, set the optimum depth D mixing and check order optimalas follows:
D optimal=min{D|F_N(D,T)≤α&F_P(D,T)≤α,T∈[1,D]}
And the observation threshold value T calculating correspondence is:
T=min{T|F_N(D optimal)≤α&F_P(D optimal)≤α}。
Described grouping overlap mixing order-checking is specific as follows:
Sample is divided into B group, the number that the rare mutation carrier calculated in each group according to hypergeometric distribution or binomial distribution is possible, and to each group independent design overlap mixing order-checking scheme, rare mutation carrier number in each group can be calculated according to following two new probability formula and be less than d bindividual Probability p b:
p B = Σ i = 0 d B n B i p v i ( 1 - p v ) ( nB - i )
p B = Σ i = 0 d B d i n - d n B - i n n B
Wherein i is temporary variable, and n is total sample number, n bfor the number of samples in every group, d is rare mutation carrier sum, p vfor the frequency of rare mutation carrier in colony, d bfor the number upper limit of carriers of mutation in each group.
Assuming that separate between B group, the rare mutation carrier in all groups is less than d bindividual probability is then p bb power when when exceeding certain threshold value, all groups can be thought all at most containing d bindividual rare mutation carrier; Then, for containing n bindividual sample is maximum d wherein bindividual is each group of carriers of mutation, and the overlapping hybrid plan of independent design also checks order.
Described order-checking Cost Model is:
C=tP l+N dP d
Wherein t is mixing order-checking number of times (namely number of times is prepared in library, is also the number of mixing pit), P lfor cost prepared by library, N dfor data volume, P dfor data produce cost, wherein, data volume N drelevant to the size in the order-checking degree of depth and region of checking order:
N d = Σ i = 1 t D i × R
D irepresent the average order-checking degree of depth of each mixing pit, the length in R representative order-checking region, i is temporary variable; Different overlap mixing order-checking schemes needs different mixing pit numbers and data volume, calculates the cost of each scheme and select the scheme of least cost to be optimum superposing mixing order-checking scheme according to this Cost Model.
Beneficial effect: the best order-checking degree of depth is set for mixing order-checking, calculate the sequenced fragments number threshold value of carrying rare mutation, one aspect of the present invention can ensure effectively in mixing sample, to observe rare mutation, can control cost on the other hand, avoid too much data output; In addition, after dividing into groups to the sample of large quantity, bamboo product overlap mixing order-checking, significantly can reduce the demand of sequencing data, reduces the cost of overlapping mixing order-checking further; Finally, the present invention selects optimum overlap mixing order-checking scheme, with minimum order-checking experimentation solving practical problems according to order-checking Cost Model.
Accompanying drawing explanation
Fig. 1 is overlapping mixing order-checking schematic diagram; In figure, 1,2,3 ~ 6 is sample, and A, B, C and D are mixing pit.
Fig. 2 is the matrix diagram of overlapping mixing.
Fig. 3 part sequencing steps of the present invention schematic diagram.
Fig. 4 a and Fig. 4 b mixes the order-checking degree of depth of the best required for checking order and positive mixing pit threshold value schematic diagram (order-checking error rate is 0.01, and the mixing pit misjudgment rate of permission is 0.01).
Fig. 5 is the best order-checking degree of depth schematic diagram that mixing 40 dliploid samples carry out mixing order-checking under difference order-checking error rate.
Embodiment
Utilize the hybrid mode of sample to encode, be mixed in same sample mixing pit by different sample and check order, each sample is mixed at least two sample mixing pits, and the sample that different sample mixing pits has has certain overlap.After sample has mixed, sequencing library is built to each mixing pit and has utilized sequenator to carry out upper machine order-checking acquisition sequencing data.Then whether it contains rare mutation carrier from mixing sequencing result, to judge each mixing pit, and identifies all samples (i.e. positive sample) carrying rare mutation according to the mixed mode of each sample.
As shown in Figure 1,6 samples are mixed in 4 sample mixing pits respectively, and have 3 samples in each mixing pit, each sample is sequenced 2 times.After having checked order, find to there is rare mutation in mixing pit B and C.Owing to only having No. 3 samples to take part in the mixing of B and C two mixing pits simultaneously, can infer that No. 3 samples are rare mutation carrier.
Sample hybrid mode in this example can use the matrix representation shown in Fig. 2, wherein every a line represents a mixing pit, each row represents a sample, and in matrix, element comprises sample representated by these row for mixing pit that 1 represents representated by this row, is 0 and does not comprise.In current overlap mixing order-checking, a kind of method of widespread use is that the separation matrix tried in theory according to group mixes sample.Fig. 3 is shown in the step signal of the overlap mixing order-checking solid yardage method optimized in the present invention, at given total sample number n, the number of carriers of mutation is reached the standard grade d, the mixing pit misjudgment rate α allowed, when order-checking zone length R, first mixing order-checking experimental design is carried out, calculate the best order-checking degree of depth of mixing order-checking, then sample is divided into groups, and the separation matrix utilizing group to try in theory builds overlapping mixing order-checking scheme respectively to each group, select the scheme with least cost to be optimum superposing mixing order-checking scheme according to Cost Model simultaneously, finally check order according to optimal case mixing sample.The embodiment of three key links is as follows.
1. determine the best order-checking degree of depth
When high-flux sequence experimental design, first will determine the degree of depth that reasonably checks order, the order-checking degree of depth is too low, and likely inspection does not measure the sudden change in object site, and do not reach experiment purpose, the degree of depth is too high, then add the cost of order-checking experiment.When designing the order-checking experiment detecting rare mutation from great amount of samples, determining that optimum depth is particularly important, should ensure, under the prerequisite accurately judging rare mutation carrier, to select the lower order-checking degree of depth as far as possible, to make order-checking experiment more economical.
First the present invention calculates the best order-checking degree of depth required for mixing order-checking.Assuming that the order-checking degree of depth obeys negative binomial distribution as follows:
P ( N r ) = NB ( N r ; D r - 1 , 1 r )
Wherein D is the degree of depth that on average checks order, N rfor certain position on genome is by the number of times measured, r is the parameter of negative binomial distribution and relevant with the object that checks order to order-checking platform, and NB represents negative binomial distribution, and:
NB ( k ; m , p ) = k + m - 1 k p k ( 1 - p ) m ; k = 0,1,2 . . .
Namely a series of independently test duration is until the probability of k success of the test, and wherein m is the number of times of test failure, and p is the probability of success of the test.
The mistake of supposition order-checking simultaneously obeys binomial distribution as follows:
P(E|N r)=Bin(E;N r,p error)
Wherein E is the number of times that order-checking mistake occurs, p errorfor the error rate that on average checks order, Bin represents binomial distribution, and:
Bin ( k ; m , p ) = m k p k ( 1 - p ) m - k ; k = 0,1,2 . . . m
The i.e. probability of successful k time in m independent experiment, wherein p is the probability of success of the test.
It is T that threshold value is observed in setting, namely observes the sequenced fragments carrying sudden change more than T and then thinks and comprise the sample carrying sudden change in a mixing pit, otherwise think that this mixing pit is all made up of normal sample.On this basis, build mixing order-checking and occur that the probability of false positive mistake (F_P) and False negative error (F_N) is as follows:
F _ P = Σ N r = T ∞ NB ( N r ; D r - 1 , 1 r ) Σ E = T N r Bin ( E ; N r , p error )
Wherein D is the mixing order-checking degree of depth, N rfor certain position on genome is by the number of times measured, E is the number of times that order-checking mistake occurs.
F _ N = Σ O = 0 T - 1 Σ x = 0 O Σ i = x ∞ NB ( i ; ( 1 - p ) D r - 1 , 1 r Bin ( x ; i , p error ) ) × Σ j = O - x ∞ NB ( j ; pD r - 1 , 1 r ) Bin ( j - O + x ; j , p error )
Wherein p is the chromosome ratio of carrying rare mutation in a mixing pit, O is the sequenced fragments number of carrying sudden change observed, x is the sequenced fragments number of carrying sudden change coming from normal individual, i and j represents the sequenced fragments number coming from normal individual He carry mutated individual respectively, D is the mixing order-checking degree of depth, p errorfor the error rate that on average checks order.Under to design the mixing pit misjudgment rate that can allow be the prerequisite of α in given overlap mixing order-checking, set the optimum depth D mixing and check order optimalas follows:
D optimal=min{D|F_N(D,T)≤α&F_P(D,T)≤α,T∈[1,D]}
And the observation threshold value T calculating correspondence is
T=min{T|F_N(D optimal)≤α&F_P(D optimal)≤α}
The best order-checking degree of depth and observation threshold value increase (see accompanying drawing 4) fast along with the increase of mixing sample number, and average order-checking error rate is higher, and the best of mixing order-checking checks order the degree of depth also higher (see accompanying drawing 5).
2. sample packet mixing
Extensive sample is divided into several group, to each group independent design overlap mixing order-checking scheme, to reduce sequencing data amount further.
We have proposed grouping overlapping mixing sequence measurement, be divided into B group by great amount of samples, the number that the carriers of mutation calculated in each group according to hypergeometric distribution as follows or binomial distribution is possible is also carried out independently overlapping mixing to each group and is checked order.Carriers of mutation number in each group can be calculated according to following two new probability formula and be less than d bindividual Probability p b.Assuming that separate between group, the carriers of mutation in all groups is all less than d bprobability be then p bb power when when exceeding certain threshold value beta (as 0.7), can think often in group at most containing d bindividual carriers of mutation.
p B = Σ i = 0 d B n B i p v i ( 1 - p v ) ( nB - i )
p B = Σ i = 0 d B d i n - d n B - i n n B
Wherein n is total sample number, n bfor the number of samples in every group, d is carriers of mutation sum, p vfor the frequency of carriers of mutation in colony.
After grouping, to often organizing the overlapping hybrid plan of independent design and checking order.Suppose in each group containing n bindividual sample is maximum d wherein bindividual is carriers of mutation, mixes order-checking scheme according to the separation matrix that group tries in theory to each group independent design overlap.The key of overlapping mixing order-checking conceptual design generates separation matrix, and according to separation matrix mixing sample.Appoint in sample mix matrix and get d+1 row, if wherein arbitrary row all can not be arranged covering by all the other d, then claim this hybrid matrix to be d-separation matrix.D-separation matrix can identify the positive sample being no more than d in the sample.Have the method for multiple design separation matrix at present, we adopt the method for hierarchical design.In the separation matrix that hierarchical design obtains, mixing pit can be divided into several layers, and the sample in every one deck in all mixing pits just in time covers all samples once.
After grouping, in group, sample number is less, and in mixing pit, number of samples can be also less, and number of samples positive correlation in best the check order degree of depth and mixing pit, thus the mixing order-checking degree of depth causing dividing into groups in overlapping mixing sequence measurement is lower, therefore required total amount of data can significantly reduce.
3. to determine totally to check order the cost of experimental program
The Cost Model that the present invention constructs overlapping mixing order-checking is as follows:
C=tP l+N dP d
Wherein t is mixing order-checking number of times (namely number of times is prepared in library, is also the number of mixing pit), P lfor cost prepared by library, N dfor data volume, P dfor data produce cost.Wherein data volume N drelevant to the size in the order-checking degree of depth and region of checking order:
N d = Σ i = 1 t D i × R
D irepresent the average order-checking degree of depth of each mixing pit, the length in R representative order-checking region.
According to the Cost Model of overlap mixing order-checking, consider library preparation, sequencing data two aspect one-tenth originally selected grouping number best in best overlap mixing order-checking scheme and grouping overlapping mixing order-checking.
Case study on implementation:
In 200 dliploid samples, identify 2 carriers of mutation, setting order-checking region is 30Mb on genome (Mb=Megabase, consistent with human exonic region total length).First setting order-checking error rate (p error) be 0.01, it is 0.01 that overlapping mixing order-checking designs the mixing pit misjudgment rate (α) that can allow, we calculate the optimum depth of mixing order-checking according to order-checking depth model, and calculate positive mixing pit threshold value, when the sequenced fragments number containing sudden change exceedes this threshold value, then think that this mixing pit is positive mixing pit (table one).In mixing pit dliploid number of samples for 1 represent required for non-mixed sequencing strategy individuality order-checking the degree of depth.
Table one mixes the best order-checking degree of depth of order-checking and positive mixing pit threshold value
Then, utilize separation matrix to build the hybrid plan of sample, the mixing pit misjudgment rate (α) that setting separation matrix can allow is 0.01.For the separation matrix that the hierarchical design that is shifted obtains, under different design parameters (q, k), mixing pit number and the number of samples in mixing pit are all not identical, and wherein k is the number of plies in hierarchical design, and q is the mixing pit number of every one deck.According to the result of the best order-checking degree of depth, we calculate the data volume required for overlap mixing order-checking scheme under different designs parameter.Suppose that the unit price building sequencing library is $ 500, the cost that sequencing data produces is $ 5300/100Gb (Gb=Gigabase), according to Cost Model, calculates the cost (table two) of each design proposal respectively.Final selection optimum superposing mixing order-checking is designed to scheme 4, and it has least cost.By comparison, the library number required for non-mixed sequencing strategy is 200, and data volume is 162Gb, and total cost is $ 108,586.The cost of optimum superposing mixing order-checking scheme is only 66% of the cost of non-mixed sequencing strategy.
Table two is the overlap mixing order-checking scheme identifying 2 carriers of mutation from 200 dliploid samples.The mixing pit misjudgment rate (α) that overlap mixing order-checking scheme designed by setting can allow is 0.01.In table, q and k is the input parameter of displacement transverse design.
Adopt grouping overlapping mixing sequence measurement, 200 samples are divided into some groups, the number that in each group, carriers of mutation is possible is calculated respectively according to grouping model, then to each group independent design overlap mixing order-checking scheme, and the best order-checking scheme of each group is selected according to Cost Model.On this basis, calculate mixing pit number, the data volume required for the overlapping mixing order-checking of grouping, and then calculate the total cost of the overlapping mixing order-checking of grouping according to Cost Model, and select best packet count thus.For this example, limit during more than 0.7, think often in group at most containing dB carriers of mutation.When dividing 3 groups, identify that the cost of carriers of mutation will reach minimum.Packet count is that 1 representative is not divided into groups and directly carries out overlap mixing order-checking, and by comparison, the data volume required for the overlapping mixing order-checking of grouping significantly reduces the reduction that directly results in total cost.Points of 3 groups costs carrying out overlapping mixing order-checking are 80% of costs directly carrying out overlapping mixing order-checking, in table three.
Table three identifies the grouping overlap mixing order-checking design proposal of 2 carriers of mutation from 200 dliploid samples.
First the present invention constructs the depth model of mixing order-checking and the Cost Model of overlapping mixing order-checking.The design parameter of order-checking is mixed to minimize order-checking cost based on these two Model Selection overlaps.Due to the restriction of the check order degree of depth and mixing sample number, we have proposed grouping overlapping mixing sequence measurement, extensive sample is divided into several group, to each group independent design overlap mixing order-checking scheme, and select optimum grouping number farthest to reduce costs according to Cost Model.Table four shows compared with non-mixed sequencing strategy, and the overlap mixing sequence measurement of optimization can significantly reduce order-checking number of times, and grouping overlapping mixing sequence measurement then can reduce data volume demand further and significantly reduce the total cost of order-checking.
Table four identifies the comparison of three kinds of schemes of 2 carriers of mutation from 200 dliploid samples
In sum, according to the order-checking depth model that the present invention proposes, selection mixes the best order-checking degree of depth of order-checking, reduces sequencing data amount, and this model is based upon the degree of depth obedience negative binomial distribution that checks order, order-checking mistake is obeyed on the basis of binomial distribution; Secondly, propose a kind of grouping overlapping mixing sequence measurement, extensive sample is divided into several group.The number that the rare mutation carrier calculated in each group according to hypergeometric distribution or binomial distribution is possible, and accordingly to each group independent design overlap mixing order-checking scheme, reduce further sequencing data amount.
Simultaneously, consider that actual mixing sequencing procedure comprises and come from library preparation and sequencing data produces the cost of two aspects, the present invention constructs the Cost Model of overlapping mixing order-checking, and select optimum overlap mixing order-checking scheme accordingly, at utmost to reduce the order-checking experimental cost of screening rare mutation carrier, improve the efficiency of overlapping mixing order-checking.Finally according to the overlap mixing order-checking scheme mixing sample of optimum, sequencing library is built to each mixing pit and upper machine order-checking.From mixing sequencing result, judging each mixing pit, whether it contains rare mutation carrier, and identifies the sample carrying rare mutation according to the mixed mode of each sample.
More than describe the preferred embodiment of the present invention in detail; but the present invention is not limited to the detail in above-mentioned embodiment, within the scope of technical conceive of the present invention; can carry out multiple equivalents to technical scheme of the present invention, these equivalents all belong to protection scope of the present invention.
It should be noted that in addition, each the concrete technical characteristic described in above-mentioned embodiment, in reconcilable situation, can be combined by any suitable mode.In order to avoid unnecessary repetition, the present invention illustrates no longer separately to various possible array mode.
In addition, also can carry out combination in any between various different embodiment of the present invention, as long as it is without prejudice to thought of the present invention, it should be considered as content disclosed in this invention equally.

Claims (4)

1. the overlap mixing sequence measurement optimized, is characterized in that, comprise the following steps:
Step one, calculate the optimum order-checking degree of depth according to mixing order-checking depth model, sample is divided into groups overlapping mixing order-checking, and selects best order-checking scheme according to order-checking Cost Model;
Wherein, the optimum order-checking degree of depth obeys binomial distribution according to order-checking degree of depth obedience negative binomial distribution, order-checking mistake, calculates the minimum order-checking degree of depth meeting false positive mistake and False negative error requirement;
The overlapping mixing order-checking of grouping: extensive sample is divided into several group, according to the possible number of rare mutation carrier in each group of the probability calculation of known rare mutation, and then carries out independently overlapping mixing order-checking to each group;
Set up the Cost Model that reasonably checks order: the cost considering library preparation, sequencing data two aspect, and calculate the cost of overlapping mixing order-checking scheme according to Cost Model, select optimum overlap mixing order-checking scheme;
Step 2, the high-flux sequence utilizing said method to carry out examination rare mutation carrier from great amount of samples are tested.
2. the overlap mixing sequence measurement optimized as claimed in claim 1, is characterized in that, the computation model of the described optimum order-checking degree of depth is as follows:
Assuming that the order-checking degree of depth obeys following negative binomial distribution:
P ( N r ) = NB ( N r ; D r - 1 , 1 r )
Wherein D is the degree of depth that on average checks order, N rfor certain position on genome is by the number of times measured, r is the parameter of negative binomial distribution and relevant with the object that checks order to order-checking platform, and NB represents negative binomial distribution,
The mistake of supposition order-checking simultaneously obeys binomial distribution as follows:
P(E|N r)=Bin(E;N r,p error)
Wherein E is the number of times that order-checking mistake occurs, p errorfor the error rate that on average checks order, Bin represents binomial distribution,
It is T that threshold value is observed in setting, namely observe and be no less than sequenced fragments that T to carry rare mutation and then think and comprise the sample carrying rare mutation in mixing sample, otherwise think that mixing sample is all made up of normal sample, on this basis, build mixing pit determined property and occur that the probability of false positive mistake F_P and False negative error F_N is as follows:
F _ P = Σ N r = T ∞ NB ( N r ; D r - 1 , 1 r ) Σ E = T N r Bin ( E ; N r , p error )
Wherein D is the mixing order-checking degree of depth, N rfor certain position on genome is by the number of times measured, E is the number of times that order-checking mistake occurs, p errorfor the error rate that on average checks order, r is the parameter of negative binomial distribution;
F _ N = Σ O = 0 T - 1 Σ x = 0 O Σ i = x ∞ NB ( i ; ( 1 - p ) D r - 1 , 1 r Bin ( x ; i , p error ) ) × Σ j = O - x ∞ NB ( j ; pD r - 1 , 1 r ) Bin ( j - O + x ; j , p error )
Wherein p is the chromosome ratio of carrying rare mutation in mixing pit, O is the sequenced fragments number of carrying sudden change observed, x is the sequenced fragments number of carrying sudden change coming from normal individual, i and j represents the sequenced fragments number coming from normal individual He carry mutated individual respectively, D is the mixing order-checking degree of depth, p errorfor the error rate that on average checks order;
Check order under the mixing pit misjudgment rate that can allow is the prerequisite of α in given overlap mixing, set the optimum depth D mixing and check order optimalas follows:
D optimal=min{D|F_N(D,T)≤α&F_P(D,T)≤α,T∈[1,D]}
And the observation threshold value T calculating correspondence is:
T=min{T|F_N(D optimal)≤α&F_P(D optimal)≤α}。
3. the overlap mixing sequence measurement optimized as claimed in claim 1, is characterized in that, described grouping overlap mixing order-checking is specific as follows:
Sample is divided into B group, the number that the rare mutation carrier calculated in each group according to hypergeometric distribution or binomial distribution is possible, and to each group independent design overlap mixing order-checking scheme, rare mutation carrier number in each group can be calculated according to following two new probability formula and be less than d bindividual Probability p b:
p B = Σ i = 0 d B n B i p v i ( 1 - p v ) ( nB - i )
p B = Σ i = 0 d B d i n - d n B - i n n B
Wherein i is temporary variable, and n is total sample number, n bfor the number of samples in every group, d is rare mutation carrier sum, p vfor the frequency of rare mutation carrier in colony, d bfor the number upper limit of carriers of mutation in each group;
Assuming that separate between B group, the rare mutation carrier in all groups is less than d bindividual probability is then p bb power when when exceeding certain threshold value, all groups can be thought all at most containing d bindividual rare mutation carrier; Then, for containing n bindividual sample is maximum d wherein bindividual is each group of carriers of mutation, and the overlapping hybrid plan of independent design also checks order.
4. the overlap mixing sequence measurement optimized as claimed in claim 1, it is characterized in that, described order-checking Cost Model is:
C=tP l+N dP d
Wherein t is mixing order-checking number of times (namely number of times is prepared in library, is also the number of mixing pit), P lfor cost prepared by library, N dfor data volume, P dfor data produce cost, wherein, data volume N drelevant to the size in the order-checking degree of depth and region of checking order:
N d = Σ i = 1 t D i × R
D irepresent the average order-checking degree of depth of each mixing pit, the length in R representative order-checking region, i is temporary variable; Different overlap mixing order-checking schemes needs different mixing pit numbers and data volume, calculates the cost of each scheme and select the scheme of least cost to be optimum superposing mixing order-checking scheme according to this Cost Model.
CN201410462490.7A 2014-09-11 2014-09-11 Optimized overlapping hybrid sequencing method Pending CN104217135A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410462490.7A CN104217135A (en) 2014-09-11 2014-09-11 Optimized overlapping hybrid sequencing method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410462490.7A CN104217135A (en) 2014-09-11 2014-09-11 Optimized overlapping hybrid sequencing method

Publications (1)

Publication Number Publication Date
CN104217135A true CN104217135A (en) 2014-12-17

Family

ID=52098616

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410462490.7A Pending CN104217135A (en) 2014-09-11 2014-09-11 Optimized overlapping hybrid sequencing method

Country Status (1)

Country Link
CN (1) CN104217135A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104894246A (en) * 2015-05-21 2015-09-09 东南大学 Two-nucleotide synthetic sequencing analysis method for multi-template PCR product

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102181533A (en) * 2011-03-17 2011-09-14 北京贝瑞和康生物技术有限公司 Multi-sample mixed sequencing method and kit
CN103459612A (en) * 2010-12-03 2013-12-18 布兰代斯大学 Methods and kits for detecting nucleic acid mutants in wild-type populations
US20140045705A1 (en) * 2012-08-10 2014-02-13 The Board Of Trustees Of The Leland Stanford Junior University Techniques for Determining Haplotype by Population Genotype and Sequence Data

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103459612A (en) * 2010-12-03 2013-12-18 布兰代斯大学 Methods and kits for detecting nucleic acid mutants in wild-type populations
CN102181533A (en) * 2011-03-17 2011-09-14 北京贝瑞和康生物技术有限公司 Multi-sample mixed sequencing method and kit
US20140045705A1 (en) * 2012-08-10 2014-02-13 The Board Of Trustees Of The Leland Stanford Junior University Techniques for Determining Haplotype by Population Genotype and Sequence Data

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
CHANG-CHANG CAO ET AL.: "Identifying Rare Variants With Optimal Depth of Coverage and Cost-Effective Overlapping Pool Sequencing", 《GENETIC EPIDEMIOLOGY》 *
CHANG-CHANG CAO ET AL.: "Quantitative group testing-based overlapping pool sequencing to identify rare variant carriers", 《BMC BIOINFORMATICS》 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104894246A (en) * 2015-05-21 2015-09-09 东南大学 Two-nucleotide synthetic sequencing analysis method for multi-template PCR product
CN104894246B (en) * 2015-05-21 2018-03-20 东南大学 A kind of method of two nucleotides synthesis order-checking analysis multi-template PCR primer

Similar Documents

Publication Publication Date Title
CN104504304B (en) A kind of short palindrome repetitive sequence recognition methods of regular intervals of cluster and device
CN104765690B (en) Embedded software test data generation method based on fuzzy genetic algorithm
CN104302781B (en) A kind of method and device detecting chromosomal structural abnormality
CN101886132B (en) Method for screening molecular markers correlative with properties based on sequencing technique and BSA (Bulked Segregant Analysis) technique
CN107480470B (en) Known variation detection method and device based on Bayesian and Poisson distribution test
CN102676676B (en) Kit for fluorescence detection of STR (short tandem repeat) loca on Y chromosomes and application thereof
Zheng et al. Species delimitation and lineage separation history of a species complex of aspens in China
CN110083531A (en) It improves the shared multi-goal path coverage test method of individual information and realizes system
CN106407258A (en) Missing data prediction method and apparatus
CN108510050A (en) It is a kind of based on shuffling the feature selection approach to leapfrog
CN102521528A (en) Method for screening gene sequence data
Moustafa et al. PhyloSort: a user-friendly phylogenetic sorting tool and its application to estimating the cyanobacterial contribution to the nuclear genome of Chlamydomonas
Alkindy et al. Finding the core-genes of chloroplasts
CN101853202B (en) Test case autogeneration method based on genetic algorithm and weighted matching algorithm
CN104217135A (en) Optimized overlapping hybrid sequencing method
Liu et al. Identification of medical plants of 24 Ardisia species from China using the matK genetic marker
Huber et al. Primer design for an accurate view of picocyanobacterial community structure by using high-throughput sequencing
Nyberg et al. Modeling protein target search in human chromosomes
Benucci et al. Stochastic and deterministic processes shape bioenergy crop microbiomes along a vertical soil niche
CN106021998A (en) Computation pipeline of single-pass multiple variant calls
CN109599146A (en) A kind of band false knot nucleic acid Structure Prediction Methods based on multi-objective genetic algorithm
Kowalczyk et al. A cautionary tale on proper use of branch-site models to detect convergent positive selection
CN104573409B (en) The multiple check method of the assignment of genes gene mapping
Avanzi et al. The latitudinal trend in genetic diversity and distinctiveness of Quercus robur rear edge forest remnants calls for new conservation priorities
CN107447021A (en) A kind of method and its application of the accurate identification genotype based on high-flux sequence

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20141217