CN104217135A

CN104217135A - Optimized overlapping hybrid sequencing method

Info

Publication number: CN104217135A
Application number: CN201410462490.7A
Authority: CN
Inventors: 孙啸; 曹唱唱; 李成
Original assignee: Southeast University
Current assignee: Southeast University
Priority date: 2014-09-11
Filing date: 2014-09-11
Publication date: 2014-12-17

Abstract

The invention discloses an optimized overlapping hybrid sequencing method, which includes the following steps: on the basis of the general law that sequencing depth follows negative binomial distribution and sequencing errors follow binomial distribution in the process of sequencing, a depth model of hybrid sequencing is put forward, moreover, the optimal depth of hybrid sequencing is calculated and designed on the basis of the model, and sequencing cost is effectively reduced by reducing redundant sequencing depth; a grouped overlapping hybrid sequencing method based on rare mutation distribution probabilities is put forward, and compared with direct sequencing, the grouping strategy can greatly reduce the demand of sequencing on data volume and increase the efficiency of hybrid sequencing; a sequencing cost model is established, and on the basis of the model, an optimal overlapping hybrid sequencing scheme is chosen to screen rare mutation carriers. The optimized overlapping hybrid sequencing method reduces the sequencing cost of screening rare mutation carriers to the max.

Description

A kind of overlap mixing sequence measurement of optimization

Technical field

The invention belongs to gene sequencing field, especially a kind of overlap mixing sequence measurement of optimization.

Background technology

Utilize high flux DNA sequencing technology, relation between genetic mutation and human diseases of analyzing is the important method of biomedical research, screens and detect the focus that rare DNA mutation is then research at present.In order to find the rare mutation in human genome, exploring the relation between rare mutation and disease, needing check order to a large amount of individual DNA sample and analyze.In order to improve order-checking efficiency, making full use of the order-checking ability of existing order-checking instrument, needing multiple sample to mix to check order simultaneously, namely mixing order-checking.

The key of mixing order-checking is from sequencing result, how to separate the DNA sequencing fragment from different sample, to determine carrier's (i.e. positive sample) of rare mutation.Conventional method adds a unique DNA bar code to each sample before order-checking, after order-checking terminates, determines which sample is this fragment come from, and judge which sample is positive according to sequencing result according to the bar code on every bar sequenced fragments.Another kind method is then be blended in by sample in different mixing pits overlappingly, respectively each mixing pit is checked order, the pattern (i.e. overlapping mixed mode) finally occurred in different mixing pit according to each sample and the sequencing result determination positive sample of each mixing pit.

In this overlap mixing sequence measurement, sample is checked order according to after certain rule of combination mixing again.Overlapping mixing order-checking can use considerably less mixing order-checking number of times can identify the carrier of rare mutation in great amount of samples, thus reduces the total cost preparing workload required for sequencing library preparation and order-checking.

But also there is following problem in existing overlapping sequence measurement: cannot determine to need which type of order-checking degree of depth not only to ensure accurately to judge positive sample but also make the Least-cost that checks order actually? need how many mixing pits actually? how each sample is assigned to each mixing pit overlappingly? how to select best order-checking scheme?

Summary of the invention

Goal of the invention: the overlap mixing sequence measurement providing a kind of optimization, to solve the problems referred to above that prior art exists, optimizes sequencing procedure, improves order-checking efficiency.

Technical scheme: a kind of overlap mixing sequence measurement of optimization, comprises the following steps:

Step one, calculate the optimum order-checking degree of depth according to mixing order-checking depth model, great amount of samples is divided into groups overlapping mixing order-checking, and selects best order-checking scheme according to order-checking Cost Model;

Wherein, the optimum order-checking degree of depth is the universal law according to the order-checking degree of depth obeys negative binomial distribution, order-checking mistake obeys binomial distribution, calculates the minimum order-checking degree of depth meeting false positive mistake and False negative error requirement;

The overlapping mixing order-checking of grouping: extensive sample is divided into several group, according to the possible number of rare mutation carrier in each group of the probability calculation of known rare mutation, and then carries out independently overlapping mixing order-checking to each group;

Set up the Cost Model that reasonably checks order: the cost considering library preparation, sequencing data two aspect, and calculate the cost of overlapping mixing order-checking scheme according to Cost Model, select optimum overlap mixing order-checking scheme;

Step 2, the high-flux sequence utilizing said method to carry out examination rare mutation carrier from great amount of samples are tested.

The computation model of the described optimum order-checking degree of depth is as follows:

Assuming that the order-checking degree of depth obeys following negative binomial distribution:

P (N_{r}) = NB (N_{r}; \frac{D}{r - 1}, \frac{1}{r})

Wherein D is the degree of depth that on average checks order, N _rfor certain position on genome is by the number of times measured, r is the parameter of negative binomial distribution and relevant with the object that checks order to order-checking platform, and NB represents negative binomial distribution, and:

NB (k; m, p) = (\begin{matrix} k + m - 1 \\ k \end{matrix}) p^{k} {(1 - p)}^{m}; k = 0,1,2 . . .

Namely a series of independently test duration is until the probability of k success of the test, and wherein m is the number of times of test failure, and p is the probability of success of the test;

The mistake of supposition order-checking simultaneously obeys binomial distribution as follows:

P(E|N _r)＝Bin(E；N _r,p _error)

Wherein E is the number of times that order-checking mistake occurs, p _errorfor the error rate that on average checks order, Bin represents binomial distribution, and:

Bin (k; m, p) = (\begin{matrix} m \\ k \end{matrix}) p^{k} {(1 - p)}^{m - k}; k = 0,1,2 . . . m

The i.e. probability of successful k time in m independent experiment, wherein p is the probability of success of the test;

It is T that threshold value is observed in setting, namely observe and be no less than sequenced fragments that T to carry rare mutation and then think and comprise the sample carrying rare mutation in mixing sample, otherwise think that mixing sample is all made up of normal sample, on this basis, build mixing pit determined property and occur that the probability of false positive mistake F_P and False negative error F_N is as follows:

F_P = Σ_{N_{r} = T}^{\infty} NB (N_{r}; \frac{D}{r - 1}, \frac{1}{r}) Σ_{E = T}^{N_{r}} Bin (E; N_{r}, p_{error})

Wherein D is the mixing order-checking degree of depth, N _rfor certain position on genome is by the number of times measured, E is the number of times that order-checking mistake occurs, p _errorfor the error rate that on average checks order, r is the parameter of negative binomial distribution.

F_N = Σ_{O = 0}^{T - 1} Σ_{x = 0}^{O} \{\begin{matrix} Σ_{i = x}^{\infty} NB (i; \frac{(1 - p) D}{r - 1}, \frac{1}{r} Bin (x; i, p_{error})) \times \\ Σ_{j = O - x}^{\infty} NB (j; \frac{pD}{r - 1}, \frac{1}{r}) Bin (j - O + x; j, p_{error}) \end{matrix}\}

Wherein p is the chromosome ratio of carrying rare mutation in mixing pit, O is the sequenced fragments number of carrying sudden change observed, x is the sequenced fragments number of carrying sudden change coming from normal individual, i and j represents the sequenced fragments number coming from normal individual He carry mutated individual respectively, D is the mixing order-checking degree of depth, p _errorfor the error rate that on average checks order.

Check order under the mixing pit misjudgment rate that can allow is the prerequisite of α in given overlap mixing, set the optimum depth D mixing and check order _optimalas follows:

D _optimal＝min{D|F_N(D,T)≤α&F_P(D,T)≤α,T∈[1,D]}

And the observation threshold value T calculating correspondence is:

T＝min{T|F_N(D _optimal)≤α&F_P(D _optimal)≤α}。

Described grouping overlap mixing order-checking is specific as follows:

Sample is divided into B group, the number that the rare mutation carrier calculated in each group according to hypergeometric distribution or binomial distribution is possible, and to each group independent design overlap mixing order-checking scheme, rare mutation carrier number in each group can be calculated according to following two new probability formula and be less than d _bindividual Probability p _b:

p^{B} = Σ_{i = 0}^{d_{B}} (\begin{matrix} n_{B} \\ i \end{matrix}) {p_{v}}^{i} {(1 - p_{v})}^{(nB - i)}

p^{B} = Σ_{i = 0}^{d_{B}} \frac{(\begin{matrix} d \\ i \end{matrix}) (\begin{matrix} n - d \\ n_{B} - i \end{matrix})}{(\begin{matrix} n \\ n_{B} \end{matrix})}

Wherein i is temporary variable, and n is total sample number, n _bfor the number of samples in every group, d is rare mutation carrier sum, p _vfor the frequency of rare mutation carrier in colony, d _bfor the number upper limit of carriers of mutation in each group.

Assuming that separate between B group, the rare mutation carrier in all groups is less than d _bindividual probability is then p _bb power when when exceeding certain threshold value, all groups can be thought all at most containing d _bindividual rare mutation carrier; Then, for containing n _bindividual sample is maximum d wherein _bindividual is each group of carriers of mutation, and the overlapping hybrid plan of independent design also checks order.

Described order-checking Cost Model is:

C＝tP _l+N _dP _d

Wherein t is mixing order-checking number of times (namely number of times is prepared in library, is also the number of mixing pit), P _lfor cost prepared by library, N _dfor data volume, P _dfor data produce cost, wherein, data volume N _drelevant to the size in the order-checking degree of depth and region of checking order:

N_{d} = Σ_{i = 1}^{t} D_{i} \times R

D _irepresent the average order-checking degree of depth of each mixing pit, the length in R representative order-checking region, i is temporary variable; Different overlap mixing order-checking schemes needs different mixing pit numbers and data volume, calculates the cost of each scheme and select the scheme of least cost to be optimum superposing mixing order-checking scheme according to this Cost Model.

Beneficial effect: the best order-checking degree of depth is set for mixing order-checking, calculate the sequenced fragments number threshold value of carrying rare mutation, one aspect of the present invention can ensure effectively in mixing sample, to observe rare mutation, can control cost on the other hand, avoid too much data output; In addition, after dividing into groups to the sample of large quantity, bamboo product overlap mixing order-checking, significantly can reduce the demand of sequencing data, reduces the cost of overlapping mixing order-checking further; Finally, the present invention selects optimum overlap mixing order-checking scheme, with minimum order-checking experimentation solving practical problems according to order-checking Cost Model.

Accompanying drawing explanation

Fig. 1 is overlapping mixing order-checking schematic diagram; In figure, 1,2,3 ~ 6 is sample, and A, B, C and D are mixing pit.

Fig. 2 is the matrix diagram of overlapping mixing.

Fig. 3 part sequencing steps of the present invention schematic diagram.

Fig. 4 a and Fig. 4 b mixes the order-checking degree of depth of the best required for checking order and positive mixing pit threshold value schematic diagram (order-checking error rate is 0.01, and the mixing pit misjudgment rate of permission is 0.01).

Fig. 5 is the best order-checking degree of depth schematic diagram that mixing 40 dliploid samples carry out mixing order-checking under difference order-checking error rate.

Embodiment

Utilize the hybrid mode of sample to encode, be mixed in same sample mixing pit by different sample and check order, each sample is mixed at least two sample mixing pits, and the sample that different sample mixing pits has has certain overlap.After sample has mixed, sequencing library is built to each mixing pit and has utilized sequenator to carry out upper machine order-checking acquisition sequencing data.Then whether it contains rare mutation carrier from mixing sequencing result, to judge each mixing pit, and identifies all samples (i.e. positive sample) carrying rare mutation according to the mixed mode of each sample.

As shown in Figure 1,6 samples are mixed in 4 sample mixing pits respectively, and have 3 samples in each mixing pit, each sample is sequenced 2 times.After having checked order, find to there is rare mutation in mixing pit B and C.Owing to only having No. 3 samples to take part in the mixing of B and C two mixing pits simultaneously, can infer that No. 3 samples are rare mutation carrier.

Sample hybrid mode in this example can use the matrix representation shown in Fig. 2, wherein every a line represents a mixing pit, each row represents a sample, and in matrix, element comprises sample representated by these row for mixing pit that 1 represents representated by this row, is 0 and does not comprise.In current overlap mixing order-checking, a kind of method of widespread use is that the separation matrix tried in theory according to group mixes sample.Fig. 3 is shown in the step signal of the overlap mixing order-checking solid yardage method optimized in the present invention, at given total sample number n, the number of carriers of mutation is reached the standard grade d, the mixing pit misjudgment rate α allowed, when order-checking zone length R, first mixing order-checking experimental design is carried out, calculate the best order-checking degree of depth of mixing order-checking, then sample is divided into groups, and the separation matrix utilizing group to try in theory builds overlapping mixing order-checking scheme respectively to each group, select the scheme with least cost to be optimum superposing mixing order-checking scheme according to Cost Model simultaneously, finally check order according to optimal case mixing sample.The embodiment of three key links is as follows.

1. determine the best order-checking degree of depth

When high-flux sequence experimental design, first will determine the degree of depth that reasonably checks order, the order-checking degree of depth is too low, and likely inspection does not measure the sudden change in object site, and do not reach experiment purpose, the degree of depth is too high, then add the cost of order-checking experiment.When designing the order-checking experiment detecting rare mutation from great amount of samples, determining that optimum depth is particularly important, should ensure, under the prerequisite accurately judging rare mutation carrier, to select the lower order-checking degree of depth as far as possible, to make order-checking experiment more economical.

First the present invention calculates the best order-checking degree of depth required for mixing order-checking.Assuming that the order-checking degree of depth obeys negative binomial distribution as follows:

P (N_{r}) = NB (N_{r}; \frac{D}{r - 1}, \frac{1}{r})

NB (k; m, p) = (\begin{matrix} k + m - 1 \\ k \end{matrix}) p^{k} {(1 - p)}^{m}; k = 0,1,2 . . .

Namely a series of independently test duration is until the probability of k success of the test, and wherein m is the number of times of test failure, and p is the probability of success of the test.

P(E|N _r)＝Bin(E；N _r,p _error)

Bin (k; m, p) = (\begin{matrix} m \\ k \end{matrix}) p^{k} {(1 - p)}^{m - k}; k = 0,1,2 . . . m

The i.e. probability of successful k time in m independent experiment, wherein p is the probability of success of the test.

It is T that threshold value is observed in setting, namely observes the sequenced fragments carrying sudden change more than T and then thinks and comprise the sample carrying sudden change in a mixing pit, otherwise think that this mixing pit is all made up of normal sample.On this basis, build mixing order-checking and occur that the probability of false positive mistake (F_P) and False negative error (F_N) is as follows:

F_P = Σ_{N_{r} = T}^{\infty} NB (N_{r}; \frac{D}{r - 1}, \frac{1}{r}) Σ_{E = T}^{N_{r}} Bin (E; N_{r}, p_{error})

Wherein D is the mixing order-checking degree of depth, N _rfor certain position on genome is by the number of times measured, E is the number of times that order-checking mistake occurs.

F_N = Σ_{O = 0}^{T - 1} Σ_{x = 0}^{O} \{\begin{matrix} Σ_{i = x}^{\infty} NB (i; \frac{(1 - p) D}{r - 1}, \frac{1}{r} Bin (x; i, p_{error})) \times \\ Σ_{j = O - x}^{\infty} NB (j; \frac{pD}{r - 1}, \frac{1}{r}) Bin (j - O + x; j, p_{error}) \end{matrix}\}

Wherein p is the chromosome ratio of carrying rare mutation in a mixing pit, O is the sequenced fragments number of carrying sudden change observed, x is the sequenced fragments number of carrying sudden change coming from normal individual, i and j represents the sequenced fragments number coming from normal individual He carry mutated individual respectively, D is the mixing order-checking degree of depth, p _errorfor the error rate that on average checks order.Under to design the mixing pit misjudgment rate that can allow be the prerequisite of α in given overlap mixing order-checking, set the optimum depth D mixing and check order _optimalas follows:

D _optimal＝min{D|F_N(D,T)≤α&F_P(D,T)≤α,T∈[1,D]}

And the observation threshold value T calculating correspondence is

T＝min{T|F_N(D _optimal)≤α&F_P(D _optimal)≤α}

The best order-checking degree of depth and observation threshold value increase (see accompanying drawing 4) fast along with the increase of mixing sample number, and average order-checking error rate is higher, and the best of mixing order-checking checks order the degree of depth also higher (see accompanying drawing 5).

2. sample packet mixing

Extensive sample is divided into several group, to each group independent design overlap mixing order-checking scheme, to reduce sequencing data amount further.

We have proposed grouping overlapping mixing sequence measurement, be divided into B group by great amount of samples, the number that the carriers of mutation calculated in each group according to hypergeometric distribution as follows or binomial distribution is possible is also carried out independently overlapping mixing to each group and is checked order.Carriers of mutation number in each group can be calculated according to following two new probability formula and be less than d _bindividual Probability p _b.Assuming that separate between group, the carriers of mutation in all groups is all less than d _bprobability be then p _bb power when when exceeding certain threshold value beta (as 0.7), can think often in group at most containing d _bindividual carriers of mutation.

p^{B} = Σ_{i = 0}^{d_{B}} (\begin{matrix} n_{B} \\ i \end{matrix}) {p_{v}}^{i} {(1 - p_{v})}^{(nB - i)}

p^{B} = Σ_{i = 0}^{d_{B}} \frac{(\begin{matrix} d \\ i \end{matrix}) (\begin{matrix} n - d \\ n_{B} - i \end{matrix})}{(\begin{matrix} n \\ n_{B} \end{matrix})}

Wherein n is total sample number, n _bfor the number of samples in every group, d is carriers of mutation sum, p _vfor the frequency of carriers of mutation in colony.

After grouping, to often organizing the overlapping hybrid plan of independent design and checking order.Suppose in each group containing n _bindividual sample is maximum d wherein _bindividual is carriers of mutation, mixes order-checking scheme according to the separation matrix that group tries in theory to each group independent design overlap.The key of overlapping mixing order-checking conceptual design generates separation matrix, and according to separation matrix mixing sample.Appoint in sample mix matrix and get d+1 row, if wherein arbitrary row all can not be arranged covering by all the other d, then claim this hybrid matrix to be d-separation matrix.D-separation matrix can identify the positive sample being no more than d in the sample.Have the method for multiple design separation matrix at present, we adopt the method for hierarchical design.In the separation matrix that hierarchical design obtains, mixing pit can be divided into several layers, and the sample in every one deck in all mixing pits just in time covers all samples once.

After grouping, in group, sample number is less, and in mixing pit, number of samples can be also less, and number of samples positive correlation in best the check order degree of depth and mixing pit, thus the mixing order-checking degree of depth causing dividing into groups in overlapping mixing sequence measurement is lower, therefore required total amount of data can significantly reduce.

3. to determine totally to check order the cost of experimental program

The Cost Model that the present invention constructs overlapping mixing order-checking is as follows:

C＝tP _l+N _dP _d

Wherein t is mixing order-checking number of times (namely number of times is prepared in library, is also the number of mixing pit), P _lfor cost prepared by library, N _dfor data volume, P _dfor data produce cost.Wherein data volume N _drelevant to the size in the order-checking degree of depth and region of checking order:

N_{d} = Σ_{i = 1}^{t} D_{i} \times R

D _irepresent the average order-checking degree of depth of each mixing pit, the length in R representative order-checking region.

According to the Cost Model of overlap mixing order-checking, consider library preparation, sequencing data two aspect one-tenth originally selected grouping number best in best overlap mixing order-checking scheme and grouping overlapping mixing order-checking.

Case study on implementation:

In 200 dliploid samples, identify 2 carriers of mutation, setting order-checking region is 30Mb on genome (Mb=Megabase, consistent with human exonic region total length).First setting order-checking error rate (p _error) be 0.01, it is 0.01 that overlapping mixing order-checking designs the mixing pit misjudgment rate (α) that can allow, we calculate the optimum depth of mixing order-checking according to order-checking depth model, and calculate positive mixing pit threshold value, when the sequenced fragments number containing sudden change exceedes this threshold value, then think that this mixing pit is positive mixing pit (table one).In mixing pit dliploid number of samples for 1 represent required for non-mixed sequencing strategy individuality order-checking the degree of depth.

Table one mixes the best order-checking degree of depth of order-checking and positive mixing pit threshold value

Then, utilize separation matrix to build the hybrid plan of sample, the mixing pit misjudgment rate (α) that setting separation matrix can allow is 0.01.For the separation matrix that the hierarchical design that is shifted obtains, under different design parameters (q, k), mixing pit number and the number of samples in mixing pit are all not identical, and wherein k is the number of plies in hierarchical design, and q is the mixing pit number of every one deck.According to the result of the best order-checking degree of depth, we calculate the data volume required for overlap mixing order-checking scheme under different designs parameter.Suppose that the unit price building sequencing library is $ 500, the cost that sequencing data produces is $ 5300/100Gb (Gb=Gigabase), according to Cost Model, calculates the cost (table two) of each design proposal respectively.Final selection optimum superposing mixing order-checking is designed to scheme 4, and it has least cost.By comparison, the library number required for non-mixed sequencing strategy is 200, and data volume is 162Gb, and total cost is $ 108,586.The cost of optimum superposing mixing order-checking scheme is only 66% of the cost of non-mixed sequencing strategy.

Table two is the overlap mixing order-checking scheme identifying 2 carriers of mutation from 200 dliploid samples.The mixing pit misjudgment rate (α) that overlap mixing order-checking scheme designed by setting can allow is 0.01.In table, q and k is the input parameter of displacement transverse design.

Adopt grouping overlapping mixing sequence measurement, 200 samples are divided into some groups, the number that in each group, carriers of mutation is possible is calculated respectively according to grouping model, then to each group independent design overlap mixing order-checking scheme, and the best order-checking scheme of each group is selected according to Cost Model.On this basis, calculate mixing pit number, the data volume required for the overlapping mixing order-checking of grouping, and then calculate the total cost of the overlapping mixing order-checking of grouping according to Cost Model, and select best packet count thus.For this example, limit during more than 0.7, think often in group at most containing dB carriers of mutation.When dividing 3 groups, identify that the cost of carriers of mutation will reach minimum.Packet count is that 1 representative is not divided into groups and directly carries out overlap mixing order-checking, and by comparison, the data volume required for the overlapping mixing order-checking of grouping significantly reduces the reduction that directly results in total cost.Points of 3 groups costs carrying out overlapping mixing order-checking are 80% of costs directly carrying out overlapping mixing order-checking, in table three.

Table three identifies the grouping overlap mixing order-checking design proposal of 2 carriers of mutation from 200 dliploid samples.

First the present invention constructs the depth model of mixing order-checking and the Cost Model of overlapping mixing order-checking.The design parameter of order-checking is mixed to minimize order-checking cost based on these two Model Selection overlaps.Due to the restriction of the check order degree of depth and mixing sample number, we have proposed grouping overlapping mixing sequence measurement, extensive sample is divided into several group, to each group independent design overlap mixing order-checking scheme, and select optimum grouping number farthest to reduce costs according to Cost Model.Table four shows compared with non-mixed sequencing strategy, and the overlap mixing sequence measurement of optimization can significantly reduce order-checking number of times, and grouping overlapping mixing sequence measurement then can reduce data volume demand further and significantly reduce the total cost of order-checking.

Table four identifies the comparison of three kinds of schemes of 2 carriers of mutation from 200 dliploid samples

In sum, according to the order-checking depth model that the present invention proposes, selection mixes the best order-checking degree of depth of order-checking, reduces sequencing data amount, and this model is based upon the degree of depth obedience negative binomial distribution that checks order, order-checking mistake is obeyed on the basis of binomial distribution; Secondly, propose a kind of grouping overlapping mixing sequence measurement, extensive sample is divided into several group.The number that the rare mutation carrier calculated in each group according to hypergeometric distribution or binomial distribution is possible, and accordingly to each group independent design overlap mixing order-checking scheme, reduce further sequencing data amount.

Simultaneously, consider that actual mixing sequencing procedure comprises and come from library preparation and sequencing data produces the cost of two aspects, the present invention constructs the Cost Model of overlapping mixing order-checking, and select optimum overlap mixing order-checking scheme accordingly, at utmost to reduce the order-checking experimental cost of screening rare mutation carrier, improve the efficiency of overlapping mixing order-checking.Finally according to the overlap mixing order-checking scheme mixing sample of optimum, sequencing library is built to each mixing pit and upper machine order-checking.From mixing sequencing result, judging each mixing pit, whether it contains rare mutation carrier, and identifies the sample carrying rare mutation according to the mixed mode of each sample.

More than describe the preferred embodiment of the present invention in detail; but the present invention is not limited to the detail in above-mentioned embodiment, within the scope of technical conceive of the present invention; can carry out multiple equivalents to technical scheme of the present invention, these equivalents all belong to protection scope of the present invention.

It should be noted that in addition, each the concrete technical characteristic described in above-mentioned embodiment, in reconcilable situation, can be combined by any suitable mode.In order to avoid unnecessary repetition, the present invention illustrates no longer separately to various possible array mode.

In addition, also can carry out combination in any between various different embodiment of the present invention, as long as it is without prejudice to thought of the present invention, it should be considered as content disclosed in this invention equally.

Claims

1. the overlap mixing sequence measurement optimized, is characterized in that, comprise the following steps:

Step one, calculate the optimum order-checking degree of depth according to mixing order-checking depth model, sample is divided into groups overlapping mixing order-checking, and selects best order-checking scheme according to order-checking Cost Model;

Wherein, the optimum order-checking degree of depth obeys binomial distribution according to order-checking degree of depth obedience negative binomial distribution, order-checking mistake, calculates the minimum order-checking degree of depth meeting false positive mistake and False negative error requirement;

2. the overlap mixing sequence measurement optimized as claimed in claim 1, is characterized in that, the computation model of the described optimum order-checking degree of depth is as follows:

P (N_{r}) = NB (N_{r}; \frac{D}{r - 1}, \frac{1}{r})

Wherein D is the degree of depth that on average checks order, N _rfor certain position on genome is by the number of times measured, r is the parameter of negative binomial distribution and relevant with the object that checks order to order-checking platform, and NB represents negative binomial distribution,

P(E|N _r)＝Bin(E；N _r,p _error)

Wherein E is the number of times that order-checking mistake occurs, p _errorfor the error rate that on average checks order, Bin represents binomial distribution,

F_P = Σ_{N_{r} = T}^{\infty} NB (N_{r}; \frac{D}{r - 1}, \frac{1}{r}) Σ_{E = T}^{N_{r}} Bin (E; N_{r}, p_{error})

Wherein D is the mixing order-checking degree of depth, N _rfor certain position on genome is by the number of times measured, E is the number of times that order-checking mistake occurs, p _errorfor the error rate that on average checks order, r is the parameter of negative binomial distribution;

F_N = Σ_{O = 0}^{T - 1} Σ_{x = 0}^{O} \{\begin{matrix} Σ_{i = x}^{\infty} NB (i; \frac{(1 - p) D}{r - 1}, \frac{1}{r} Bin (x; i, p_{error})) \times \\ Σ_{j = O - x}^{\infty} NB (j; \frac{pD}{r - 1}, \frac{1}{r}) Bin (j - O + x; j, p_{error}) \end{matrix}\}

Wherein p is the chromosome ratio of carrying rare mutation in mixing pit, O is the sequenced fragments number of carrying sudden change observed, x is the sequenced fragments number of carrying sudden change coming from normal individual, i and j represents the sequenced fragments number coming from normal individual He carry mutated individual respectively, D is the mixing order-checking degree of depth, p _errorfor the error rate that on average checks order;

D _optimal＝min{D|F_N(D,T)≤α&F_P(D,T)≤α,T∈[1,D]}

And the observation threshold value T calculating correspondence is:

T＝min{T|F_N(D _optimal)≤α&F_P(D _optimal)≤α}。

3. the overlap mixing sequence measurement optimized as claimed in claim 1, is characterized in that, described grouping overlap mixing order-checking is specific as follows:

p^{B} = Σ_{i = 0}^{d_{B}} (\begin{matrix} n_{B} \\ i \end{matrix}) {p_{v}}^{i} {(1 - p_{v})}^{(nB - i)}

p^{B} = Σ_{i = 0}^{d_{B}} \frac{(\begin{matrix} d \\ i \end{matrix}) (\begin{matrix} n - d \\ n_{B} - i \end{matrix})}{(\begin{matrix} n \\ n_{B} \end{matrix})}

Wherein i is temporary variable, and n is total sample number, n _bfor the number of samples in every group, d is rare mutation carrier sum, p _vfor the frequency of rare mutation carrier in colony, d _bfor the number upper limit of carriers of mutation in each group;

4. the overlap mixing sequence measurement optimized as claimed in claim 1, it is characterized in that, described order-checking Cost Model is:

C＝tP _l+N _dP _d

N_{d} = Σ_{i = 1}^{t} D_{i} \times R