CN109439729A

CN109439729A - Detect connector, connector mixture and the correlation method of low frequency variation

Info

Publication number: CN109439729A
Application number: CN201811608440.XA
Authority: CN
Inventors: 黄炳顶; 章扬; 任勇哲; 王丹丹; 戴珩; 史耀舟
Original assignee: Shanghai Whale Boat Gene Technology Co Ltd
Current assignee: Xukang Medical Science & Technology Suzhou Co ltd
Priority date: 2018-12-27
Filing date: 2018-12-27
Publication date: 2019-03-08

Abstract

The present invention relates to a kind of connector for detecting low frequency variation, the connector includes that two complementary DNAs are single-stranded, wherein it successively includes: the sequence to partially overlap with upstream amplification primer that a chain P5 chain, which is held from 5 ' ends to 3 ',；Sequence in conjunction with the sequencing primer of upstream；The molecular label of specific nucleotide sequence combination；1 prominent base T；Another chain P7 chain is from 5 ' ends to 3 ' ends successively including three parts: the molecular label with molecular label reverse complemental in P5 chain；Sequence in conjunction with the sequencing primer of downstream；Sequence in conjunction with downstream amplification primer.The invention further relates to a kind of connector mixtures and correlation method for detecting low frequency variation.Library preparation and high-flux sequence are carried out to the sample containing low frequency variation and single-stranded damage using connector mixture of the invention, in conjunction with analysis of biological information process disclosed by the invention and algorithm, effectively improve the accuracy of variation detection.

Description

Detect connector, connector mixture and the correlation method of low frequency variation

Technical field

The present invention relates to biological information field more particularly to genetic tests, in particular to one kind to be suitable for detection low frequency body The connector and its application method of cytometaplasia and single-stranded damage variation.

Background technique

The generation that compares sequencing technologies, two generation sequencing technologies are by feat of simultaneously parallel to millions of or even over ten billion sequence The excellent properties being sequenced, significantly reduce sequencing cost, have pushed it in each field such as scientific research, legal medical expert and clinic rapidly Using.For example, containing Fetal genetic information in dissociative DNA in maternal blood slurry, pass through the plasma DNA to pregnant woman (cell-free DNA, cfDNA) carries out low depth genome sequencing, can detecte fetal chromosomal abnormalities, noninvasive antenatal sieve Look into the development for having pushed genetic test industry rapidly.And with the proposition of the U.S.'s " accurate medicine plan ", domestic oncogene Detection service industry is also rapidly developing.Some researches show that the tumour cells of apoptosis or necrosis in recent years can be by small fragment intracellular DNA discharges into blood circulation system, these DNA are Circulating tumor DNA (circulating tumor DNA, ctDNA). Compared to the traditional approach for obtaining tumor specimen by operation, tissue biopsy, when ctDNA detection technique can overcome tumour well Empty heterogeneous, it is the main flow direction of current liquid Biopsy that repetition is easy to detect.But it is compared to foetal DNA and accounts for pregnant woman Ratio (up to 4% or more at pregnant 12 weeks) in blood plasma cfDNA, the ctDNA in tumor patient blood plasma account for the ratio of cfDNA total amount Very low, according to the difference of cancer kind and the course of disease, the ratio of most of ctDNA only accounts for 0.1%~1%, therefore, the detection of ctDNA Need higher sensitivity and specificity.And in current two generations sequencing experiment flow, in pre- library preparation, hybrid capture and The mistake or hybridization time that some amplifications and sequencing are inevitably introduced in sequencing procedure are damaged caused by too long, are caused low Frequency mutation cannot be distinguished with background noise and cause false positive or false negative.

In order to improve sequencing error correcting capability, scientist proposes some new methods, and there are two main classes: single-stranded cyclisation at present Method and molecular label method.

1, single-stranded cyclisation method

This method is that the peace biology that can help proposes, is named as firefly technology, cardinal principle be first by clip size about The double-strand cfDNA denaturation of 170bp is single stranded DNA, then circlewise by single stranded DNA connection, with unilateral target gene specificity Primer carries out unidirectional RCA rolling circle amplification, guarantees that each DNA fragmentation is multiple by inline copy, then introduces P5/P7 connector and carries out Both-end PE 150bp sequencing guarantees that each Insert Fragment is at least repeated sequencing twice or more, is confirmed by repeating sequencing Whether the variation detected is really to make a variation.The advantage of this method is that rolling circle amplification molecule is replicated in initial molecule always, no Mistake can be accumulated；It is enriched with target area by multiplex PCR, synthesising probing needle is not needed and is captured, simplifies operation；Compared to point Subtab method, sequencing cost reduce.This method disadvantage is that the efficiency of single-stranded cyclization, the efficiency of multiplex PCR all have an impact, primer Restricted to quantity, control primer size is relatively difficult, and can not identify the single-stranded damage in double-stranded template.

2, molecular label method

Molecular label (Unique Molecular Identifier, UMI) is now widely used method, principle It is to add the distinctive sequence label of the preceding paragraph to each original DNA template, library upper machine sequencing after PCR amplification, in data When analysis, multiple segments that same DNA profiling amplifies can be identified according to sequence label, it is multiple at this according to the variation of detection Distribution situation in segment, it is vacation caused by random error in PCR amplification, hybrid capture and sequencing procedure which, which can be differentiated, Positive variation, which is the variation that patient really carries, to improve detection sensitivity and specificity.

According to the difference of molecular label mark position, single chain molecule label and duplex molecule label can be divided into.

Single chain molecule label can only labeled ssdna molecule, or mark respectively two of double-stranded DNA it is single-stranded, cannot be simultaneously Double-strand is marked, single stranded DNA is applied in general to and builds library or when molecular label is located at a wherein jag for breeches joint, it is excellent Gesture is can to substantially reduce false positive results with relatively small number of sequencing amount, and disadvantage is that original DNA double-strand mould can not be utilized The further error correction of the complementary chain information of another of plate, if the earlier cycles in exponential amplification, Huo Zheshi occur for PCR amplification mistake Wax, which embeds, contains single-stranded damage in sample DNA (FFPE DNA), then can not only detect, be needed by double-strand by single chain molecule label Molecular label technology could detect.

Duplex molecule label technique was published an article proposition by Michael W et al. in 2012, it is characterized in that double-strand Y type connects Head end has 12 random nucleotide N as molecular label, and molecular label is followed by the nucleotide of 4 known arrays as molecule The identification label of label, have after identification label a prominent base A, the connector and end plus T base double chain DNA molecule into Row TA connection, then each double chain DNA molecule both ends respectively added a unique molecular label, so as to distinguishing different Primary template, and the pair principle that can use positive-sense strand and antisense strand carries out further error correction.Michael W et al. This method is improved in 2014, makes the protrusion base T of connector, is suitable for the banking process of current mainstream.But the party Method is related to multistep enzymatic reaction and more purification steps, and connector preparation process is relatively complicated, and FS final spice concentration is difficult accurately fixed Amount, Quality Control step is more demanding to experiment condition, and the success rate of connector preparation is not high, affects answering for duplex molecule label technique With and promote.

Summary of the invention

The purpose of the present invention is overcoming the above-mentioned prior art, provide a kind of with new ds molecular label Breeches joint, the duplex molecule sequence label be specific nucleotide sequence combination, the connector system containing the duplex molecule label Preparation Method is very easy, it is only necessary to equal proportion again after the multipair connector containing specific nucleotide sequence molecular label is annealed respectively Mixing.Library preparation is carried out to the sample containing low frequency variation and single-stranded damage using the connector mixture and high pass measures Sequence can effectively improve the accuracy of variation detection in conjunction with analysis of biological information process disclosed by the invention and algorithm.

To achieve the goals above, one aspect of the present invention provides a kind of connector for detecting low frequency variation, has such as Lower composition:

The connector include two complementary DNAs it is single-stranded, wherein a chain P5 chain from 5 ' end to 3 ' end successively include: with it is upper The sequence that trip amplimer partially overlaps；Sequence in conjunction with the sequencing primer of upstream；The molecule mark of specific nucleotide sequence combination Label；1 prominent base T；Another chain P7 chain is from 5 ' ends to 3 ' ends successively including three parts: with molecular label reverse mutual in P5 chain The molecular label of benefit；Sequence in conjunction with the sequencing primer of downstream；Sequence in conjunction with downstream amplification primer.

Preferably, in P5 chain, the sequence, the sequence in conjunction with the sequencing primer of upstream that partially overlap with upstream amplification primer Can there are partial sequence coincidence, and 3 ' end thio-modifications；Sequence and downstream in P7 chain, in conjunction with the sequencing primer of downstream The sequence that amplimer combines can have partial sequence to be overlapped or even be completely coincident the chain, and 5 ' end phosphorylation modifications.

Preferably, the upstream amplification primer and downstream amplification primer includes sample label.

Preferably, the connector is that Y type truncates type joint；The length of the molecular label is 3~12bp.

Preferably, P5 chain is the nucleotide sequence as shown in SEQ ID NO:3, P7 chain is the institute as shown in SEQ ID NO:4 The upstream amplification primer stated is the nucleotide sequence as shown in SEQ ID NO:1, and the downstream amplification primer is such as SEQ ID Nucleotide sequence shown in NO:2.

The present invention provides a kind of connector mixture, the connector mixture includes at least eight kinds of institutes mixed in proportion The connector stated.

Preferably, on longitudinal same position of the duplex molecule tag combination in the connector mixture, four kinds of bases It exists simultaneously, it is preferable that from longitudinal direction, the ratio of four kinds of base A:T:G:C is close to 1:1:1:1 in molecular label combination；From cross Upwards, the appearance of continuous 4 or more identical bases is avoided the occurrence of in each molecular label, it is preferable that originate in molecular label Position will avoid the appearance of continuous 2 and the above bases G.

Preferably, in the connector mixture duplex molecule label of each connector from each other at least 3 and 3 with The difference of upper nucleotide sequence；Preferably, the length of the duplex molecule label of each connector cannot be complete in the connector mixture It is exactly the same；Preferably, each connector is mixed by equal proportion in the connector mixture, alternatively, according to actually surveying in sequencing data The ratio obtained adjusts the ratio of each connector mixing again.

The present invention provides a kind of methods for detecting low frequency somatic variation, comprising the following steps:

(1) it is respectively synthesized the P5 chain and P7 chain of every butt joint in the connector mixture, annealing forms breeches joint, and It is mixed to form connector mixture in proportion；

(2) the connector mixture with duplex molecule label is attached with dissociative DNA segment sample and is reacted, connected It practices midwifery object, and PCR amplification is carried out to connection product with the upstream and downstream amplimer with sample label, obtain amplified production；

(3) hybrid capture is carried out to the target fragment in amplified production, targeted capture library is obtained, to targeted capture library The both-end sequencing for carrying out high depth carries out data fractionation to different samples according to the sample label of both-end；

(4) Quality Control processing is carried out to sequencing data, removes the joint sequence of low quality base, low quality read and pollution, Correction process is carried out according to base quality to the lap of read pair simultaneously, obtains clean data；

(5) reads is compared onto reference genome, on the comparison position of each, there will be identical molecular label Sequence, identical CIGAR label and the identical read pairs for comparing direction are classified as a read pairs family.

(6) it for each read pairs family, is accurately calculated according to Bayes' theorem and determines single-stranded consistency sequence SSCS is arranged, base mass value is recalculated, reduces sequencing mistake；

(7) SSCS of generation is found into the SSCS that molecular label sequence can be complementary, further generates double-strand consensus sequence DSCS, while retaining the SSCS that cannot form DSCS, the mobile position that compares repeats step (5)~(7) to next base；

(8) final consistency sequence is carried out variation detection, obtains initial variation set with reference to genome alignment, it is right Above-mentioned variation set is annotated, and it is true and reliable to obtain final low frequency to mistake, demographic data library, coding region for specific filtration resistance Somatic mutation.

Preferably, in step (1), the condition of annealing reaction are as follows: after 95 DEG C of 5min, with the cooling rate of 0.02 DEG C/sec After slow cooling to 25 DEG C or 95 DEG C of 5min, PCR instrument is closed, is stood until temperature is down to room temperature；

Preferably, in step (2), the otal investment of dissociative DNA segment sample is 20~33ng；

Preferably, in step (3), sequencing depth is 10,000-30,000x；The sample label of both-end is different samples Between the UDI that is all different of both-end sample label sequence；

Preferably, in step (5), the threshold value for reducing read pair family size read cluster size is 2, is generated More read pair families, while utilizing the read comprising a read that DSCS can be formed with other SSCS Pair family is for generating DSCS, in the fastq file of final output, while retaining SSCS and DSCS sequence and right The base mass value answered；

Preferably, in step (6), according to Bayes' theorem, the method for determining prior probability is, if the alkali observed Base is consistent with possible true base, then prior probability is 1-10^-q/10, it is otherwise 10^-q/10/ 3, q are base mass value, this point Cloth p (b, b_i,q_i) description；Base possible for 4 kinds calculates posterior probability according to following formula one, for every on SSCS A base positions, using the corresponding base/mass value of read pairs family to (b_i,q_i), calculating consistency base I is b When probability (b ∈ { A, C, G, T }), the maximum base type of probability value is true base, thereby determines that the true of each position Real base；Simultaneously according to following formula two, recalculate base quality using the posterior probability values of true base, obtain error correction it Consensus sequence reads afterwards.

q_c=-10log₁₀(1-P [I=b_c|{(b_i,q_i)]) formula two；

Preferably, in step (7), when two SSCS generate DSCS, if corresponding position base is identical, retain this alkali Base, otherwise by this position, base is changed to N.

Detailed description of the invention

Fig. 1 shows the structural schematic diagram of center tap sequence and amplimer of the present invention.

Fig. 2 shows the quality inspection figure of the Agilent 2100Bioanalyzer of Duplex Adapter#10 annealed product.

Fig. 3 shows the quality inspection figure of the Agilent 2100Bioanalyzer of connector mixture.

Fig. 4 shows the flow chart for generating consensus sequence process.

Fig. 5 shows the detection sensitivity under the conditions of different sequencing depth and Monitoring lower-cut.

Fig. 6 shows the detection specificity under the conditions of different sequencing depth and Monitoring lower-cut.

Specific embodiment

In order to more clearly describe technology contents of the invention, further retouch combined with specific embodiments below It states.

Of the invention provides a kind of breeches joint with duplex molecule label, which is specific The combination of nucleotide sequence, the connector preparation method containing the duplex molecule label are very easy, it is only necessary to contain spy for multipair Determine nucleotide sequence molecular label connector anneal respectively after again equal proportion mix.Using the connector mixture to containing low The sample of frequency variation and single-stranded damage carries out library preparation and high-flux sequence, in conjunction with analysis of biological information disclosed by the invention Process and algorithm can effectively improve the accuracy of variation detection.

Breeches joint mixture provided by the invention with duplex molecule label, the connector mixture are one group of connector Equal proportion mixture, every kind of connector in connector mixture is annealed by two DNA are single-stranded, wherein a chain is named as P5 Chain, another is named as P7 chain.

As shown in Figure 1, P5 chain successively includes four parts by functionality from 5 ' ends to 3 ' ends: having portion with upstream amplification primer S1 Divide the sequence S2 being overlapped, the molecular label S4 of sequence S3 and specific nucleotide sequence combination in conjunction with the sequencing primer of upstream, There are also 1 prominent base T, there is thio-modification at the chain 3 ' end, and wherein sequence S2 and sequence S3 can have partial sequence to be overlapped even It is completely coincident.

As shown in Figure 1, P7 chain successively includes three parts by functionality from 5 ' ends to 3 ' ends: reversed with molecular label combination S 4 Complementary sequence S5 (5 ' end phosphorylation modification), the sequence S6 in conjunction with the sequencing primer of downstream, and with downstream amplification primer S7 The sequence S8 that part combines, wherein sequence S6 and sequence S8 can have partial sequence to be overlapped and even be completely coincident.

Wherein, the S3+S4 sequence of P5 chain and the S5+S6 sequence of P7 chain have partial sequence reverse complemental, and Y can be formed after annealing Type joint.P5 chain S9 and P7 chain S10 is made annealing treatment after being respectively synthesized, and then mixes the various terminal after annealing by equal proportion It closes, forms connector mixture.In connector mixture, between various terminal other than duplex molecule label, that is, S4 and S5 is different, other Sequence is all identical.In connector mixture, the sequence of duplex molecule tag combination is specific nucleotide sequence, rather than random nucleosides Acid sequence, thus with the prior art unlike, do not need to add the identification to molecular label near molecular label sequence Sequence.

It should be noted that connector of the invention does not include the sample mark for being used to distinguish different samples for cost consideration Label, to truncate type joint.Sample label is drawn during PCR amplification by upstream amplification primer S1 and downstream amplification primer S7 Enter, the sequence complementary with sequence in sequencing flowing groove is further comprised on the amplimer of upstream and downstream and is used to carry out cluster reaction, therefore Truncation type joint and upstream and downstream amplimer of the invention is used cooperatively.

The sequence of upstream amplification primer S1 is as shown in SEQ ID NO:1:

5'-AATGATACGGCGACCACCGAGATCTACACNNNNNNNNACACTCTTTCCCTACACGAC-3'；

The sequence of downstream amplification primer S7 is as shown in SEQ ID NO:2:

5’-CAAGCAGAAGACGGCATACGAGATNNNNNNNNGTGACTGGAGTTCAGACGTGT-3’。

Wherein, " NNNNNNNN " sequence is the sample label at the library end P5 and the end P7 respectively, is all 8 length of nucleotides The sample label sequence at combination, the end P5 and the end P7 is different, and the sample label between different samples is also different.Sample Label is used to distinguish the different samples that ibid machine is sequenced, because the sequenator of some models is easy to happen crosstalk between sample, because It is necessary to add double sample label to sample for this.

P5 chain overall length S9 sequence is as shown in SEQ ID NO:3:

5'-ACACTCTTTCCCTACACGACGCTCTTCCGATCTNNNNNNT-3'；

P7 chain overall length S10 sequence is as shown in SEQ ID NO:4:

5’-NNNNNNGATCGGAAGAGCACACGTCTGAACTCCAGTCAC-3’。

Wherein, " NNNNNN " is molecular label sequence, and it is specific for every butt joint that length, which is 3~12 nucleotide, Sequence rather than random sequence.For same a pair of joint in connector mixture, molecular label is particular sequence, P5 chain and P7 The molecular label sequence of chain is reverse complementary sequence, can be matched completely when being annealed into duplex molecule label.The end of P5 chain 3 ' has thio There is phosphorylation modification at modification, the end of P7 chain 5 '.

The length of duplex molecule label is 3~12bp, theoretically can produce 4³+4⁴+4⁵+…+4¹¹+4¹²Different pairs of kind Chain molecular label.Preferably, duplex molecule tag combination is colour balance, that is, each channel can in each circulation being sequenced Detect signal, that is to say, that in connector mixture, on same a position of duplex molecule tag combination, four kinds of bases are simultaneously In the presence of.High-flux sequence instrument majority is two-color laser, some are four-ways, detects four kinds with four kinds of different optical channels Nucleotide；Some be it is twin-channel, A and C are respectively a fluorescence, and there are two fluorescence by T, and G is without fluorescence.

Preferably, base complexity is balance in duplex molecule tag combination, from the point of view of longitudinal, every group of molecular label group The ratio of four kinds of base A:T:G:C is close to 1:1:1:1 in conjunction；From the point of view of laterally, continuous 4 are avoided the occurrence of in each molecular label Or more identical base appearance；Particularly with binary channels sequenator, molecular label initial position to avoid continuous 2 and with The appearance of upper bases G, when causing sequencing to proceed to these bases to avoid base composition imbalance, place of the software to sequencing signal There is obstacle in reason, cannot accurately identify these bases.

Preferably, the editing distance (edit distance) between connector mixture double center chain molecular label is not less than 3, i.e., The difference of duplex molecule label at least 3 and 3 or more nucleotide sequences from each other, i.e., will at least occur 3 times or more Sequencing mistake just will lead to the crosstalk of molecular label.

Preferably, the number of connector mixture double center chain molecular label is no less than 8, for both-end sequencing, at least The combination of 8x8=64 kind is generated, due to being broken in genomic DNA fragment in the same probability with reference to genome starting and final position It is very low, it has been able to whether distinguish from same primary template molecule with less number of combinations.

Preferably, although each connector is mixed by equal proportion in connector mixture, the connection reaction of connector has sequence preference Property, therefore actually measured each molecular label ratio and unequal, need to be returned again according to ratio actually measured in sequencing data The ratio of butt joint mixing is adjusted back.

Preferably, the length of each duplex molecule label cannot be identical in connector mixture, and molecular label end is all 1 prominent base T, if molecular label length is all consistent, sequencing the latter circulation in the base that measures it is whole It is all T, base is seriously uneven, can reduce sequencing data quality.

It is thin using low frequency body in duplex molecule label coding technology detection tumor blood sample that the present invention provides a kind of The method of born of the same parents' variation, comprising the following steps:

(1) expressing feature for pressing above-mentioned connector, is respectively synthesized the P5 chain and P7 chain of every butt joint in connector mixture, with moving back Fiery buffer is diluted to certain concentration；P5 chain and P7 chain are mixed in molal quantity 1:1 ratio, annealing reaction is carried out, it is double to form Y type Chain joint；

(2) annealed each butt joint is mixed by equimolar number, forms connector mixture, it is dense is diluted to working solution Degree；

(3) a certain number of cfDNA extracted from tumor blood sample are taken, are pressed with the connector with duplex molecule label Certain proportion is attached reaction, obtains connection product；

(4) PCR amplification is carried out to connection product with the upstream and downstream amplimer with sample label, obtains amplified production；

(5) hybrid capture is carried out to the target fragment in amplified production with probe, obtains targeted capture library；

(6) to targeted capture library carry out high depth both-end be sequenced, according to the sample label of both-end to different samples into Row data are split；

(7) the sample sequencing data obtained to fractionation carries out Quality Control processing first, removes low quality base, low quality read And the joint sequence of pollution, while correction process is carried out according to base quality to the lap of read pair, it obtains clean Data；

(8) above-mentioned reads is compared onto reference genome, is compared on position at each, there will be identical molecule mark Sequence is signed, identical CIGAR label and the identical read pairs for comparing direction are classified as a read pairs family；

(9) for each read pairs family, base sequence is determined using Bayes' theorem, generates one SSCS.The method for determining prior probability is, if the base observed is consistent with possible true base, prior probability 1- 10^-q/10, it is otherwise 10^-q/10/ 3, q are base mass value, this distribution p (b, b_i,q_i) description；Base possible for 4 kinds, root Posterior probability is calculated according to following formula one, it is corresponding using read pairs family for each base positions on SSCS Base/mass value is to (b_i,q_i), calculate the probability (b ∈ { A, C, G, T }) when consistency base I is b, the maximum base of probability value Type is true base, thereby determines that the true base of each position.Simultaneously according to formula two, the posteriority of true base is used Probability value recalculates base quality, obtains the consensus sequence read pair after error correction.

q_c=-10log₁₀(1-P [I=b_c|{(b_i,q_i)]) formula two

(10) SSCS is found into the SSCS that molecular label sequence can be complementary, further generates DSCS.Retain simultaneously not The SSCS of DSCS can be formed.The mobile position that compares repeats step (8)~(10) to next base；

(11) the final consistency sequence is subjected to variation detection with reference to genome alignment, obtains the collection that initially makes a variation It closes；

(12) above-mentioned variation set is annotated, specific filtration resistance obtains final mistake, demographic data library, coding region The true and reliable somatic mutation of low frequency.

In step (1), the ingredient of annealing buffer contains Tris, EDTA, NaCl etc.；The reaction condition of annealing is 95 DEG C 5min, then with the cooling rate slow cooling of 0.02 DEG C/sec to 25 DEG C；The reaction condition of annealing is 95 DEG C of 5min, is then closed PCR instrument is closed, is stood until temperature is down to room temperature.

In step (3), the extraction agent box of ctDNA is QIAamp Circulating Nucleic Acid Kit (Qiagen)；The total amount of ctDNA is 20ng~33ng, i.e., 6,000~10,000 genome monoploid copy；Connector with The ratio of cfDNA segment is 100:1~200:1；Purifying reaction is carried out after connection reaction, the purifying magnetic bead used is Agencourt AMPure XP(Beckman Coulter)。

In step (4), the nucleotide sequence of upstream amplification product as shown in SEQ ID NO:1, downstream amplification product Nucleotide sequence is as shown in SEQ ID NO:2；PCR amplification recurring number is recycled at 5~10, is guaranteeing enough amplified production premises Under reduce recurring number to the greatest extent.

In step (5), probe is biotin labeling；Probe can be DNA probe, be also possible to rna probe；Probe Length in 50~120nt；The amplified production total amount for putting into hybrid capture is 500~750ng.

In step (6), both-end sequencing reading length is 2x 75bp or 2x150bp；Sequencing depth is 10,000x-30, 000x；The sample label of both-end is unique dual index (UDI), i.e., the both-end sample label sequence between different samples is all It is not identical.

In step (8), it includes at least 3 pairs of read pairs ability that conventional method, which needs each read pairs family, It is effective read pairs family, is just used to generate SSCS, what the SSCS of two molecular label sequence complementations was formed DSCS sequence, which can be just retained, is further used for variation detection, and data user rate is lower, and contains 2 couples of read in the present invention The read pairs family of pair is effective read pairs family, while an if read pairs Family contains only 1 couple of read pair, but DSCS, such read pair can be complementarily shaped to other SSCS sequence Family is also retained as valid data, to greatly improve data user rate.

In step (9), for each read paris family, during generating a SSCS, each position The determination of upper base, conventional method use most of rules, that is, calculate the ratio of every kind of base (A, T, G, C) on this position, such as Certain base ratio of fruit is greater than 70%, it is believed that the true base of this position is the base, while using wherein higher base For mass value as final base mass value, the method is fairly simple, and the present invention calculates every kind of base according to Bayes' theorem is The probability of true base, maximum probability is true base, according to this probability calculation base mass value, makes the alkali of consensus sequence Base is more accurate and reliable.

Main advantages of the present invention include:

Connector of the invention contains duplex molecule label, therefore when application the technology of the present invention progress low frequency abrupt climatic change, phase Compare cyclisation tandem sequence repeats method of ascertainment and single chain molecule labeling acts, positive-sense strand and the antisense strand that can use primary template are further Correct single-stranded damage mistake caused by amplification incipient error and hybrid capture；

Duplex molecule label in inventive joint is one group of specific nucleotide sequence combination, rather than random nucleosides Acid sequence, therefore the preparation method of this connector is very easily and economically, it is only necessary to simple annealing and mixing are not needed as existing Multistep enzymatic reaction and purifying reaction will be carried out by having technology generally；

Duplex molecule label in inventive joint is one group of specific nucleotide sequence combination, minimum in connector mixture There are 8 kinds of molecular labels to constitute the combination of both-end 8x8=64 kind, can effectively distinguish has identical starting on reference genome sequence Whether come from the sequencing sequence of final position with a primary template molecule, without having 4 as the prior art¹²x4¹²= The combination of 2.8e14 kind；

Connector of the invention does not need the identification sequence of molecular label, while the length of molecular label is less than the prior art 12 nucleotide sequences, thus using the technology of the present invention be sequenced when, increase effectively read length, reduce sequencing cost；

It is 2 that the present invention, which reasonably reduces read pair family threshold value, and being dexterously utilized can be with other SSCS The read pair family for containing only a read for forming DSCS sequence, substantially increases the utilization of raw sequencing data Rate；

The present invention uses Bayes' theorem, accurately calculates the probability of every kind of base on each position, chooses probability value most Big base is consistency base, and recalculates base mass value according to this probability value, can effectively reduce survey The mistake that the random sequencing mistake and PCR amplification process of sequence instrument are brought into.

It should be understood that above-mentioned each technical characteristic of the invention and having in below (eg embodiment) within the scope of the present invention It can be combined with each other between each technical characteristic of body description, to form a new or preferred technical solution.As space is limited, exist This no longer tires out one by one states.

Present invention specific nucleotide sequence instead of existing random nucleotides, and to specific nucleotide sequence into Optimization of having gone using the bioinformatic analysis algorithm of the technology and independent development can more effectively detect tumor blood sample In low frequency somatic variation.With reference to the accompanying drawings and examples, a specific embodiment of the invention is made further detailed It illustrates.It should be understood that these examples are only for illustrating the present invention and are not intended to limit the scope of the present invention.

Preparation of the embodiment 1 containing 7bp+8bp duplex molecule label connector

The sequence of connector P5 chain and P7 chain of the design with duplex molecule label: design includes 16 kinds in the connector mixture Molecular label has 7 pairs as shown in table 1 below for 7 length of nucleotides, remaining 9 pairs are that 8 nucleotide are long in 16 kinds of molecular labels Degree, the two length are staggered 1.In preceding 7 nucleotide sequences of molecular label combination, the base ratio A:G of longitudinal same position: T:C=1:1:1:1, and between molecular label at least 3 nucleotide sequences difference.

Table 1

The pipe 12 of synthesis, 000g, which is centrifuged 1 minute, makes dry powder be thrown to bottom, careful to open pipe lid, with LowTE Buffer Dry powder is diluted to 250 μM by (10mM Tris-HCl (pH 8.0), 0.1mM EDTA), and oscillation mixes and is placed in 4 DEG C of refrigerator mistakes Night.5x Annealing Buffer is configured by table 2.

Table 2

Every butt joint is mixed by the system of following table 3, final concentration of 100 μM of connector.

Table 3

PCR pipe is placed on GeneAmp 9700PCR instrument (Applied Biosystems), 95 DEG C be incubated for 5 minutes, so PCR instrument is directly closed afterwards, is taken out after standing 1 hour.

Every pipe annealed product takes 1 μ L, carries out quality inspection with Agilent 2100Bioanalyzer after dilution.It is connect with wherein #10 For head, the peak type of annealed product is as shown in Figure 2.Because being breeches joint, peak type can clip size bigger than normal one than expected A bit.

16 pipe annealed products, every pipe take same volume to be mixed, and use Agilent after taking 1 μ L mixture to dilute 2100Bioanalyzer carries out quality inspection, and the peak type of annealed product is as shown in Figure 3.

The duplex molecule label connector prepared is diluted to working solution concentration, after small size packing -20 DEG C freeze it is spare, Avoid multigelation.

Low frequency somatic variation in 2 examination criteria product of embodiment and tumor blood sample

Prepare standard items DNA: with normal cell line dna NA18536 to the standard items of Horizon Discovery HD701 and HD753 carries out the dilution of different multiples, and the mixture of different extension rates and expected variation frequency are shown in Table 4 and table 5.DNA mixture is interrupted to main peak with Covaris S220 ultrasound in 170bp or so, similar to the main peak size of cfDNA.

Table 4

Table 5

Prepare cfDNA: being extracted with QIAamp Circulating Nucleic Acid Kit (Qiagen) from patient whole blood Blood plasma obtained after separation.

The preparation of pre- library: by taking KAPA Hyper Prep Kit (Roche) as an example, blood plasma cfDNA and the standard items after interrupting DNA takes 33ng, carries out pre- library preparation with KAPA Hyper Prep Kit (Roche), upper embodiment 1 is connected after filling-in tailing The connector of middle preparation carries out PCR amplification after magnetic beads for purifying, and the sample label of both-end is introduced by upstream and downstream primer, and upstream and downstream is drawn Object sequence is shown in SEQ ID NO:1 and SEQ ID NO:2.

Targeted capture library preparation: the DNA probe synthesized with IDT (Integrated DNA Technologies) company For big Panel, which covers variant sites all in table 5 and table 6.Concrete operations are as follows: in the pre- library 500ng Human Cot-1DNA and Adapter Blocker is added to be then added to 65 DEG C of DNA probe hybridization incubation 4-16 hours It is incubated for 45 minutes for 65 DEG C in M270 magnetic bead, keeps the M270 magnetic bead with Streptavidin and the probe with biotin labeling abundant In conjunction with then being cleaned for several times with the cleaning buffer solution of different ions concentration and temperature, wash away the non-purpose piece not in conjunction with probe Section.DNA probe grabs the target fragment got off after PCR amplification, is purified with magnetic bead to get to the targeted capture prepared Library.

Targeted capture library carries out 2x 75bp or 2x after quality inspection is qualified on the sequenator of Illumina platform The sequencing depth of 150bp sequencing, initial data is 10,000x-30,000x, and the fractionation of data is carried out with the sample label of both-end.

For sequencing data, first using fastp removal low quality base, the joint sequence and low quality reads of pollution. Correction process can be carried out according to base quality to lap if there is overlapping for R1 and R2.Count total using C++ program The quality control indexs such as data volume, comparison rate, on-target rate, overburden depth.

It is compared using BWA, is determined according to position, molecular label sequence, CIGAR label and comparison direction is compared for the first time read pairs family.The probability that every kind of base on each position in consensus sequence is calculated according to Bayes' theorem, determines Real sequence further generates DSCS according to two SSCS of molecular label sequence complementation.

Comparison is re-started to the sequence after above-mentioned error correction, detection variation, annotation is as a result, obtain final variation after screening Set.

HD701Mix1 and Mix2 is used when LOD (limit of detection) is 0.001,0.002,0.005 respectively Detection sensitivity (PPA) and specificity (PPV) are calculated, as a result such as table 6.M1 indicates that Mix1, M2 indicate Mix2.Number after sample name Word represents Monitoring lower-cut, such as HD701M1_0.001, indicates that LOD=0.001, TP are the abbreviation of true positive, ignore Refer to that variation frequency is less than the number of loci of Monitoring lower-cut, FP is the abbreviation of false positive, and FN is false negative Abbreviation.This method shows good sensitivity and specificity.

Table 6

sample	totalSNP	TP	ignore	FP	FN	PPA	PPV
								HD701M1_0.001	251	247	0	14	4	98.41%	94.64%
HD701M2_0.001	251	247	0	9	4	98.41%	96.48%
								HD701M1_0.002	251	247	0	9	4	98.41%	96.48%
HD701M2_0.002	251	246	1	5	4	98.40%	98.01%
								HD701M1_0.005	251	244	4	6	3	98.79%	97.60%
HD701M2_0.005	251	239	9	5	3	98.76%	97.95%

For this method by reducing read pairs family size threshold value, containing for DSCS can be formed with other SSCS by retaining The methods of the read pair family for having single read substantially increases data user rate.As shown in table 7, with it is original Duplex method is compared to (needing at least three read pairs to form read pair family, only to retain DSCS sequence), we Method improves the valid data amount (6.526G) for detecting variation, and overburden depth (1954.61) and sensitivity are promoted.Phase Than in only using single-ended UMI (70.06%), (94.64%) is substantially improved in the specificity of this method, while can also reach good Detection sensitivity (98.41%), it was demonstrated that the detection advantage of this method.

Table 7

	data_size(G)	reads	ontarget	mean_cov	PPA	PPV
							raw data	52.579	350,531,218	-	-	-	-
Original duplex	1.104	8,095,816	94.3239	383.88	95.62%	98.36%
							sinotools	6.526	48,896,444	84.5106	1954.61	98.41%	94.64%
single	7.974	59,591,152	85.8695	2448.31	98.80%	70.06%

To detecting that mutation selects 6 low frequency sites and carry out ddPCR verifyings, as shown in table 8, wherein 5 are positive findings, And frequency invariance is higher.Separately there is the variation frequency in a site in ddPCR Monitoring lower-cut, ddPCR detects the lower positive Signal, but not can determine that, provide negative findings.The above verifying proves that this method and ddPCR method have high consistency, can be with Accurate detection low frequency variation.

Table 8

Gene	amino acid	This method	ddPCR
				EGFR	p.T790M	0.3%	0.33%
EGFR	p.T790M	2.3%	1.82%
				EGFR	p.T790M	0.1%	-
EGFR	p.L858R	0.3%	0.89%
				EGFR	p.L858R	2.3%	4.60%
KRAS	p.G12D	2%	1.80%

Downsample experiment is carried out, the variation of detection sensitivity and specificity under different sequencing depth is simulated.Under detection When limiting LOD=0.005, when sequencing depth reaches 1300X, detection sensitivity can be optimal level.LOD=0.001 or When LOD=0.002, when sequencing depth reaches 1800X, detection sensitivity is optimal level.

In this description, the present invention is described referring to its specific embodiment.But it is clear that can still make Various modifications and alterations are without departing from the spirit and scope of the invention out.Therefore, the description and the appended drawings should be considered as illustrative And not restrictive.

Sequence table

<110>Shanghai Jing Zhou Gene Tech. Company Limited

<120>connector, connector mixture and the correlation method of low frequency variation are detected

<141> 2018-12-27

<160> 4

<170> SIPOSequenceListing 1.0

<210> 5

<211> 57

<212> DNA

<213>artificial sequence ()

<400> 5

aatgatacgg cgaccaccga gatctacacn nnnnnnnaca ctctttccct acacgac 57

<210> 5

<211> 53

<212> DNA

<213>artificial sequence ()

<400> 5

caagcagaag acggcatacg agatnnnnnn nngtgactgg agttcagacg tgt 53

<210> 5

<211> 40

<212> DNA

<213>artificial sequence ()

<400> 5

acactctttc cctacacgac gctcttccga tctnnnnnnt 40

<210> 5

<211> 39

<212> DNA

<213>artificial sequence ()

<400> 5

nnnnnngatc ggaagagcac acgtctgaac tccagtcac 39

Claims

1. a kind of connector for detecting low frequency variation, which is characterized in that the connector is single-stranded including two complementary DNAs, wherein It successively includes: the sequence to partially overlap with upstream amplification primer and upstream sequencing primer knot that one chain P5 chain is held from 5 ' ends to 3 ' The molecular label and 1 prominent base T that the sequence of conjunction, specific nucleotide sequence combine；Another chain P7 chain is from 5 ' ends to 3 ' ends Successively include three parts: with the molecular label of molecular label reverse complemental in P5 chain, the sequence in conjunction with the sequencing primer of downstream and Sequence in conjunction with downstream amplification primer.

2. the connector of detection low frequency variation according to claim 1, which is characterized in that in P5 chain, with upstream amplification Sequence, the sequence in conjunction with the sequencing primer of upstream of primer portion coincidence can have partial sequence coincidence, and 3 ' hold thio repair Decorations；In P7 chain, the sequence in conjunction with the sequencing primer of downstream, the sequence in conjunction with downstream amplification primer can have partial sequence weight It closes and is even completely coincident the chain, and 5 ' end phosphorylation modifications.

3. the connector of detection low frequency variation according to claim 1, which is characterized in that the upstream amplification primer and Downstream amplification primer includes sample label.

4. the connector of detection low frequency variation according to claim 1, which is characterized in that the connector is the truncation of Y type Type joint；The length of the molecular label is 3~12bp.

5. the connector of detection low frequency variation according to claim 1, which is characterized in that P5 chain is such as SEQ ID NO:3 Shown in nucleotide sequence, P7 chain be as shown in SEQ ID NO:4, the upstream amplification primer is such as SEQ ID NO:1 institute The nucleotide sequence shown, the downstream amplification primer are the nucleotide sequence as shown in SEQ ID NO:2.

6. a kind of connector mixture, which is characterized in that the connector mixture includes at least eight kinds of rights mixed in proportion It is required that connector described in 1.

7. connector mixture according to claim 6, which is characterized in that the duplex molecule mark in the connector mixture On longitudinal same position of label combination, four kinds of bases are existed simultaneously, it is preferable that from longitudinal direction, four kinds of alkali in molecular label combination The ratio of base A:T:G:C is close to 1:1:1:1；From transverse direction, continuous 4 or more identical alkali are avoided the occurrence of in each molecular label The appearance of base, it is preferable that avoid the appearance of continuous 2 and the above bases G in molecular label initial position.

8. connector mixture according to claim 6, which is characterized in that the double-strand of each connector in the connector mixture The difference of molecular label at least 3 and 3 or more nucleotide sequences from each other；Preferably, in the connector mixture The length of the duplex molecule label of each connector cannot be identical；Preferably, each connector is pressed and waits ratios in the connector mixture Example mixing, alternatively, adjusting the ratio of each connector mixing again according to ratio actually measured in sequencing data.

9. a kind of method for detecting low frequency somatic variation, which comprises the following steps:

(1) it is respectively synthesized the P5 chain and P7 chain of every butt joint in connector mixture described in any one of claim 6 to 8, is moved back Fire forms breeches joint, and is mixed to form connector mixture in proportion；

(2) the connector mixture with duplex molecule label is attached with dissociative DNA segment sample and is reacted, obtained connection and produce Object, and PCR amplification is carried out to connection product with the upstream and downstream amplimer with sample label, obtain amplified production；

(3) hybrid capture is carried out to the target fragment in amplified production, obtains targeted capture library, targeted capture library is carried out The both-end of high depth is sequenced, and carries out data fractionation to different samples according to the sample label of both-end；

(4) Quality Control processing is carried out to sequencing data, removes the joint sequence of low quality base, low quality read and pollution, simultaneously Correction process is carried out according to base quality to the lap of read pair, obtains clean data；

(5) reads is compared on reference genome, on the comparison position of each, will have identical molecular label sequence, Identical CIGAR label and the identical read pairs for comparing direction are classified as a read pairs family.

(6) it for each read pairs family, is accurately calculated according to Bayes' theorem and determines single-stranded consensus sequence SSCS recalculates base mass value, reduces sequencing mistake；

(7) SSCS of generation is found into the SSCS that molecular label sequence can be complementary, further generates double-strand consensus sequence DSCS, Retain the SSCS that cannot form DSCS simultaneously, the mobile position that compares repeats step (5)~(7) to next base；

(8) final consistency sequence is carried out variation detection, obtains initial variation set with reference to genome alignment, to this change Different set is annotated, and specific filtration resistance obtains the true and reliable body cell of final low frequency to mistake, demographic data library, coding region Mutation.

10. the method according to claim 9 for detecting low frequency somatic variation, which is characterized in that in step (1) In, the condition of annealing reaction are as follows: after 95 DEG C of 5min, with the cooling rate slow cooling of 0.02 DEG C/sec to 25 DEG C or 95 DEG C After 5min, PCR instrument is closed, is stood until temperature is down to room temperature；

Preferably, in step (3), sequencing depth is 10,000-30,000x；The sample label of both-end is between different samples The UDI that both-end sample label sequence is all different；

Preferably, in step (5), the threshold value for reducing read pairs family size is 2, generates more read pairs Family, while being used using the only read pairs family comprising a pair of read pair that can form DSCS with other SSCS In generating DSCS, in the fastq file of final output, while retaining SSCS and DSCS sequence and corresponding base quality Value；

It preferably, include 2 pairs for each read pairs family on identical comparison position in step (6) The read pairs family of the above read pairs, which is just further used in, generates SSCS；According to Bayes' theorem, determine first The method for testing probability is, if the base observed is consistent with possible true base, prior probability 1-10^-q/10, otherwise It is 10^-q/10/ 3, q are base mass value, this distribution p (b, b_i,q_i) description；Base possible for 4 kinds, according to following formula One calculating posterior probability uses the corresponding base/quality of read pairs family for each base positions on SSCS Value is to (b_i,q_i), the probability (b ∈ { A, C, G, T }) when consistency base I is b is calculated, the maximum base type of probability value is True base thereby determines that the true base of each position；Simultaneously according to following formula two, the posterior probability of true base is used Value recalculates base quality, obtains the consensus sequence reads after error correction；

q_c=-10log₁₀(1-P [I=b_c|{(b_i,q_i)]) formula two；

Preferably, in step (7), when two SSCS generate DSCS, if corresponding position base is identical, retain this base, Otherwise by this position, base is changed to N.