CN106021986B

CN106021986B - Ultralow frequency mutating molecule consensus sequence degeneracy algorithm

Info

Publication number: CN106021986B
Application number: CN201610348484.8A
Authority: CN
Inventors: 曾华萍; 宋卓; 袁梦兮
Original assignee: Human And Future Biotechnology (changsha) Co Ltd
Current assignee: Human And Future Biotechnology (changsha) Co Ltd
Priority date: 2016-05-24
Filing date: 2016-05-24
Publication date: 2019-04-09
Anticipated expiration: 2036-05-24
Also published as: CN106021986A

Abstract

The invention discloses a kind of methods of determining sequencing read group consensus sequence, this method comprises: the sequencing read group is carried out the first filtering according to basic comparison situation by (1), to obtain the sequencing read group by the first filtering, (2) each for the sequencing read group by the first filtering, determine the consensus sequence of sequencing read group, (3) each for the sequencing read group by the first filtering, step described in similar (2) determines the shared sequence label of sequencing read group.The consensus sequence that read group is sequenced obtained from sequencing is repeated several times from same DNA molecular can be effectively determined as this method, it is accurate to DNA molecular and mutation quantitative to realize, the influence to result such as sequencing mistake is eliminated simultaneously, ensures the accuracy of result.

Description

Ultralow frequency mutating molecule consensus sequence degeneracy algorithm

Technical field

The present invention relates to sequencing technologies fields, especially ultralow frequency mutating molecule consensus sequence degeneracy algorithm, specifically, The present invention relates to the methods for determining sequencing read group consensus sequence.

Background technique

With the rapid development that two generations were sequenced, the reduction of expense is sequenced, two generations were sequenced in detection research in all fields It is more and more widely used.And relative to genome sequencing, sequencing cost can be greatly lowered in target interval sequencing With the complexity of data, our interested target intervals is made to reach very high sequencing coverage while lower cost, This is mutated into the low frequency detected in cancer mutation in order to possible.

In target interval sequencing approach, the method for PCR amplification is carried out due to its behaviour to target interval using specific primer The advantages that making simply, quickly, and only needing a small amount of DNA, is widely applied by people.However, in primer amplified sequencing, no It is avoidable to have serious amplification Preference, while there is also the various mistakes that amplification sequencing introduces.These problems are on the one hand Quantitative accuracy is directly affected, because the quantity in sequencing data cannot represent the quantity of original DNA fragment；On the other hand It will affect precision of analysis, introduce a large amount of false positive.And in Tumor mutations research, since the height of tumour is heterogeneous Property, there are the mutation of a large amount of low frequency, so that these problems are especially prominent.

Thus, current primer amplified sequencing still has much room for improvement.

Summary of the invention

The present invention is directed at least solve one of the technical problems existing in the prior art.For this purpose, one object of the present invention A kind of method for being to propose determining sequencing read group consensus sequence, so that realization is accurate to DNA molecular and mutation quantitative, together When eliminate influence of the sequencing mistake etc. to result, ensure the accuracy of result.

It should be noted that the present invention is the following work based on inventor and completes:

At this stage, for the above problem of primer amplified sequencing, researcher introduces molecular label, original The unique sequence label that a Duan Neng represents the DNA molecular is connected on DNA molecular.Different DNA moleculars connects different molecules Label can accurately identify DNA molecular by molecular label sequence.The introducing of molecular label, can be to DNA molecular and mutation It carries out accurately quantitative, while can also reduce and even be eliminated the mistake as caused by expanding and be sequenced etc..

For two generation sequencing datas of addition molecular label, in data processing, need reads according to its molecular label A point group is carried out, reads start-stop position is the same, and same reads points of molecular label are a group, it is believed that this is by same Multiple duplicates that DNA molecular segment is generated by PCR amplification.Then be directed to each group, find its final consensus sequence ( Herein, " consensus sequence " is also known as " consensus sequence "), it is the sequence of original DNA molecule corresponding to this group.Most Afterwards, these consensus sequences is recycled to carry out the analysis such as subsequent abrupt climatic change.

However, due to carrying out PCR amplification, the same molecular template meeting to the molecular template after addition molecular label in experiment Generate the sub- molecule of a group striking resemblances；But in experiment sequencing procedure, and some mistakes is unavoidably introduced, finally obtain one A little molecular templates containing a small amount of mistake are repeated as many times as required the fastq data of sequencing.Inventor in response to this, is dedicated to root According to the own sequence (the comparison position with genome) of molecular label and read (sequencing read), in the premise for considering sequencing mistake Under, the reads from the same molecular template is carried out cluster grouping, to obtain sequencing read group；Further, for poly- Class divides the sequencing read group after group, obtains the consensus sequence (Consensus sequence) of each sequencing read group.

In turn, in the first aspect of the present invention, the present invention provides a kind of methods of determining sequencing read group consensus sequence. According to an embodiment of the invention, the described method includes:

(1) the sequencing read group is subjected to the first filtering according to basic comparison situation, to obtain by the first mistake The standard of the sequencing read group of filter, first filtering are for example following:

(a) exclude double ends respectively with the matched sequencing read group of the different chromosomes of reference sequences；

(b) sequencing read group of the Insert Fragment except preset range is excluded；And

(c) initial position of the sequencing read sequencing read group different from amplimer initial position is excluded；

(2) each for the sequencing read group by the first filtering determines that the sequencing is read according to the following steps The consensus sequence of stage group:

(i) each sequencing read in predetermined position, ergodic group, counts the respective depth of tetra- kinds of bases of ATCG；

(ii) base type of the base with significant depth advantage as the predetermined position is selected, and according to the base The depth etc. of type obtains the mass value in the predetermined position；

(iii) all positions are directed to, step (i) and (ii) is repeated, to determine the consensus sequence,

(3) each for the sequencing read group by the first filtering, determines according to step described in similar (2) The shared sequence label of sequencing read group:

(A) sequence label that read is sequenced in each in ergodic group, counts the depth of each sequence label；And

(B) shared sequence label of the sequence label with significant depth advantage as sequencing read group is selected.

It can effectively determine and be surveyed from same DNA molecular through being repeated several times obtained from sequencing by means of the present invention The consensus sequence of sequence read group so that realization is accurate to DNA molecular and mutation quantitative, while being eliminated and mistake etc. is sequenced to result Influence, ensure the accuracy of result.

According to an embodiment of the invention, the sequencing read group is and the sequencing read obtained to sequencing clusters It obtains, and the sequencing read group is the read group with similar sequences, similar molecules label, it is more likely that be same molecule Multiple copies (i.e. duplicate) that template is generated by amplification.

According to an embodiment of the invention, the similar sequences, which refer to, is matched to identical start-stop position with genome.

Some specific examples according to the present invention, (b) in the preset range be 30~400bp.

According to an embodiment of the invention, (ii) further comprises:

(A ') is ranked up in predetermined position, by tetra- kinds of bases of ATCG by depth, to obtain depth capacity and second deeply Degree and the base type corresponding to it；

(B ') is based on the depth capacity and the second depth, determines the consensus sequence base type in the predetermined position and right The mass value answered.

According to an embodiment of the invention, in (B '), comprising:

Determine parameter C, wherein parameter C=(the-the second depth of depth capacity)/depth capacity；

If parameter C is not less than specified threshold, the base of the depth capacity being total to as the predetermined position is selected There is series type, and the mass value Q=20+ (max*C^2)/2 of the base takes 40 as Q > 40；If the parameter C Less than specified threshold, it is determined that the consensus sequence base type in the predetermined position is uncertain base N, corrresponding quality value Q= 2。

Some specific examples according to the present invention, the specified threshold are 0.65.

According to an embodiment of the invention, the number for not knowing base N in the consensus sequence is more than 5, then shared Sequence failure, filters the sequencing read group.

According to an embodiment of the invention, (B) further comprises determining shared sequence label through the following steps:

Sequence label is ranked up by (A ") by depth, to obtain depth capacity and the second depth；

(B ") determines parameter C, wherein parameter C=(the-the second depth of depth capacity)/depth capacity；

If parameter C is not less than specified threshold, select the sequence label of the depth capacity as the sequencing read The shared sequence label of group；If the parameter is less than specified threshold, shared sequence label failure is obtained, the sequencing is filtered Read group.Thereby, it is possible to effectively determine shared sequence label or the filtering sequencing read group.

Additional aspect and advantage of the invention will be set forth in part in the description, and will partially become from the following description Obviously, or practice through the invention is recognized.

Detailed description of the invention

Above-mentioned and/or additional aspect of the invention and advantage will become from the description of the embodiment in conjunction with the following figures Obviously and it is readily appreciated that, in which:

Fig. 1 shows the flow diagram of the method for determining sequencing read group consensus sequence according to an embodiment of the present invention.

Specific embodiment

The embodiment of the present invention is described below in detail.The embodiments described below is exemplary, and is only used for explaining this hair It is bright, and be not considered as limiting the invention.

The method for determining sequencing read group consensus sequence

In the first aspect of the present invention, the present invention provides a kind of methods of determining sequencing read group consensus sequence.According to The embodiment of the present invention, which comprises

It can effectively determine to be repeated several times from same DNA molecular by means of the present invention as a result, and be sequenced and obtain The consensus sequence of the sequencing read group arrived so that realization is accurate to DNA molecular and mutation quantitative, while being eliminated and mistake etc. is sequenced Influence to result ensures the accuracy of result.

According to an embodiment of the invention, the sequencing read group is and the sequencing read obtained to sequencing clusters It obtains, and the sequencing read group is the read group with similar sequences, similar molecules label, is that same molecular template passes through Amplification is repeated sequencing and generates.

According to an embodiment of the invention, regional scope can be sequenced according to actual target determines preset range.According to this Invention some specific examples, (b) in the preset range be 30~400bp.

According to an embodiment of the invention, (ii) further comprises:

According to an embodiment of the invention, in (B '), comprising:

According to an embodiment of the invention, can determine specified threshold according to practical operation demand.It is more according to the present invention Specific example, the specified threshold are 0.65.

Other embodiments according to the present invention, referring to Fig.1, the method for determining sequencing read group consensus sequence of the invention The following steps are included:

1, it filters；

After sequencing read (read) cluster grouping obtains sequencing read group (reads groups), read group is sequenced to these It is filtered according to the following conditions:

A) both-end is filtered than the read groups to different chromosomes；

B) to Insert Fragment size<30, or>400 read groups is filtered；

Since the clip size of cfDNA is mainly in 166bp and 330bp or so, so Insert Fragment size most should very much not surpass Cross 400bp；And the length of amplimer is generally more than 20 bp, therefore Insert Fragment size minimum is no less than 30bp.

C) the read groups to the initial position of read not in amplimer initial position is filtered；

Due to being the amplified production of amplimer, the initial position of read should be the initial position of primer.

2, consensus sequence (Consensus sequence) is determined

Basic principle:

Reads in each sequencing read group is that the same molecular template generates, so in principle in the same group Reads should sequence it is the same, and barcode is the same；But due to unavoidably existing in experiment and sequencing procedure Mistake, the reads in group have some mistakes.And determine the process of Consensus sequence, these mistakes are exactly excluded, are obtained To the real sequence of molecular template.

Processing step:

A) it is directed to each position read, is performed the following operation:

I. 4 kinds of respective depth of base of ATCG are counted；

Ii. it sorts from high to low to the depth of ATCG4 kind base, obtains max, sec, third, fourth

Iii. design factor C=(max-sec)/max, if coefficient C >=0.65, then it is assumed that the base of max depth is Position Consensus base, and the quality of the Consensus base is that Q=20+ (max*C^2)/2 takes 40 as Q > 40； If C < 0.65, then it is assumed that the base of this position read is uncertain, and the Consensus sequence position is N, corrresponding quality value Q= 2。

After carrying out these operations to each base of read, the Consensus sequence and corresponding quality of the group is obtained Value；But there may be some bases uncertain in Consensus sequence, be N.

If b) uncertain base number is more than 5 in entire read, the group is filtered；If being no more than 5, carry out next Walk (c) judgement；

C) depth of barcode (molecular label) in the group is counted, ibid method, judges barcode in the group Whether can determine that；If uncertain, the group is filtered；If it is determined that the group retains, and final Consensus sequence, phase Mass value and its barcode sequence is answered all to obtain.

Read group is sequenced

As previously mentioned, sequencing read group of the invention is to cluster and obtain, the sequencing read to sequencing read Carry sequence label.In order to facilitate understanding, a kind of method clustered to sequencing read is set forth below.

According to an embodiment of the invention, can cluster by following steps to sequencing read, sequencing read group is obtained:

(1) multiple sequencing reads are compared with reference sequences, and determine the end positions of each sequencing read, by both ends The sequencing read of position consistency is sorted out to identical level-one group；

(2) to belonging to the sequencing read of the same level-one group according to the further point second level group of its sequence label, by molecule mark The similar sequencing read of label sequence is divided into the same second level group.

According to an embodiment of the invention, the detailed step of the step (2) includes:

(a) depth of each label in the level-one group is determined；

(b) each label is ranked up from high to low by depth；

(c) successively implement the following steps for the label of depth from high to low:

If the mispairing of the label and existing Seed label sequence is no more than specified mispairing number, there will be the mark The sequencing read of label is distributed into the Seed label subgroup；

If the mispairing of the label and existing Seed label sequence is more than specified mispairing number, select the label for New Seed label, and the sequencing read with the label is distributed into corresponding Seed label subgroup；

After above-mentioned second level group processing, all sequencing reads are all divided into several second levels group, these second levels group is Last grouping result.

The cluster grouping result that read is sequenced as a result, is reliable, thus realize it is accurately quantitative to DNA molecular, while after being Phase carries out accurate ultralow frequency abrupt climatic change using consensus sequence and establishes solid foundation.

According to an embodiment of the invention, Seed label described in (c) refers to the highest sequence label of the depth of second level group, Be considered the true sequence label of this group, at the same in the group there are some depth it is lower contain vicious sequence label.By This, the cluster grouping result that read is sequenced is reliable, and subsequent sequencing analysis result is accurate.

According to an embodiment of the invention, determining specified mispairing number according to used microarray dataset in (c), wherein when When using Illumina microarray dataset, since Illumina microarray dataset is mainly with mismatch (mispairing number) for main sequencing Mistake, so the molecular label of 8bp holds 1 mismatch namely the specified mispairing number is 1.Cluster grouping result can as a result, It leans on, subsequent sequencing analysis result is accurate.

The solution of the present invention is explained below in conjunction with embodiment.It will be understood to those of skill in the art that following Embodiment is merely to illustrate the present invention, and should not be taken as limiting the scope of the invention.Particular technique or item are not specified in embodiment Part, it described technology or conditions or is carried out according to the literature in the art according to product description.Agents useful for same or instrument Production firm person is not specified in device, and being can be with conventional products that are commercially available, such as can purchase from Illumina company.

Embodiment 1:

The present embodiment is for two that the frequency of mutation of known 8 mutational sites (as shown in table 3 below) is 1% and 0.1% Sample (mankind) is marked DNA molecular using 8bp random molecular label, then, using AmpliTaq 360Master Mix carries out the primer amplified for each known mutations site to each sample, finally utilizes Illumina NS500 microarray dataset carries out 75PE sequencing to each amplified production.

Then, according to the mentioned-above method clustered to sequencing read, the sequencing read of acquisition is clustered, Sequencing read group is obtained, and the method for determining sequencing read group consensus sequence according to the present invention determines this according to the following steps The consensus sequence of a little sequencing read groups:

1, it filters；

After sequencing read (read) cluster grouping obtains sequencing read group (reads groups), according to the following conditions to this A little sequencing read groups are filtered:

A) both-end is filtered than the read groups to different chromosomes；

B) to Insert Fragment size<30, or>400 read groups is filtered；

2, consensus sequence (i.e. Consensus sequence) is determined

Basic principle:

Processing step:

A) it is directed to each position read, is performed the following operation:

I. 4 kinds of respective depth of base of ATCG are counted；

C) depth of barcode (i.e. molecular label) in the group is counted, ibid method, judged in the group Whether barcode can determine that；If uncertain, the group is filtered；If it is determined that the group retains, and final Consensus Sequence, corrresponding quality value and its barcode sequence have all obtained.

Meanwhile counting each classification filtering situation such as the following table 1:

Table 1

First row is sample names in table 1, and planAS01 is the sample that the frequency of mutation is 0.1%, and planAS1 is mutation frequency The sample that rate is 1%；Secondary series is total reads number；Third column are to compare the ratio shared by the reads of different chromosomes；The Four column are ratios shared by the reads of Insert Fragment size not within the predefined range；5th column are initial positions not in primer position Ratio shared by the reads set；6th column are that uncertain base number is greater than in the uncertain read group of 5 or barcode sequence Ratio shared by reads；7th column are in the read group for can normally obtain consensus sequence and consistency barcode sequence Ratio shared by reads.

Further, the reads situation counted in read group number (i.e. consensus sequence number) and read group is as follows:

Table 2

SampleID	ConsensesSeqs	Reads
			planAS01	12996	4,684,288
planAS1	13435	4,642,052

In table 2, first row is sample names, and planAS01 is the sample that the frequency of mutation is 0.1%, and planAS1 is mutation The sample that frequency is 1%；Secondary series is consensus sequence number；Third column are the reads numbers in read group.

After obtaining consensus sequence, consensus sequence is compared with the mankind with reference to genome (hg19), according to comparison As a result abrupt climatic change is carried out, testing result such as the following table 3:

Table 3

First row is chromosome numbers in table 3, and secondary series is the position of mutational site on chromosome, and third column are genes Name, the 4th column are the direction of gene on chromosome, and the 5th column are specific CDS and protein mutation information, and the 6th column are mutation Frequency be 0.1% sample testing result, the 7th column be the frequency of mutation be 1% sample testing result (YES is to detect, NO It is to be not detected).

To sum up, for the implementation case using the technology of addition molecular label, binding molecule label clustering divides group, and obtains read The consensus sequence of group has successfully been accurately detected all frequencies of mutation only in the only sequencing of about 5M reads 1% mutation and 6 frequencies of mutation are down to 0.1% mutation, and another 2 0.1% mutation are the case where improving sequencing data amount Under can also detect.

The technologies such as technology, such as ARMS and Digital PCR of the mutation of detection low frequency can just be detected down to 0.1% at present Mutation, but that there are flux is low for these technologies, at high cost, and the shortcomings that can only detect known mutations site, and two common generations Sequencing technologies can only detect 2% frequency of mutation.And by the above results of the present embodiment it is found that the present invention is in addition molecular label Technical foundation on, binding molecule label clustering divides group, and obtains the consensus sequence of read group, that is, overcome ARMS and The shortcomings that technologies such as Digital PCR, while it successfully being detected the frequency of mutation again down to 0.1% mutation.

In the description of this specification, reference term " one embodiment ", " some embodiments ", " example ", " specifically show The description of example " or " some examples " etc. means specific features, structure, material or spy described in conjunction with this embodiment or example Point is included at least one embodiment or example of the invention.In the present specification, schematic expression of the above terms are not Centainly refer to identical embodiment or example.Moreover, particular features, structures, materials, or characteristics described can be any One or more embodiment or examples in can be combined in any suitable manner.

Although an embodiment of the present invention has been shown and described, it will be understood by those skilled in the art that: not A variety of change, modification, replacement and modification can be carried out to these embodiments in the case where being detached from the principle of the present invention and objective, this The range of invention is defined by the claims and their equivalents.

Claims

1. a kind of method of determining sequencing read group consensus sequence, which comprises

(1) the sequencing read group is subjected to the first filtering according to basic comparison situation, to obtain by the first filtering Read group is sequenced, the standard of first filtering is for example following:

(2) each for the sequencing read group by the first filtering, determines sequencing read group according to the following steps Consensus sequence:

(ii) base type of the base with significant depth advantage as the predetermined position is selected, and according to the base type Depth obtain the mass value in the predetermined position；

(3) each for the sequencing read group by the first filtering, determines the survey according to step described in similar (2) The shared sequence label of sequence read group:

(B) shared sequence label of the sequence label with significant depth advantage as sequencing read group is selected；

Wherein, described (ii) further comprises:

(A ') is ranked up in predetermined position, by tetra- kinds of bases of ATCG by depth, to obtain depth capacity and the second depth, with And its corresponding base type；

(B ') is based on the depth capacity and the second depth, determines the consensus sequence base type in the predetermined position and corresponding Mass value；

(B ') includes:

If parameter C is not less than specified threshold, shared sequence of the base of the depth capacity as the predetermined position is selected Column base type, and the mass value Q=20+ (max*C^2)/2 of the base takes 40 as Q > 40；If the parameter C is less than Specified threshold, it is determined that the consensus sequence base type in the predetermined position is uncertain base N, corrresponding quality value Q=2；

The max refers to the depth capacity.

2. the method according to claim 1, wherein the sequencing read group is the sequencing by obtaining to sequencing Read is clustered and is obtained, and the sequencing read group is the read group with similar sequences, similar molecules label, is same Multiple copies that one molecular template is generated by amplification,

Wherein, the read group with similar sequences, similar molecules label is determined as follows:

1) multiple sequencing reads are compared with reference sequences, and determine the end positions of each sequencing read, by end positions Consistent sequencing read is sorted out to identical level-one group；

2) to belonging to the sequencing read of the same level-one group according to the further point second level group of its sequence label, by molecular label sequence Similar sequencing read is divided into the same second level group；

It is described 2) to include:

A) depth of each label in the level-one group is determined；

B) each label is ranked up from high to low by depth；

C) successively implement the following steps for the label of depth from high to low:

If the mispairing of the label and existing Seed label sequence is no more than specified mispairing number, will be with the label Sequencing read is distributed into the Seed label subgroup；

If the mispairing of the label and existing Seed label sequence is more than specified mispairing number, selecting the label is newly Seed label, and the sequencing read with the label is distributed into corresponding Seed label subgroup；

C) Seed label described in refers to the highest sequence label of depth of the second level group, after above-mentioned second level group processing, All sequencing reads are all divided into several second levels group, these second levels group is i.e. described to have similar sequences, similar molecules label Read group.

3. the method according to claim 1, wherein the preset range in (b) is 30~400bp.

4. the method according to claim 1, wherein the specified threshold is 0.65.

5. the method according to claim 1, wherein the number for not knowing base N in the consensus sequence is more than 5 It is a, then consensus sequence failure is obtained, the sequencing read group is filtered.