CN106021986B - Ultralow frequency mutating molecule consensus sequence degeneracy algorithm - Google Patents
Ultralow frequency mutating molecule consensus sequence degeneracy algorithm Download PDFInfo
- Publication number
- CN106021986B CN106021986B CN201610348484.8A CN201610348484A CN106021986B CN 106021986 B CN106021986 B CN 106021986B CN 201610348484 A CN201610348484 A CN 201610348484A CN 106021986 B CN106021986 B CN 106021986B
- Authority
- CN
- China
- Prior art keywords
- group
- label
- sequencing
- sequencing read
- depth
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Classifications
-
- C—CHEMISTRY; METALLURGY
- C12—BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
- C12Q—MEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
- C12Q1/00—Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
- C12Q1/68—Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
- C12Q1/6869—Methods for sequencing
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B20/00—ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
Landscapes
- Life Sciences & Earth Sciences (AREA)
- Chemical & Material Sciences (AREA)
- Proteomics, Peptides & Aminoacids (AREA)
- Health & Medical Sciences (AREA)
- Engineering & Computer Science (AREA)
- Organic Chemistry (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Physics & Mathematics (AREA)
- Biotechnology (AREA)
- Analytical Chemistry (AREA)
- Genetics & Genomics (AREA)
- Zoology (AREA)
- Biophysics (AREA)
- Molecular Biology (AREA)
- General Health & Medical Sciences (AREA)
- Wood Science & Technology (AREA)
- Evolutionary Biology (AREA)
- Theoretical Computer Science (AREA)
- Immunology (AREA)
- Microbiology (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Medical Informatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Biochemistry (AREA)
- General Engineering & Computer Science (AREA)
- Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a kind of methods of determining sequencing read group consensus sequence, this method comprises: the sequencing read group is carried out the first filtering according to basic comparison situation by (1), to obtain the sequencing read group by the first filtering, (2) each for the sequencing read group by the first filtering, determine the consensus sequence of sequencing read group, (3) each for the sequencing read group by the first filtering, step described in similar (2) determines the shared sequence label of sequencing read group.The consensus sequence that read group is sequenced obtained from sequencing is repeated several times from same DNA molecular can be effectively determined as this method, it is accurate to DNA molecular and mutation quantitative to realize, the influence to result such as sequencing mistake is eliminated simultaneously, ensures the accuracy of result.
Description
Technical field
The present invention relates to sequencing technologies fields, especially ultralow frequency mutating molecule consensus sequence degeneracy algorithm, specifically,
The present invention relates to the methods for determining sequencing read group consensus sequence.
Background technique
With the rapid development that two generations were sequenced, the reduction of expense is sequenced, two generations were sequenced in detection research in all fields
It is more and more widely used.And relative to genome sequencing, sequencing cost can be greatly lowered in target interval sequencing
With the complexity of data, our interested target intervals is made to reach very high sequencing coverage while lower cost,
This is mutated into the low frequency detected in cancer mutation in order to possible.
In target interval sequencing approach, the method for PCR amplification is carried out due to its behaviour to target interval using specific primer
The advantages that making simply, quickly, and only needing a small amount of DNA, is widely applied by people.However, in primer amplified sequencing, no
It is avoidable to have serious amplification Preference, while there is also the various mistakes that amplification sequencing introduces.These problems are on the one hand
Quantitative accuracy is directly affected, because the quantity in sequencing data cannot represent the quantity of original DNA fragment;On the other hand
It will affect precision of analysis, introduce a large amount of false positive.And in Tumor mutations research, since the height of tumour is heterogeneous
Property, there are the mutation of a large amount of low frequency, so that these problems are especially prominent.
Thus, current primer amplified sequencing still has much room for improvement.
Summary of the invention
The present invention is directed at least solve one of the technical problems existing in the prior art.For this purpose, one object of the present invention
A kind of method for being to propose determining sequencing read group consensus sequence, so that realization is accurate to DNA molecular and mutation quantitative, together
When eliminate influence of the sequencing mistake etc. to result, ensure the accuracy of result.
It should be noted that the present invention is the following work based on inventor and completes:
At this stage, for the above problem of primer amplified sequencing, researcher introduces molecular label, original
The unique sequence label that a Duan Neng represents the DNA molecular is connected on DNA molecular.Different DNA moleculars connects different molecules
Label can accurately identify DNA molecular by molecular label sequence.The introducing of molecular label, can be to DNA molecular and mutation
It carries out accurately quantitative, while can also reduce and even be eliminated the mistake as caused by expanding and be sequenced etc..
For two generation sequencing datas of addition molecular label, in data processing, need reads according to its molecular label
A point group is carried out, reads start-stop position is the same, and same reads points of molecular label are a group, it is believed that this is by same
Multiple duplicates that DNA molecular segment is generated by PCR amplification.Then be directed to each group, find its final consensus sequence (
Herein, " consensus sequence " is also known as " consensus sequence "), it is the sequence of original DNA molecule corresponding to this group.Most
Afterwards, these consensus sequences is recycled to carry out the analysis such as subsequent abrupt climatic change.
However, due to carrying out PCR amplification, the same molecular template meeting to the molecular template after addition molecular label in experiment
Generate the sub- molecule of a group striking resemblances;But in experiment sequencing procedure, and some mistakes is unavoidably introduced, finally obtain one
A little molecular templates containing a small amount of mistake are repeated as many times as required the fastq data of sequencing.Inventor in response to this, is dedicated to root
According to the own sequence (the comparison position with genome) of molecular label and read (sequencing read), in the premise for considering sequencing mistake
Under, the reads from the same molecular template is carried out cluster grouping, to obtain sequencing read group;Further, for poly-
Class divides the sequencing read group after group, obtains the consensus sequence (Consensus sequence) of each sequencing read group.
In turn, in the first aspect of the present invention, the present invention provides a kind of methods of determining sequencing read group consensus sequence.
According to an embodiment of the invention, the described method includes:
(1) the sequencing read group is subjected to the first filtering according to basic comparison situation, to obtain by the first mistake
The standard of the sequencing read group of filter, first filtering are for example following:
(a) exclude double ends respectively with the matched sequencing read group of the different chromosomes of reference sequences;
(b) sequencing read group of the Insert Fragment except preset range is excluded;And
(c) initial position of the sequencing read sequencing read group different from amplimer initial position is excluded;
(2) each for the sequencing read group by the first filtering determines that the sequencing is read according to the following steps
The consensus sequence of stage group:
(i) each sequencing read in predetermined position, ergodic group, counts the respective depth of tetra- kinds of bases of ATCG;
(ii) base type of the base with significant depth advantage as the predetermined position is selected, and according to the base
The depth etc. of type obtains the mass value in the predetermined position;
(iii) all positions are directed to, step (i) and (ii) is repeated, to determine the consensus sequence,
(3) each for the sequencing read group by the first filtering, determines according to step described in similar (2)
The shared sequence label of sequencing read group:
(A) sequence label that read is sequenced in each in ergodic group, counts the depth of each sequence label;And
(B) shared sequence label of the sequence label with significant depth advantage as sequencing read group is selected.
It can effectively determine and be surveyed from same DNA molecular through being repeated several times obtained from sequencing by means of the present invention
The consensus sequence of sequence read group so that realization is accurate to DNA molecular and mutation quantitative, while being eliminated and mistake etc. is sequenced to result
Influence, ensure the accuracy of result.
According to an embodiment of the invention, the sequencing read group is and the sequencing read obtained to sequencing clusters
It obtains, and the sequencing read group is the read group with similar sequences, similar molecules label, it is more likely that be same molecule
Multiple copies (i.e. duplicate) that template is generated by amplification.
According to an embodiment of the invention, the similar sequences, which refer to, is matched to identical start-stop position with genome.
Some specific examples according to the present invention, (b) in the preset range be 30~400bp.
According to an embodiment of the invention, (ii) further comprises:
(A ') is ranked up in predetermined position, by tetra- kinds of bases of ATCG by depth, to obtain depth capacity and second deeply
Degree and the base type corresponding to it;
(B ') is based on the depth capacity and the second depth, determines the consensus sequence base type in the predetermined position and right
The mass value answered.
According to an embodiment of the invention, in (B '), comprising:
Determine parameter C, wherein parameter C=(the-the second depth of depth capacity)/depth capacity;
If parameter C is not less than specified threshold, the base of the depth capacity being total to as the predetermined position is selected
There is series type, and the mass value Q=20+ (max*C^2)/2 of the base takes 40 as Q > 40;If the parameter C
Less than specified threshold, it is determined that the consensus sequence base type in the predetermined position is uncertain base N, corrresponding quality value Q=
2。
Some specific examples according to the present invention, the specified threshold are 0.65.
According to an embodiment of the invention, the number for not knowing base N in the consensus sequence is more than 5, then shared
Sequence failure, filters the sequencing read group.
According to an embodiment of the invention, (B) further comprises determining shared sequence label through the following steps:
Sequence label is ranked up by (A ") by depth, to obtain depth capacity and the second depth;
(B ") determines parameter C, wherein parameter C=(the-the second depth of depth capacity)/depth capacity;
If parameter C is not less than specified threshold, select the sequence label of the depth capacity as the sequencing read
The shared sequence label of group;If the parameter is less than specified threshold, shared sequence label failure is obtained, the sequencing is filtered
Read group.Thereby, it is possible to effectively determine shared sequence label or the filtering sequencing read group.
Additional aspect and advantage of the invention will be set forth in part in the description, and will partially become from the following description
Obviously, or practice through the invention is recognized.
Detailed description of the invention
Above-mentioned and/or additional aspect of the invention and advantage will become from the description of the embodiment in conjunction with the following figures
Obviously and it is readily appreciated that, in which:
Fig. 1 shows the flow diagram of the method for determining sequencing read group consensus sequence according to an embodiment of the present invention.
Specific embodiment
The embodiment of the present invention is described below in detail.The embodiments described below is exemplary, and is only used for explaining this hair
It is bright, and be not considered as limiting the invention.
The method for determining sequencing read group consensus sequence
In the first aspect of the present invention, the present invention provides a kind of methods of determining sequencing read group consensus sequence.According to
The embodiment of the present invention, which comprises
(1) the sequencing read group is subjected to the first filtering according to basic comparison situation, to obtain by the first mistake
The standard of the sequencing read group of filter, first filtering are for example following:
(a) exclude double ends respectively with the matched sequencing read group of the different chromosomes of reference sequences;
(b) sequencing read group of the Insert Fragment except preset range is excluded;And
(c) initial position of the sequencing read sequencing read group different from amplimer initial position is excluded;
(2) each for the sequencing read group by the first filtering determines that the sequencing is read according to the following steps
The consensus sequence of stage group:
(i) each sequencing read in predetermined position, ergodic group, counts the respective depth of tetra- kinds of bases of ATCG;
(ii) base type of the base with significant depth advantage as the predetermined position is selected, and according to the base
The depth etc. of type obtains the mass value in the predetermined position;
(iii) all positions are directed to, step (i) and (ii) is repeated, to determine the consensus sequence,
(3) each for the sequencing read group by the first filtering, determines according to step described in similar (2)
The shared sequence label of sequencing read group:
(A) sequence label that read is sequenced in each in ergodic group, counts the depth of each sequence label;And
(B) shared sequence label of the sequence label with significant depth advantage as sequencing read group is selected.
It can effectively determine to be repeated several times from same DNA molecular by means of the present invention as a result, and be sequenced and obtain
The consensus sequence of the sequencing read group arrived so that realization is accurate to DNA molecular and mutation quantitative, while being eliminated and mistake etc. is sequenced
Influence to result ensures the accuracy of result.
According to an embodiment of the invention, the sequencing read group is and the sequencing read obtained to sequencing clusters
It obtains, and the sequencing read group is the read group with similar sequences, similar molecules label, is that same molecular template passes through
Amplification is repeated sequencing and generates.
According to an embodiment of the invention, the similar sequences, which refer to, is matched to identical start-stop position with genome.
According to an embodiment of the invention, regional scope can be sequenced according to actual target determines preset range.According to this
Invention some specific examples, (b) in the preset range be 30~400bp.
According to an embodiment of the invention, (ii) further comprises:
(A ') is ranked up in predetermined position, by tetra- kinds of bases of ATCG by depth, to obtain depth capacity and second deeply
Degree and the base type corresponding to it;
(B ') is based on the depth capacity and the second depth, determines the consensus sequence base type in the predetermined position and right
The mass value answered.
According to an embodiment of the invention, in (B '), comprising:
Determine parameter C, wherein parameter C=(the-the second depth of depth capacity)/depth capacity;
If parameter C is not less than specified threshold, the base of the depth capacity being total to as the predetermined position is selected
There is series type, and the mass value Q=20+ (max*C^2)/2 of the base takes 40 as Q > 40;If the parameter C
Less than specified threshold, it is determined that the consensus sequence base type in the predetermined position is uncertain base N, corrresponding quality value Q=
2。
According to an embodiment of the invention, can determine specified threshold according to practical operation demand.It is more according to the present invention
Specific example, the specified threshold are 0.65.
According to an embodiment of the invention, the number for not knowing base N in the consensus sequence is more than 5, then shared
Sequence failure, filters the sequencing read group.
According to an embodiment of the invention, (B) further comprises determining shared sequence label through the following steps:
Sequence label is ranked up by (A ") by depth, to obtain depth capacity and the second depth;
(B ") determines parameter C, wherein parameter C=(the-the second depth of depth capacity)/depth capacity;
If parameter C is not less than specified threshold, select the sequence label of the depth capacity as the sequencing read
The shared sequence label of group;If the parameter is less than specified threshold, shared sequence label failure is obtained, the sequencing is filtered
Read group.Thereby, it is possible to effectively determine shared sequence label or the filtering sequencing read group.
Other embodiments according to the present invention, referring to Fig.1, the method for determining sequencing read group consensus sequence of the invention
The following steps are included:
1, it filters;
After sequencing read (read) cluster grouping obtains sequencing read group (reads groups), read group is sequenced to these
It is filtered according to the following conditions:
A) both-end is filtered than the read groups to different chromosomes;
B) to Insert Fragment size<30, or>400 read groups is filtered;
Since the clip size of cfDNA is mainly in 166bp and 330bp or so, so Insert Fragment size most should very much not surpass
Cross 400bp;And the length of amplimer is generally more than 20 bp, therefore Insert Fragment size minimum is no less than 30bp.
C) the read groups to the initial position of read not in amplimer initial position is filtered;
Due to being the amplified production of amplimer, the initial position of read should be the initial position of primer.
2, consensus sequence (Consensus sequence) is determined
Basic principle:
Reads in each sequencing read group is that the same molecular template generates, so in principle in the same group
Reads should sequence it is the same, and barcode is the same;But due to unavoidably existing in experiment and sequencing procedure
Mistake, the reads in group have some mistakes.And determine the process of Consensus sequence, these mistakes are exactly excluded, are obtained
To the real sequence of molecular template.
Processing step:
A) it is directed to each position read, is performed the following operation:
I. 4 kinds of respective depth of base of ATCG are counted;
Ii. it sorts from high to low to the depth of ATCG4 kind base, obtains max, sec, third, fourth
Iii. design factor C=(max-sec)/max, if coefficient C >=0.65, then it is assumed that the base of max depth is
Position Consensus base, and the quality of the Consensus base is that Q=20+ (max*C^2)/2 takes 40 as Q > 40;
If C < 0.65, then it is assumed that the base of this position read is uncertain, and the Consensus sequence position is N, corrresponding quality value Q=
2。
After carrying out these operations to each base of read, the Consensus sequence and corresponding quality of the group is obtained
Value;But there may be some bases uncertain in Consensus sequence, be N.
If b) uncertain base number is more than 5 in entire read, the group is filtered;If being no more than 5, carry out next
Walk (c) judgement;
C) depth of barcode (molecular label) in the group is counted, ibid method, judges barcode in the group
Whether can determine that;If uncertain, the group is filtered;If it is determined that the group retains, and final Consensus sequence, phase
Mass value and its barcode sequence is answered all to obtain.
Read group is sequenced
As previously mentioned, sequencing read group of the invention is to cluster and obtain, the sequencing read to sequencing read
Carry sequence label.In order to facilitate understanding, a kind of method clustered to sequencing read is set forth below.
According to an embodiment of the invention, can cluster by following steps to sequencing read, sequencing read group is obtained:
(1) multiple sequencing reads are compared with reference sequences, and determine the end positions of each sequencing read, by both ends
The sequencing read of position consistency is sorted out to identical level-one group;
(2) to belonging to the sequencing read of the same level-one group according to the further point second level group of its sequence label, by molecule mark
The similar sequencing read of label sequence is divided into the same second level group.
According to an embodiment of the invention, the detailed step of the step (2) includes:
(a) depth of each label in the level-one group is determined;
(b) each label is ranked up from high to low by depth;
(c) successively implement the following steps for the label of depth from high to low:
If the mispairing of the label and existing Seed label sequence is no more than specified mispairing number, there will be the mark
The sequencing read of label is distributed into the Seed label subgroup;
If the mispairing of the label and existing Seed label sequence is more than specified mispairing number, select the label for
New Seed label, and the sequencing read with the label is distributed into corresponding Seed label subgroup;
After above-mentioned second level group processing, all sequencing reads are all divided into several second levels group, these second levels group is
Last grouping result.
The cluster grouping result that read is sequenced as a result, is reliable, thus realize it is accurately quantitative to DNA molecular, while after being
Phase carries out accurate ultralow frequency abrupt climatic change using consensus sequence and establishes solid foundation.
According to an embodiment of the invention, Seed label described in (c) refers to the highest sequence label of the depth of second level group,
Be considered the true sequence label of this group, at the same in the group there are some depth it is lower contain vicious sequence label.By
This, the cluster grouping result that read is sequenced is reliable, and subsequent sequencing analysis result is accurate.
According to an embodiment of the invention, determining specified mispairing number according to used microarray dataset in (c), wherein when
When using Illumina microarray dataset, since Illumina microarray dataset is mainly with mismatch (mispairing number) for main sequencing
Mistake, so the molecular label of 8bp holds 1 mismatch namely the specified mispairing number is 1.Cluster grouping result can as a result,
It leans on, subsequent sequencing analysis result is accurate.
The solution of the present invention is explained below in conjunction with embodiment.It will be understood to those of skill in the art that following
Embodiment is merely to illustrate the present invention, and should not be taken as limiting the scope of the invention.Particular technique or item are not specified in embodiment
Part, it described technology or conditions or is carried out according to the literature in the art according to product description.Agents useful for same or instrument
Production firm person is not specified in device, and being can be with conventional products that are commercially available, such as can purchase from Illumina company.
Embodiment 1:
The present embodiment is for two that the frequency of mutation of known 8 mutational sites (as shown in table 3 below) is 1% and 0.1%
Sample (mankind) is marked DNA molecular using 8bp random molecular label, then, using AmpliTaq
360Master Mix carries out the primer amplified for each known mutations site to each sample, finally utilizes Illumina
NS500 microarray dataset carries out 75PE sequencing to each amplified production.
Then, according to the mentioned-above method clustered to sequencing read, the sequencing read of acquisition is clustered,
Sequencing read group is obtained, and the method for determining sequencing read group consensus sequence according to the present invention determines this according to the following steps
The consensus sequence of a little sequencing read groups:
1, it filters;
After sequencing read (read) cluster grouping obtains sequencing read group (reads groups), according to the following conditions to this
A little sequencing read groups are filtered:
A) both-end is filtered than the read groups to different chromosomes;
B) to Insert Fragment size<30, or>400 read groups is filtered;
Since the clip size of cfDNA is mainly in 166bp and 330bp or so, so Insert Fragment size most should very much not surpass
Cross 400bp;And the length of amplimer is generally more than 20 bp, therefore Insert Fragment size minimum is no less than 30bp.
C) the read groups to the initial position of read not in amplimer initial position is filtered;
Due to being the amplified production of amplimer, the initial position of read should be the initial position of primer.
2, consensus sequence (i.e. Consensus sequence) is determined
Basic principle:
Reads in each sequencing read group is that the same molecular template generates, so in principle in the same group
Reads should sequence it is the same, and barcode is the same;But due to unavoidably existing in experiment and sequencing procedure
Mistake, the reads in group have some mistakes.And determine the process of Consensus sequence, these mistakes are exactly excluded, are obtained
To the real sequence of molecular template.
Processing step:
A) it is directed to each position read, is performed the following operation:
I. 4 kinds of respective depth of base of ATCG are counted;
Ii. it sorts from high to low to the depth of ATCG4 kind base, obtains max, sec, third, fourth
Iii. design factor C=(max-sec)/max, if coefficient C >=0.65, then it is assumed that the base of max depth is
Position Consensus base, and the quality of the Consensus base is that Q=20+ (max*C^2)/2 takes 40 as Q > 40;
If C < 0.65, then it is assumed that the base of this position read is uncertain, and the Consensus sequence position is N, corrresponding quality value Q=
2。
After carrying out these operations to each base of read, the Consensus sequence and corresponding quality of the group is obtained
Value;But there may be some bases uncertain in Consensus sequence, be N.
If b) uncertain base number is more than 5 in entire read, the group is filtered;If being no more than 5, carry out next
Walk (c) judgement;
C) depth of barcode (i.e. molecular label) in the group is counted, ibid method, judged in the group
Whether barcode can determine that;If uncertain, the group is filtered;If it is determined that the group retains, and final Consensus
Sequence, corrresponding quality value and its barcode sequence have all obtained.
Meanwhile counting each classification filtering situation such as the following table 1:
Table 1
First row is sample names in table 1, and planAS01 is the sample that the frequency of mutation is 0.1%, and planAS1 is mutation frequency
The sample that rate is 1%;Secondary series is total reads number;Third column are to compare the ratio shared by the reads of different chromosomes;The
Four column are ratios shared by the reads of Insert Fragment size not within the predefined range;5th column are initial positions not in primer position
Ratio shared by the reads set;6th column are that uncertain base number is greater than in the uncertain read group of 5 or barcode sequence
Ratio shared by reads;7th column are in the read group for can normally obtain consensus sequence and consistency barcode sequence
Ratio shared by reads.
Further, the reads situation counted in read group number (i.e. consensus sequence number) and read group is as follows:
Table 2
SampleID | ConsensesSeqs | Reads |
planAS01 | 12996 | 4,684,288 |
planAS1 | 13435 | 4,642,052 |
In table 2, first row is sample names, and planAS01 is the sample that the frequency of mutation is 0.1%, and planAS1 is mutation
The sample that frequency is 1%;Secondary series is consensus sequence number;Third column are the reads numbers in read group.
After obtaining consensus sequence, consensus sequence is compared with the mankind with reference to genome (hg19), according to comparison
As a result abrupt climatic change is carried out, testing result such as the following table 3:
Table 3
First row is chromosome numbers in table 3, and secondary series is the position of mutational site on chromosome, and third column are genes
Name, the 4th column are the direction of gene on chromosome, and the 5th column are specific CDS and protein mutation information, and the 6th column are mutation
Frequency be 0.1% sample testing result, the 7th column be the frequency of mutation be 1% sample testing result (YES is to detect, NO
It is to be not detected).
To sum up, for the implementation case using the technology of addition molecular label, binding molecule label clustering divides group, and obtains read
The consensus sequence of group has successfully been accurately detected all frequencies of mutation only in the only sequencing of about 5M reads
1% mutation and 6 frequencies of mutation are down to 0.1% mutation, and another 2 0.1% mutation are the case where improving sequencing data amount
Under can also detect.
The technologies such as technology, such as ARMS and Digital PCR of the mutation of detection low frequency can just be detected down to 0.1% at present
Mutation, but that there are flux is low for these technologies, at high cost, and the shortcomings that can only detect known mutations site, and two common generations
Sequencing technologies can only detect 2% frequency of mutation.And by the above results of the present embodiment it is found that the present invention is in addition molecular label
Technical foundation on, binding molecule label clustering divides group, and obtains the consensus sequence of read group, that is, overcome ARMS and
The shortcomings that technologies such as Digital PCR, while it successfully being detected the frequency of mutation again down to 0.1% mutation.
In the description of this specification, reference term " one embodiment ", " some embodiments ", " example ", " specifically show
The description of example " or " some examples " etc. means specific features, structure, material or spy described in conjunction with this embodiment or example
Point is included at least one embodiment or example of the invention.In the present specification, schematic expression of the above terms are not
Centainly refer to identical embodiment or example.Moreover, particular features, structures, materials, or characteristics described can be any
One or more embodiment or examples in can be combined in any suitable manner.
Although an embodiment of the present invention has been shown and described, it will be understood by those skilled in the art that: not
A variety of change, modification, replacement and modification can be carried out to these embodiments in the case where being detached from the principle of the present invention and objective, this
The range of invention is defined by the claims and their equivalents.
Claims (5)
1. a kind of method of determining sequencing read group consensus sequence, which comprises
(1) the sequencing read group is subjected to the first filtering according to basic comparison situation, to obtain by the first filtering
Read group is sequenced, the standard of first filtering is for example following:
(a) exclude double ends respectively with the matched sequencing read group of the different chromosomes of reference sequences;
(b) sequencing read group of the Insert Fragment except preset range is excluded;And
(c) initial position of the sequencing read sequencing read group different from amplimer initial position is excluded;
(2) each for the sequencing read group by the first filtering, determines sequencing read group according to the following steps
Consensus sequence:
(i) each sequencing read in predetermined position, ergodic group, counts the respective depth of tetra- kinds of bases of ATCG;
(ii) base type of the base with significant depth advantage as the predetermined position is selected, and according to the base type
Depth obtain the mass value in the predetermined position;
(iii) all positions are directed to, step (i) and (ii) is repeated, to determine the consensus sequence,
(3) each for the sequencing read group by the first filtering, determines the survey according to step described in similar (2)
The shared sequence label of sequence read group:
(A) sequence label that read is sequenced in each in ergodic group, counts the depth of each sequence label;And
(B) shared sequence label of the sequence label with significant depth advantage as sequencing read group is selected;
Wherein, described (ii) further comprises:
(A ') is ranked up in predetermined position, by tetra- kinds of bases of ATCG by depth, to obtain depth capacity and the second depth, with
And its corresponding base type;
(B ') is based on the depth capacity and the second depth, determines the consensus sequence base type in the predetermined position and corresponding
Mass value;
(B ') includes:
Determine parameter C, wherein parameter C=(the-the second depth of depth capacity)/depth capacity;
If parameter C is not less than specified threshold, shared sequence of the base of the depth capacity as the predetermined position is selected
Column base type, and the mass value Q=20+ (max*C^2)/2 of the base takes 40 as Q > 40;If the parameter C is less than
Specified threshold, it is determined that the consensus sequence base type in the predetermined position is uncertain base N, corrresponding quality value Q=2;
The max refers to the depth capacity.
2. the method according to claim 1, wherein the sequencing read group is the sequencing by obtaining to sequencing
Read is clustered and is obtained, and the sequencing read group is the read group with similar sequences, similar molecules label, is same
Multiple copies that one molecular template is generated by amplification,
Wherein, the read group with similar sequences, similar molecules label is determined as follows:
1) multiple sequencing reads are compared with reference sequences, and determine the end positions of each sequencing read, by end positions
Consistent sequencing read is sorted out to identical level-one group;
2) to belonging to the sequencing read of the same level-one group according to the further point second level group of its sequence label, by molecular label sequence
Similar sequencing read is divided into the same second level group;
It is described 2) to include:
A) depth of each label in the level-one group is determined;
B) each label is ranked up from high to low by depth;
C) successively implement the following steps for the label of depth from high to low:
If the mispairing of the label and existing Seed label sequence is no more than specified mispairing number, will be with the label
Sequencing read is distributed into the Seed label subgroup;
If the mispairing of the label and existing Seed label sequence is more than specified mispairing number, selecting the label is newly
Seed label, and the sequencing read with the label is distributed into corresponding Seed label subgroup;
C) Seed label described in refers to the highest sequence label of depth of the second level group, after above-mentioned second level group processing,
All sequencing reads are all divided into several second levels group, these second levels group is i.e. described to have similar sequences, similar molecules label
Read group.
3. the method according to claim 1, wherein the preset range in (b) is 30~400bp.
4. the method according to claim 1, wherein the specified threshold is 0.65.
5. the method according to claim 1, wherein the number for not knowing base N in the consensus sequence is more than 5
It is a, then consensus sequence failure is obtained, the sequencing read group is filtered.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610348484.8A CN106021986B (en) | 2016-05-24 | 2016-05-24 | Ultralow frequency mutating molecule consensus sequence degeneracy algorithm |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610348484.8A CN106021986B (en) | 2016-05-24 | 2016-05-24 | Ultralow frequency mutating molecule consensus sequence degeneracy algorithm |
Publications (2)
Publication Number | Publication Date |
---|---|
CN106021986A CN106021986A (en) | 2016-10-12 |
CN106021986B true CN106021986B (en) | 2019-04-09 |
Family
ID=57094507
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610348484.8A Active CN106021986B (en) | 2016-05-24 | 2016-05-24 | Ultralow frequency mutating molecule consensus sequence degeneracy algorithm |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106021986B (en) |
Families Citing this family (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107523563A (en) * | 2017-09-08 | 2017-12-29 | 杭州和壹基因科技有限公司 | A kind of Bioinformatics method for Circulating tumor DNA analysis |
CN107563151B (en) * | 2017-09-18 | 2020-09-22 | 杭州和壹基因科技有限公司 | Error correction method for genome sequence assembled by PacBio sequencing data |
WO2019074972A1 (en) * | 2017-10-10 | 2019-04-18 | Memorial Sloan Kettering Cancer Center | System and methods for primer extraction and clonality detection |
CN108154010B (en) * | 2017-12-26 | 2018-10-19 | 东莞博奥木华基因科技有限公司 | A kind of ctDNA low frequencies mutation sequencing data analysis method and device |
US11600360B2 (en) * | 2018-08-20 | 2023-03-07 | Microsoft Technology Licensing, Llc | Trace reconstruction from reads with indeterminant errors |
CN116469462B (en) * | 2023-03-20 | 2024-09-20 | 重庆邮电大学 | Ultra-low frequency DNA mutation identification method and device based on double sequencing |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103805689B (en) * | 2012-11-15 | 2015-08-19 | 深圳华大基因科技服务有限公司 | A kind of sex chromosome with heterotype sequence assembling method of feature based kmer and application thereof |
CN105046105A (en) * | 2015-07-09 | 2015-11-11 | 天津诺禾医学检验所有限公司 | Haplotype map of chromosome span, and construction method thereof |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CA2526810A1 (en) * | 2003-05-23 | 2005-04-21 | Cold Spring Harbor Laboratory | Virtual representations of nucleotide sequences |
-
2016
- 2016-05-24 CN CN201610348484.8A patent/CN106021986B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103805689B (en) * | 2012-11-15 | 2015-08-19 | 深圳华大基因科技服务有限公司 | A kind of sex chromosome with heterotype sequence assembling method of feature based kmer and application thereof |
CN105046105A (en) * | 2015-07-09 | 2015-11-11 | 天津诺禾医学检验所有限公司 | Haplotype map of chromosome span, and construction method thereof |
Also Published As
Publication number | Publication date |
---|---|
CN106021986A (en) | 2016-10-12 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106021986B (en) | Ultralow frequency mutating molecule consensus sequence degeneracy algorithm | |
CN104462869B (en) | The method and apparatus for detecting body cell single nucleotide mutation | |
CN108830044B (en) | Detection method and device for detecting cancer sample gene fusion | |
CN111951895A (en) | Pathogen analysis method, analysis device, apparatus and storage medium based on metagenomics | |
CN108154010B (en) | A kind of ctDNA low frequencies mutation sequencing data analysis method and device | |
US20200294628A1 (en) | Creation or use of anchor-based data structures for sample-derived characteristic determination | |
CN108004330B (en) | Molecular marker for identifying maple leaf ducks and application thereof | |
CN108804876A (en) | Method and apparatus for calculating cancer sample purity and ploidy | |
CN106021987B (en) | Ultralow frequency mutating molecule label clustering clustering algorithm | |
Carrillo-de-Santa-Pau et al. | Automatic identification of informative regions with epigenomic changes associated to hematopoiesis | |
CN108642568B (en) | Method for designing SNP chip special for identifying low-density breed of whole genome of domestic dog | |
CN113674803A (en) | Detection method of copy number variation and application thereof | |
CN110444253B (en) | Method and system suitable for mixed pool gene positioning | |
CN108319817B (en) | Method and device for processing circulating tumor DNA repetitive sequence | |
CN115083521A (en) | Method and system for identifying tumor cell group in single cell transcriptome sequencing data | |
JP2023523002A (en) | Structural variant detection in chromosomal proximity experiments | |
WO2024140368A1 (en) | Sample cross contamination detection method and device | |
JP5825790B2 (en) | Nucleic acid information processing apparatus and processing method thereof | |
KR101539737B1 (en) | Methodology for improving efficiency of marker-assisted backcrossing using genome sequence and molecular marker | |
CN107815489A (en) | A kind of method for screening the high polymorphic molecular marker site of plant | |
WO2012096016A1 (en) | Nucleic acid information processing device and processing method thereof | |
CN116312779A (en) | Method and apparatus for detecting sample contamination and identifying sample mismatch | |
CN108304693B (en) | Method for analyzing gene fusion by using high-throughput sequencing data | |
CN107545152A (en) | A kind of method that variation is looked for based on Illumina data | |
CN110684830A (en) | RNA analysis method for paraffin section tissue |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |