CN107895104A - Assess and verify the method and apparatus of the sequence assembling result of three generations's sequencing - Google Patents

Assess and verify the method and apparatus of the sequence assembling result of three generations's sequencing Download PDF

Info

Publication number
CN107895104A
CN107895104A CN201711114931.4A CN201711114931A CN107895104A CN 107895104 A CN107895104 A CN 107895104A CN 201711114931 A CN201711114931 A CN 201711114931A CN 107895104 A CN107895104 A CN 107895104A
Authority
CN
China
Prior art keywords
sequence
generations
sequencing
result
assembling
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201711114931.4A
Other languages
Chinese (zh)
Other versions
CN107895104B (en
Inventor
邓天全
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
BGI Technology Solutions Co Ltd
Original Assignee
BGI Technology Solutions Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by BGI Technology Solutions Co Ltd filed Critical BGI Technology Solutions Co Ltd
Priority to CN201711114931.4A priority Critical patent/CN107895104B/en
Publication of CN107895104A publication Critical patent/CN107895104A/en
Application granted granted Critical
Publication of CN107895104B publication Critical patent/CN107895104B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids

Landscapes

  • Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Chemical & Material Sciences (AREA)
  • Analytical Chemistry (AREA)
  • Biophysics (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Health & Medical Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Theoretical Computer Science (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
  • Medicines Containing Antibodies Or Antigens For Use As Internal Diagnostic Agents (AREA)

Abstract

The invention discloses a kind of method and apparatus for the sequence assembling result for assessing and verifying three generations's sequencing.The method of the sequence assembling result provided by the present invention for assessing three generations's sequencing includes:Two generation sequences assemble result with three generations and compared;Low cover degree region extends and chosen, sequence after being extended;Three generation sequences compare with extension sequence;Base overburden depth counts;Assemble result queue.It is not too high region that can filter out three generations to assemble quality in result by the present invention, and is marked out and.In follow-up species research, there is the function of prompting in the region for using these of low quality if desired, and provides quick screening means for follow-up improvement.It also can prove that three generations assembles the accuracy and quality of result simultaneously, the accuracy of assembling result can be improved.

Description

Assess and verify the method and apparatus of the sequence assembling result of three generations's sequencing
Technical field
The invention belongs to gene order-checking field, is related to a kind of side for the sequence assembling result for assessing and verifying three generations's sequencing Method and device.
Background technology
Contig (contig) is by being assembled into the splicing of overlapping (overlap) region without between by sequence (reads) The tract of gap (gap);The contig (contig) that frame sequence (scaffold) is determined by double terminal position information is arranged Row, there is gap centre.The contigs or scaffolds that assemble are arranged from big to small, when its cumulative length is just beyond complete When portion assembles sequence total length 50%, last contig or scaffold size are N50 size, and N50 is to evaluation Continuity, the integrality for assembling sequence are significant;N70 and N90 computational methods are similar with N50, and simply percentage is changed into 70% or 90%.
The sequencing of two generations due to reading long limitation (generally 50bp-300bp), using splicing two kinds of algorithms OLC and DBG all Can not be across long repeat region, in assembling running into these repeat regions can all be disconnected.Although difference can be used Gradient large fragment (such as 2k, 5k, 10k, 20k, 40k etc.) data connect two contigs using the comparison relation of double terminal positions Pick up to be spliced into frame sequence (Scaffold), but Contig N50 length is not still long (generally 1k-70k).
(SMRT) technology is sequenced due to reading long (averagely read long general with overlength in three generations's sequencing-Pacbio unimolecules in real time In 8k-13k) the characteristics of, the genome complex regions such as high repetitive sequence, swivel base subregion and height variable region can be carried out high Level assembling so that contig (Contig) N50 and frame sequence (Scaffold) N50 length are longer, and assembling result is more complete Accurately, as three generations is sequenced, cost is more and more lower, and it is also more and more that three generations assembles Genome Project.Three generations's composite software master at present There are PBCR, Falcon, MECAT, CANU, HGAP etc., these softwares all include the work(from sequence self assembly after error correction and error correction Energy.Because three generations's sequence average error rate is up to 15%, therefore these softwares are required for first carrying out self-picketing mistake, recycle error correction postorder Row are assembled, and finally obtain assembling result, because assembling result there may be certain single base mistake or structure variation, institute Subsequently to need to carry out Polish error correction with three generations's original series, and Pilon error correction is carried out with two generation sequences, obtain final three Generation assembling result, the main process of three generations's assembling are as shown in Figure 1.
After obtaining assembling result, we can be evaluated the quality for assembling result by different methods.Such as: (1) utilize same individual BAC/Fosmid sequences (or BAC/Fosmid sequences of species of the same race), by with genome Sequence alignment, genome euchromatin coverage is examined, as shown in Fig. 2 being one section of Fosmid sequence above, here is for we Assemble result sequence, they compare effect it is very good, it was demonstrated that this section of Fosmid sequence be assembled out and effect very It is good.(2) existing est sequence is utilized, by being compared with genome sequence, examines the coverage of gene regions.(3) single base is covered Lid depth assessment, two generation sequences compare the overburden depth for assembling result to three generations and counting three generations's assembling each base of result.Such as Shown in Fig. 3, the average overburden depth 80X of two generation sequences, X-axis represents the overburden depth in different sections, and Y-axis represents different sections and covered Ratio corresponding to lid depth, from the point of view of this figure, the ratio less than 10X overburden depths is lower, the group of reaction assembling result single base Higher (four) the G/C content distributional analysis of packing quality.As shown in figure 4, abscissa is G/C content, ordinate is mean depth.Two article used in lieu of a preface Row, which compare, to be assembled result to three generations and counts the overburden depth that three generations assembles each base of result, is entered using 10kb as window without repetition Row calculates.According to this figure, we can analyze the G/C content of this species, whether can have to the sample exogenous DNA pollute into Row judges.In addition it is also seen that the assembling quality effect of our subregions.Fig. 4 (B) result display assembling result G/C content Depth profile is normal, but shown in Fig. 4 (A), there is part low depth overlay area, causes the possibility of this phenomenon and has two reasons, and one It is that three generation sequences are relatively low in these region overlay depth, causes assembling result certain base mistake or missing to be present, although through Three generations polish error correction and two generation pilon error correction are crossed, but correction does not come;Second, the assembling of this subregion is accurate, Region overlay depth is very high in this section by three generations, but two generations, region overlay depth was relatively low in this section, may be due to mistake is sequenced Cause to compare less than this subregion, or the part that this subregion does not measure or measured is relatively low.
The content of the invention
In order to effectively solve three generations assemble result overburden depth is relatively low in two generation sequences for subregion is which kind of reason Caused, the invention provides a kind of method and apparatus for the sequence assembling result for assessing and verifying three generations's sequencing.
The method of the sequence assembling result provided by the present invention for assessing three generations's sequencing, generally comprises following steps:
(1) the sequence assembling result that two generation sequencing sequences of same sample are sequenced with three generations is contrasted.
(2) according to the comparison result of step (1), picked out from the sequence assembling result of three generations sequencing described two For the low region of average overburden depth in sequencing sequence, then by each selected region in the sequence of three generations sequencing Extended in assembling result, so as to obtain sequence after several extensions.
(3) sequence is individually compared after each extension for obtaining three generations's sequencing sequence with step (2).
(4) according to the comparison result of step (3), (i.e. in each two generation, is low in each selected region in statistic procedure (2) Overburden depth region) average overburden depth in three generations's sequencing sequence.
(5) according to the statistical result of step (4), the height of the assembling quality in each region selected in step (2) is determined It is low, and then realize the assessment of the sequence assembling result to three generations sequencing.
Specifically, methods described comprises the following steps:
(1) being contrasted the sequence assembling result that two generation sequencing sequences of same sample are sequenced with three generations (can use The softwares such as software bwa or SOAPAligner are compared to complete), count each base in the sequence assembling result of three generations's sequencing Overburden depth (can be carried out using SOAPCoverage softwares) in the two generations sequencing sequence, so it is (specific with 1-5kb Such as 1kb) it is window, each window area in the sequence assembling result of three generations's sequencing is calculated sequence is sequenced in two generation Average overburden depth in row.
(2) according to the result of step (1), pick out from the sequence assembling result of three generations sequencing and surveyed in two generation The low whole window areas of average overburden depth in sequence sequence, then survey each selected window area in the three generations Respectively extend 10-40kb (specific such as 30kb) in the sequence assembling result of sequence forwards, backwards, so as to obtain sequence after several extensions.
(3) sequence carries out individually comparing (comparison software after each extension for obtaining three generations's sequencing sequence and step (2) Bwa can be used).
(4) according to the comparison result of step (3), selected each window area (i.e. former 1-5kb in statistic procedure (2) Two generations low overburden depth region) average overburden depth in three generations's sequencing sequence.
(5) according to the statistical result of step (4), according to (i.e. former to each window area selected in step (2) as follows The generations of 1-5kb bis- low overburden depth region) carry out the mark of assembling quality height, and then assess the sequence assembling of three generations sequencing As a result overall package quality:If selected some window area A is flat in three generations's sequencing sequence in step (2) Equal overburden depth is less than or equal to 5X, then the window area A is labeled as into " the relatively low region of assembling quality ";If step (2) average overburden depths of selected some window area B in three generations's sequencing sequence is more than 5X in, then by the window Mouth region domain B is labeled as " the of a relatively high region of assembling quality ".
The methods described the step of in (5), specifically according to the sequence assembling result for assessing three generations's sequencing as follows Overall package quality:The quantity in " the of a relatively high region of assembling quality " that is marked with described " assembling quality is relative The ratio of both relatively low region " and " the of a relatively high region of assembling quality " total quantity is bigger, then three generations's sequencing Sequence assembling result overall package quality it is higher.
Wherein, if continuous 2 or multiple while being marked as " assembling quality in window area selected in step (2) Relatively low region ", then they are merged into note into one " the relatively low region of assembling quality ";If continuous 2 or more Individual while be marked as " the of a relatively high region of assembling quality ", then they being merged into note, " assembling quality is of a relatively high into one Region ".
The methods described the step of in (1), the two generations sequencing sequence is the initial data obtained by two generation high-flux sequences Sequence (eliminating joint and low quality base) after filtration treatment.In one embodiment of the invention, the sample This is Maize genome, and the two generations sequencing sequence is specially that the HiSeq2500 platforms 250PE sequencing gained of Maize genome is former Beginning sequence filter falls the part after joint and low quality base.
Further, needed in the present invention by initial data to valid data filtering by three step process:
1) transition joint:Sequencing read matches the 50% of adapter sequences or then deletes whole piece reads above;
2) low quality data is filtered:If base of the mass value less than 20 accounts for whole piece read 10%- in sequencing read Whole piece read is deleted if 50% (specific such as 20%) or the above;
3) N is removed:If in sequencing reads N content account for whole piece read 1%-10% (specific as 2%) or more than, Delete whole piece read.Wherein, N represents the base that sequencing is not measured.
The methods described the step of in (3), three generations's sequencing sequence is non-error correction sequence or from after error correction Sequence.Wherein, the non-error correction sequence of three generations has two kinds, and a kind of is the subreads sequences that PacBio RSII machine sequencings obtain, separately A kind of is the sequence that the bam formatted datas that Sequal machine sequencings obtain are converted into fasta forms;Refer to from the sequence after error correction It is the sequence that three generations's initial data obtains from after error correction.In one embodiment of the invention, the sample is Maize genome, Three generations's sequencing sequence is specially that original series process obtained by (SMRT) is sequenced in the Pacbio unimolecules of Maize genome in real time From the sequence after error correction.
The methods described the step of in (2), the average overburden depth is low to refer to that average overburden depth is less than " low depth Define threshold values ", it is following any:
(a1) when the average overburden depth of two generation sequencing is 30X, described " low depth defines threshold values " is 3X.
(a2) when the average overburden depth of two generation sequencing is more than 30X and is less than or equal to 50X, described " low depth is fixed Adopted threshold values " is 4-5X.
(a3) when the average overburden depth of two generation sequencing is more than 50X and is less than or equal to 70X, described " low depth is fixed Adopted threshold values " is 6-8X.
(a4) when the average overburden depth of two generation sequencing is more than 70X, described " low depth defines threshold values " is 9- 10X。
In one embodiment of the invention, the sample is Maize genome, and the average covering of the two generations sequencing is deep It is 6X to spend for 60X, described " low depth defines threshold values ".
The methods described the step of in (2), described " by each selected window area, (in i.e. former generations of 1kb bis-, low covering was deep Spend region) respectively extend 30kb " forwards, backwards in the sequence assembling result of three generations sequencing, if selected window area is in its institute Frame sequence (Scaffold) in, less than 30kb before or after it, then extend to existing part and terminate.I.e.:Such as Fruit respectively extends 30kb forwards, backwards, obtains the sequence after extension, and each back zone length of field that extends amounts to 61kb;Before and after Scaffold such as Fruit is less than 30k, the part taken, for example window is 5-6k regions, only takes 0-5k forward, removes 6-36k, i.e. 1-36kb backward; It is insufficient behind Scaffold, get last base from window end.Because the length range of three generation sequences is hundreds of bp To tens kb, and these low depth overlay areas may as little as tens bp, so the present invention is to each before and after low depth overlay area 30kb is extended, to ensure that three generation sequences can be compared with these regions.
The methods described the step of in (1), the average amount of the two generations sequencing reach the 30X of Genome Size with On, preferably more than 50X.In one embodiment of the invention, the sample is Maize genome, and the two generations sequencing is put down Equal data measurer body is the 60X of Maize genome size.
The methods described the step of in (1), the base mismatch number allow during the comparison is more preferably less than equal to 2.
In the process, three generations's sequencing is sequenced (SMRT) in real time for Pacbio unimolecules;Three generations's sequencing Average amount is preferably in more than the 50X of Genome Size.In one embodiment of the invention, the sample is corn gene Group, the average amount of three generations's sequencing is specially the 80X of Maize genome size.
The assessment system (device) of the sequence assembling result of three generations's sequencing provided by the present invention, including data processing equipment A, data processing equipment B, data processing equipment C, data processing equipment D and data processing equipment E.
Module a1, module a2, module a3 are set in the data processing equipment A;Sequence can be sequenced to two generations in the module a1 Arrange and contrasted with the sequence assembling result of three generations's sequencing;The result that the module a2 can compare according to the module a1, system Count each overburden depth of the base in the two generations sequencing sequence in the sequence assembling result of three generations's sequencing;The module According to the statistical result of the module a2, with 1-5kb (specific such as 1kb) for window, three generations's sequencing can be calculated in a3 Sequence assembling result in each average overburden depth of the window area in the two generations sequencing sequence.
In the present invention, the module a1 is specially bwa softwares or SOAPAligner softwares;The module a2 is specially SOAPCoverage softwares.
Module b1 and module b2 are set in the data processing equipment B;The module b1 can fill according to the data processing The result of A acquisitions is put, picks out from the sequence assembling result of three generations sequencing and is averagely covered in the two generations sequencing sequence The low whole window areas of lid depth;The module b2 can be by each window area selected by the module b1 described Respectively extend 10-40kb (specific such as 30kb) in the sequence assembling result of three generations's sequencing forwards, backwards, after obtaining several extensions Sequence.
Module c1 is set in the data processing equipment C;The module c1 can be by three generations's sequencing sequence and the data Sequence is individually compared after each extension that reason device B is obtained.
In the present invention, the module c1 is specially bwa softwares.
Module d1 is set in the data processing equipment D;What the module d1 can obtain according to the data processing equipment C Comparison result, (in the i.e. former generations of 1kb bis-, are low for each window area for counting described in the data processing equipment B selected by module b1 Overburden depth region) average overburden depth in three generations's sequencing sequence.
Module e1 and module e2 are set in the data processing equipment E;The module e1 can fill according to the data processing The statistical result of D acquisitions is put, according to as follows to each window area selected by module b1 described in the data processing equipment B (i.e. former the generations of 1kb bis- low overburden depth region) carries out the mark of assembling quality height:If some window area A is described three It is less than or equal to 5X for the average overburden depth in sequencing sequence, then by the window area A labeled as " assembling quality is relatively low Region ";If average overburden depths of some window area B in three generations's sequencing sequence is more than 5X, by described in Window area B is labeled as " the of a relatively high region of assembling quality ";The module e2 can be according to the mark knot of the module e1 Fruit statistics calculates " the of a relatively high region of assembling quality " that module b1 is marked described in the data processing equipment B Quantity and both " the relatively low region of assembling quality " and " the of a relatively high region of assembling quality " total quantity Ratio;Wherein, if continuous 2 or multiple while being marked as that " assembling quality is relative in window area selected by the module b1 Relatively low region ", then they are merged into note into one " the relatively low region of assembling quality ";If continuous 2 or multiple same When be marked as " the of a relatively high region of assembling quality ", then they are merged into note into " of a relatively high an area of assembling quality Domain ".
Methods described or the system it is following it is any in application fall within protection scope of the present invention:
(A) the sequence assembling result of three generations's sequencing is assessed;
(B) screen and mark ropy region in the sequence assembling result that three generations is sequenced.
It is demonstrated experimentally that using assessment provided by the present invention and the method for the sequence assembling result of verification three generations's sequencing, can It is not too high region successfully to filter out three generations to assemble quality in result, and is marked out and.So in follow-up species In research, there is the function of prompting in the region for using these of low quality if desired, and is provided quickly for follow-up improvement Screening means.It also can prove that three generations assembles the accuracy and quality of result simultaneously, the accuracy of assembling result can be improved.
Brief description of the drawings
Fig. 1 is three generations's sequence assembling, and assembles the stream of result error correction to three generations respectively using three generation sequences and three generation sequences Cheng Tu.
Fig. 2 is the figure that checking is compared with Fosmid sequences and assembling result;
Fig. 3 is the figure of the two generation sequence overburden depths distribution of assembling result.
Fig. 4 is the figure of the G/C content two generations overburden depth distribution of assembling result.
Fig. 5 goes to assess and verifies three generations for the present invention using three generations's sequential covering depth assembles the low covering of two generations in result deeply Spend the flow chart in region.
Embodiment
Experimental method used in following embodiments is conventional method unless otherwise specified.
Material used, reagent etc., unless otherwise specified, are commercially obtained in following embodiments.
Fig. 5 shows that the present invention goes to assess and verifies three generations using three generations's sequential covering depth and assembles two generations low covering in result The flow chart of one embodiment of depth areas.
It is compared as shown in figure 5, step 202 is two generation sequences with three generations's assembling result, two generation sequences compare and arrive three generations Assembling result simultaneously counts the overburden depth that three generations assembles each base of result.The average amount in the generation of general recommendations two can reach base Because of more than the 50X of group, minimum 30X.The comparison in this stage can use the softwares such as comparison software bwa or SOAPAligner complete Into.The base mismatch number of permission, which is generally less than, is equal to 2.And count three generations and assemble each site overburden depth of result.Statistics is can To be carried out using SOAPCoverage softwares.
Step 204, the low extension of overburden depth region and selection:Using 1kb as window, low (the low depth of average overburden depth is selected The threshold values of degree refers to the suggestion of table 1) region, and forwards, backwards respectively extension 30kb, obtain extension after sequence, due to three article used in lieu of a preface The length range of row is hundreds of bp to tens kb, and these low overlay areas may as little as tens bp, so we are to low depth 30kb is respectively extended before and after overlay area, to ensure that three generation sequences can be compared with these regions.If Scaffold is front and rear Less than 30k, the part taken, for example window is 5-6k regions, only takes 0-5k forward, removes 6-36k, i.e. 1-36kb backward; It is insufficient behind Scaffold, get last base from window end.
The low depth threshold values recommendation tables of table 1
Depth was sequenced in two generations Low depth defines threshold values
30X 3X
More than 30X and it is less than or equal to 50X 4-5X
More than 50X and it is less than or equal to 70X 6-8X
More than 70X 9-10X
Step 206, three generation sequences compare with extension sequence:Compared with the sequence after three generation sequences and extension, here three Generation sequence can be non-error correction sequence or from the sequence after error correction.Bwa can be used by comparing software.The data of three generation sequences Amount suggests more than the 50X in Genome Size.
Step 208, base overburden depth counts:The three generations's average base for obtaining low overburden depth region of the former generations of 1kb bis- covers Lid depth.
Step 210, result queue, low to two generation overburden depths or uncovered region, the average overburden depth of three generations are assembled Mark less than or equal to 5X is low region, and the average overburden depth of three generations is more than 5X labeled as the high area of assembling quality Domain.If continuous 2 or multiple all inadequate 5X of 1kb low depths region three generations, would combine them into a region.It is if continuous 2 or multiple while " the of a relatively high region of assembling quality " is marked as, then they is merged into note into " assembling quality a phase To higher region ".Being marked as " the high region of assembling quality ", proportion is higher in window area selected by whole, then institute The overall package quality for stating the sequence assembling result of three generations's sequencing is higher.
The Maize genome concrete application example of embodiment 1, the inventive method
(1) three generations's sequence assembling
Using three generations Maize genome data of three generations's composite software FALCON to 80X, (Pacbio unimolecules are sequenced in real time (SMRT) result) assembled, and Polish error correction done to assembling result with three generations's initial data, then with the codes or datas pair of 60X bis- The further error correction of assembling result after Polish, obtain the final assembling result of Maize genome.
(2) two generation sequences assemble result with three generations and are compared
Using SOAPAligner softwares by 60X PE250 two generation sequences (two generation sequences after filtering, gone joint and Sequence after low quality base.Filtered by initial data to valid data through three step process:1) transition joint:Read matchings are sequenced 50% or the above of upper adapter sequences then delete whole piece reads;2) low quality data is filtered:If matter in read is sequenced Base of the value less than 20 accounts for the 20% of whole piece read or then deletes whole piece read above;3) N is removed:If N in reads is sequenced Content account for whole piece read 2% or more than, then delete whole piece read) compare to three generations and assemble result, each read allows most Big 2bp mispairing, overburden depth of each site of result in two generation sequences is assembled with SOAPCoverage software statistics.
(3) the low extension of overburden depth region and selection
Using 1kb as window, select three generations and assemble each 1kb window areas average overburden depth in two generation sequences in result Region less than or equal to 6X, and respectively extend 30kb forwards, backwards.As shown in table 2, such region always has 88.Before Scaffold Afterwards if less than 30k, the part taken, for example window is 5-6k regions, only takes 0-5k forward, removes 6-36k, i.e. 1- backward 36kb;It is insufficient behind Scaffold, get last base from window end.
(4) three generation sequences compare with extension sequence
80X is compared by from three generation sequences of error correction and the sequence after extension with bwa softwares.
(5) base overburden depth counts
Obtain three generations's average base overburden depth in low overburden depth region of the former generations of 1kb bis-.
(6) result queue is assembled
As shown in table 2, mark of the average overburden depth of three generations less than or equal to 5X is relatively low region, altogether 35 It is individual;The average overburden depth of three generations is more than 5X labeled as the high region of assembling quality, 53 altogether.It can be marked well by the present invention Remember that three generations assembles in result the low region of two generation overburden depths which is high assembling quality region, which is poor region.
The overburden depth situation of three generations in the former generation sequence low depth regions of 1kb bis- of table 2
Less than 6X region quantities Region of three generations's covering more than 5X Region of three generations's covering less than or equal to 5X
88 53 35
In summary, by the present invention can filter out three generations assemble result in quality be not too high region, and by its Mark out and.In follow-up species research, there is the function of prompting in the region for using these of low quality if desired, and is Follow-up improvement provides quick screening means.It also can prove that three generations assembles the accuracy and quality of result simultaneously, group can be improved Fill the accuracy of result.

Claims (10)

1. a kind of method for the sequence assembling result for assessing three generations's sequencing, comprises the following steps:
(1) the sequence assembling result that two generation sequencing sequences of same sample are sequenced with three generations is contrasted;
(2) according to the comparison result of step (1), pick out from the sequence assembling result of three generations sequencing and surveyed in two generation The low region of average overburden depth in sequence sequence, then by each selected region in the sequence assembling of three generations sequencing As a result extended in, so as to obtain sequence after several extensions;
(3) sequence is individually compared after each extension for obtaining three generations's sequencing sequence with step (2);
(4) according to the comparison result of step (3), each selected region is in three generations's sequencing sequence in statistic procedure (2) In average overburden depth;
(5) according to the statistical result of step (4), the height of the assembling quality in each region selected in step (2) is determined, is entered And realize the assessment of the sequence assembling result to three generations sequencing.
2. according to the method for claim 1, it is characterised in that:Methods described comprises the following steps:
(1) the sequence assembling result that two generation sequencing sequences of same sample are sequenced with three generations is contrasted, counts the three generations Each overburden depth of the base in the two generations sequencing sequence in the sequence assembling result of sequencing, and then using 1-5kb as window, Each window area being averaged in the two generations sequencing sequence in the sequence assembling result of three generations's sequencing is calculated to cover Lid depth;
(2) according to the result of step (1), picked out from the sequence assembling result of three generations sequencing and sequence is sequenced in two generation The low whole window areas of average overburden depth in row, then by each selected window area in three generations sequencing Respectively extend 10-40kb in sequence assembling result forwards, backwards, so as to obtain sequence after several extensions;
(3) sequence is individually compared after each extension for obtaining three generations's sequencing sequence with step (2);
(4) according to the comparison result of step (3), each selected window area is sequenced in the three generations in statistic procedure (2) Average overburden depth in sequence;
(5) according to the statistical result of step (4), according to being assembled as follows to each window area selected in step (2) The mark of quality height, and then assess the overall package quality of the sequence assembling result of three generations's sequencing:If in step (2) Average overburden depths of selected some window area A in three generations's sequencing sequence is less than or equal to 5X, then by the window Mouth region domain A is labeled as " the relatively low region of assembling quality ";If selected some window area B is in institute in step (2) The average overburden depth stated in three generations's sequencing sequence is more than 5X, then by the window area B labeled as " assembling quality is of a relatively high Region ".
3. according to the method for claim 2, it is characterised in that:It is according to assessing three generations sequencing as follows in step (5) Sequence assembling result overall package quality:The quantity in " the of a relatively high region of assembling quality " that is marked and institute It is bigger to state the ratio of both " the relatively low region of assembling quality " and " the of a relatively high region of assembling quality " total quantity, Then the overall package quality of the sequence assembling result of three generations's sequencing is higher;
If continuous 2 or multiple while being marked as that " assembling quality is relatively low in selected window area in step (2) Region ", then they are merged into note into one " the relatively low region of assembling quality ";If continuous 2 or multiple while marked " the of a relatively high region of assembling quality " is designated as, then they are merged into note into one " the of a relatively high region of assembling quality ".
4. according to any described method in claim 1-3, it is characterised in that:In step (1), the two generations sequencing sequence is Sequence of the initial data after filtration treatment obtained by two generation high-flux sequences;And/or
In step (3), three generations's sequencing sequence is non-error correction sequence or from error correction sequence.
5. according to any described method in claim 1-4, it is characterised in that:In step (2), the average overburden depth is low Refer to that average overburden depth is less than " low depth defines threshold values ", be following any:
(a1) when the average overburden depth of two generation sequencing is 30X, described " low depth defines threshold values " is 3X;
(a2) when the average overburden depth of two generation sequencing is more than 30X and is less than or equal to 50X, described " low depth defines valve Value " is 4-5X;
(a3) when the average overburden depth of two generation sequencing is more than 50X and is less than or equal to 70X, described " low depth defines valve Value " is 6-8X;
(a4) when the average overburden depth of two generation sequencing is more than 70X, described " low depth defines threshold values " is 9-10X.
6. according to any described method in claim 1-5, it is characterised in that:In step (1), the two generations sequencing is averaged Data volume reaches more than the 30X of Genome Size;
Specifically, the average amount of the two generations sequencing reaches more than the 50X of Genome Size.
7. according to any described method in claim 1-6, it is characterised in that:In step (1), during the comparison allow Base mismatch number be less than or equal to 2.
8. according to any described method in claim 1-7, it is characterised in that:The average amount of three generations's sequencing is in base Because of group a more than 50X for size.
9. the assessment system of the sequence assembling result of three generations's sequencing, including at data processing equipment A, data processing equipment B, data Manage device C, data processing equipment D and data processing equipment E;
Module a1, module a2, module a3 are set in the data processing equipment A;The module a1 can to two generation sequencing sequences with The sequence assembling result of three generations's sequencing is contrasted;The result that the module a2 can compare according to the module a1, counts institute State each overburden depth of the base in the two generations sequencing sequence in the sequence assembling result of three generations's sequencing;The module a3 energy Enough statistical results according to the module a2, using 1-5kb as window, it is calculated in the sequence assembling result of three generations's sequencing Each average overburden depth of the window area in the two generations sequencing sequence;
Module b1 and module b2 are set in the data processing equipment B;The module b1 can obtain according to the data processing equipment A The result obtained, the average overburden depth in the two generations sequencing sequence is picked out from the sequence assembling result of three generations sequencing Low whole window areas;The module b2 can survey each window area selected by the module b1 in the three generations Respectively extend 10-40kb in the sequence assembling result of sequence forwards, backwards, so as to obtain sequence after several extensions;
Module c1 is set in the data processing equipment C;The module c1 can fill three generations's sequencing sequence and the data processing Sequence is individually compared after putting each extension of B acquisitions;
Module d1 is set in the data processing equipment D;The comparison that the module d1 can obtain according to the data processing equipment C As a result, each window area described in the data processing equipment B selected by module b1 is counted in three generations's sequencing sequence In average overburden depth;
Module e1 and module e2 are set in the data processing equipment E;The module e1 can obtain according to the data processing equipment D The statistical result obtained, according to as follows to each window area progress selected by module b1 described in the data processing equipment B The mark of assembling quality height:If average overburden depths of some window area A in three generations's sequencing sequence is less than Equal to 5X, then the window area A is labeled as " the relatively low region of assembling quality ";If some window area B exists Average overburden depth in three generations's sequencing sequence is more than 5X, then by the window area B labeled as " assembling quality is relatively High region ";The module e2 can be counted according to the mark result of the module e1 and be calculated institute in the data processing equipment B State the quantity in " the of a relatively high region of assembling quality " that module b1 is marked and " the relatively low area of assembling quality The ratio of both domain " and " the of a relatively high region of assembling quality " total quantity;Wherein, window area selected by the module b1 In if continuous 2 or multiple while being marked as " the relatively low region of assembling quality ", then they are merged into note into one " the relatively low region of assembling quality ";If continuous 2 or multiple while being marked as " the of a relatively high area of assembling quality Domain ", then they are merged into note into one " the of a relatively high region of assembling quality ".
10. system in claim 1-8 described in any described method or claim 9 it is following it is any in application:
(A) the sequence assembling result of three generations's sequencing is assessed;
(B) screen and mark ropy region in the sequence assembling result that three generations is sequenced.
CN201711114931.4A 2017-11-13 2017-11-13 Method and device for evaluating and verifying sequence assembly result of third-generation sequencing Active CN107895104B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201711114931.4A CN107895104B (en) 2017-11-13 2017-11-13 Method and device for evaluating and verifying sequence assembly result of third-generation sequencing

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201711114931.4A CN107895104B (en) 2017-11-13 2017-11-13 Method and device for evaluating and verifying sequence assembly result of third-generation sequencing

Publications (2)

Publication Number Publication Date
CN107895104A true CN107895104A (en) 2018-04-10
CN107895104B CN107895104B (en) 2020-07-07

Family

ID=61805218

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201711114931.4A Active CN107895104B (en) 2017-11-13 2017-11-13 Method and device for evaluating and verifying sequence assembly result of third-generation sequencing

Country Status (1)

Country Link
CN (1) CN107895104B (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108595915A (en) * 2018-04-16 2018-09-28 北京化工大学 A kind of three generations's data correcting method based on DNA variation detections
CN108753765A (en) * 2018-06-08 2018-11-06 中国科学院遗传与发育生物学研究所 A kind of genome assemble method of structure overlength continuous DNA sequence
CN108776749A (en) * 2018-06-05 2018-11-09 南京诺禾致源生物科技有限公司 The processing method and processing device of sequencing data
CN111564181A (en) * 2020-04-02 2020-08-21 北京百迈客生物科技有限公司 Metagenome assembly method based on second-generation and third-generation ONT (ONT) technologies
CN112397148A (en) * 2019-08-23 2021-02-23 武汉未来组生物科技有限公司 Sequence comparison method, sequence correction method and device thereof
CN114657175A (en) * 2022-04-08 2022-06-24 武汉百奥微帆生物科技有限公司 Virus genome assembly method based on third-generation sequencing

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2015058095A1 (en) * 2013-10-18 2015-04-23 Seven Bridges Genomics Inc. Methods and systems for quantifying sequence alignment
CN105989246A (en) * 2015-01-28 2016-10-05 深圳华大基因研究院 Variation detection method and device assembled based on genomes
CN106650254A (en) * 2016-12-16 2017-05-10 武汉菲沙基因信息有限公司 Method for detecting fusion gene based on transcriptome sequencing data

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2015058095A1 (en) * 2013-10-18 2015-04-23 Seven Bridges Genomics Inc. Methods and systems for quantifying sequence alignment
CN105989246A (en) * 2015-01-28 2016-10-05 深圳华大基因研究院 Variation detection method and device assembled based on genomes
CN106650254A (en) * 2016-12-16 2017-05-10 武汉菲沙基因信息有限公司 Method for detecting fusion gene based on transcriptome sequencing data

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
ALEXEY GUREVICH,VLADISLAV SAVELIEV,NIKOLAY VYAHHI,ET AL: "QUAST: quality assessment tool for genome assemblies", 《BIOINFORMATICS》 *
SIMAO F A,WATERHOUSE R M,IOANNIDIS P,ET AL: "BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs", 《BIOINFORMATICS》 *

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108595915A (en) * 2018-04-16 2018-09-28 北京化工大学 A kind of three generations's data correcting method based on DNA variation detections
CN108595915B (en) * 2018-04-16 2021-06-22 北京化工大学 Third-generation data correction method based on DNA variation detection
CN108776749A (en) * 2018-06-05 2018-11-09 南京诺禾致源生物科技有限公司 The processing method and processing device of sequencing data
CN108776749B (en) * 2018-06-05 2022-05-03 北京诺禾致源科技股份有限公司 Sequencing data processing method and device
CN108753765A (en) * 2018-06-08 2018-11-06 中国科学院遗传与发育生物学研究所 A kind of genome assemble method of structure overlength continuous DNA sequence
CN108753765B (en) * 2018-06-08 2020-12-08 中国科学院遗传与发育生物学研究所 Genome assembly method for constructing ultra-long continuous DNA sequence
CN112397148A (en) * 2019-08-23 2021-02-23 武汉未来组生物科技有限公司 Sequence comparison method, sequence correction method and device thereof
CN112397148B (en) * 2019-08-23 2023-10-03 武汉希望组生物科技有限公司 Sequence comparison method, sequence correction method and device thereof
CN111564181A (en) * 2020-04-02 2020-08-21 北京百迈客生物科技有限公司 Metagenome assembly method based on second-generation and third-generation ONT (ONT) technologies
CN111564181B (en) * 2020-04-02 2024-06-04 北京百迈客生物科技有限公司 Method for carrying out metagenome assembly based on second-generation and third-generation ONT technology
CN114657175A (en) * 2022-04-08 2022-06-24 武汉百奥微帆生物科技有限公司 Virus genome assembly method based on third-generation sequencing

Also Published As

Publication number Publication date
CN107895104B (en) 2020-07-07

Similar Documents

Publication Publication Date Title
CN107895104A (en) Assess and verify the method and apparatus of the sequence assembling result of three generations's sequencing
CN106031328B (en) The control method of quality management device and quality management device
CN107423578B (en) Device for detecting somatic cell mutation
CN106355431A (en) Detection method, device and terminal for cheating traffic
CN108008244B (en) A kind of small current grounding fault progressive classifying identification method at many levels
US20170060664A1 (en) Method for verifying bad pattern in time series sensing data and apparatus thereof
CN103994786B (en) Image detecting method for arc ruler lines of pointer instrument scale
CN102445544A (en) Method and system for increasing judgment accuracy of monoisotopic peaks
CN105354427B (en) One kind meets screening technique and device
CN107153931A (en) A kind of Express Logistics dispense method for detecting abnormality
CN104903262A (en) System for managing production of glass substrates and method for managing production of glass substrates
CN102831055A (en) Test case selection method based on weighting attribute
CN104392069B (en) A kind of WAMS delay character modeling method
CN111460567A (en) BIM-based stair surface layer clear height inspection system, application system-based method and process
CN104352244B (en) A kind of data processing method and device
CN112016046B (en) Construction method and device of tobacco shred structure prediction model
CN103413051A (en) Method and device for determining coincidence events
CN108229586B (en) The detection method and system of a kind of exceptional data point in data
CN111538654A (en) Software reliability testing method, system, storage medium and computer program
CN107247871A (en) Item detection time checking method for early warning and device
CN110458246A (en) A kind of Biliary Calculi classification method based on deep learning
CN115375348A (en) Data outlier detection method based on boxplot indexes
CN105447321B (en) A kind of offline filtering method of reactivity meter electric current
CN106779864A (en) The abnormal method for early warning of vegetable price and the abnormal prior-warning device of vegetable price
CN113762707A (en) Cargo tracking method and system based on block chain

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant