WO2019233427A1 - Genome assembly method for constructing ultralong continuous dna sequence - Google Patents

Genome assembly method for constructing ultralong continuous dna sequence Download PDF

Info

Publication number
WO2019233427A1
WO2019233427A1 PCT/CN2019/090053 CN2019090053W WO2019233427A1 WO 2019233427 A1 WO2019233427 A1 WO 2019233427A1 CN 2019090053 W CN2019090053 W CN 2019090053W WO 2019233427 A1 WO2019233427 A1 WO 2019233427A1
Authority
WO
WIPO (PCT)
Prior art keywords
sequence
sequences
anchor
fragment
pathway
Prior art date
Application number
PCT/CN2019/090053
Other languages
French (fr)
Chinese (zh)
Inventor
梁承志
杜会龙
Original Assignee
中国科学院遗传与发育生物学研究所
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 中国科学院遗传与发育生物学研究所 filed Critical 中国科学院遗传与发育生物学研究所
Publication of WO2019233427A1 publication Critical patent/WO2019233427A1/en

Links

Images

Classifications

    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12NMICROORGANISMS OR ENZYMES; COMPOSITIONS THEREOF; PROPAGATING, PRESERVING, OR MAINTAINING MICROORGANISMS; MUTATION OR GENETIC ENGINEERING; CULTURE MEDIA
    • C12N15/00Mutation or genetic engineering; DNA or RNA concerning genetic engineering, vectors, e.g. plasmids, or their isolation, preparation or purification; Use of hosts therefor
    • C12N15/09Recombinant DNA-technology
    • C12N15/10Processes for the isolation, preparation or purification of DNA or RNA
    • C12N15/102Mutagenizing nucleic acids
    • C12N15/1027Mutagenizing nucleic acids by DNA shuffling, e.g. RSR, STEP, RPR

Definitions

  • the invention relates to a genome assembly method for constructing ultra-long continuous DNA sequences, and belongs to the technical field of genome assembly.
  • the sequencer generates random reads (Read) by sequencing the genomic fragments.
  • the distribution of these Reads across the genome is random.
  • the process of genome assembly is to arrange and connect these Reads in the correct order, and assemble them into contigs of contiguous bases, and finally restore the entire chromosome and the entire genome sequence.
  • This assembly process generally includes three steps: the assembly of continuous fragments (Contig), the assembly of discontinuous non-continuous fragments (Scaffold), and the completion of gaps (GF).
  • the difficulty of genome assembly stems from the large number of repeated sequences present in the genome (ie, two / segment or multiple / segment sequences with similar or identical sequences). Repeats can be divided into two major categories in the genome: tandem repeats and interspersed repeats.
  • a tandem repeat is a sequence of very similar repeating units that are directly connected to the head and tail, and are generated by local repeats.
  • Typical tandem repeats include rDNA, centromeric repeats, and the like.
  • Interspersed repeats are non-locally repeated sequences distributed at different locations in the genome. In some repeats, there are both tandem and non-tandem repeats. These regions are long and form complex repeats.
  • Reads derived from different copies of the repeat sequence have sequence similarity. At present, the length N50 of single molecule sequencing Read is generally greater than 10-15kb, and the longest is more than 100kb. If it is a repeating sequence plus the single test sequence at both ends is covered by a single Read, then there is no assembly problem in this area. What needs to be solved now is the assembly of repeating sequences that exceed the Read average or N50 length.
  • OLC Overlap-Layout-Consensus
  • SG String Graph
  • the OLC method can also be described succinctly with SG, collectively referred to as the SG-type method.
  • Common software for existing SG methods includes PBcR (Berlin et al. 2015, Nat. Biotechnol. 33, 623–30), CANU (Koren et al. 2017, Genome et Res.
  • the key in the SG method is to use the method of transitive reduction (Transitive reduction) (TU) to remove redundant reads (all reads that are particularly similar are compressed into one). That is, after constructing an overlay graph of all Reads, the number of edges in and out of many nodes is reduced to one using TU. This leaves no branches on many paths.
  • Transitive reduction Transitive reduction
  • a Read node has a degree of overlapping edges greater than 1 in the simplified diagram, it is called a cross node, and the other nodes are internal nodes.
  • a path without crossing nodes can form a Contig, which can be further compressed together in the SG.
  • the cross node represents the connection between the single test sequence region and the repeated sequence region (Read on this node includes each part of the two types of sequences); the sequencer will make errors when detecting the Read sequence, causing its measurement Read sequences with sequencing errors, such as base insertions, deletions, mutations, or chimeras derived from sequences at different positions, may cause additional cross-node sequences.
  • the single test area is reduced to a single path formed by a series of Reads, which are connected together to form a single test sequence Contig; and a repeating sequence can also be compressed into a single path formed by a series of Reads to form Repeating sequence Contig. Because errors must be tolerated during sequence comparison, the Reads from different duplicate sequence copies will be compressed together, and the duplicate sequences of different copies will become one, so it cannot be distinguished. However, due to the existence of cross nodes, the formed repeat sequence Contig is broken at the compressed start and end positions, resulting in fragmentation of the assembled Contig, which in turn makes it impossible to truly restore the entire original genome sequence.
  • the purpose of the present invention is to provide a genome assembly method for constructing an ultra-long continuous DNA sequence, which can effectively solve the problems existing in the prior art, especially in the prior art, compressing similar multi-segment repeating sequences into a string of Read.
  • a single path because Reads from different duplicate sequence copies will be compressed together, causing the duplicate sequences of different copies to become one, it cannot be distinguished; and because of the existence of cross nodes, the formed repeat sequence Contig is at the starting point of compression Disconnect from the end position, causing fragmentation of the assembled Contig.
  • the present invention adopts the following technical solution: a method for assembling a genome for constructing an ultra-long continuous DNA sequence, including the following steps:
  • the known DNA sequences include anchor sequence fragments (that is, sequence fragments used for anchoring, It can include multiple types, such as a specific segment or segments of a specific sequence segment intercepted from a DNA sequence, and / or a specific segment or segments of a specific sequence segment that have been assembled, and / or selected from a random sequencing Read sequence One or several specific Read sequences, etc.) and random sequencing Read sequences; said anchor sequence fragments include at least two; said pairwise comparison of all known DNA sequences, including all Anchor sequence fragments are compared pairwise with all random sequencing Read sequences, and all random sequencing Read sequences are compared pairwise;
  • any anchor sequence fragment starting from a free end (such as Es) of any anchor sequence fragment, extending the anchor sequence fragment with a random sequencing Read sequence that overlaps with it to form one or more extended sequences; then these The extended sequence uses the same method to continue the extension by using random sequencing Read sequences.
  • Each sequence is extended multiple times until it encounters a random sequencing Read sequence that can be compared to the end of another different anchor sequence fragment.
  • the extension of one end of the starting anchor sequence segment ends, and one or more pathway sequences connecting one end of the starting anchor sequence segment to another end of the different end anchor sequence segment are obtained.
  • the pathway sequence forms sequence set A (that is, the pathway sequence in sequence set A connects the end of the starting anchor sequence segment Es to one or more different end anchoring sequence segment ends Ee1, ..., Eek);
  • a maximum of one sequence is selected as a valid linking sequence (from one start anchor There may be no effective linking sequence at the end of a given sequence fragment);
  • any two anchor sequence fragments are not completely the same, so that the occurrence of conflicting ends can be avoided as much as possible.
  • a sequence has two ends, and each end can be defined as a sequence of a specific length (for example, 1-50 kb). Then, a sequence of a specific length (for example, 1-50 kb) corresponding to the end is the terminal sequence.
  • similar end sequences can be removed by sequence alignment (for example, the identity is> 98%), and new available ends can be generated after the sequence is shortened.
  • the method before selecting a candidate extended sequence, further includes: setting a global sequence similarity minimum threshold SImin; for any sequence X, first determining the sequence similarity of the sequence overlapping with it in the overlapping area Whether the value is greater than or equal to the minimum threshold SImin, and if so, use these overlapping sequences to extend sequence X, otherwise give up selecting these overlapping sequences to extend sequence X, thereby eliminating noise interference, improving the efficiency and speed of data processing, and improving The accuracy of the results.
  • SImin global sequence similarity minimum threshold SImin
  • the sequence identity value is used as the sequencing read sequence accuracy value ⁇ at the genome-wide level; the estimated sequencing accuracy value is used to set the minimum overlap screening threshold, which can improve the accuracy of this value setting and reduce background noise. Improved the accuracy of the results and the speed of calculations.
  • step S2 when the sequence ends (a sequence has two ends, each end can be defined as a sequence of a specific length (such as 1-50 kb)) is extended
  • a sequence of a specific length such as 1-50 kb
  • the probability of is determined by its overlap score or extension score (for example, the probability can be: the score of this sequence / the sum of the scores of all sequences that can be used for extension); each extension method is a greedy algorithm, if there is only one In the case of channels, it is not guaranteed to be correct, so combining multiple extension methods to obtain multiple channels can increase the probability of finding the correct result.
  • the extension score for X2 can also be calculated in the same way); In general, the higher the overlap score, the more likely this overlap region originates from the same position on the genome;
  • Including the length of the extended sequence in the extension score can help Prefer long sequences in order to find longer overlap regions and higher overlap scores in subsequent extensions. Note that because DNA sequences are double-stranded complementary, only single strands can be used in sequence comparison, so sequences overlap Regions can be created on both chains. Through chain adjustment, the two can be unified without redundancy or contradiction.
  • step S3 includes:
  • each sub-set includes one or more pathway sequences;
  • the pathway sequence found can be randomly connected to one or more different end-anchor sequences
  • the Read sequence generated in the same region of the genome has the highest similarity and overlap score, and when extended, the high-scoring sequence is preferentially selected, resulting in the most connected pathways reaching the end of the anchor sequence at the correct endpoint; not based on the end Grouping, high error rate, if the highest number of effective paths is not selected, the error rate is high; if there is a conflict, it means The two end up too similar
  • the assembled sequences are all partially complete fragments, which increases the length of the assembled sequence.
  • most of the assembled fragments are local fragments.
  • the long sequences contain more complete and easier genes. Arranged on the chromosome, it is easier to find the colinearity and structural variation between fragments.
  • the method of the present invention can also be used to judge the adjacent relationship between two anchored sequences. Or the distance between two adjacent anchoring sequences.
  • step S32 includes:
  • each channel sequence subset Ai into one or more groups Ai1, ... Aig, where 1 ⁇ i ⁇ k;
  • the groups in the sequence sub-set Ai are arranged according to the sequence length from left to right and from short to long, starting from the group with the highest length frequency peak and comparing with the first group with the longer channel sequence on the right side. If the total number of effective pathway sequences in the left group is higher than the total number of effective pathway sequences in the right group by a certain proportion (for example, more than twice), then set the representative sequence of the left group as the representative sequence of the sequence sub-set Ai to find The process of the representative sequence of the sequence sub-set Ai stops; otherwise, temporarily set the representative sequence of the right group as the representative sequence of the sequence sub-set Ai, then set this group as the left group, compare it with its first right group, and repeat the above Process until the total number of effective pathway sequences in the right group is lower than the total number of effective pathway sequences in the left group by a certain percentage (for example, less than 50%); thus, the representative sequence of the sequence sub-set Ai and the connected sequence are determined.
  • the correct path sequence can be found in the case of multiple repeating units (without grouping, the complex sequence of multiple repeating units cannot be solved), thereby achieving the alignment of Assembly of complex repeating regions containing multiple repeating units. If there are no multiple repeating units in a repeating sequence, no grouping is required, and only one group will be automatically generated when grouping. Because there are multiple repeating units, it is easy to form a path with the wrong number of repeating units when extended. In this way, the group with the highest length frequency or a group to the right of it represents the correct path set. The number of effective paths in this group cannot be too large. Low (too low is most likely background noise). The method of the present invention selects this group with a relatively low number of effective paths, thereby increasing the probability of finding the correct path.
  • step S321 grouping is performed in the following manner:
  • S3213 Compare the total number of sequences contained in each window with the total number of sequences corresponding to the two windows directly adjacent to it; if the value is larger than both sides, the window is a peak window, and if the value is smaller than both sides, The window is a valley window; if there is no window on one side, the total number of sequences on this side is set to 0;
  • the key to grouping is to separate multiple peaks according to the bottom value. If the distribution of the path length is narrow, for example, the distance between the shortest path and the longest path is less than 10kb, it is not necessary to group and treat them as a group. ; So when grouping according to the window method, it is not necessary to set each window too small, and of course it is not necessary to set it to 1kb.
  • step S33 at most one sequence is selected as a valid linking sequence connecting the end of the initial anchor sequence fragment (such as Es) to the end of the other end anchor sequence fragment:
  • the sequence is the final representative sequence of sequence set A, that is, a valid linking sequence connecting the end of the initial anchor sequence fragment (such as Es) to the end of the other end anchor sequence fragment is obtained.
  • the representative sequence corresponding to the end determined by the solution method is selected as a valid linking sequence connecting the end of the starting anchor sequence segment to another end anchor sequence segment; if this conflict cannot be resolved , If no valid linking sequence is found at the end of this anchoring sequence segment and any other anchoring sequence segment, go to step S2.
  • the method for resolving conflicts includes: resolving conflicts at the ends of anchor sequence fragments located on different chromosomes according to the information of chromosome grouping; Sequence information to resolve conflicts at the ends of anchoring sequence fragments; or in the construction of ultra-long continuous DNA sequences, one of the two end anchoring sequence fragments that will cause conflicts at the ends of an initial anchoring sequence fragment has been used In the connection of other anchor sequence fragments, the conflict at the ends of the corresponding initial anchor sequence fragments is also resolved accordingly.
  • the data based on these conflict resolution methods are relatively easy to obtain, so that conflicts can be resolved quickly.
  • the chromosomal grouping data has Hi-C or genetic maps
  • the neighboring information is the BioNano genome optical map, or 10xGenomics data. Minority conflicts can be resolved based on their own information.
  • the present invention has the following advantages:
  • all known DNA sequences are compared pair by pair to find similar overlapping regions between each pair of sequences, and then start from a free end (such as Es) of any one anchor sequence fragment, followed by It has overlapping random sequencing Read sequences to extend the anchor sequence fragment to form one or more extended sequences; and then use the same method for these extended sequences to continue the extension using random sequencing Read sequences to extend each sequence. Repeat several times until you encounter a random sequencing Read sequence that can be compared to the end of another different anchor sequence fragment, then the extension ends from one end of the initial anchor sequence fragment, and one end connected to the initial anchor sequence fragment is obtained.
  • One or more pathway sequences at the end of another or more different end-anchor sequence fragments said one or more pathway sequences forming sequence set A; one is selected according to the pathway sequences in said sequence set A
  • the sequence is used as an effective linking sequence connecting the end of the initial anchoring sequence fragment to the end of the other end anchoring sequence fragment; using said effective linking sequence Connect the initial anchor sequence fragment and the corresponding end anchor sequence fragment; use the ligated sequence fragment as a new anchor sequence fragment or record the free ends of the remaining anchor sequence fragments, and repeat the above steps continuously to form a super Long continuous DNA sequence; by using the method of the present invention to construct an ultra-long continuous DNA sequence, it is more beneficial to restore the entire chromosome and the entire genome sequence;
  • the present invention forms a random Read sequence into a pathway sequence, and then processes the pathway sequence, thereby greatly improving the accuracy of genome assembly.
  • the existing technology only performs processing at the Read level and compresses similar Read sequences, so that the Read sequences from different repeating regions are compressed into one, so that the originally different repeating regions cannot be separated; and due to various For reasons such as sequencing errors, correction errors, etc., errors at the Read level can easily cause compression errors; while at the channel level, although there are errors, the length of the sequence of the channel is longer than the length of the Read sequence, making it easier to distinguish .
  • the present invention can be further compared with the existing SG method. Improve the accuracy of genome assembly;
  • sequence identity values or overlap scores or extension scores by setting sequence identity values or overlap scores or extension scores, according to the sequence identity values or overlap scores or extension scores, copies of different repeat sequences (each repeat from the same repeat sequence family in the genome) (Sequence similarity is greater than 85%, especially similarity> 98%) Reads are separated as much as possible, and the Reads from each repeated sequence source are assembled into an independent Contig, and connected to the anchor sequence fragments at its two ends. , Eventually forming ultra-long continuous DNA sequences, while improving the accuracy of genome assembly;
  • the genome assembly method of the present invention can also be used for sequence filling of blank regions in a genomic sequence (the sequences at both ends of the blank region are used as anchor sequence fragments, and the final effective linking sequence is obtained by the method of the present invention);
  • the genomic assembly method of the present invention can realize the assembly of genomic regions of repeated sequence regions, as well as the assembly of single-copy sequence regions;
  • the inventors tested with a high-quality rice genome (assembled genome size 390.3Mb, estimated true size does not exceed 394Mb) (Du et al, 2017).
  • the results of assembly using the existing OLC-based SG-type assembly methods are: the total genome size is 402.5Mb; the Contig N50 size is 1.3Mb.
  • Contig N50 After assembly using the method of the present invention, under the condition that the total genome size is slightly reduced (399.2Mb) (the genome becomes smaller because of the redundancy in the result of OLC-based SG-type assembly), the size of Contig N50 increases It reached 13.2Mb; in the case of using BioNano (sequence adjacent information) genome optical map to resolve the conflict, Contig N50 was further increased to 14.4Mb, and the entire chromosome 8 sequence was assembled into a Contig. After filtering out non-rice sequences, the entire genome is 391.6Mb and contains 40 Contigs, which is slightly larger than the original assembled reference genome, mainly due to the increase in centromeric repeat sequences.
  • the chromosome 8 sequence assembled by the method of the present invention contains a sequence of about 387 kb missing from the previous reference genome (that is, assembly using the existing method, there are many fragments, the genome is incomplete, and many repeated sequences are missed, which is difficult.
  • Arranged onto chromosomes complex regions cannot be assembled, as shown in Figures 10a, 10b and 10c).
  • the present invention can easily assemble 7 of them (as shown in Figure 8e, Figure 10b, Figure 10c, Figure 10e, Figure 10f, Figure 10g). , And shown in Figure 10h).
  • the region assembled by the method of the present invention in the rice genome was subjected to quality inspection using other second-generation sequencing short sequences. It was found that 97.21% of the short sequences can be aligned to the genome, and 99.56% of the assembled sequences of the present invention can be covered by the second-generation short sequences. It shows that the assembled sequences of the present invention are all correct.
  • HX1_FALCON FALCON (SG software)
  • Contig N50 increased from 8.3 Mb to 54.4 Mb.
  • the longest Contig increased from 38Mb to 109.8Mb.
  • FIG. 12f in the human reference genome GRCH38, a plurality of blank regions are not assembled in HX1_FALCON, but have been filled in the assembly result of the present invention.
  • FIG. 12c and FIG. 12e As shown, the human reference genome itself is gapped, but after the assembly by the method of the present invention, the gap is filled; in addition, it can be seen from FIG. 12d that the fragment assembled by Falcon is a chimera; and the present invention The assembly method builds the correct sequence.
  • the genome assembly method of the present invention can construct a longer continuous DNA sequence; (2) the present invention can be used to sequence fill a blank area in the genome sequence; (3) the method of the present invention Complex repeat sequence regions can be assembled; (4) The accuracy of genome assembly by the genome assembly method of the present invention is high.
  • FIG. 1 is a flowchart of a method according to an embodiment of the present invention
  • FIG. 2 is a flowchart of a method for obtaining a representative sequence of a subset of pathway sequences and the number of effective pathway sequences of the subset;
  • FIG. 3 is a schematic diagram of a grouping method of a path sequence sub-set
  • FIG. 4 is a schematic diagram of a method for selecting an effective ligation sequence connecting the end of an initial anchoring sequence fragment (such as Es) to the end of another end anchoring sequence fragment;
  • Figure 5 is a schematic diagram of duplicate sequence copies derived from different regions; two similar repeat sequence fragments R1, R2 on the genome, the upper half of the sequence aligned with it are locally generated Read sequences, the sequence consistency is very high , And the lower half of the sequence aligned with it is another sequencing Read sequence source from another copy of the repeat sequence, with unpaired dangling ends and base mutations;
  • FIG. 6 is a schematic diagram of overlapping sequences and the corresponding overlapping diagram of FIG. 5; a, two overlapping sequences, OL, an overlapping sequence portion; OH, an unpaired hanging sequence portion at the end; EL, an extended sequence portion; b, A pathway composed of overlapping sequences; c, the connection diagram of the overlapping sequences in Fig. 5 (overlapping diagram); C1-C4 are anchor sequence fragments; R1 and R2 are repeating sequences; each sequence has two ends; U, Single test sequence, UR, single test sequence and boundary sequence of repeating region;
  • Figure 7 is a schematic diagram and an example of selecting a valid sequence from multiple pathway sequences and resolving conflicts; c, a schematic diagram of a connecting pathway sequence from the end of a start anchor sequence fragment to the end of a plurality of end anchor sequence fragments; d, showing The number of effective pathway sequences at the ends of different anchoring sequence fragments; e, showing conflicting ends; f, showing the relationship between the sequences in e; g, using the adjacent relationship between the sequences (BioNano optical atlas) to solve conflict in e;
  • Figure 8 shows an example of a complex repeat region containing multiple repeat units; ab, the optical spectrum of the BioNano genome shows that this region contains two repeat units. In each pair of comparison bars, the lower bar is the optical spectrum, and the upper bar The bar is the reference genome. A comparison of the two shows that a sequence is missing from the reference genome; cd, the frequency distribution of the length of the pathway sequence, which corresponds to the two sequences in a / b; e, which shows in b / d
  • the sequence structure of cns1 and cns2 are two representative sequences of different lengths, where the line segments at both ends represent the anchoring sequence, and the box and triangle in the middle represent the assembled sequence;
  • FIG. 9 is a schematic diagram showing the results of processing using two different methods of the background technology and the present invention for two repeated sequences in the same repeated sequence family;
  • Figure 10 is an example of rice genome assembly results
  • ac is a schematic diagram showing the comparison results of rice genome assembly using the existing method and the method of the present invention.
  • the lower bar is the BioNano optical spectrum.
  • the horizontal bar is the reference genome;
  • the upper pair represents the schematic diagram of the result assembled using the existing method, and the lower pair represents the schematic diagram of the result assembled using the method of the present invention;
  • e / f / g / h respectively, represents multiple complex repeats Frequency distribution map of the channel sequence length in the sequence region;
  • Figure 11 is an example of maize genome assembly; bd.
  • the lower bar is the BioNano optical spectrum
  • the upper bar is the reference genome
  • the upper part shows the pair using the existing software PBcR (based on the SG method).
  • the results of the assembly of the corn genome are shown in the lower part.
  • the assembly results of the same region using the method of the present invention are shown.
  • the assembly result of the invention is corrected; d, the position error in the PBcR sequence is corrected in the assembly result of the invention;
  • Figure 12 is an example of human genome assembly results
  • c / e the upper horizontal bar represents the human reference genome (with gaps), and the lower horizontal bar represents the results of assembly by the method of the present invention (no gaps);
  • d in Falcon It is a chimera, and the result of assembly of the present invention does not include the chimera;
  • f Falcon results have multiple blank areas not assembled, but have been filled in the assembly results of the present invention (ie, HERA Contig629).
  • An embodiment of the present invention a method for assembling a genome to construct an ultra-long continuous DNA sequence, as shown in FIG. 1, includes the following steps:
  • the known DNA sequences include anchor sequence fragments (that is, sequence fragments used for anchoring, For example, one or more specific sequence fragments intercepted from the DNA sequence, and / or one or more specific sequence fragments that have been assembled, and / or one or more selected from the random sequencing Read sequence Read sequence, etc.) and random read sequence (in order to improve the accuracy of the Read sequence, the Read sequence can be corrected first, or the original random sequence Read sequence can be used without correction; the correction method includes using The sequencing Read sequence obtained by other sequencing platforms has a low sequencing error rate for correction, and also includes the use of other Read sequences in this set for correction.
  • anchor sequence fragments that is, sequence fragments used for anchoring, For example, one or more specific sequence fragments intercepted from the DNA sequence, and / or one or more specific sequence fragments that have been assembled, and / or one or more selected from the random sequencing Read sequence Read sequence, etc.
  • random read sequence in order to improve the accuracy of the Read sequence, the Read sequence can be corrected first, or the original random sequence Read sequence
  • the random sequencing Read sequence here can be partially assembled. Short Contig sequences (and assembled long Contig sequences are used as anchor sequence fragments, and the length can be divided into 50kb boundaries, for example) instead ); Said anchor sequence fragments include at least two; said pairwise comparison of all known DNA sequences, including pairwise comparison of all anchor sequence fragments and all random sequencing Read sequences, and All random sequencing Read sequences are compared pairwise; in specific implementation, an overlapping graph can be constructed, which is an undirected simple graph constructed by the nodes representing a known sequence and the sequence overlap between them as an edge.
  • Each known sequence is represented by two nodes, each node representing the end of a sequence fragment, and the two nodes are connected by an undirected edge (here called a coupling edge); in this overlapping graph If there is a non-coupling edge connection between two nodes, it means that there is overlap between the two ends, and one of them can be used to extend the other.
  • a coupling edge When traversing the path in the graph, there is a basic requirement: When entering a node, it must come out through the coupling edge of that node (that is, after reaching an end node of a known sequence, it cannot come out from the end nodes of other sequences connected to the same end node, but must come from the same sequence. The other end node comes out to ensure the linear extension of the sequence); In this figure, identifying whether there is a connection between the two ends of two different anchoring sequences can be achieved by deep search or breadth search;
  • each sequence is extended multiple times until it encounters a random sequencing Read that can be connected to the end node of another different anchor sequence fragment
  • the extension ends from the end of the starting anchor sequence segment to obtain one or more pathway sequences connecting the end of the starting anchor sequence segment to the end of another or more different end anchor sequence segments ( As shown in the example of FIG. 7c), the one or more pathway sequences form sequence set A (that is, the pathway sequences in sequence set A connect the end of the starting anchor sequence segment Es to one or more other endpoints.
  • Anchor sequence fragment ends Ee1, ..., Eek);
  • a maximum of one sequence is selected as a valid linking sequence (from one starting anchor) connecting the end of the initial anchor sequence fragment (such as Es) to the end of another end anchor sequence fragment. There may be no effective linking sequence at the end of a given sequence fragment);
  • one or more specific Read sequences selected from the randomly sequenced Read sequences can be selected by one of the following methods:
  • the method may further include: setting a global sequence similarity minimum threshold SImin; for any sequence X, first determine which overlaps it with Whether the sequence similarity value of the sequence in the overlapping region is greater than or equal to the minimum threshold SImin, and if so, these overlapping sequences are selected to extend the sequence X, otherwise these overlapping sequences are not selected to extend the sequence X.
  • the sequencing read sequence accuracy value ⁇ at the genome level is calculated by: taking the known overlapping sequences with the highest overlap score for each sequence, taking the maximum number of average sequencing depths; calculating the average sequence consistency of all overlapping regions
  • the sex value is used as the sequencing read sequence accuracy value ⁇ at the genome-wide level.
  • the minimum global sequence similarity threshold SImin may also be set empirically or arbitrarily. For example, a corrected random Read sequence is used as the extension sequence, and an assembled Contig is used as the anchor sequence. Therefore, in the implementation, a fixed Simin value of 97% can be used, and the effect is good enough. Because generally the accuracy of the read sequence after random sequencing is about 99%.
  • step S2 when the end of the sequence (a sequence has two ends, each end can be defined as a sequence of a specific length (such as 1-50kb), each time One step can select the sequence with the highest overlap score; or the sequence with the highest extension score; or randomly select a sequence; or a combination of any two or three of the above methods; where a sequence is randomly selected, any one of the sequences is selected
  • the probability is determined according to its overlap score or extension score (for example, the probability may be: the score of this sequence / the sum of the scores of all sequences that can be used as extension).
  • extension length can be considered, but the effect is not necessarily good.
  • Step S3 includes:
  • the sequence set A is divided into one or more channel sequence sub-sets A1, A2, ..., Ak according to different end-anchor sequence fragments (where all the channel sequences in the sub-set A1 are connected to each other).
  • the ends of the initial anchor sequence segment are Es and the ends of the end anchor sequence segment are Ee1, and so on for other subsets.
  • Each subset includes one or more pathway sequences.
  • the sequence set A may also be divided into one or more channel sequence sub-sets A1, A2, ..., Ak (where all the paths in the sub-set A1 are based on different end-anchor sequence fragments).
  • the sequences are connected to the end of the starting anchor sequence fragment such as Es and the end of the anchor anchor sequence fragment such as Ee1, and other subsets, and so on), each subset includes one or more pathway sequences; then directly select the largest subset Any channel sequence with the highest frequency of mid-length is used as a valid linking sequence connecting the end of the initial anchor sequence fragment (such as Es) to the end of the corresponding end anchor sequence fragment; if the largest subset is more than one, randomly select one The largest sub-collection.
  • the ends of all anchoring sequences can be used as nodes and the representative sequences between nodes as edges to construct an undirected
  • the connection graph uses the number of effective paths as the length of the edges.
  • Step S32 may include (as shown in FIG. 2):
  • each channel sequence subset Ai into one or more groups Ai1, ... Aig, where 1 ⁇ i ⁇ k;
  • the groups in the sequence sub-set Ai are arranged according to the sequence length from left to right and from short to long, starting from the group with the highest length frequency peak and comparing with the first group with the longer channel sequence on the right side. If the total number of effective pathway sequences in the left group is higher than the total number of effective pathway sequences in the right group by a certain proportion (for example, more than twice), then set the representative sequence of the left group as the representative sequence of the sequence sub-set Ai to find The process of the representative sequence of the sequence sub-set Ai stops; otherwise, temporarily set the representative sequence of the right group as the representative sequence of the sequence sub-set Ai, then set this group as the left group, compare it with its first right group, and repeat the above Process until the total number of effective pathway sequences in the right group is lower than the total number of effective pathway sequences in the left group by a certain percentage (for example, less than 50%); thus, the representative sequence of the sequence sub-set Ai and the connected sequence are determined.
  • a certain proportion for example, more than twice
  • grouping can be performed in the following ways (as shown in Figure 3):
  • each window is compared with the total number of sequences corresponding to the two windows immediately adjacent to it; if the value is larger than both sides, the window is a peak window, and if the value is smaller than both sides, The window is a valley window; if there is no window on one side, the total number of sequences on this side is set to 0;
  • the key to grouping is to separate multiple peaks according to the bottom value. If the distribution of the path length is narrow, for example, the distance between the shortest path and the longest path is less than 10kb, it is not necessary to group and treat them as a group. ; So when grouping according to the window method, it is not necessary to set each window too small, and of course it is not necessary to set it to 1kb.
  • the ratio of the minimum length frequency in the valley window to the maximum length frequency in the peak window is very important. If there are too many groups, it will cause background interference. If there are too few groups, it will not find the correct path.
  • step S33 a maximum of one sequence is selected as a valid linking sequence connecting the ends of the initial anchoring sequence fragment (such as Es) to the end of the other end anchoring sequence fragment (as shown in FIG. 4):
  • the representative sequence corresponding to the end determined by the solution method is selected as the effective linking sequence connecting the end of the starting anchor sequence fragment to another end anchor sequence fragment (as shown in Figure 7g. The adjacent relationship between them resolves the conflict at the end of the sequence as shown in Figure 7e); if this conflict cannot be resolved, then no valid link sequence is found at the end of this anchor sequence segment and any other anchor sequence segment, go to step S2.
  • connection graph can be used to judge. If the end of a starting anchor sequence is connected to the ends of multiple different end anchor sequences, and the longest two sides have the same length (the number of effective paths), then this is explained.
  • the ends of the starting anchor sequence are connected to two other end points by similar repeats; the ends of the non-conflicting ends can be connected from large to small according to the number of pathways; the conflicting ends are not used to connect to other End until this conflict can be resolved.
  • the method for resolving conflicts includes: resolving conflicts at the ends of anchor sequence fragments located on different chromosomes according to the information of chromosome grouping; or resolving conflicts at the ends of anchor sequence fragments based on known adjacent sequence information; or In the process of constructing ultra-long continuous DNA sequences, one of the two end-point anchoring sequence fragments that cause a conflict between the ends of an initial anchoring sequence fragment has already been used in the connection of other anchoring sequence fragments. Conflicts at the ends of the initial anchor sequence fragments are also resolved accordingly.
  • FIG. 8 shows an example of a complex repeat region containing multiple repeat units. As can be seen from Figures 8a and 8b, the genomic optical spectrum shows that this region contains two repeat units, while the reference genome sequence contains only one repeat unit; FIG. 8c and FIG. 8d are the corresponding frequency distribution diagrams of the length of the channel sequence obtained by using the above method of the present invention, respectively, corresponding to the two sequences in FIG. 8a and FIG. 8b; FIG.
  • sequence structure in the present invention two representative sequences of different lengths of cns1 and cns2 obtained by the method of the present invention, wherein the line segment represents the anchor sequence, and the middle box and triangle represent the assembled sequence.
  • the invention can realize the assembly of complex sequences containing multiple repeating units, but the sequences assembled by the existing methods are incomplete and some will be missed.
  • Figure 5 The two R12 and R2 sequences in the genome have high sequence identity, but there are still some base differences that cause them to be inconsistent; the upper half of the aligned sequence is a locally generated Read sequence, followed by the The differences between the compared sequences are small; the aligned sequences in the lower half have unpaired dangling ends, which are sequencing Read sequences from different copies of the repeat sequence, which are significantly different from the compared sequences), Figure 6 (An overlay (partial) example, corresponding to the sequence region and sequencing read in Figure 5; C1-C4 are anchor sequence fragments; R1 and R2 are repeat sequences; each sequence has two ends; U: single test shell As shown in the sequence, UR, single-column, and boundary sequences of repeating regions), there are differences in the repeating sequence copies (the repeating sequence copies are similar sequences belonging to the same repeating sequence family) derived from different regions.
  • the present invention can assemble the Read sequences derived from the above-mentioned differential repeat sequence copies to form separate Contig
  • the most critical difference between the genomic assembly method of the present invention and the existing SG assembly method is that the present invention completely processes a repeat sequence region, while the SG method processes the repeat region at the Read level. Similar Read After compression, different repeating regions are compressed into one, so that the repeating regions that are different from each other cannot be separated. For various reasons, such as sequencing errors, correction errors, and so on, errors at the Read level can easily lead to compression errors, and at the channel level, although there are errors, the representative sequence of the channel is longer than Read and is easier to distinguish. At the pathway level, even if the sequence difference between the two pathways is only 1% -2%, it is possible to distinguish them, and those greater than 2% can be easily separated. The overlap between Reads already contains possible sequence differences.
  • the average read rate of raw reads produced by single-molecule sequencing is between 10% and 15%. After self-correction, the average error rate of Read is greatly reduced, for example, the error rate of many Reads can be reduced to less than 1%.
  • the reads produced by the read will not have a similarity higher than 95% after correction (in a few cases, more similar sequences will be generated due to correction errors), so It is easy to distinguish in the sequence alignment and assembly process, so it is basically not a problem when using existing software for assembly.
  • the present invention needs to process mainly those sequences with a similarity greater than 95%.
  • Ni Congtig
  • Nj Contig
  • the purpose of the present invention is to determine which Nj is adjacent to Ni by comparing the number and quality of the paths from Ni to Nj. If only one pathway is compared, it is more susceptible to accidental factors.
  • the present invention needs to find a representative sequence to represent these pathways and use it as a potential sequence connecting two Contigs.
  • the representative sequence can also be one of these channels that takes the length of the highest occurrence frequency as a reference sequence, compares all other sequences, and then calculates the consensus sequence of these sequences. If there is a consensus sequence between the two Contigs, if the path sequence aligned to the consensus sequence accounts for more than 50% of the total, it is confirmed that there is a valid consensus sequence between the two Contigs, and the number of connections is higher than The number of pathway sequences. If it is less than 50%, the repeat sequence between the two Contigs is considered too complicated, and the number of connections can be set to zero.
  • a repeating region containing multiple repeating units because a repeating unit can be crossed or repeated during extension, resulting in a shorter or longer representative sequence, showing a regular distribution of multiple length frequency peaks.
  • the two sequences shown in FIG. 9 (a) indicate that in the original genomic sequence, the sequence A is connected to the D sequence through one copy of a repeat sequence, and the C sequence is connected to another through the repeat sequence.
  • the copy (that is, the same repeat family as the above-mentioned repeat sequence) is linked to sequence B.
  • two copies of the repeating sequence shown in Fig. 9 (a) are compressed into one, as shown in Fig. 9 (b).
  • the present invention assembles similar repeat sequences into two different contig sequences, and uses the method of the present invention to find the starting point from the end of a certain sequence to the end of all unknown sequences.
  • the representative sequences of all the pathway sequences of the sequence are then found from these representative sequences to find the most correctly connected pathway from the end, that is, the method of the present invention can correctly determine that the sequence A passes the repeated sequence (ie
  • the final effective linking sequence to be found in the present invention) is linked to sequence D, and sequence C is linked to sequence B through a similar sequence of the repeat sequence (also the final effective linking sequence to be found in the present invention), so that it can be correctly
  • the entire original genomic sequence is reduced, and most of the repeating sequences or similar repeating sequences can be assembled by using the method of the present invention, and finally an ultra-long continuous DNA sequence is formed.
  • a single copy region can also be assembled using the method of the present invention, because it is not known whether the sequence is derived from a repeat sequence or a single copy sequence before assembly.
  • the invention provides a genome assembly method for constructing an ultra-long continuous DNA sequence.
  • the method of the present invention includes: S1, finding an overlapping region between each pair of known DNA sequences; S2, starting from a free end of any anchor sequence segment, extending it with a Read sequence that overlaps with it, and cycling Repeatedly, until a Read sequence that can be compared to the end of another different anchor sequence fragment is obtained, and one or more pathway sequences are obtained; S3, at most one of all pathway sequences is selected as the connection starting anchor sequence fragment.
  • the method of the invention is beneficial for restoring the entire chromosome and the entire genome sequence, further improving the accuracy of genome assembly, and can realize the genome assembly of the repeated sequence region, as well as the assembly of a single copy sequence region.
  • the assembly of the unit's complex and repeating regions has good economic value and application prospects.

Abstract

A genome assembly method for constructing an ultralong continuous DNA sequence, comprising: S1, finding an overlapping region between each pair of known DNA sequences; S2, starting from a free end of any anchor sequence fragment, extending the anchor sequence fragment with a read sequence overlapping therewith, repeating the extension a plurality of times until a read sequence that can be aligned to an end of another different anchor sequence fragment is found, so as to obtain one or a plurality of pathway sequences; S3, selecting at most one of all the pathway sequences as an effective joining sequence to join the end of the initial anchor sequence fragment to the end of the other ending anchor sequence fragment; S4, using the effective joining sequence to join the initial and corresponding ending anchor sequence fragment; using the joined anchor sequence fragment as a new anchor sequence fragment, or recording a free end of a remaining anchor sequence fragment, and proceeding to S2; and repeating steps S2-S4 to finally form an ultralong continuous DNA sequence.

Description

一种构建超长连续DNA序列的基因组组装方法Genome assembly method for constructing ultra-long continuous DNA sequence
交叉引用cross reference
本申请要求2018年6月8日提交的专利名称为“一种构建超长连续DNA序列的基因组组装方法”的第201810588945.8号中国专利申请的优先权,其全部公开内容通过引用整体并入本文。This application claims priority from Chinese Patent Application No. 201810588945.8, entitled “A Method for Constructing a Genome Assembly of Ultra-Long Continuous DNA Sequences”, filed on June 8, 2018, the entire disclosure of which is incorporated herein by reference in its entirety.
技术领域Technical field
本发明涉及一种构建超长连续DNA序列的基因组组装方法,属于基因组组装技术领域。The invention relates to a genome assembly method for constructing ultra-long continuous DNA sequences, and belongs to the technical field of genome assembly.
背景技术Background technique
测序仪通过对基因组片段的测序产生了随机的读出序列(Read,读序)。这些Read在基因组上的分布是随机的。基因组组装的过程就是把这些Read按照正确的顺序排列和连接,组装成碱基连续的DNA片段(Contig),最终复原整条染色体及整个基因组的序列。这个组装的过程一般包括三步:连续片段(Contig)的组装,有缺口的非连续片段(Scaffold)的组装,缺口的补齐(GF)。基因组组装的困难来源于基因组存在的大量重复序列(即序列相似或一样的两个/段或多个/段序列)。重复序列在基因组中可分为两个大类:串联重复序列和散布重复序列。串联重复是一组头尾直接相连的非常相似的重复单位组成的序列,通过局部重复产生。典型的串联重复序列包括rDNA、着丝粒重复序列等。散布重复序列是分布于基因组中不同位置的非局部重复序列。在有些重复序列中,串联重复和非串联重复序列都有,这些区域很长,形成复杂重复序列。测序产生的来源于不同重复序列拷贝的Read具有序列上的相似性。目前单分子测序Read的长度N50一般大于10-15kb,最长达到了100kb以上。若是一个重复序列加上其两端的单考贝序列一起被一条Read全部覆盖,则这个区域不存在组装的问题。而当前需要解决的是超出了Read平均或N50长度的重复序列的组装问题。The sequencer generates random reads (Read) by sequencing the genomic fragments. The distribution of these Reads across the genome is random. The process of genome assembly is to arrange and connect these Reads in the correct order, and assemble them into contigs of contiguous bases, and finally restore the entire chromosome and the entire genome sequence. This assembly process generally includes three steps: the assembly of continuous fragments (Contig), the assembly of discontinuous non-continuous fragments (Scaffold), and the completion of gaps (GF). The difficulty of genome assembly stems from the large number of repeated sequences present in the genome (ie, two / segment or multiple / segment sequences with similar or identical sequences). Repeats can be divided into two major categories in the genome: tandem repeats and interspersed repeats. A tandem repeat is a sequence of very similar repeating units that are directly connected to the head and tail, and are generated by local repeats. Typical tandem repeats include rDNA, centromeric repeats, and the like. Interspersed repeats are non-locally repeated sequences distributed at different locations in the genome. In some repeats, there are both tandem and non-tandem repeats. These regions are long and form complex repeats. Reads derived from different copies of the repeat sequence have sequence similarity. At present, the length N50 of single molecule sequencing Read is generally greater than 10-15kb, and the longest is more than 100kb. If it is a repeating sequence plus the single test sequence at both ends is covered by a single Read, then there is no assembly problem in this area. What needs to be solved now is the assembly of repeating sequences that exceed the Read average or N50 length.
对于长单分子测序数据现在最常用的基因组组装方法采用了基于Overlap-Layout-Consensus(OLC)(Myers et al.2000,Science 287,2196–2204)或String Graph(SG)(Myers 2005,Bioinformatics 21 Suppl 2,ii79-85)的策略。 OLC方法也可以简洁地用SG来描述,统称为SG类方法。现有的SG类方法常用软件包括PBcR(Berlin et al.2015,Nat.Biotechnol.33,623–30)、CANU(Koren et al.2017,Genome Res.27,722–736)、FALCON(Chin et al.2016,Nat.Methods 13,1050–1054)、MECAT(Xiao et al.2017,Nat.Methods.doi:10.1038/nmeth.4432)等。SG方法中的关键是利用可传递性简化路径(Transitive reduction,TU)的方法来去除多余的Read(所有的序列特别相似的Read被压缩成一条)。即在构建所有Read的重叠图后,利用TU将很多节点的进出边数简化到一条。这样在很多路径上就会没有分支。若是一个Read节点在简化后图中的重叠边度数大于1,则称之为交叉节点,其他的节点为内部节点。没有交叉节点的一个通路就可以形成一条Contig,在SG中可被进一步压缩到一起。交叉节点代表了单考贝序列区域和重复序列区域的连接之处(这个节点上的Read包括了两种类型序列的各一部分);测序仪在测出Read序列的时候会犯错误,导致其测出带有测序错误的Read序列,这些序列错误包括碱基的插入、缺失、变异,或是来源于不同位置的序列的嵌合体,这些错误也可能导致额外的交叉节点序列。由于测序错误的存在,导致没有一个统一的标准来区分Read序列之间的差异到底是由测序错误引起,还是来源于重复序列的不同拷贝而引起的。在这个路径简化的过程中,单考贝区域被简化成一长串Read形成的单一路径,连接到一起形成单考贝序列Contig;而一段重复序列也可以被压缩成一串Read形成的单一路径,形成重复序列Contig。由于序列比较时要允许错误,导致来源于不同重复序列拷贝的Read会被压缩到一起,也导致不同拷贝的重复序列变成一个,因而不能区分开。但是由于交叉节点的存在,形成的重复序列Contig在被压缩的起点和终点位置断开,导致组装出的Contig的碎片化,进而导致无法真正复原整个原始基因组序列。For long single-molecule sequencing data, the most commonly used genome assembly method now uses Overlap-Layout-Consensus (OLC) (Myers, et al. 2000, Science 287, 2196-2204) or String Graph (SG) (Myers 2005, Bioinformatics 21). Suppl 2, ii79-85). The OLC method can also be described succinctly with SG, collectively referred to as the SG-type method. Common software for existing SG methods includes PBcR (Berlin et al. 2015, Nat. Biotechnol. 33, 623–30), CANU (Koren et al. 2017, Genome et Res. 27, 722–736), and FALCON (Chin et et al. 2016, Nat.Methods 13,1050–1054), MECAT (Xiao et al. 2017, Nat.Methods.doi: 10.1038 / nmeth.4432), etc. The key in the SG method is to use the method of transitive reduction (Transitive reduction) (TU) to remove redundant reads (all reads that are particularly similar are compressed into one). That is, after constructing an overlay graph of all Reads, the number of edges in and out of many nodes is reduced to one using TU. This leaves no branches on many paths. If a Read node has a degree of overlapping edges greater than 1 in the simplified diagram, it is called a cross node, and the other nodes are internal nodes. A path without crossing nodes can form a Contig, which can be further compressed together in the SG. The cross node represents the connection between the single test sequence region and the repeated sequence region (Read on this node includes each part of the two types of sequences); the sequencer will make errors when detecting the Read sequence, causing its measurement Read sequences with sequencing errors, such as base insertions, deletions, mutations, or chimeras derived from sequences at different positions, may cause additional cross-node sequences. Due to the existence of sequencing errors, there is no unified standard to distinguish whether the differences between Read sequences are caused by sequencing errors or caused by different copies of repeated sequences. In the process of path simplification, the single test area is reduced to a single path formed by a series of Reads, which are connected together to form a single test sequence Contig; and a repeating sequence can also be compressed into a single path formed by a series of Reads to form Repeating sequence Contig. Because errors must be tolerated during sequence comparison, the Reads from different duplicate sequence copies will be compressed together, and the duplicate sequences of different copies will become one, so it cannot be distinguished. However, due to the existence of cross nodes, the formed repeat sequence Contig is broken at the compressed start and end positions, resulting in fragmentation of the assembled Contig, which in turn makes it impossible to truly restore the entire original genome sequence.
发明内容Summary of the Invention
本发明的目的在于,提供一种构建超长连续DNA序列的基因组组装方法,它可以有效解决现有技术中存在的问题,尤其是现有技术中将相似的多段重复序列压缩成一串Read形成的单一路径;由于来源于不同重复序列拷贝的Read会被压缩到一起,导致不同拷贝的重复序列变成一个,因而不 能区分开;而且由于交叉节点的存在,形成的重复序列Contig在被压缩的起点和终点位置断开,导致组装出的Contig的碎片化的问题。The purpose of the present invention is to provide a genome assembly method for constructing an ultra-long continuous DNA sequence, which can effectively solve the problems existing in the prior art, especially in the prior art, compressing similar multi-segment repeating sequences into a string of Read. A single path; because Reads from different duplicate sequence copies will be compressed together, causing the duplicate sequences of different copies to become one, it cannot be distinguished; and because of the existence of cross nodes, the formed repeat sequence Contig is at the starting point of compression Disconnect from the end position, causing fragmentation of the assembled Contig.
为解决上述技术问题,本发明采用如下的技术方案:一种构建超长连续DNA序列的基因组组装方法,包括以下步骤:In order to solve the above technical problem, the present invention adopts the following technical solution: a method for assembling a genome for constructing an ultra-long continuous DNA sequence, including the following steps:
S1,将所有的已知DNA序列进行两两比较,找出每对序列之间相似的重叠区域;其中,所述的已知DNA序列包括锚定序列片段(即用于锚定的序列片段,可包括多种类型,比如从DNA序列上截取的某一段或几段特定的序列片段,和/或已经组装好的某一段或几段特定的序列片段,和/或从随机测序Read序列中选出的某一个或几个特定的Read序列等)和随机测序Read序列;所述的锚定序列片段至少包括两个;所述的将所有的已知DNA序列进行两两比较,包括将所有的锚定序列片段与所有的随机测序Read序列进行两两比较,以及将所有的随机测序Read序列进行两两比较;S1. Perform a pairwise comparison of all known DNA sequences to find similar overlapping regions between each pair of sequences; wherein the known DNA sequences include anchor sequence fragments (that is, sequence fragments used for anchoring, It can include multiple types, such as a specific segment or segments of a specific sequence segment intercepted from a DNA sequence, and / or a specific segment or segments of a specific sequence segment that have been assembled, and / or selected from a random sequencing Read sequence One or several specific Read sequences, etc.) and random sequencing Read sequences; said anchor sequence fragments include at least two; said pairwise comparison of all known DNA sequences, including all Anchor sequence fragments are compared pairwise with all random sequencing Read sequences, and all random sequencing Read sequences are compared pairwise;
S2,从任意一个锚定序列片段的一个自由末端(如Es)开始,用跟其有重叠的随机测序Read序列对该锚定序列片段进行延伸,形成一到多个延长的序列;再对这些延长的序列采用同样的方法利用随机测序Read序列继续进行延伸,每个序列的延伸循环多次,直至遇到能够比对到另一个不同的锚定序列片段末端的随机测序Read序列,则从起始锚定序列片段一端开始的延伸结束,获得连接起始锚定序列片段的一端到另一个或多个不同的终点锚定序列片段末端的一个或多个通路序列,所述的一个或多个通路序列形成序列集合A(即序列集合A中的通路序列连接了起始锚定序列片段末端Es到其它一或多个不同的终点锚定序列片段末端Ee1,…,Eek);S2, starting from a free end (such as Es) of any anchor sequence fragment, extending the anchor sequence fragment with a random sequencing Read sequence that overlaps with it to form one or more extended sequences; then these The extended sequence uses the same method to continue the extension by using random sequencing Read sequences. Each sequence is extended multiple times until it encounters a random sequencing Read sequence that can be compared to the end of another different anchor sequence fragment. The extension of one end of the starting anchor sequence segment ends, and one or more pathway sequences connecting one end of the starting anchor sequence segment to another end of the different end anchor sequence segment are obtained. The pathway sequence forms sequence set A (that is, the pathway sequence in sequence set A connects the end of the starting anchor sequence segment Es to one or more different end anchoring sequence segment ends Ee1, ..., Eek);
S3,根据所述的序列集合A中的通路序列,选择最多一条序列作为连接起始锚定序列片段末端(如Es)到另一个终点锚定序列片段末端的有效连接序列(从一个起始锚定序列片段末端开始可以不存在有效连接序列);S3. According to the pathway sequence in the sequence set A, a maximum of one sequence is selected as a valid linking sequence (from one start anchor There may be no effective linking sequence at the end of a given sequence fragment);
S4,利用所述的有效连接序列连接起始锚定序列片段(如末端Es)和相应的终点锚定序列片段;将连接后的序列片段作为新的锚定序列片段或记录剩余的锚定序列片段的自由末端,转到S2;不断重复步骤S2-S4,从而最终形成超长连续的DNA序列。S4. Use the effective linking sequence to connect the starting anchor sequence fragment (such as end Es) and the corresponding end anchor sequence fragment; use the linked sequence fragment as a new anchor sequence fragment or record the remaining anchor sequence The free end of the fragment is transferred to S2; steps S2-S4 are continuously repeated, so as to finally form an ultra-long continuous DNA sequence.
本发明中,任意两个锚定序列片段都不完全相同,从而可以尽量避免冲突末端的出现。一个序列有两个末端,每个末端可定义为一段特定长度 (比如1-50kb)的序列,那么所述末端对应的一段特定长度(比如1-50kb)的序列即为末端序列。实际操作中,可通过序列比对的方式去除相似的末端序列(比如一致性>98%),序列缩短后产生新的可用的末端。In the present invention, any two anchor sequence fragments are not completely the same, so that the occurrence of conflicting ends can be avoided as much as possible. A sequence has two ends, and each end can be defined as a sequence of a specific length (for example, 1-50 kb). Then, a sequence of a specific length (for example, 1-50 kb) corresponding to the end is the terminal sequence. In actual operation, similar end sequences can be removed by sequence alignment (for example, the identity is> 98%), and new available ends can be generated after the sequence is shortened.
优选的,步骤S2中,在选择候选延伸序列之前,还包括:设定一个全局序列相似性最低阈值SImin;对任一序列X来说,首先判断跟其重叠的序列在重叠区域的序列相似性值是否大于等于所述最低阈值SImin,如果是,则选用这些重叠序列来延伸序列X,否则放弃选用这些重叠序列来延伸序列X,从而可以去除噪音干扰,提高数据处理的效率和速度,并提高了结果的准确性。Preferably, in step S2, before selecting a candidate extended sequence, the method further includes: setting a global sequence similarity minimum threshold SImin; for any sequence X, first determining the sequence similarity of the sequence overlapping with it in the overlapping area Whether the value is greater than or equal to the minimum threshold SImin, and if so, use these overlapping sequences to extend sequence X, otherwise give up selecting these overlapping sequences to extend sequence X, thereby eliminating noise interference, improving the efficiency and speed of data processing, and improving The accuracy of the results.
优选的,所述的全局序列相似性最低阈值SImin参考全基因组水平上的测序Read序列准确率值α进行设定(比如设定SImin=1-(1-α)*3),其中,所述的全基因组水平上的测序Read序列准确率值α通过以下方式计算获得:取已知的每条序列的具有最高重叠分数的重叠序列,最多取平均测序深度的条数;计算所有重叠区域的平均序列一致性值,作为全基因组水平上的测序Read序列准确率值α;利用估算出的测序精确度值来设定最低重叠的筛选阈值,可以提高此值设定的准确性,减少背景噪音,提高了结果的准确性和运算速度。Preferably, the minimum global sequence similarity threshold value SImin is set with reference to the sequencing read sequence accuracy value α at the whole genome level (for example, SImin = 1- (1-α) * 3), wherein, the The sequencing-read sequence accuracy value α at the genome-wide level is obtained by calculating: taking the known overlapping sequences with the highest overlap score for each sequence, taking at most the number of average sequencing depths; calculating the average of all overlapping regions The sequence identity value is used as the sequencing read sequence accuracy value α at the genome-wide level; the estimated sequencing accuracy value is used to set the minimum overlap screening threshold, which can improve the accuracy of this value setting and reduce background noise. Improved the accuracy of the results and the speed of calculations.
前述的构建超长连续DNA序列的基因组组装方法中,步骤S2中,对序列末端(一个序列有两个末端,每个末端可定义为一段特定长度(比如1-50kb)的序列)进行延伸时,每一步都选择重叠分数最高的序列;或者延伸分数最高的序列;或者随机选择一个序列;或者为上述任意两种或上述三种方式的结合;其中,随机选择序列时,任意一个序列被选中的概率根据其重叠分数或延伸分数确定(比如所述的概率可以是:这条序列的分数/可以用作延伸的所有序列的分数总和);每一种延伸方式都是贪心算法,若是只有一条通路的情况下,并不能保证就是正确的,所以结合多种延伸方式得到多条通路可以提高找到正确结果的概率。In the foregoing genome assembly method for constructing an ultra-long continuous DNA sequence, in step S2, when the sequence ends (a sequence has two ends, each end can be defined as a sequence of a specific length (such as 1-50 kb)) is extended For each step, select the sequence with the highest overlap score; or the sequence with the highest extension score; or randomly select a sequence; or a combination of any two or three of the above methods; where a sequence is selected randomly, any sequence is selected The probability of is determined by its overlap score or extension score (for example, the probability can be: the score of this sequence / the sum of the scores of all sequences that can be used for extension); each extension method is a greedy algorithm, if there is only one In the case of channels, it is not guaranteed to be correct, so combining multiple extension methods to obtain multiple channels can increase the probability of finding the correct result.
通过以上方法,从而在对序列进行延伸时,除了第一步外,每一步都只选择一个序列,而非利用所有的序列都去延伸,从而保证可以在有限或较短的时间内延伸出更长的连续的DNA序列。而如果每一步都要用所有的序列进行延伸,那么随着延伸次数的增长,总的序列数就会呈指数式增长, 最终导致进行对所有序列的延伸变得不具有可行性;第一步的多个Read延伸保证最后能产生多条序列,增加了最后结果中包含正确结果的概率。Through the above method, when extending the sequence, except for the first step, only one sequence is selected at each step, instead of using all the sequences to extend, thereby ensuring that it can be extended in a limited or short period of time. Long continuous DNA sequence. And if every step needs to be extended with all the sequences, as the number of extensions increases, the total number of sequences will increase exponentially, and eventually it will not be feasible to extend all sequences; Multiple Read extensions guarantee that multiple sequences can be generated at the end, increasing the probability that the final result contains the correct result.
上述的构建超长连续DNA序列的基因组组装方法中,对于末端重叠的两条序列X1和X2(其中一条序列没有被另一条序列完全覆盖),其重叠区域的重叠分数OS为:OS=(OL1+OL2)*SI/2;其中,OL1,OL2分别为序列X1和X2中其重叠区域的长度,SI为序列X1和X2之间的重叠区域的序列一致性值;X2对X1的延伸分数ES2为:ES2=OS+EL2/2-(OH1+OH2)/2,其中OH1,OH2分别是两条序列末端未配对悬空(Overhang)区域的长度,EL2是X2对X1的延伸长度(类似,X1对X2的延伸分数也可以用同样的方式计算出来);一般来说,重叠分数越高,则此重叠区域是来源于基因组上同一个位置的可能性越大;序列末端的未配对悬空部分少数是由于测序错误,多数是由于重复序列的不同拷贝造成,因此分数中减去这个值增加了找到正确序列的概率;延伸序列的长度也很重要,在延伸分数中包括了延伸序列的长度可以帮助优先选择长的序列,以便在后续的延伸中找到更长的重叠区域及更高的重叠分数;注意由于DNA序列是双链互补的,但在序列比较时,只能用单链,所以序列重叠区域在两条链上都可以产生,通过链的调整,可以把二者统一起来,不会产生冗余或矛盾。In the above-mentioned genome assembly method for constructing ultra-long continuous DNA sequences, for the two sequences X1 and X2 whose ends overlap (one of which is not completely covered by the other sequence), the overlap score OS of the overlap region is: OS = (OL1 + OL2) * SI / 2; where OL1 and OL2 are the lengths of the overlapping regions in the sequences X1 and X2, SI is the sequence identity value of the overlapping regions between the sequences X1 and X2; the extension score of X2 to X1 is ES2 ES2 = OS + EL2 / 2- (OH1 + OH2) / 2, where OH1 and OH2 are the lengths of the unpaired overhang regions at the ends of the two sequences, and EL2 is the extension of X2 to X1 (similarly, X1 The extension score for X2 can also be calculated in the same way); In general, the higher the overlap score, the more likely this overlap region originates from the same position on the genome; the unpaired overhangs at the end of the sequence are few It is due to sequencing errors, most of which are caused by different copies of repeated sequences, so subtracting this value from the score increases the probability of finding the correct sequence; the length of the extended sequence is also important. Including the length of the extended sequence in the extension score can help Prefer long sequences in order to find longer overlap regions and higher overlap scores in subsequent extensions. Note that because DNA sequences are double-stranded complementary, only single strands can be used in sequence comparison, so sequences overlap Regions can be created on both chains. Through chain adjustment, the two can be unified without redundancy or contradiction.
发明人经研究发现,序列之间重叠区域的产生,有两种方式:1、来源于基因组上的同一个位置,这些序列的一致性往往很高,但由于测序错误,导致序列的一致性不是100%;2、来源于重复序列的不同拷贝,但这些序列的一致性往往较低。The inventors found through research that there are two ways to generate overlapping regions between sequences: 1. Originating from the same position on the genome, the consistency of these sequences is often high, but due to sequencing errors, the consistency of the sequences is not 100%; 2. Derived from different copies of repeated sequences, but the identity of these sequences is often low.
优选的,步骤S3包括:Preferably, step S3 includes:
S31,将所述的由起始锚定序列片段末端如Es开始的通路序列集合A按照终点锚定序列片段的不同,划分为一或多个通路序列子集合A1,A2,…,Ak(其中,子集合A1中所有的通路序列都连接到终点锚定序列片段末端Ee1,其它的子集合以此类推),每个子集合中包括一条或多条通路序列;S31. Divide the pathway sequence set A starting from the end of the initial anchor sequence segment, such as Es, into one or more pathway sequence subsets A1, A2, ..., Ak (wherein , All the pathway sequences in the sub-set A1 are connected to the end-anchor sequence fragment end Ee1, and the other sub-sets and so on), each sub-set includes one or more pathway sequences;
S32,根据每个通路序列子集合Ai中的通路序列获得一条序列作为这个子集合的代表性序列并计算这个子集合的有效通路序列数目,其中,1≤i≤k;S32. Obtain a sequence as a representative sequence of the subset according to the channel sequence in each of the channel sequence subsets Ai and calculate the number of valid channel sequences of the subset, where 1 ≦ i ≦ k;
S33,在所有的通路序列子集合的代表性序列中,选择最多一条作为连接起始锚定序列片段末端(如Es)到另一个终点锚定序列片段末端的有效连接序列。S33. Among the representative sequences of all the pathway sequence subsets, a maximum of one is selected as a valid linking sequence connecting the end of the initial anchor sequence fragment (such as Es) to the end of the other end anchor sequence fragment.
通过以上方法,从而可以快速、准确的找出从一个起始锚定序列片段的一个自由末端开始,连接到一个或多个终点锚定序列片段一端的所有通路序列中最正确的那条(在多个连接到起始锚定序列末端的终点锚定序列末端中,只有一个终点序列末端是正确的,其他的终点序列末端都是背景噪音),然后利用正确的那条通路序列连接起始锚定序列片段和相应的终点锚定序列片段,从而提高了基因组组装的准确率;从一个起始锚定序列末端开始,找到的通路序列可以随机连接到一个或多个不同的终点锚定序列末端,由于在基因组上同一个区域产生的Read序列相似性及重叠分数最高,而在延伸时,高分数的序列被优先选择,因而导致到达正确终点锚定序列末端的连接通路会最多;不根据末端分组,错误率高,不选有效通路数目最高的,错误率高;有冲突,说明到达两个终点的序列太过相似,一致性值太高,不解决冲突,容易选错。另外,通过上述方法,组装出的序列都是局部完整的片段,提高了组装序列的长度,而现有方法,组装出的多是局部的碎片;而长的序列包含的基因更完整,更容易排列到染色体上,更容易发现片段之间的共线性及结构变异;另外,在不需要输出序列的时候,本发明的这个方法还可以用于判断两个锚定序列之间的相邻关系,或是两个相邻锚定序列之间的距离。Through the above method, it is possible to quickly and accurately find the most correct one of all pathway sequences starting from a free end of an initial anchor sequence fragment and connected to one end of one or more end anchor sequence fragments (in the Of the multiple end anchoring sequence ends connected to the end of the starting anchor sequence, only one end sequence end is correct, and the other end sequence ends are background noise), and then use the correct path sequence to connect the start anchor Sequence fragments and corresponding end-anchor sequence fragments to improve the accuracy of genome assembly; starting from the end of a starting anchor sequence, the pathway sequence found can be randomly connected to one or more different end-anchor sequences As the Read sequence generated in the same region of the genome has the highest similarity and overlap score, and when extended, the high-scoring sequence is preferentially selected, resulting in the most connected pathways reaching the end of the anchor sequence at the correct endpoint; not based on the end Grouping, high error rate, if the highest number of effective paths is not selected, the error rate is high; if there is a conflict, it means The two end up too similar sequence, consistency value is too high, not conflict resolution, easy wrong. In addition, through the above methods, the assembled sequences are all partially complete fragments, which increases the length of the assembled sequence. In the existing methods, most of the assembled fragments are local fragments. The long sequences contain more complete and easier genes. Arranged on the chromosome, it is easier to find the colinearity and structural variation between fragments. In addition, when the output sequence is not needed, the method of the present invention can also be used to judge the adjacent relationship between two anchored sequences. Or the distance between two adjacent anchoring sequences.
优选的,步骤S32包括:Preferably, step S32 includes:
S321,将每个通路序列子集合Ai分成一或多个组Ai1,…Aig,其中1≤i≤k;S321. Divide each channel sequence subset Ai into one or more groups Ai1, ... Aig, where 1 ≦ i ≦ k;
S322,从每个组中选出序列长度的出现频率为最高值及小于最高值一定范围的序列(比如选出序列长度的出现频率最高值到最高值一半的所有序列),形成序列集合Bi1,…,Big,其中序列集合Bi1与Ai1对应,其他集合以此类推;S322. From each group, select sequences with the highest occurrence frequency of the sequence length and a range smaller than the highest value (for example, select all sequences with the highest occurrence frequency of the sequence length and half the highest value) to form a sequence set Bi1, …, Big, where the sequence set Bi1 corresponds to Ai1, and the other sets can be deduced by analogy;
S323,将序列集合Bij中的所有序列进行两两比较,若是两条序列之间短的序列被覆盖超过一定比例(比如90%以上),则此两条序列被认为是相似性序列;选出所有能比对到最多条相似性序列的序列;若有多条序列 的相似性序列的数目最高且相同,则选择序列长度的出现频率为最高的任意一条序列,作为Aij的代表性序列,并记录Bij中与此代表性序列相似的序列数目作为序列组Aij的有效通路序列数目;其中,1<=j<=g;S323: Compare all sequences in the sequence set Bij pair by pair. If the short sequence between the two sequences is covered by more than a certain ratio (such as more than 90%), the two sequences are considered as similar sequences; All sequences that can match up to the most similar sequences; if there are multiple sequences with the highest number of similar sequences and the same, select any sequence with the highest frequency of sequence length as the representative sequence of Aij, and Record the number of sequences similar to this representative sequence in Bij as the number of effective pathway sequences in sequence group Aij; where 1 <= j <= g;
S324,将序列子集合Ai中的各组按照序列长度从左到右、从短到长进行排列,从具有最高长度频率峰值的一个组开始,跟其右边通路序列更长的第一个组比较,若是左边组中总的有效通路序列数目高出右边组中总的有效通路序列数目一定比例(比如两倍以上),则设左边组的代表性序列为序列子集合Ai的代表性序列,寻找序列子集合Ai代表性序列的过程停止;否则,暂设右边组的代表性序列为序列子集合Ai的代表性序列,然后设此组为左边组,跟其第一个右边组比较,重复上述过程,直到右边组中总的有效通路序列数目低于左边组中总的有效通路序列数目一定比例(比如50%以下)为止;从而确定了序列子集合Ai的代表性序列和其所连接的一对相应的锚定序列片段末端(如Es和Eei)之间的有效通路序列数目(即对应序列组的有效通路序列数目,如NPsi)。S324. The groups in the sequence sub-set Ai are arranged according to the sequence length from left to right and from short to long, starting from the group with the highest length frequency peak and comparing with the first group with the longer channel sequence on the right side. If the total number of effective pathway sequences in the left group is higher than the total number of effective pathway sequences in the right group by a certain proportion (for example, more than twice), then set the representative sequence of the left group as the representative sequence of the sequence sub-set Ai to find The process of the representative sequence of the sequence sub-set Ai stops; otherwise, temporarily set the representative sequence of the right group as the representative sequence of the sequence sub-set Ai, then set this group as the left group, compare it with its first right group, and repeat the above Process until the total number of effective pathway sequences in the right group is lower than the total number of effective pathway sequences in the left group by a certain percentage (for example, less than 50%); thus, the representative sequence of the sequence sub-set Ai and the connected sequence are determined. The number of effective pathway sequences between the corresponding anchor sequence segment ends (such as Es and Eei) (that is, the number of effective pathway sequences of the corresponding sequence group, such as NPsi).
本发明中,通过对通路序列子集合进行分组,从而可以在有多个重复单位的情况下,找出正确的通路序列(而不分组不能解决多个重复单位的复杂序列),进而实现了对包含多个重复单位的复杂重复区域的组装。若是一个重复序列内部不存在多个重复单位,则不需要进行分组,在分组时自动会只产生一个组。由于存在多个重复单位,在延伸时,容易形成具有错误重复单位数目的通路,这样,长度频率最高组或其右边的某个组代表了正确的通路集合,这个组中的有效通路数目不能太低(太低了则很可能是背景噪音)。通过本发明的方法选择这个有效通路数目不太低的组,从而提高了找到正确通路的概率。In the present invention, by grouping the path sequence sub-sets, the correct path sequence can be found in the case of multiple repeating units (without grouping, the complex sequence of multiple repeating units cannot be solved), thereby achieving the alignment of Assembly of complex repeating regions containing multiple repeating units. If there are no multiple repeating units in a repeating sequence, no grouping is required, and only one group will be automatically generated when grouping. Because there are multiple repeating units, it is easy to form a path with the wrong number of repeating units when extended. In this way, the group with the highest length frequency or a group to the right of it represents the correct path set. The number of effective paths in this group cannot be too large. Low (too low is most likely background noise). The method of the present invention selects this group with a relatively low number of effective paths, thereby increasing the probability of finding the correct path.
更优选的,步骤S321中,通过以下方式进行分组:More preferably, in step S321, grouping is performed in the following manner:
S3211,将通路序列子集合Ai中的通路序列按照从短到长、从左到右(左短右长)的顺序进行排列;S3211, arrange the pathway sequences in the pathway sequence sub-set Ai in order from short to long and left to right (left short right long);
S3212,按照相同的长度差异(如1kb)将通路序列子集合Ai中的通路序列划分成多个不重叠的小窗口,并计算每个窗口中包含的序列总数;S3212. Divide the channel sequences in the channel sequence subset Ai into multiple non-overlapping small windows according to the same length difference (such as 1 kb), and calculate the total number of sequences contained in each window;
S3213,将每个窗口包含的序列总数与其直接相邻的两个窗口对应的序列总数分别进行比较;若比两边的数值都大,则该窗口为一个高峰窗口, 若比两边的数值都小,则该窗口为一个谷底窗口;其中,若是一边没有窗口则此边的序列总数设为0;S3213: Compare the total number of sequences contained in each window with the total number of sequences corresponding to the two windows directly adjacent to it; if the value is larger than both sides, the window is a peak window, and if the value is smaller than both sides, The window is a valley window; if there is no window on one side, the total number of sequences on this side is set to 0;
S3214,计算出所有的高峰窗口和谷底窗口;若是一个谷底窗口中序列长度的出现频率最小值小于其右边最近的一个高峰窗口中序列长度的出现频率最大值的某一特定比例(比如4/5)时,则用此谷底窗口中的出现频率最低的序列长度把两边的通路序列分成两组;以此类推,通路序列子集合Ai被分成一或多个组。S3214. Calculate all the peak and valley windows; if the minimum value of the frequency of the sequence length in a valley window is less than the maximum value of the frequency of the sequence length in the nearest peak window to the right of a certain ratio (such as 4/5 ), The channel sequence on both sides is divided into two groups by using the lowest frequency sequence length in the valley window; and so on, the channel sequence subset Ai is divided into one or more groups.
分组的关键是根据谷底值将多个峰值分开,若是通路长度的分布很窄,比如最短的通路和最长的通路之间的距离<10kb,则没有必要进行分组,当作一个组处理即可;所以按照窗口法分组时,每个窗口没有必要设定的太小,当然大了也没有必要,1kb即可。The key to grouping is to separate multiple peaks according to the bottom value. If the distribution of the path length is narrow, for example, the distance between the shortest path and the longest path is less than 10kb, it is not necessary to group and treat them as a group. ; So when grouping according to the window method, it is not necessary to set each window too small, and of course it is not necessary to set it to 1kb.
优选的,步骤S33中,通过以下方式来选择最多一条序列作为连接起始锚定序列片段末端(如Es)到另一个终点锚定序列片段末端的有效连接序列:Preferably, in step S33, at most one sequence is selected as a valid linking sequence connecting the end of the initial anchor sequence fragment (such as Es) to the end of the other end anchor sequence fragment:
S331,从序列集合A的各个通路序列子集合的代表性序列(一条代表性序列连接起始锚定序列片段末端Es到一个不同的终点锚定序列片段末端Eei)所对应的有效通路序列数目中,选取有效通路序列数目的最大值与第二大值,计算相应的起始锚定序列片段末端的冲突指数为:CIs=NPs n/NPs m,其中,CIs表示相应的起始锚定序列片段末端Es的冲突指数,NPs m为从起始锚定序列片段末端Es连接到其他不同的终点锚定序列片段末端的最大的有效通路序列数目,NPs n为从起始锚定序列片段末端Es连接到其他不同的终点锚定序列片段末端的第二大的有效通路序列数目;其中,NPs m≥NPs n;若是冲突指数超出了阈值(比如0.75),则相应的起始锚定序列片段末端被称之为一个冲突末端; S331. From the representative sequences of each pathway sequence subset of sequence set A (a representative sequence connects the starting anchor sequence fragment end Es to a different end anchor sequence fragment end Eei), , Select the maximum value and the second largest value of the number of effective pathway sequences, and calculate the conflict index at the end of the corresponding initial anchor sequence fragment: CIs = NPs n / NPs m , where CIs represents the corresponding initial anchor sequence fragment Conflict index of terminal Es, NPs m is the maximum number of effective pathway sequences connected from the end of the starting anchor sequence fragment Es to other different end of the anchor sequence fragment ends, and NPs n is the link from the end of the starting anchor sequence fragment Es The second largest number of effective pathway sequences to the ends of other different end-anchor sequence fragments; where NPs m ≥ NPs n ; if the collision index exceeds a threshold (such as 0.75), the end of the corresponding initial anchor sequence fragment is Call it the end of a conflict;
S332,对于不存在冲突的锚定序列片段末端,则在其通路序列集合A的所有子集合的代表性序列中,选择相应的有效通路序列数目的最大值所对应的通路序列子集合的代表性序列作为序列集合A最终的代表性序列,即获得了连接起始锚定序列片段末端(如Es)到另一个终点锚定序列片段末端的有效连接序列;对于存在冲突的锚定序列片段末端,若是此冲突能够被解决,则选择由解决方法所决定的末端所对应的代表性序列作为连接 起始锚定序列片段末端到另一个终点锚定序列片段的有效连接序列;若是此冲突不能被解决,则没有找到此锚定序列片段末端跟其它任一锚定序列片段连接的有效连接序列,转到步骤S2。S332. For the end of the anchor sequence fragment that does not have a conflict, among the representative sequences of all the subsets of the channel sequence set A, select the representativeness of the channel sequence sub-set corresponding to the maximum value of the corresponding effective channel sequence number. The sequence is the final representative sequence of sequence set A, that is, a valid linking sequence connecting the end of the initial anchor sequence fragment (such as Es) to the end of the other end anchor sequence fragment is obtained. For the ends of the anchor sequence fragment in conflict, If this conflict can be resolved, the representative sequence corresponding to the end determined by the solution method is selected as a valid linking sequence connecting the end of the starting anchor sequence segment to another end anchor sequence segment; if this conflict cannot be resolved , If no valid linking sequence is found at the end of this anchoring sequence segment and any other anchoring sequence segment, go to step S2.
通过以上方法识别锚定序列片段的末端是否为冲突末端及进一步选择最多一条序列作为连接起始锚定序列片段末端(如Es)到另一个终点锚定序列片段末端的有效连接序列,从而避免了将冲突末端进行错误的延伸(会导致DNA序列嵌合体,即不同区域的序列被连接到了一起),保证形成更长的连续DNA序列的同时正确率也更高。By the above method, whether the end of the anchor sequence fragment is a conflicting end and further selecting at most one sequence as a valid linking sequence connecting the end of the initial anchor sequence fragment (such as Es) to the end of the other end anchor sequence fragment, thereby avoiding Extending the conflicting ends incorrectly (causing DNA sequence chimeras, that is, sequences from different regions are joined together), guarantees that a longer continuous DNA sequence is formed, and the accuracy is higher.
上述的构建超长连续DNA序列的基因组组装方法中,步骤S332中,解决冲突的方法包括:根据染色体分组的信息解决位于不同染色体上的锚定序列片段末端的冲突;或根据已知相邻的序列信息,从而解决锚定序列片段末端的冲突;或在构建超长连续DNA序列的过程中,将引起某起始锚定序列片段末端冲突的两个终点锚定序列片段末端中的一个已经用于其他锚定序列片段的连接中,则对应的起始锚定序列片段末端的冲突也相应得到解决。这些解决冲突的方法基于的数据均比较容易得到,从而可以快速解决冲突,比如染色体分组数据有Hi-C或遗传图谱,相邻信息有BioNano基因组光学图谱,或是10x Genomics数据。少数冲突可根据自有的信息来解决。In the above-mentioned genome assembly method for constructing ultra-long continuous DNA sequences, in step S332, the method for resolving conflicts includes: resolving conflicts at the ends of anchor sequence fragments located on different chromosomes according to the information of chromosome grouping; Sequence information to resolve conflicts at the ends of anchoring sequence fragments; or in the construction of ultra-long continuous DNA sequences, one of the two end anchoring sequence fragments that will cause conflicts at the ends of an initial anchoring sequence fragment has been used In the connection of other anchor sequence fragments, the conflict at the ends of the corresponding initial anchor sequence fragments is also resolved accordingly. The data based on these conflict resolution methods are relatively easy to obtain, so that conflicts can be resolved quickly. For example, the chromosomal grouping data has Hi-C or genetic maps, the neighboring information is the BioNano genome optical map, or 10xGenomics data. Minority conflicts can be resolved based on their own information.
与现有技术相比,本发明具有以下优点:Compared with the prior art, the present invention has the following advantages:
(1)本发明通过将所有的已知DNA序列进行两两比较,找出每对序列之间相似的重叠区域,然后从任意一个锚定序列片段的一个自由末端(如Es)开始,用跟其有重叠的随机测序Read序列对该锚定序列片段进行延伸,形成一到多个延长的序列;再对这些延长的序列采用同样的方法利用随机测序Read序列继续进行延伸,每个序列的延伸循环多次,直至遇到能够比对到另一个不同的锚定序列片段末端的随机测序Read序列,则从起始锚定序列片段一端开始的延伸结束,获得连接起始锚定序列片段的一端到另一个或多个不同的终点锚定序列片段末端的一个或多个通路序列,所述的一个或多个通路序列形成序列集合A;根据所述的序列集合A中的通路序列,选择一条序列作为连接起始锚定序列片段末端到另一个终点锚定序列片段末端的有效连接序列;利用所述的有效连接序列连接起始锚定序列片段和 相应的终点锚定序列片段;将连接后的序列片段作为新的锚定序列片段或记录剩余的锚定序列片段的自由末端,不断重复上述步骤,从而最终形成超长连续的DNA序列;通过利用本发明的方法构建出超长连续的DNA序列,更有利于复原整条染色体及整个基因组的序列;(1) In the present invention, all known DNA sequences are compared pair by pair to find similar overlapping regions between each pair of sequences, and then start from a free end (such as Es) of any one anchor sequence fragment, followed by It has overlapping random sequencing Read sequences to extend the anchor sequence fragment to form one or more extended sequences; and then use the same method for these extended sequences to continue the extension using random sequencing Read sequences to extend each sequence. Repeat several times until you encounter a random sequencing Read sequence that can be compared to the end of another different anchor sequence fragment, then the extension ends from one end of the initial anchor sequence fragment, and one end connected to the initial anchor sequence fragment is obtained. One or more pathway sequences at the end of another or more different end-anchor sequence fragments, said one or more pathway sequences forming sequence set A; one is selected according to the pathway sequences in said sequence set A The sequence is used as an effective linking sequence connecting the end of the initial anchoring sequence fragment to the end of the other end anchoring sequence fragment; using said effective linking sequence Connect the initial anchor sequence fragment and the corresponding end anchor sequence fragment; use the ligated sequence fragment as a new anchor sequence fragment or record the free ends of the remaining anchor sequence fragments, and repeat the above steps continuously to form a super Long continuous DNA sequence; by using the method of the present invention to construct an ultra-long continuous DNA sequence, it is more beneficial to restore the entire chromosome and the entire genome sequence;
(2)本发明将随机Read序列形成通路序列,然后对所述的通路序列进行处理,从而大大提高了基因组组装的准确率。而现有技术仅仅进行了Read水平上的处理,将相似的Read序列进行了压缩,使得来源于不同重复区域的Read序列被压缩成了一个,这样本来有差别的重复区域不能分开;而且由于种种原因,比如测序错误、校正错误等,Read水平上的错误也很容易导致压缩错误;而本发明在通路水平上,虽然也有错误,但通路的序列长度比Read序列的长度要长,更容易区分。在通路水平上,即使两个通路序列之间的差别只有1%-2%,也是有可能区分开的,大于2%的则很容易分开,因此本发明相对于现有的SG方法,可以进一步提高基因组组装的准确率;(2) The present invention forms a random Read sequence into a pathway sequence, and then processes the pathway sequence, thereby greatly improving the accuracy of genome assembly. However, the existing technology only performs processing at the Read level and compresses similar Read sequences, so that the Read sequences from different repeating regions are compressed into one, so that the originally different repeating regions cannot be separated; and due to various For reasons such as sequencing errors, correction errors, etc., errors at the Read level can easily cause compression errors; while at the channel level, although there are errors, the length of the sequence of the channel is longer than the length of the Read sequence, making it easier to distinguish . At the channel level, even if the difference between the two channel sequences is only 1% -2%, it is possible to distinguish them, and those larger than 2% are easily separated. Therefore, the present invention can be further compared with the existing SG method. Improve the accuracy of genome assembly;
(3)本发明通过设置序列一致性值或重叠分数或延伸分数,从而根据序列一致性值或重叠分数或延伸分数实现了将基因组中来源于同一个重复序列家庭的不同重复序列拷贝(各个重复序列的相似度大于85%,特别是相似度>98%)的Reads尽可能多地分开,将每个重复序列来源的Reads组装成一个独立的Contig,并跟其两端的锚定序列片段连接起来,最终形成超长连续的DNA序列,同时提高了基因组组装的准确率;(3) In the present invention, by setting sequence identity values or overlap scores or extension scores, according to the sequence identity values or overlap scores or extension scores, copies of different repeat sequences (each repeat from the same repeat sequence family in the genome) (Sequence similarity is greater than 85%, especially similarity> 98%) Reads are separated as much as possible, and the Reads from each repeated sequence source are assembled into an independent Contig, and connected to the anchor sequence fragments at its two ends. , Eventually forming ultra-long continuous DNA sequences, while improving the accuracy of genome assembly;
(4)通过本发明的方法,还实现了对包含多个重复单位的复杂重复区域的组装;(4) Through the method of the present invention, the assembly of a complex repeating region containing a plurality of repeating units is also achieved;
(5)本发明的基因组组装方法还可以用于基因组序列中空白区域的序列填充(将空白区域两端的序列作为锚定序列片段,通过本发明的方法来获得最终的有效连接序列);(5) The genome assembly method of the present invention can also be used for sequence filling of blank regions in a genomic sequence (the sequences at both ends of the blank region are used as anchor sequence fragments, and the final effective linking sequence is obtained by the method of the present invention);
(6)本发明的基因组组装方法,可以实现重复序列区域的基因组组装,也可以实现单拷贝序列区域的组装;(6) The genomic assembly method of the present invention can realize the assembly of genomic regions of repeated sequence regions, as well as the assembly of single-copy sequence regions;
(7)为了验证本发明的效果,发明人还利用本发明的方案对水稻基因组、玉米基因组及人基因组进行了基因组组装试验,具体如下:(7) In order to verify the effect of the present invention, the inventor also performed a genome assembly test on the rice genome, maize genome, and human genome using the scheme of the invention, as follows:
首先,发明人用一个高质量水稻基因组(组装基因组大小390.3Mb, 估算的真实大小不超过394Mb)(Du et al,2017)进行了测试。采用现有的基于OLC的SG类型的组装方法进行组装的结果是:总基因组大小402.5Mb;Contig N50大小1.3Mb。而采用本发明的方法组装后,在总基因组大小略有减小(399.2Mb)的情况下(基因组变小是因为基于OLC的SG类型的组装的结果中有冗余),Contig N50的大小提高到了13.2Mb;利用BioNano(序列相邻信息)基因组光学图谱解决冲突的情况下,Contig N50进一步提高到了14.4Mb,整条染色体8序列组装成了一个Contig。过滤掉非水稻序列后,整个基因组大小是391.6Mb,包含有40个Contig,比原来组装出的参考基因组稍大,主要是着丝粒重复序列的增多造成。采用本发明方法组装的8号染色体序列中包含有以前参考基因组中漏掉的一段约387kb的序列(即采用现有的方法进行组装,碎片多,基因组不全,会漏掉了很多重复序列,难以排列到染色体上,不能组装复杂区域,如图10a、图10b和图10c所示)。在对已知的14个潜在的复杂重复序列区域进行的不完全测试中,本发明可以很容易地组装其中的7个(如图8e、图10b、图10c、图10e、图10f、图10g、和图10h所示)。水稻基因组中本发明方法组装的区域利用其它的二代测序短序列进行质量检测,发现97.21%的短序列能够比对到基因组上,且本发明组装的99.56%的序列能够被二代短序列覆盖,说明本发明组装的序列都是对的。First, the inventors tested with a high-quality rice genome (assembled genome size 390.3Mb, estimated true size does not exceed 394Mb) (Du et al, 2017). The results of assembly using the existing OLC-based SG-type assembly methods are: the total genome size is 402.5Mb; the Contig N50 size is 1.3Mb. After assembly using the method of the present invention, under the condition that the total genome size is slightly reduced (399.2Mb) (the genome becomes smaller because of the redundancy in the result of OLC-based SG-type assembly), the size of Contig N50 increases It reached 13.2Mb; in the case of using BioNano (sequence adjacent information) genome optical map to resolve the conflict, Contig N50 was further increased to 14.4Mb, and the entire chromosome 8 sequence was assembled into a Contig. After filtering out non-rice sequences, the entire genome is 391.6Mb and contains 40 Contigs, which is slightly larger than the original assembled reference genome, mainly due to the increase in centromeric repeat sequences. The chromosome 8 sequence assembled by the method of the present invention contains a sequence of about 387 kb missing from the previous reference genome (that is, assembly using the existing method, there are many fragments, the genome is incomplete, and many repeated sequences are missed, which is difficult. Arranged onto chromosomes, complex regions cannot be assembled, as shown in Figures 10a, 10b and 10c). In incomplete testing of 14 known regions of potentially complex repeats, the present invention can easily assemble 7 of them (as shown in Figure 8e, Figure 10b, Figure 10c, Figure 10e, Figure 10f, Figure 10g). , And shown in Figure 10h). The region assembled by the method of the present invention in the rice genome was subjected to quality inspection using other second-generation sequencing short sequences. It was found that 97.21% of the short sequences can be aligned to the genome, and 99.56% of the assembled sequences of the present invention can be covered by the second-generation short sequences. It shows that the assembled sequences of the present invention are all correct.
其次,如图11b所示,已发表的玉米的参考基因组B73 RefGen_v4(Jiao et al,2017)是用PBcR(基于OLC的SG类)软件进行组装的,其包含有大量(总长90.55Mb)的小Contig序列没有锚定到染色体上,而染色体上的空白序列有约43Mb。对于同样的数据,采用本发明的方法进行基因组组装后,Contig N50大小从1.3Mb增加到61.2Mb,最长的Contig从7Mb增加到140Mb。定位到染色体上的长度从2075.6Mb增加到2104.2Mb(说明本发明的组装准确率更高),空白区域的总数从2,523个下降到76个,不能锚定到染色体上的序列只剩下2.8Mb。除了组装空白序列之外,经过BioNano基因组光学图谱验证发现,如图11c和图11d,本发明的组装结果中也校正了RefGen_v4中的多处错误,包括两个序列方向错误和两个位置错误(现有的方法因为序列短,无法发现这样的错误,但是采用本发明的方法进行组装,序列变长后,以前的序列因为太短而造成的错误就消失 了)。Secondly, as shown in Figure 11b, the published reference genome B73 of Maize RefGen_v4 (Jiao et al., 2017) was assembled using PBcR (OLC-based SG class) software, which contains a large number (90.55Mb total) of small The Contig sequence is not anchored to the chromosome, and the blank sequence on the chromosome is about 43Mb. For the same data, after genomic assembly using the method of the present invention, the size of Contig N50 increased from 1.3Mb to 61.2Mb, and the longest Contig increased from 7Mb to 140Mb. The length mapped to the chromosome increased from 2075.6Mb to 2104.2Mb (indicating that the assembly accuracy of the present invention is higher), the total number of blank areas decreased from 2,523 to 76, and only 2.8Mb of the sequence could not be anchored to the chromosome. . In addition to assembling blank sequences, it was found through BioNano genome optical spectrum verification that, as shown in Figures 11c and 11d, the assembly results of the present invention also corrected multiple errors in RefGen_v4, including two sequence direction errors and two position errors ( The existing method cannot detect such errors because the sequence is short, but using the method of the present invention to assemble and the sequence becomes longer, the errors caused by the previous sequence are too short to disappear).
再次,人的一个基因组HX1(Shi et al,2016)是用FALCON(SG软件)组装的(HX1_FALCON),经本发明的方法改进后,Contig N50从8.3Mb增加到54.4Mb。最长的Contig从38Mb增加到109.8Mb。经比较发现,如图12f所示,在人的参考基因组GRCH38中,有多个空白区域在HX1_FALCON中没有组装出来,但在本发明的组装结果中已得到填充;另外,由图12c和图12e所示,人的参考基因组本身是有缺口的,但是利用本发明的方法进行组装后,对所述缺口进行了填充;此外,由图12d可知,Falcon组装的片段是一个嵌合体;而本发明的组装方法则构建出了正确的序列。Third, one human genome, HX1 (Shi et al., 2016), was assembled using FALCON (SG software) (HX1_FALCON). After improvement by the method of the present invention, Contig N50 increased from 8.3 Mb to 54.4 Mb. The longest Contig increased from 38Mb to 109.8Mb. After comparison, it is found that, as shown in FIG. 12f, in the human reference genome GRCH38, a plurality of blank regions are not assembled in HX1_FALCON, but have been filled in the assembly result of the present invention. In addition, from FIG. 12c and FIG. 12e As shown, the human reference genome itself is gapped, but after the assembly by the method of the present invention, the gap is filled; in addition, it can be seen from FIG. 12d that the fragment assembled by Falcon is a chimera; and the present invention The assembly method builds the correct sequence.
通过以上试验说明了:(1)本发明的基因组组装方法可以构建更长的连续DNA序列;(2)本发明可以用于对基因组序列中的空白区域进行序列填充;(3)本发明的方法可以组装复杂重复序列区域;(4)本发明的基因组组装方法组装的基因组准确率较高。The above experiments show that: (1) the genome assembly method of the present invention can construct a longer continuous DNA sequence; (2) the present invention can be used to sequence fill a blank area in the genome sequence; (3) the method of the present invention Complex repeat sequence regions can be assembled; (4) The accuracy of genome assembly by the genome assembly method of the present invention is high.
附图说明BRIEF DESCRIPTION OF THE DRAWINGS
图1是本发明的一种实施例的方法流程图;FIG. 1 is a flowchart of a method according to an embodiment of the present invention;
图2是获取通路序列子集合的代表性序列及这个子集合的有效通路序列数目的方法流程图;2 is a flowchart of a method for obtaining a representative sequence of a subset of pathway sequences and the number of effective pathway sequences of the subset;
图3是通路序列子集合的分组方法示意图;FIG. 3 is a schematic diagram of a grouping method of a path sequence sub-set; FIG.
图4是选择连接起始锚定序列片段末端(如Es)到另一个终点锚定序列片段末端的有效连接序列的方法示意图;FIG. 4 is a schematic diagram of a method for selecting an effective ligation sequence connecting the end of an initial anchoring sequence fragment (such as Es) to the end of another end anchoring sequence fragment;
图5是来源于不同区域的重复序列拷贝的示意图;基因组上的两个相似的重复序列片段R1、R2,跟其比对的上半部分序列是本地产生的测序Read序列,序列一致性很高,而跟其比对的下半部分序列是另一个重复序列拷贝来源的测序Read序列,带有未配对悬空末端和碱基变异;Figure 5 is a schematic diagram of duplicate sequence copies derived from different regions; two similar repeat sequence fragments R1, R2 on the genome, the upper half of the sequence aligned with it are locally generated Read sequences, the sequence consistency is very high , And the lower half of the sequence aligned with it is another sequencing Read sequence source from another copy of the repeat sequence, with unpaired dangling ends and base mutations;
图6是是重叠的序列及图5对应的重叠图示意图;a,两个重叠的序列,OL,重叠的序列部分;OH,末端未配对悬空的序列部分;EL,延伸的序列部分;b,由重叠的序列组成的一条通路;c,图5中重叠序列的连接示意图(重叠图);C1-C4是锚定序列片段;R1、R2是重复序列;每个序列 有两个末端;U,单考贝序列,UR,单考贝和重复区域的边界序列;FIG. 6 is a schematic diagram of overlapping sequences and the corresponding overlapping diagram of FIG. 5; a, two overlapping sequences, OL, an overlapping sequence portion; OH, an unpaired hanging sequence portion at the end; EL, an extended sequence portion; b, A pathway composed of overlapping sequences; c, the connection diagram of the overlapping sequences in Fig. 5 (overlapping diagram); C1-C4 are anchor sequence fragments; R1 and R2 are repeating sequences; each sequence has two ends; U, Single test sequence, UR, single test sequence and boundary sequence of repeating region;
图7是从多个通路序列中选择一条有效性序列及解决冲突的示意图和举例;c,一个起始锚定序列片段末端到多个终点锚定序列片段末端的连接通路序列示意图;d,显示了不同锚定序列片段末端的有效通路序列数目;e,显示了有冲突的末端;f,显示了e中序列之间的关系;g,利用序列之间的相邻关系(BioNano光学图谱)解决e中的冲突;Figure 7 is a schematic diagram and an example of selecting a valid sequence from multiple pathway sequences and resolving conflicts; c, a schematic diagram of a connecting pathway sequence from the end of a start anchor sequence fragment to the end of a plurality of end anchor sequence fragments; d, showing The number of effective pathway sequences at the ends of different anchoring sequence fragments; e, showing conflicting ends; f, showing the relationship between the sequences in e; g, using the adjacent relationship between the sequences (BioNano optical atlas) to solve conflict in e;
图8为含有多个重复单位的复杂重复序列区域举例;a-b,BioNano基因组光学图谱显示此区域含有两个重复单位,在每对比较的横柱中,下面的横柱是光学图谱,上面的横柱是参考基因组,二者的比较显示了在参考基因组上漏掉了一段序列;c-d,通路序列长度频率分布图,分别对应了a/b中的两个序列;e,显示了b/d中的序列结构,cns1,cns2是两条不同长度的代表性序列,其中两端的线段代表了锚定序列,中间的方框和三角代表了组装出的序列;Figure 8 shows an example of a complex repeat region containing multiple repeat units; ab, the optical spectrum of the BioNano genome shows that this region contains two repeat units. In each pair of comparison bars, the lower bar is the optical spectrum, and the upper bar The bar is the reference genome. A comparison of the two shows that a sequence is missing from the reference genome; cd, the frequency distribution of the length of the pathway sequence, which corresponds to the two sequences in a / b; e, which shows in b / d The sequence structure of cns1 and cns2 are two representative sequences of different lengths, where the line segments at both ends represent the anchoring sequence, and the box and triangle in the middle represent the assembled sequence;
图9是针对同一重复序列家庭中的两个重复序列,采用背景技术和本发明两种不同的方法进行处理的结果示意图;FIG. 9 is a schematic diagram showing the results of processing using two different methods of the background technology and the present invention for two repeated sequences in the same repeated sequence family;
图10是水稻基因组组装结果举例;a-c,表示采用现有方法和本发明方法分别进行水稻基因组组装的结果对比示意图;在每对比较的横柱中,下面的横柱是BioNano光学图谱,上面的横柱是参考基因组;上部的一对代表了采用现有方法组装的结果示意图,下部的一对代表了采用本发明方法组装的结果示意图;e/f/g/h,分别表示多个复杂重复序列区域的通路序列长度频率分布图;Figure 10 is an example of rice genome assembly results; ac is a schematic diagram showing the comparison results of rice genome assembly using the existing method and the method of the present invention. In each pair of comparison bars, the lower bar is the BioNano optical spectrum. The horizontal bar is the reference genome; the upper pair represents the schematic diagram of the result assembled using the existing method, and the lower pair represents the schematic diagram of the result assembled using the method of the present invention; e / f / g / h, respectively, represents multiple complex repeats Frequency distribution map of the channel sequence length in the sequence region;
图11是玉米基因组组装举例;b-d,在每对比较的横柱中,下面的横柱是BioNano光学图谱,上面的横柱是参考基因组,上部显示了采用已有软件PBcR(基于SG方法)对玉米基因组进行组装的结果,下部显示了同一区域采用本发明方法的组装结果;b,PBcR结果中漏掉(中间缺失竖线条的区域)或多出了序列;c,PBcR序列的方向错误在本发明的组装结果中得到改正;d,PBcR序列中的位置错误在本发明的组装结果中得到改正;Figure 11 is an example of maize genome assembly; bd. In each comparison bar, the lower bar is the BioNano optical spectrum, the upper bar is the reference genome, and the upper part shows the pair using the existing software PBcR (based on the SG method). The results of the assembly of the corn genome are shown in the lower part. The assembly results of the same region using the method of the present invention are shown. The assembly result of the invention is corrected; d, the position error in the PBcR sequence is corrected in the assembly result of the invention;
图12是人的基因组组装结果举例;c/e,上部横条代表了人的参考基因组(有缺口),下部横条代表了本发明的方法进行组装的结果(没有缺口);d,Falcon中是一个嵌合体,本发明组装的结果中不包含嵌合体;f,Falcon 结果中有多个空白区域没有组装出来,但在本发明(即HERA Contig629)的组装结果中已得到填充。Figure 12 is an example of human genome assembly results; c / e, the upper horizontal bar represents the human reference genome (with gaps), and the lower horizontal bar represents the results of assembly by the method of the present invention (no gaps); d, in Falcon It is a chimera, and the result of assembly of the present invention does not include the chimera; f, Falcon results have multiple blank areas not assembled, but have been filled in the assembly results of the present invention (ie, HERA Contig629).
下面结合附图和具体实施方式对本发明作进一步的说明。The invention is further described below with reference to the drawings and specific embodiments.
具体实施方式Detailed ways
本发明的实施例:一种构建超长连续DNA序列的基因组组装方法,如图1所示,包括以下步骤:An embodiment of the present invention: a method for assembling a genome to construct an ultra-long continuous DNA sequence, as shown in FIG. 1, includes the following steps:
S1,将所有的已知DNA序列进行两两比较,找出每对序列之间相似的重叠区域;其中,所述的已知DNA序列包括锚定序列片段(即用于锚定的序列片段,比如从DNA序列上截取的某一段或几段特定的序列片段,和/或已经组装好的某一段或几段特定的序列片段,和/或从随机测序Read序列中选出的某一个或几个特定的Read序列等)和随机测序Read序列(为了提高Read序列的准确率,可以先对该Read序列进行校正,也可以不进行校正,直接采用原始的随机测序Read序列;校正的方法包括用其他测序平台得到的测序错误率很低的测序Read序列来校正,也包括利用这个集合中的其他的Read序列来校正;为了提高基因组组装的效率,这里的随机测序Read序列可以部分用组装好的短的Contig序列(而组装好的长的Contig序列则作为锚定序列片段,所述的长短比如可以以50kb为界进行划分)来替代);所述的锚定序列片段至少包括两个;所述的将所有的已知DNA序列进行两两比较,包括将所有的锚定序列片段与所有的随机测序Read序列进行两两比较,以及将所有的随机测序Read序列进行两两比较;具体实施时,可以构建一个重叠图,是由代表已知序列的节点和他们两两之间的序列重叠作为边构建的无向简单图。每条已知序列用两个节点来表示,每个节点代表了一个序列片段末端,而这两个节点之间由一条无向边(在此称之为耦合边)连接;在这个重叠图中,若是两个节点之间有非耦合边的连接,则说明此两个末端之间有重叠,其中的一个可以用来延伸另一个;在遍历图中的通路时,有一个基本要求:在任何时候,进入了一个节点,则必须通过这个节点的耦合边出来(即到达一条已知序列的一个末端节点以后,不能从连接到同一末端节点的其他序列的末端节点出来,而必须从同一序列的另一个末端节点出来,以保证序列的线性延伸);在这个图中,识 别两个不同锚定序列的两个末端之间是不是有连接可以通过深度搜索或广度搜索来实现;S1. Perform a pairwise comparison of all known DNA sequences to find similar overlapping regions between each pair of sequences; wherein the known DNA sequences include anchor sequence fragments (that is, sequence fragments used for anchoring, For example, one or more specific sequence fragments intercepted from the DNA sequence, and / or one or more specific sequence fragments that have been assembled, and / or one or more selected from the random sequencing Read sequence Read sequence, etc.) and random read sequence (in order to improve the accuracy of the Read sequence, the Read sequence can be corrected first, or the original random sequence Read sequence can be used without correction; the correction method includes using The sequencing Read sequence obtained by other sequencing platforms has a low sequencing error rate for correction, and also includes the use of other Read sequences in this set for correction. In order to improve the efficiency of genome assembly, the random sequencing Read sequence here can be partially assembled. Short Contig sequences (and assembled long Contig sequences are used as anchor sequence fragments, and the length can be divided into 50kb boundaries, for example) instead ); Said anchor sequence fragments include at least two; said pairwise comparison of all known DNA sequences, including pairwise comparison of all anchor sequence fragments and all random sequencing Read sequences, and All random sequencing Read sequences are compared pairwise; in specific implementation, an overlapping graph can be constructed, which is an undirected simple graph constructed by the nodes representing a known sequence and the sequence overlap between them as an edge. Each known sequence is represented by two nodes, each node representing the end of a sequence fragment, and the two nodes are connected by an undirected edge (here called a coupling edge); in this overlapping graph If there is a non-coupling edge connection between two nodes, it means that there is overlap between the two ends, and one of them can be used to extend the other. When traversing the path in the graph, there is a basic requirement: When entering a node, it must come out through the coupling edge of that node (that is, after reaching an end node of a known sequence, it cannot come out from the end nodes of other sequences connected to the same end node, but must come from the same sequence. The other end node comes out to ensure the linear extension of the sequence); In this figure, identifying whether there is a connection between the two ends of two different anchoring sequences can be achieved by deep search or breadth search;
S2,从任意一个锚定序列片段的一个自由末端(如Es)节点开始,用跟其有重叠的随机测序Read序列节点对该锚定序列片段进行通路延伸,形成一到多个延长的通路序列;再对这些延长的通路序列采用同样的方法利用随机测序Read序列继续进行延伸,每个序列的延伸循环多次,直至遇到能够连接到另一个不同的锚定序列片段末端节点的随机测序Read序列末端节点,则从起始锚定序列片段一端开始的延伸结束,获得连接起始锚定序列片段的一端到另一个或多个不同的终点锚定序列片段末端的一个或多个通路序列(如图7c的例子所示),所述的一个或多个通路序列形成序列集合A(即序列集合A中的通路序列连接了起始锚定序列片段末端Es到其它一或多个不同的终点锚定序列片段末端Ee1,…,Eek);S2, starting from a free end (such as Es) node of any anchor sequence fragment, and using the random sequencing Read sequence node that overlaps with it to extend the pathway of the anchor sequence fragment to form one or more extended pathway sequences ; Then use the same method for these extended pathway sequences to continue the extension using random sequencing Read sequence, each sequence is extended multiple times until it encounters a random sequencing Read that can be connected to the end node of another different anchor sequence fragment At the end of the sequence, the extension ends from the end of the starting anchor sequence segment to obtain one or more pathway sequences connecting the end of the starting anchor sequence segment to the end of another or more different end anchor sequence segments ( As shown in the example of FIG. 7c), the one or more pathway sequences form sequence set A (that is, the pathway sequences in sequence set A connect the end of the starting anchor sequence segment Es to one or more other endpoints. Anchor sequence fragment ends Ee1, ..., Eek);
S3,根据所述的序列集合A中的通路序列,选择最多一条序列作为连接起始锚定序列片段末端(如Es)到另一个终点锚定序列片段末端的有效连接序列(从一个起始锚定序列片段末端开始可以不存在有效连接序列);S3. According to the pathway sequence in the sequence set A, a maximum of one sequence is selected as a valid linking sequence (from one starting anchor) connecting the end of the initial anchor sequence fragment (such as Es) to the end of another end anchor sequence fragment. There may be no effective linking sequence at the end of a given sequence fragment);
S4,利用所述的有效连接序列连接起始锚定序列片段(如末端Es)和相应的终点锚定序列片段;将连接后的序列片段作为新的锚定序列片段或记录剩余的锚定序列片段的自由末端,转到S2;不断重复步骤S2-S4,从而最终形成超长连续的DNA序列。S4. Use the effective linking sequence to connect the starting anchor sequence fragment (such as end Es) and the corresponding end anchor sequence fragment; use the linked sequence fragment as a new anchor sequence fragment or record the remaining anchor sequence The free end of the fragment is transferred to S2; steps S2-S4 are continuously repeated, so as to finally form an ultra-long continuous DNA sequence.
对于上述的锚定序列片段,所述的从随机测序Read序列中选出的某一个或几个特定的Read序列,可通过以下方法之一进行筛选:For the above-mentioned anchor sequence fragments, one or more specific Read sequences selected from the randomly sequenced Read sequences can be selected by one of the following methods:
1)将所有的随机测序Read序列进行两两比较,若是某一个Read序列跟其他多个Read序列(比如平均测序深度的1/3以上)有重叠,但没有末端未配对悬空序列,另外也跟多个其他Read序列有重叠(比如平均测序深度的1/3以上),但在此序列的同一端有末端未配对悬空序列,说明这个序列是一个位于单考贝和重复序列边界上的序列。把能够比对到此边界序列的单考贝端的所有Read序列去重,保留一条单考贝端所有重叠区域的平均重叠分数最高的Read序列,作为锚定序列片段。1) Compare all random sequencing Read sequences in pairs. If a Read sequence overlaps with other multiple Read sequences (such as more than 1/3 of the average sequencing depth), but there is no unpaired hanging sequence at the ends, it also follows Multiple other Read sequences overlap (for example, more than 1/3 of the average sequencing depth), but there is a terminal unpaired dangling sequence at the same end of this sequence, indicating that this sequence is a sequence located on the boundary of a single test and repeat sequence. Deduplicate all Read sequences that can be compared to the single test end of this boundary sequence, and retain a Read sequence with the highest average overlap score in all overlapping regions of the single test end as the anchor sequence segment.
2)根据每个Read序列的末端重叠的Read序列数目,把高于平均深度的序列作为重复序列,把低于或等于平均深度的序列作为单考贝序列;取 一条其所有重叠区域平均重叠分数最高的单考贝序列,向两边延伸,直到遇到一个标记为重复序列的Read序列停止。将此延伸后的序列离末端最近的两个单考贝Read序列作为锚定序列片段。2) According to the number of Read sequences overlapping at the end of each Read sequence, a sequence with an average depth above is used as a repeat sequence, and a sequence below or equal to the average depth is used as a single test sequence; take an average overlap score of all its overlapping areas The highest single test sequence extends to both sides until it stops when it encounters a Read sequence marked as a repeated sequence. The two single-column Read sequences closest to the end of this extended sequence were used as anchor sequence fragments.
3)将比对到参考基因组上任意一个区域的Read序列作为锚定序列片段。3) The Read sequence aligned to any region on the reference genome is used as the anchor sequence fragment.
为了提高基因组组装的效率和准确度,步骤S2中,在选择候选延伸序列之前,还可包括:设定一个全局序列相似性最低阈值SImin;对任一序列X来说,首先判断跟其重叠的序列在重叠区域的序列相似性值是否大于等于所述最低阈值SImin,如果是,则选用这些重叠序列来延伸序列X,否则放弃选用这些重叠序列来延伸序列X。In order to improve the efficiency and accuracy of genome assembly, in step S2, before selecting a candidate extension sequence, the method may further include: setting a global sequence similarity minimum threshold SImin; for any sequence X, first determine which overlaps it with Whether the sequence similarity value of the sequence in the overlapping region is greater than or equal to the minimum threshold SImin, and if so, these overlapping sequences are selected to extend the sequence X, otherwise these overlapping sequences are not selected to extend the sequence X.
所述的全局序列相似性最低阈值SImin可参考全基因组水平上的测序Read序列准确率值α进行设定(比如设定SImin=1-(1-α)*3),其中,所述的全基因组水平上的测序Read序列准确率值α通过以下方式计算获得:取已知的每条序列的具有最高重叠分数的重叠序列,最多取平均测序深度的条数;计算所有重叠区域的平均序列一致性值,作为全基因组水平上的测序Read序列准确率值α。The minimum global sequence similarity threshold SImin can be set by referring to the sequencing read sequence accuracy value α at the whole genome level (for example, setting SImin = 1- (1-α) * 3), where the total The sequencing read sequence accuracy value α at the genome level is calculated by: taking the known overlapping sequences with the highest overlap score for each sequence, taking the maximum number of average sequencing depths; calculating the average sequence consistency of all overlapping regions The sex value is used as the sequencing read sequence accuracy value α at the genome-wide level.
具体实施时,所述的全局序列相似性最低阈值SImin也可凭经验设定或是随意设定。比如,采用校正后的随机Read序列作为延伸序列,采用组装好的Contig作为锚定序列。因此在实施时,可采用一个固定的Simin值97%,效果已足够好。因为一般校正后的随机测序Read序列的准确率在99%左右。In specific implementation, the minimum global sequence similarity threshold SImin may also be set empirically or arbitrarily. For example, a corrected random Read sequence is used as the extension sequence, and an assembled Contig is used as the anchor sequence. Therefore, in the implementation, a fixed Simin value of 97% can be used, and the effect is good enough. Because generally the accuracy of the read sequence after random sequencing is about 99%.
为了可以延伸出更长的连续的DNA序列,步骤S2中,对序列末端(一个序列有两个末端,每个末端可定义为一段特定长度(比如1-50kb)的序列)进行延伸时,每一步都可选择重叠分数最高的序列;或者延伸分数最高的序列;或者随机选择一个序列;或者为上述任意两种或上述三种方式的结合;其中,随机选择序列时,任意一个序列被选中的概率根据其重叠分数或延伸分数确定(比如所述的概率可以是:这条序列的分数/可以用作延伸的所有序列的分数总和)。In order to extend a longer continuous DNA sequence, in step S2, when the end of the sequence (a sequence has two ends, each end can be defined as a sequence of a specific length (such as 1-50kb), each time One step can select the sequence with the highest overlap score; or the sequence with the highest extension score; or randomly select a sequence; or a combination of any two or three of the above methods; where a sequence is randomly selected, any one of the sequences is selected The probability is determined according to its overlap score or extension score (for example, the probability may be: the score of this sequence / the sum of the scores of all sequences that can be used as extension).
对于末端重叠的两条序列X1和X2(其中一条序列没有被另一条序列完全覆盖),其重叠区域的重叠分数OS为:OS=(OL1+OL2)*SI/2;其中, OL1,OL2分别为序列X1和X2中其重叠区域的长度,SI为序列X1和X2之间的重叠区域的序列一致性值(Sequence Identity);序列一致性值的计算一般包括错配和插入及缺失的碱基,但若是没有经过校正的原始Read序列,计算此值时,可以是:错配的碱基数目/(配对的碱基数目+错配的碱基数目);X2对X1的延伸分数ES2为:ES2=OS+EL2/2-(OH1+OH2)/2,其中OH1,OH2分别是两条序列末端未配对悬空(Overhang)区域的长度,EL2是X2对X1的延伸长度(类似,X1对X2的延伸分数也可以用同样的方式计算出来)。For two sequences with overlapping ends X1 and X2 (one of which is not completely covered by the other sequence), the overlap score OS of the overlapping region is: OS = (OL1 + OL2) * SI / 2; where OL1 and OL2 are respectively Is the length of the overlapping region in sequences X1 and X2, SI is the sequence identity value (Sequence Identity) of the overlapping region between sequences X1 and X2; the calculation of sequence identity value generally includes mismatches and insertions and deletions of bases , But if the original Read sequence has not been corrected, when calculating this value, it can be: number of mismatched bases / (number of matched bases + number of mismatched bases); the extension score ES2 of X2 to X1 is: ES2 = OS + EL2 / 2- (OH1 + OH2) / 2, where OH1 and OH2 are the lengths of the unpaired overhang regions at the ends of the two sequences, and EL2 is the extension of X2 to X1 (similarly, X1 to X2 The elongation score can be calculated in the same way).
具体实施时,对于延伸分数的定义,也可以只考虑延伸长度,但效果不一定好。In specific implementation, for the definition of the extension score, only the extension length can be considered, but the effect is not necessarily good.
步骤S3包括:Step S3 includes:
S31,将所述的序列集合A按照终点锚定序列片段的不同,划分为一或多个通路序列子集合A1,A2,…,Ak(其中,子集合A1中所有的通路序列都连接到起始锚定序列片段末端如Es和终点锚定序列片段末端如Ee1,其它的子集合以此类推),每个子集合中包括一条或多条通路序列;S31. The sequence set A is divided into one or more channel sequence sub-sets A1, A2, ..., Ak according to different end-anchor sequence fragments (where all the channel sequences in the sub-set A1 are connected to each other). The ends of the initial anchor sequence segment are Es and the ends of the end anchor sequence segment are Ee1, and so on for other subsets. Each subset includes one or more pathway sequences.
S32,根据每个通路序列子集合Ai中的通路序列获得一条序列作为这个子集合的代表性序列并计算这个子集合的有效通路序列数目,其中,1≤i≤k;S32. Obtain a sequence as a representative sequence of the subset according to the channel sequence in each of the channel sequence subsets Ai and calculate the number of valid channel sequences of the subset, where 1 ≦ i ≦ k;
S33,在所有的通路序列子集合的代表性序列中,选择最多一条作为连接起始锚定序列片段末端(如Es)到另一个终点锚定序列片段末端的有效连接序列。S33. Among the representative sequences of all the pathway sequence subsets, a maximum of one is selected as a valid linking sequence connecting the end of the initial anchor sequence fragment (such as Es) to the end of the other end anchor sequence fragment.
具体实施时,还可以:将所述的序列集合A按照终点锚定序列片段的不同,划分为一或多个通路序列子集合A1,A2,…,Ak(其中,子集合A1中所有的通路序列都连接到起始锚定序列片段末端如Es和终点锚定序列片段末端如Ee1,其它的子集合以此类推),每个子集合中包括一条或多条通路序列;然后直接选取最大子集合中长度出现频率最高的任意一条通路序列作为连接起始锚定序列片段末端(如Es)到相应终点锚定序列片段末端的有效连接序列;如果最大的子集合有多个时,则随机选一个最大的子集合。In specific implementation, the sequence set A may also be divided into one or more channel sequence sub-sets A1, A2, ..., Ak (where all the paths in the sub-set A1 are based on different end-anchor sequence fragments). The sequences are connected to the end of the starting anchor sequence fragment such as Es and the end of the anchor anchor sequence fragment such as Ee1, and other subsets, and so on), each subset includes one or more pathway sequences; then directly select the largest subset Any channel sequence with the highest frequency of mid-length is used as a valid linking sequence connecting the end of the initial anchor sequence fragment (such as Es) to the end of the corresponding end anchor sequence fragment; if the largest subset is more than one, randomly select one The largest sub-collection.
在所有的有连接通路的锚定序列末端对之间的有效通路序列数目都确定了后,可以用所有的锚定序列末端作为节点,用节点之间的代表性序列 作为边,构建一个无向连接图,用有效通路数作为边的长度。After the number of effective pathway sequences between the end pairs of all anchoring sequences with connected pathways is determined, the ends of all anchoring sequences can be used as nodes and the representative sequences between nodes as edges to construct an undirected The connection graph uses the number of effective paths as the length of the edges.
步骤S32可包括(如图2所示):Step S32 may include (as shown in FIG. 2):
S321,将每个通路序列子集合Ai分成一或多个组Ai1,…Aig,其中1≤i≤k;S321. Divide each channel sequence subset Ai into one or more groups Ai1, ... Aig, where 1 ≦ i ≦ k;
S322,从每个组中选出序列长度的出现频率为最高值及小于最高值一定范围的序列(比如选出序列长度的出现频率最高值到最高值一半的所有序列),形成序列集合Bi1,…,Big,其中序列集合Bi1与Ai1对应,其他集合以此类推;S322. From each group, select sequences with the highest occurrence frequency of the sequence length and a range smaller than the highest value (for example, select all sequences with the highest occurrence frequency of the sequence length and half the highest value) to form a sequence set Bi1, …, Big, where the sequence set Bi1 corresponds to Ai1, and the other sets can be deduced by analogy;
S323,将序列集合Bij中的所有序列进行两两比较,若是两条序列之间短的序列被覆盖超过一定比例(比如90%以上),则此两条序列被认为是相似性序列;选出所有能比对到最多条相似性序列的序列;若有多条序列的相似性序列的数目最高且相同,则选择序列长度的出现频率为最高的任意一条序列,作为Aij的代表性序列,并记录Bij中与此代表性序列相似的序列数目作为序列组Aij的有效通路序列数目;其中,1<=j<=g;S323: Compare all sequences in the sequence set Bij pair by pair. If the short sequence between the two sequences is covered by more than a certain ratio (such as more than 90%), the two sequences are considered as similar sequences; All sequences that can match up to the most similar sequences; if there are multiple sequences with the highest number of similar sequences and the same, select any sequence with the highest frequency of sequence length as the representative sequence of Aij, and Record the number of sequences similar to this representative sequence in Bij as the number of effective pathway sequences in sequence group Aij; where 1 <= j <= g;
S324,将序列子集合Ai中的各组按照序列长度从左到右、从短到长进行排列,从具有最高长度频率峰值的一个组开始,跟其右边通路序列更长的第一个组比较,若是左边组中总的有效通路序列数目高出右边组中总的有效通路序列数目一定比例(比如两倍以上),则设左边组的代表性序列为序列子集合Ai的代表性序列,寻找序列子集合Ai代表性序列的过程停止;否则,暂设右边组的代表性序列为序列子集合Ai的代表性序列,然后设此组为左边组,跟其第一个右边组比较,重复上述过程,直到右边组中总的有效通路序列数目低于左边组中总的有效通路序列数目一定比例(比如50%以下)为止;从而确定了序列子集合Ai的代表性序列和其所连接的一对相应的锚定序列片段末端(如Es和Eei)之间的有效通路序列数目(即对应序列组的有效通路序列数目,如NPsi,如图7d的举例所示)S324. The groups in the sequence sub-set Ai are arranged according to the sequence length from left to right and from short to long, starting from the group with the highest length frequency peak and comparing with the first group with the longer channel sequence on the right side. If the total number of effective pathway sequences in the left group is higher than the total number of effective pathway sequences in the right group by a certain proportion (for example, more than twice), then set the representative sequence of the left group as the representative sequence of the sequence sub-set Ai to find The process of the representative sequence of the sequence sub-set Ai stops; otherwise, temporarily set the representative sequence of the right group as the representative sequence of the sequence sub-set Ai, then set this group as the left group, compare it with its first right group, and repeat the above Process until the total number of effective pathway sequences in the right group is lower than the total number of effective pathway sequences in the left group by a certain percentage (for example, less than 50%); thus, the representative sequence of the sequence sub-set Ai and the connected sequence are determined. For the number of effective pathway sequences between the corresponding anchor sequence segment ends (such as Es and Eei) (that is, the number of effective pathway sequences of the corresponding sequence group, such as NPsi, as shown in the example of Figure 7d) (Shown)
在具体实施时,也可以不分组,选择长度出现频率最高的任意一条,或是长度的出现频率最高且跟别的序列的相似性最高的任意一条作为序列子集合的代表性序列;有效通路序列数可以直接选所有的通路,不看相似性。In specific implementation, it is not necessary to select any one of the most frequent occurrences of length, or any of the most frequent occurrences of the length and the highest similarity with other sequences as the representative sequence of the sequence subset; You can directly select all the channels without looking at the similarity.
本发明中,可以通过以下方式进行分组(如图3所示):In the present invention, grouping can be performed in the following ways (as shown in Figure 3):
S3211,将通路序列子集合Ai中的通路序列按照从短到长、从左到右(左短右长)的顺序进行排列;S3211, arrange the pathway sequences in the pathway sequence sub-set Ai in order from short to long and left to right (left short right long);
S3212,按照相同的长度差异(如1kb)将通路序列子集合Ai中的通路序列划分成多个不重叠的小窗口,并计算每个窗口中包含的序列总数;S3212. Divide the channel sequences in the channel sequence subset Ai into multiple non-overlapping small windows according to the same length difference (such as 1 kb), and calculate the total number of sequences contained in each window;
S3213,将每个窗口包含的序列总数与其直接相邻的两个窗口对应的序列总数分别进行比较;若比两边的数值都大,则该窗口为一个高峰窗口,若比两边的数值都小,则该窗口为一个谷底窗口;其中,若是一边没有窗口则此边的序列总数设为0;S3213: The total number of sequences contained in each window is compared with the total number of sequences corresponding to the two windows immediately adjacent to it; if the value is larger than both sides, the window is a peak window, and if the value is smaller than both sides, The window is a valley window; if there is no window on one side, the total number of sequences on this side is set to 0;
S3214,计算出所有的高峰窗口和谷底窗口;若是一个谷底窗口中序列长度的出现频率最小值小于其右边最近的一个高峰窗口中序列长度的出现频率最大值的某一特定比例(比如4/5)时,则用此谷底窗口中的出现频率最低的序列长度把两边的通路序列分成两组;以此类推,通路序列子集合Ai被分成一或多个组。S3214. Calculate all the peak and valley windows; if the minimum value of the frequency of the sequence length in a valley window is less than the maximum value of the frequency of the sequence length in the nearest peak window to the right of a certain ratio (such as 4/5 ), The channel sequence on both sides is divided into two groups by using the lowest frequency sequence length in the valley window; and so on, the channel sequence subset Ai is divided into one or more groups.
分组的关键是根据谷底值将多个峰值分开,若是通路长度的分布很窄,比如最短的通路和最长的通路之间的距离<10kb,则没有必要进行分组,当作一个组处理即可;所以按照窗口法分组时,每个窗口没有必要设定的太小,当然大了也没有必要,1kb即可。谷底窗口中长度频率最小值和高峰窗口中长度频率最高值的比例很重要,若是分组太多,会造成背景干扰,若是分组太少,会导致找不到正确的通路。The key to grouping is to separate multiple peaks according to the bottom value. If the distribution of the path length is narrow, for example, the distance between the shortest path and the longest path is less than 10kb, it is not necessary to group and treat them as a group. ; So when grouping according to the window method, it is not necessary to set each window too small, and of course it is not necessary to set it to 1kb. The ratio of the minimum length frequency in the valley window to the maximum length frequency in the peak window is very important. If there are too many groups, it will cause background interference. If there are too few groups, it will not find the correct path.
步骤S33中,通过以下方式来选择最多一条序列作为连接起始锚定序列片段末端(如Es)到另一个终点锚定序列片段末端的有效连接序列(如图4所示):In step S33, a maximum of one sequence is selected as a valid linking sequence connecting the ends of the initial anchoring sequence fragment (such as Es) to the end of the other end anchoring sequence fragment (as shown in FIG. 4):
S331,从序列集合A的各个通路序列子集合的代表性序列(一条代表性序列连接起始锚定序列片段末端Es到一个不同的终点锚定序列片段末端Eei)所对应的有效通路序列数目中,选取有效通路序列数目的最大值与第二大值,计算相应的起始锚定序列片段末端的冲突指数为:CIs=NPs n/NPs m,其中,CIs表示相应的起始锚定序列片段末端Es的冲突指数,NPs m为从起始锚定序列片段末端Es连接到其他不同的终点锚定序列片段末端的最大的有效通路序列数目,NPs n为从起始锚定序列片段末端Es连接到其他不同的终点锚定序列片段末端的第二大的有效通路序列数目;其中,NPs m≥NPs n; 若是冲突指数超出了阈值(比如0.75),则相应的起始锚定序列片段末端被称之为一个冲突末端(如图7e的例子所示的冲突末端,图7f显示了图7e中序列之间的关系); S331. From the representative sequences of each pathway sequence subset of sequence set A (a representative sequence connects the starting anchor sequence fragment end Es to a different end anchor sequence fragment end Eei), , Select the maximum value and the second largest value of the number of effective pathway sequences, and calculate the conflict index at the end of the corresponding initial anchor sequence fragment: CIs = NPs n / NPs m , where CIs represents the corresponding initial anchor sequence fragment Conflict index of terminal Es, NPs m is the maximum number of effective pathway sequences connected from the end of the starting anchor sequence fragment Es to other different end of the anchor sequence fragment ends, and NPs n is the link from the end of the starting anchor sequence fragment Es The second largest number of effective pathway sequences to the ends of other different end-anchor sequence fragments; where, NPs m ≥ NPs n ; if the collision index exceeds a threshold (such as 0.75), the end of the corresponding initial anchor sequence fragment is Call it a collision end (as shown in the example of FIG. 7e, FIG. 7f shows the relationship between the sequences in FIG. 7e);
S332,对于不存在冲突的锚定序列片段末端,则在其通路序列集合A的所有子集合的代表性序列中,选择相应的有效通路序列数目的最大值所对应的通路序列子集合的代表性序列作为序列集合A最终的代表性序列,即获得了连接起始锚定序列片段末端(如Es)到另一个终点锚定序列片段末端的有效连接序列;对于存在冲突的锚定序列片段末端,若是此冲突能够被解决,则选择由解决方法所决定的末端所对应的代表性序列作为连接起始锚定序列片段末端到另一个终点锚定序列片段的有效连接序列(如图7g即利用序列之间的相邻关系解决如图7e中序列末端的冲突);若是此冲突不能被解决,则没有找到此锚定序列片段末端跟其它任一锚定序列片段连接的有效连接序列,转到步骤S2。S332. For the end of the anchor sequence fragment that does not have a conflict, among the representative sequences of all the subsets of the channel sequence set A, select the representativeness of the channel sequence sub-set corresponding to the maximum value of the corresponding effective channel sequence number. The sequence is the final representative sequence of sequence set A, that is, a valid linking sequence connecting the end of the initial anchor sequence fragment (such as Es) to the end of the other end anchor sequence fragment is obtained. For the ends of the anchor sequence fragment in conflict, If this conflict can be resolved, the representative sequence corresponding to the end determined by the solution method is selected as the effective linking sequence connecting the end of the starting anchor sequence fragment to another end anchor sequence fragment (as shown in Figure 7g. The adjacent relationship between them resolves the conflict at the end of the sequence as shown in Figure 7e); if this conflict cannot be resolved, then no valid link sequence is found at the end of this anchor sequence segment and any other anchor sequence segment, go to step S2.
具体实施时,也可以不计算冲突值,总是选取具有最高有效通路数目的代表性序列进行连接,若是有两个最高有效通路数目一样的,则随机连接一个代表性序列。这样会造成一些嵌合体,但也有很多正确的序列。In specific implementation, it is not necessary to calculate the collision value, and always select a representative sequence with the highest number of effective paths for connection. If there are two highest effective paths, the representative sequence is randomly connected. This will result in some chimeras, but also many correct sequences.
在具体实施时,可以利用连接图进行判断,若是一个起始锚定序列末端连接到多个不同终点锚定序列的末端,其中最长的两条边长度(有效通路数)相似,则说明这个起始锚定序列末端通过相似的重复序列连接到了两个其他的终点起始锚定序列末端;非冲突的末端可以按照通路数目的多少从大到小连接起来;而冲突末端不用于连接到其他末端,直到这个冲突能够得到解决。In the specific implementation, the connection graph can be used to judge. If the end of a starting anchor sequence is connected to the ends of multiple different end anchor sequences, and the longest two sides have the same length (the number of effective paths), then this is explained. The ends of the starting anchor sequence are connected to two other end points by similar repeats; the ends of the non-conflicting ends can be connected from large to small according to the number of pathways; the conflicting ends are not used to connect to other End until this conflict can be resolved.
步骤S332中,解决冲突的方法包括:根据染色体分组的信息解决位于不同染色体上的锚定序列片段末端的冲突;或根据已知相邻的序列信息,从而解决锚定序列片段末端的冲突;或在构建超长连续DNA序列的过程中,将引起某起始锚定序列片段末端冲突的两个终点锚定序列片段末端中的一个已经用于其他锚定序列片段的连接中,则对应的起始锚定序列片段末端的冲突也相应得到解决。In step S332, the method for resolving conflicts includes: resolving conflicts at the ends of anchor sequence fragments located on different chromosomes according to the information of chromosome grouping; or resolving conflicts at the ends of anchor sequence fragments based on known adjacent sequence information; or In the process of constructing ultra-long continuous DNA sequences, one of the two end-point anchoring sequence fragments that cause a conflict between the ends of an initial anchoring sequence fragment has already been used in the connection of other anchoring sequence fragments. Conflicts at the ends of the initial anchor sequence fragments are also resolved accordingly.
本发明通过对通路序列子集合进行分组,从而可以在有多个重复单位的情况下,找出正确的通路序列(而不分组不能解决多个重复单位的复杂 序列),进而实现了对包含多个重复单位的复杂重复区域的组装。如图8所示,为含有多个重复单位的复杂重复序列区域举例;由图8a、图8b可知,基因组光学图谱显示此区域含有两个重复单位,而参考基因组序列中只含有一个重复单位;图8c和图8d,分别为采用本发明的上述方法获得的对应的通路序列长度频率分布图,分别对应了图8a和图8b中的两个序列;图8e中显示了对于图8b/图8d中的序列结构,采用本发明的方法所获得的cns1,cns2两条不同长度的代表性序列,其中的线段代表了锚定序列,中间的方框和三角代表了组装出的序列,说明了本发明可以实现对包含多个重复单位的复杂序列进行组装,而现有的方法组装出的序列不完整,会漏掉一些。By grouping the path sequence sub-sets in the present invention, the correct path sequence can be found in the case of multiple repeating units (without grouping, the complex sequence of multiple repeating units cannot be solved), thereby achieving the inclusion of multiple repeating units. Of complex repeating regions of two repeating units. Figure 8 shows an example of a complex repeat region containing multiple repeat units. As can be seen from Figures 8a and 8b, the genomic optical spectrum shows that this region contains two repeat units, while the reference genome sequence contains only one repeat unit; FIG. 8c and FIG. 8d are the corresponding frequency distribution diagrams of the length of the channel sequence obtained by using the above method of the present invention, respectively, corresponding to the two sequences in FIG. 8a and FIG. 8b; FIG. In the sequence structure in the present invention, two representative sequences of different lengths of cns1 and cns2 obtained by the method of the present invention, wherein the line segment represents the anchor sequence, and the middle box and triangle represent the assembled sequence. The invention can realize the assembly of complex sequences containing multiple repeating units, but the sequences assembled by the existing methods are incomplete and some will be missed.
本发明的一种实施例的工作原理:The working principle of an embodiment of the present invention:
如图5(基因组上的两个重复序列片段R1、R2序列一致性很高,但还是有些碱基差异导致他们不完全一致;上半部分比对的序列是本地产生的测序Read序列,跟所比较的序列之间的差异很小;下半部分比对的序列带有未配对悬空末端,是不同的重复序列拷贝来源的测序Read序列,跟所比较的序列之间差异较大)、图6(一个重叠图(部分)举例,对应了图5中的序列区域及测序Read;C1-C4是锚定序列片段;R1、R2是重复序列;每个序列有两个末端;U:单考贝序列,UR,单考贝和重复区域的边界序列)所示,来源于不同区域的重复序列拷贝(所述的重复序列拷贝为相似的序列,属于同一个重复序列家庭)是有差别的。本发明可以实现将来源于上述有差异的重复序列拷贝的Read序列分别进行组装,形成单独的Contig序列,然后再与两端的锚定序列片段进行连接,最终形成一个超长连续的DNA序列。Figure 5 (The two R12 and R2 sequences in the genome have high sequence identity, but there are still some base differences that cause them to be inconsistent; the upper half of the aligned sequence is a locally generated Read sequence, followed by the The differences between the compared sequences are small; the aligned sequences in the lower half have unpaired dangling ends, which are sequencing Read sequences from different copies of the repeat sequence, which are significantly different from the compared sequences), Figure 6 (An overlay (partial) example, corresponding to the sequence region and sequencing read in Figure 5; C1-C4 are anchor sequence fragments; R1 and R2 are repeat sequences; each sequence has two ends; U: single test shell As shown in the sequence, UR, single-column, and boundary sequences of repeating regions), there are differences in the repeating sequence copies (the repeating sequence copies are similar sequences belonging to the same repeating sequence family) derived from different regions. The present invention can assemble the Read sequences derived from the above-mentioned differential repeat sequence copies to form separate Contig sequences, and then ligate the anchor sequence fragments at both ends to form an ultra-long continuous DNA sequence.
本发明的基因组组装方法和现有的SG组装方法之间最关键的区别是本发明把一个重复序列区域完整地来处理,而SG方法是把重复区域进行了Read水平上的处理,相似的Read进行了压缩,不同的重复区域被压成了一个,这样本来有差别的重复区域不能分开。因为种种原因,比如测序错误,校正错误,等等,Read水平上的错误很容易导致压缩错误,而在通路水平上,虽然也有错误,但通路的代表性序列比Read要长,更容易区分。在通路水平上,即使两个通路之间序列差别只有1%-2%,也是有可能区分 开的,大于2%的则很容易分开。Read之间的重叠已包含了可能的序列上的差异。在延伸时,不同重复拷贝来源的Read不容易连接到一起。反过来说,若是在阈值之上,他们还连接到了一起,则说明这些Read无法区分,这两个区域就会产生冲突。若是只是部分重复区域相似,也会造成冲突,但可以通过解决冲突的方法来解决。The most critical difference between the genomic assembly method of the present invention and the existing SG assembly method is that the present invention completely processes a repeat sequence region, while the SG method processes the repeat region at the Read level. Similar Read After compression, different repeating regions are compressed into one, so that the repeating regions that are different from each other cannot be separated. For various reasons, such as sequencing errors, correction errors, and so on, errors at the Read level can easily lead to compression errors, and at the channel level, although there are errors, the representative sequence of the channel is longer than Read and is easier to distinguish. At the pathway level, even if the sequence difference between the two pathways is only 1% -2%, it is possible to distinguish them, and those greater than 2% can be easily separated. The overlap between Reads already contains possible sequence differences. When extended, Reads from different duplicate copy sources are not easily linked together. Conversely, if they are connected above the threshold, it means that these Reads cannot be distinguished, and the two areas will conflict. Conflicts can also occur if only partially overlapping regions are similar, but can be resolved by resolving conflicts.
若是两个重复序列(或是长于Read长度的某个区域)之间的相似度非常高,比如>99%,从两个地区产生的Read之间很难区分,这样就会导致连接一对Contig的Path会混杂不同来源的Reads。然而,若是在相似度低的重复序列之间,比如<97%,来源于同一地区的Read之间序列相似度高,重叠分数就高,而不同来源的Read之间相似度低,重叠分数就低,这样不同重复拷贝之间形成的通路就可以区分开来。若是限定了每个重叠的最低序列一致性,比如97%,则在寻找通路时,低于这个值的重叠会被滤掉,这样形成的通路就会只包括正确的通路,不会形成其他重复拷贝形成的通路。举例来说,对于长Read来说,Read之间的重叠可以设定一个最低值,比如1kb。若是两个重复拷贝之间的序列相似性<=99%,且序列上的差异是均匀分布的,这样在1kb的重叠区域内若是不同拷贝之间的Read就会有至少10个单核苷酸多态性位点(SNP),而同一拷贝之间的Read就会没有SNP。这些Read是很容易区分的。在实践中,由于序列之间的差异不是完全均匀分布的,或是由于测序错误没有得到全部校正,会导致误差,但总的来说,绝大多数<99%的重复序列拷贝是很容易区分的。目前单分子测序产生的原始Read的平均错误率在10%-15%。通过自我校正后的Read平均错误率大大降低,比如很多Read的错误率能降低到1%以下。对于绝大多数基因组上相似度小于95%的重复序列来说,其产生的Read在校正后相似度都不会高于95%(少数情况下由于校正错误,会产生更相似的序列),因而在序列比对及组装的过程中都很容易区分,因此利用现有软件进行组装时基本上不会造成问题。本发明需要处理的主要是这些相似度大于95%的序列。对所有的跟某个Congtig(Ni)有连接的所有Contig来说,只有一个Contig(Nj)是其相邻的。本发明的目的就是通过比较Ni到Nj的通路数量及质量来判断Ni跟哪个Nj是相邻的。若是只比较一条通路,较易受到偶然因素的影响。If the similarity between two repeated sequences (or a region longer than the Read length) is very high, such as> 99%, it is difficult to distinguish between Reads generated from the two regions, which will cause a pair of Contigs to be connected. Path will mix Reads from different sources. However, if the repeats with low similarity, such as <97%, the sequence similarity between Reads from the same region is high, the overlap score is high, and the reads from different sources are low, the overlap score is Low, so that the pathways formed between different duplicates can be distinguished. If the minimum sequence identity of each overlap is defined, such as 97%, when searching for a path, the overlaps below this value will be filtered out, so that the formed path will only include the correct path and will not form other duplications Pathways formed by copying. For example, for long Reads, the overlap between Reads can be set to a minimum value, such as 1kb. If the sequence similarity between two duplicate copies is <= 99%, and the sequence differences are evenly distributed, then there will be at least 10 single nucleotides in the 1kb overlap region if there are Reads between different copies. Polymorphic sites (SNPs), and Reads between the same copies will be free of SNPs. These Reads are easy to distinguish. In practice, because the differences between sequences are not completely evenly distributed, or because sequencing errors are not fully corrected, errors will result, but in general, the vast majority of duplicate copies of <99% are easy to distinguish of. At present, the average read rate of raw reads produced by single-molecule sequencing is between 10% and 15%. After self-correction, the average error rate of Read is greatly reduced, for example, the error rate of many Reads can be reduced to less than 1%. For most of the repeated sequences with a similarity of less than 95% on the genome, the reads produced by the read will not have a similarity higher than 95% after correction (in a few cases, more similar sequences will be generated due to correction errors), so It is easy to distinguish in the sequence alignment and assembly process, so it is basically not a problem when using existing software for assembly. The present invention needs to process mainly those sequences with a similarity greater than 95%. For all Contigs connected to a Congtig (Ni), only one Contig (Nj) is adjacent to it. The purpose of the present invention is to determine which Nj is adjacent to Ni by comparing the number and quality of the paths from Ni to Nj. If only one pathway is compared, it is more susceptible to accidental factors.
在连接两个Contig的通路序列中,并不是所有的序列都是一致的,因此,本发明需要找到一条代表性序列来代表这些通路,并作为连接两个Contig的潜在序列。代表性序列也可以是这些通路中取长度为最高出现频率中的一条作为参考序列,把其他所有的序列比对上,然后计算这些序列的共识序列。若是两个Contig之间存在一条共识序列,则若是比对到这条共识序列的通路序列占总数的50%以上,则确认这两个Contig之间存有有效共识序列,其连接数为比上通路序列的数目。若是不到50%,则认为这两个Contig之间的重复序列太复杂,其连接数可设为0。含有多个重复单位的重复区域,因为在延伸时可以跨过或是重复某个重复单位,导致代表性序列变短或变长,呈现有规律的多个长度频率峰值分布。In the pathway sequence connecting two Contigs, not all sequences are identical. Therefore, the present invention needs to find a representative sequence to represent these pathways and use it as a potential sequence connecting two Contigs. The representative sequence can also be one of these channels that takes the length of the highest occurrence frequency as a reference sequence, compares all other sequences, and then calculates the consensus sequence of these sequences. If there is a consensus sequence between the two Contigs, if the path sequence aligned to the consensus sequence accounts for more than 50% of the total, it is confirmed that there is a valid consensus sequence between the two Contigs, and the number of connections is higher than The number of pathway sequences. If it is less than 50%, the repeat sequence between the two Contigs is considered too complicated, and the number of connections can be set to zero. A repeating region containing multiple repeating units, because a repeating unit can be crossed or repeated during extension, resulting in a shorter or longer representative sequence, showing a regular distribution of multiple length frequency peaks.
如图9所示,其中图9(a)中所示的2条序列表示:在原始基因组序列中,序列A通过某重复序列的一个拷贝与D序列连接,C序列通过该重复序列的另一个拷贝(即与上述重复序列属于同一个重复序列家庭)与序列B连接。采用现有的SG方法进行组装时,图9(a)中所示的重复序列的两个拷贝(也可以说是两条相似的序列)被压缩成了一条,如图9(b)所示;注意在压缩的过程中,由于两条原始的序列之间是有差异的,压缩成一条后,很可能会跟两条都不一样,造成组装出的序列错误;根据图9(b),根本无法确定序列A通过重复序列应该与D序列连接还是应该与B序列连接,也无法确定序列C通过重复序列应该与B序列进行连接还是应该与D序列进行连接,也就是说,由于交叉节点的存在,形成的重复序列组装后的Contig在被压缩的起点和终点位置附近断开,导致组装出的Contig的碎片化,无法真正复原整个原始基因组序列。而采用本发明后,如图9(c)所示,本发明将相似的重复序列分别组装成两条不同的contig序列,通过本发明的方法寻找从某一序列的末端出发到所有未知序列末端的所有通路序列的代表性序列,然后再从这些代表性序列中找出从该末端出发连接最正确的一条通路,也即,通过本发明的方法可以正确的判断出序列A通过重复序列(即本发明中所需寻找的最终的有效连接序列)与序列D连接,序列C通过该重复序列的相似序列(也是本发明中所需寻找的最终的有效连接序列)与序列B连接,从而可以正确还原整个原始基因组序列,而且利用本发明的方法,可以将大部分的重复序列或其相似重复序 列都组装起来,最终形成超长连续的DNA序列。As shown in FIG. 9, the two sequences shown in FIG. 9 (a) indicate that in the original genomic sequence, the sequence A is connected to the D sequence through one copy of a repeat sequence, and the C sequence is connected to another through the repeat sequence. The copy (that is, the same repeat family as the above-mentioned repeat sequence) is linked to sequence B. When using the existing SG method for assembly, two copies of the repeating sequence shown in Fig. 9 (a) (also two similar sequences) are compressed into one, as shown in Fig. 9 (b). ; Note that during the compression process, due to the difference between the two original sequences, after compression into one, it is likely to be different from the two, resulting in an assembled sequence error; according to Figure 9 (b), It is impossible to determine whether sequence A should be connected to D sequence or B sequence through repeated sequences, nor whether sequence C should be connected to B sequence or D sequence through repeated sequences, that is, due to the cross node's Existing, the assembled Contig after assembly of the repeated sequence is broken near the compressed start and end positions, resulting in fragmentation of the assembled Contig, which cannot truly restore the entire original genome sequence. After adopting the present invention, as shown in FIG. 9 (c), the present invention assembles similar repeat sequences into two different contig sequences, and uses the method of the present invention to find the starting point from the end of a certain sequence to the end of all unknown sequences. The representative sequences of all the pathway sequences of the sequence are then found from these representative sequences to find the most correctly connected pathway from the end, that is, the method of the present invention can correctly determine that the sequence A passes the repeated sequence (ie The final effective linking sequence to be found in the present invention) is linked to sequence D, and sequence C is linked to sequence B through a similar sequence of the repeat sequence (also the final effective linking sequence to be found in the present invention), so that it can be correctly The entire original genomic sequence is reduced, and most of the repeating sequences or similar repeating sequences can be assembled by using the method of the present invention, and finally an ultra-long continuous DNA sequence is formed.
采用本发明的方法同样可以组装单拷贝区域,因为在组装前,并不知道该序列是来源于重复序列还是单拷贝序列。A single copy region can also be assembled using the method of the present invention, because it is not known whether the sequence is derived from a repeat sequence or a single copy sequence before assembly.
工业实用性Industrial applicability
本发明提供一种构建超长连续DNA序列的基因组组装方法。本发明方法包括:S1,找出每对已知DNA序列之间的重叠区域;S2,从任一个锚定序列片段的一个自由末端开始,用跟其有重叠的Read序列对其进行延伸,循环多次,直至遇到能够比对到另一不同的锚定序列片段末端的Read序列,获得一条或多条通路序列;S3,从所有的通路序列中选择最多一条作为连接起始锚定序列片段末端到另一个终点锚定序列片段末端的有效连接序列;S4,利用该有效连接序列连接起始和相应的终点锚定序列片段;连接后作为新的锚定序列片段或记录剩余的锚定序列片段的自由末端,转到S2;重复步骤S2-S4,最终形成超长连续的DNA序列。本发明方法有利于复原整条染色体及整个基因组的序列,进一步提高基因组组装的准确率,可以实现重复序列区域的基因组组装,也可以实现单拷贝序列区域的组装,还实现了对包含多个重复单位的复杂重复区域的组装,具有较好的经济价值和应用前景。The invention provides a genome assembly method for constructing an ultra-long continuous DNA sequence. The method of the present invention includes: S1, finding an overlapping region between each pair of known DNA sequences; S2, starting from a free end of any anchor sequence segment, extending it with a Read sequence that overlaps with it, and cycling Repeatedly, until a Read sequence that can be compared to the end of another different anchor sequence fragment is obtained, and one or more pathway sequences are obtained; S3, at most one of all pathway sequences is selected as the connection starting anchor sequence fragment. A valid ligation sequence from the end to the end of another end anchoring sequence fragment; S4. Use the effective ligation sequence to link the start and corresponding end anchoring sequence fragments; use the new anchor sequence fragment or record the remaining anchoring sequence after ligation The free end of the fragment is transferred to S2; steps S2-S4 are repeated to finally form an ultra-long continuous DNA sequence. The method of the invention is beneficial for restoring the entire chromosome and the entire genome sequence, further improving the accuracy of genome assembly, and can realize the genome assembly of the repeated sequence region, as well as the assembly of a single copy sequence region. The assembly of the unit's complex and repeating regions has good economic value and application prospects.

Claims (10)

  1. 一种构建超长连续DNA序列的基因组组装方法,其特征在于,包括以下步骤:A genome assembly method for constructing an ultra-long continuous DNA sequence, comprising the following steps:
    S1,将所有的已知DNA序列进行两两比较,找出每对序列之间相似的重叠区域;其中,所述的已知DNA序列包括锚定序列片段和随机测序Read序列;所述的锚定序列片段至少包括两个;所述的将所有的已知DNA序列进行两两比较,包括将所有的锚定序列片段与所有的随机测序Read序列进行两两比较,以及将所有的随机测序Read序列进行两两比较;S1: Perform a pairwise comparison of all known DNA sequences to find similar overlapping regions between each pair of sequences; wherein the known DNA sequences include anchor sequence fragments and random sequencing Read sequences; the anchors The predetermined sequence fragment includes at least two; the described pairwise comparison of all known DNA sequences includes a pairwise comparison of all anchor sequence fragments and all random sequencing Read sequences, and all random sequencing Read Compare the sequences one by one;
    S2,从任意一个锚定序列片段的一个自由末端开始,用跟其有重叠的随机测序Read序列对该锚定序列片段进行延伸,形成一到多个延长的序列;再对这些延长的序列采用同样的方法利用随机测序Read序列继续进行延伸,每个序列的延伸循环多次,直至遇到能够比对到另一个不同的锚定序列片段末端的随机测序Read序列,则从起始锚定序列片段一端开始的延伸结束,获得连接起始锚定序列片段的一端到另一个或多个不同的终点锚定序列片段末端的一个或多个通路序列,所述的一个或多个通路序列形成序列集合A;S2, starting from a free end of any anchor sequence fragment, extending the anchor sequence fragment with a random sequencing Read sequence that overlaps with it to form one or more extended sequences; and then using these extended sequences with The same method uses random sequencing of the Read sequence to continue the extension. Each sequence is extended multiple times until it encounters a random sequencing Read sequence that can be compared to the end of another different anchor sequence fragment. One end of the fragment is extended to obtain one or more pathway sequences connecting one end of the initial anchor sequence fragment to another or more different end anchor sequences. The one or more pathway sequences form a sequence. Collection A;
    S3,根据所述的序列集合A中的通路序列,选择最多一条序列作为连接起始锚定序列片段末端到另一个终点锚定序列片段末端的有效连接序列;S3. According to the pathway sequence in the sequence set A, a maximum of one sequence is selected as a valid linking sequence connecting the end of the starting anchor sequence fragment to the end of the other end anchor sequence fragment;
    S4,利用所述的有效连接序列连接起始锚定序列片段和相应的终点锚定序列片段;将连接后的序列片段作为新的锚定序列片段或记录剩余的锚定序列片段的自由末端,转到S2;不断重复步骤S2-S4,从而最终形成超长连续的DNA序列。S4. Use the effective linking sequence to connect the starting anchor sequence fragment and the corresponding end anchor sequence fragment; use the linked sequence fragment as a new anchor sequence fragment or record the free ends of the remaining anchor sequence fragments, Go to S2; repeat steps S2-S4 continuously, so as to finally form an ultra-long continuous DNA sequence.
  2. 根据权利要求1所述的构建超长连续DNA序列的基因组组装方法,其特征在于,步骤S2中,在选择候选延伸序列之前,还包括:设定一个全局序列相似性最低阈值SImin;对任一序列X来说,首先判断跟其重叠的序列在重叠区域的序列相似性值是否大于等于所述最低阈值SImin,如果是,则选用这些重叠序列来延伸序列X,否则放弃选用这些 重叠序列来延伸序列X。The genome assembly method for constructing an ultra-long continuous DNA sequence according to claim 1, characterized in that, in step S2, before selecting a candidate extension sequence, further comprising: setting a global sequence similarity minimum threshold SImin; For sequence X, first determine whether the sequence similarity value of the sequence that overlaps with it is greater than or equal to the minimum threshold SImin. If so, use these overlapping sequences to extend sequence X, otherwise give up using these overlapping sequences to extend Sequence X.
  3. 根据权利要求2所述的构建超长连续DNA序列的基因组组装方法,其特征在于,所述的全局序列相似性最低阈值SImin参考全基因组水平上的测序Read序列准确率值α进行设定,其中,所述的全基因组水平上的测序Read序列准确率值α通过以下方式计算获得:取已知的每条序列的具有最高重叠分数的重叠序列,最多取平均测序深度的条数;计算所有重叠区域的平均序列一致性值,作为全基因组水平上的测序Read序列准确率值α。The genome assembly method for constructing an ultra-long continuous DNA sequence according to claim 2, characterized in that the minimum global sequence similarity threshold value SImin is set with reference to the sequencing read sequence accuracy value α at the whole genome level, wherein The sequence read sequence accuracy value α at the genome-wide level is obtained by calculating the following methods: taking known overlapping sequences with the highest overlap score of each sequence, taking at most the number of average sequencing depths; calculating all overlaps The average sequence identity value of the region is used as the sequencing read sequence accuracy value α at the genome-wide level.
  4. 根据权利要求1所述的构建超长连续DNA序列的基因组组装方法,其特征在于,步骤S2中,对序列末端进行延伸时,每一步都选择重叠分数最高的序列;或者延伸分数最高的序列;或者随机选择一个序列;或者为上述任意两种或上述三种方式的结合;其中,随机选择序列时,任意一个序列被选中的概率根据其重叠分数或延伸分数确定。The genome assembly method for constructing an ultra-long continuous DNA sequence according to claim 1, characterized in that, in step S2, when the sequence ends are extended, each step selects the sequence with the highest overlap score; or the sequence with the highest extension score; Either randomly select a sequence; or a combination of any two or three of the above; wherein when a sequence is randomly selected, the probability that any one sequence is selected is determined according to its overlap score or extension score.
  5. 根据权利要求3或4所述的构建超长连续DNA序列的基因组组装方法,其特征在于,对于末端重叠的两条序列X1和X2,其重叠区域的重叠分数OS为:OS=(OL1+OL2)*SI/2;其中,OL1,OL2分别为序列X1和X2中其重叠区域的长度,SI为序列X1和X2之间的重叠区域的序列一致性值;X2对X1的延伸分数ES2为:ES2=OS+EL2/2-(OH1+OH2)/2,其中OH1,OH2分别是两条序列末端错配悬空区域的长度,EL2是X2对X1的延伸长度。The genome assembly method for constructing an ultra-long continuous DNA sequence according to claim 3 or 4, characterized in that, for two sequences X1 and X2 with overlapping ends, the overlapping fraction OS of the overlapping region is: OS = (OL1 + OL2 ) * SI / 2; where OL1 and OL2 are the lengths of the overlapping regions in the sequences X1 and X2, SI is the sequence identity value of the overlapping regions between the sequences X1 and X2; the extension score ES2 of X2 to X1 is: ES2 = OS + EL2 / 2- (OH1 + OH2) / 2, where OH1 and OH2 are the lengths of the mismatched dangling regions at the ends of the two sequences, and EL2 is the extension of X2 to X1.
  6. 根据权利要求1所述的构建超长连续DNA序列的基因组组装方法,其特征在于,步骤S3包括:The method for assembling a genome of an ultra-long continuous DNA sequence according to claim 1, wherein step S3 comprises:
    S31,将所述的序列集合A按照终点锚定序列片段的不同,划分为一或多个通路序列子集合A1,A2,…,Ak,每个子集合中包括一条或多条通路序列;S31. Divide the sequence set A into one or more path sequence sub-sets A1, A2, ..., Ak according to different end-anchor sequence fragments, and each sub-set includes one or more path sequences;
    S32,根据每个通路序列子集合Ai中的通路序列获得一条序列作为这个子集合的代表性序列并计算这个子集合的有效通路序列数目,其中,1≤i≤k;S32. Obtain a sequence as a representative sequence of the subset according to the channel sequence in each of the channel sequence subsets Ai and calculate the number of valid channel sequences of the subset, where 1 ≦ i ≦ k;
    S33,在所有的通路序列子集合的代表性序列中,选择最多一条作为 连接起始锚定序列片段末端到另一个终点锚定序列片段末端的有效连接序列。S33. Among the representative sequences of all the channel sequence sub-sets, at most one is selected as a valid linking sequence connecting the end of the start anchor sequence fragment to the end of the other end anchor sequence fragment.
  7. 根据权利要求6所述的构建超长连续DNA序列的基因组组装方法,其特征在于,步骤S32包括:The method for assembling a genome of an ultra-long continuous DNA sequence according to claim 6, wherein step S32 comprises:
    S321,将每个通路序列子集合Ai分成一或多个组Ai1,…Aig,其中1≤i≤k;S321. Divide each channel sequence subset Ai into one or more groups Ai1, ... Aig, where 1 ≦ i ≦ k;
    S322,从每个组中选出序列长度的出现频率为最高值及小于最高值一定范围的序列,形成序列集合Bi1,…,Big,其中序列集合Bi1与Ai1对应,其他集合以此类推;S322. From each group, a sequence with a highest occurrence frequency of the sequence length and a range smaller than the highest value is selected to form a sequence set Bi1, ..., Big, where the sequence set Bi1 corresponds to Ai1, and the other sets are deduced by analogy;
    S323,将序列集合Bij中的所有序列进行两两比较,若是两条序列之间短的序列被覆盖超过一定比例,则此两条序列被认为是相似性序列;选出所有能比对到最多条相似性序列的序列;若有多条序列的相似性序列的数目最高且相同,则选择序列长度的出现频率为最高的任意一条序列,作为Aij的代表性序列,并记录Bij中与此代表性序列相似的序列数目作为序列组Aij的有效通路序列数目;其中,1<=j<=g;S323: Perform a pairwise comparison of all sequences in the sequence set Bij. If the short sequence between the two sequences is covered by more than a certain ratio, the two sequences are considered as similar sequences; all the ones that can be compared to the most are selected. A sequence of similar sequences; if there are multiple sequences with the highest number of similar sequences and the same, then select any sequence with the highest frequency of sequence length as the representative sequence of Aij, and record this representative in Bij The number of sequences with similar sexual sequences is taken as the number of effective pathway sequences of the sequence group Aij; where 1 <= j <= g;
    S324,将序列子集合Ai中的各组按照序列长度从左到右、从短到长进行排列,从具有最高长度频率峰值的一个组开始,跟其右边通路序列更长的第一个组比较,若是左边组中总的有效通路序列数目高出右边组中总的有效通路序列数目一定比例,则设左边组的代表性序列为序列子集合Ai的代表性序列,寻找序列子集合Ai代表性序列的过程停止;否则,暂设右边组的代表性序列为序列子集合Ai的代表性序列,然后设此组为左边组,跟其第一个右边组比较,重复上述过程,直到右边组中总的有效通路序列数目低于左边组中总的有效通路序列数目一定比例为止;从而确定了序列子集合Ai的代表性序列和其所连接的一对相应的锚定序列片段末端之间的有效通路序列数目。S324. The groups in the sequence sub-set Ai are arranged according to the sequence length from left to right and from short to long, starting from the group with the highest length frequency peak and comparing with the first group with the longer channel sequence on the right side. If the total number of effective pathway sequences in the left group is higher than the total number of effective pathway sequences in the right group, the representative sequence of the left group is the representative sequence of the sequence sub-set Ai, and the representativeness of the sequence sub-set Ai is sought. The sequence process stops; otherwise, temporarily set the representative sequence of the right group as the representative sequence of the sequence sub-set Ai, then set this group as the left group, compare it with its first right group, and repeat the above process until the right group The total number of effective pathway sequences is lower than a certain percentage of the total number of effective pathway sequences in the left group; thus, the validity between the representative sequence of the sequence subset Ai and the ends of a pair of corresponding anchoring sequence fragments connected to it is determined. Number of pathway sequences.
  8. 根据权利要求7所述的构建超长连续DNA序列的基因组组装方法,其特征在于,步骤S321中,通过以下方式进行分组:The genome assembly method for constructing an ultra-long continuous DNA sequence according to claim 7, characterized in that, in step S321, grouping is performed in the following manner:
    S3211,将通路序列子集合Ai中的通路序列按照从短到长、从左到右的顺序进行排列;S3211, arrange the pathway sequences in the pathway sequence sub-set Ai in order from short to long and from left to right;
    S3212,按照相同的长度差异将通路序列子集合Ai中的通路序列划分成多个不重叠的小窗口,并计算每个窗口中包含的序列总数;S3212. Divide the channel sequences in the channel sequence subset Ai into multiple non-overlapping small windows according to the same length difference, and calculate the total number of sequences contained in each window;
    S3213,将每个窗口包含的序列总数与其直接相邻的两个窗口对应的序列总数分别进行比较;若比两边的数值都大,则该窗口为一个高峰窗口,若比两边的数值都小,则该窗口为一个谷底窗口;其中,若是一边没有窗口则此边的序列总数设为0;S3213: The total number of sequences contained in each window is compared with the total number of sequences corresponding to the two windows immediately adjacent to it; if the value is larger than both sides, the window is a peak window, and if the value is smaller than both sides, The window is a valley window; if there is no window on one side, the total number of sequences on this side is set to 0;
    S3214,计算出所有的高峰窗口和谷底窗口;若是一个谷底窗口中序列长度的出现频率最小值小于其右边最近的一个高峰窗口中序列长度的出现频率最大值的某一特定比例时,则用此谷底窗口中的出现频率最低的序列长度把两边的通路序列分成两组;以此类推,通路序列子集合Ai被分成一或多个组。S3214. Calculate all the peak and valley windows; if the minimum frequency of the sequence length in a valley window is less than a certain percentage of the maximum frequency of the sequence length in the nearest peak window to the right, use this The sequence length with the lowest frequency in the valley window divides the channel sequences on both sides into two groups; and so on, the channel sequence subset Ai is divided into one or more groups.
  9. 根据权利要求6所述的构建超长连续DNA序列的基因组组装方法,其特征在于,步骤S33中,通过以下方式来选择最多一条序列作为连接起始锚定序列片段末端到另一个终点锚定序列片段末端的有效连接序列:The genome assembly method for constructing an ultra-long continuous DNA sequence according to claim 6, characterized in that in step S33, at most one sequence is selected as follows to connect the end of the anchoring sequence fragment to the anchoring sequence of the other end point. A valid ligation sequence at the end of the fragment:
    S331,从序列集合A的各个通路序列子集合的代表性序列所对应的有效通路序列数目中,选取有效通路序列数目的最大值与第二大值,计算相应的起始锚定序列片段末端的冲突指数为:CIs=NPs n/NPs m,其中,CIs表示相应的起始锚定序列片段末端Es的冲突指数,NPs m为从起始锚定序列片段末端Es连接到其他不同的终点锚定序列片段末端的最大的有效通路序列数目,NPs n为从起始锚定序列片段末端Es连接到其他不同的终点锚定序列片段末端的第二大的有效通路序列数目;其中,NPs m≥NPs n;若是冲突指数超出了阈值,则相应的起始锚定序列片段末端被称之为一个冲突末端; S331. From the number of effective channel sequences corresponding to the representative sequences of each channel sequence sub-set of sequence set A, select the maximum value and the second largest value of the number of effective channel sequences to calculate the corresponding end of the starting anchor sequence fragment The collision index is: CIs = NPs n / NPs m , where CIs represents the collision index of the corresponding end of the starting anchor sequence fragment Es, and NPs m is the link from the end of the starting anchor sequence fragment Es to other different end anchors The maximum number of effective pathway sequences at the end of a sequence fragment. NPs n is the second largest number of effective pathway sequences connected from the end of the initial anchor sequence fragment Es to the ends of other different end anchor sequence fragments; where NPs m ≥NPs n ; if the collision index exceeds the threshold, the end of the corresponding starting anchor sequence segment is called a collision end;
    S332,对于不存在冲突的锚定序列片段末端,则在其通路序列集合A的所有子集合的代表性序列中,选择相应的有效通路序列数目的最大值所对应的通路序列子集合的代表性序列作为序列集合A最终的代表性序列,即获得了连接起始锚定序列片段末端到另一个终点锚定序列片段末端的有效连接序列;对于存在冲突的锚定序列片段末端,若是此冲突能够被解 决,则选择由解决方法所决定的末端所对应的代表性序列作为连接起始锚定序列片段末端到另一个终点锚定序列片段的有效连接序列;若是此冲突不能被解决,则没有找到此锚定序列片段末端跟其它任一锚定序列片段连接的有效连接序列,转到步骤S2。S332. For the end of the anchor sequence fragment that does not have a conflict, among the representative sequences of all the subsets of the channel sequence set A, select the representativeness of the channel sequence sub-set corresponding to the maximum value of the corresponding effective channel sequence number. The sequence is the final representative sequence of sequence set A, that is, a valid linking sequence connecting the end of the starting anchor sequence fragment to another end of the anchor sequence fragment is obtained. For the conflicting anchor sequence fragment ends, if the conflict can When the solution is resolved, the representative sequence corresponding to the end determined by the solution is selected as the effective linking sequence connecting the end of the starting anchor sequence fragment to another end anchor sequence fragment; if this conflict cannot be resolved, then it is not found If the end of the anchor sequence fragment is connected to any other anchor sequence fragment and is a valid linking sequence, go to step S2.
  10. 根据权利要求9所述的构建超长连续DNA序列的基因组组装方法,其特征在于,步骤S332中,解决冲突的方法包括:根据染色体分组的信息解决位于不同染色体上的锚定序列片段末端的冲突;或根据已知相邻的序列信息,从而解决锚定序列片段末端的冲突;或在构建超长连续DNA序列的过程中,将引起某起始锚定序列片段末端冲突的两个终点锚定序列片段末端中的一个已经用于其他锚定序列片段的连接中,则对应的起始锚定序列片段末端的冲突也相应得到解决。The genome assembly method for constructing an ultra-long continuous DNA sequence according to claim 9, characterized in that, in step S332, the method for resolving conflicts comprises: resolving conflicts at the ends of anchor sequence fragments located on different chromosomes according to the information of chromosome grouping ; Or resolve conflicts at the ends of anchoring sequence fragments based on known adjacent sequence information; or in the process of constructing ultra-long continuous DNA sequences, two end anchors that will cause conflicts at the ends of an initial anchoring sequence fragment One of the ends of the sequence fragment has been used in the connection of other anchor sequence fragments, and the conflicts at the ends of the corresponding initial anchor sequence fragments have been resolved accordingly.
PCT/CN2019/090053 2018-06-08 2019-06-05 Genome assembly method for constructing ultralong continuous dna sequence WO2019233427A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201810588945.8 2018-06-08
CN201810588945.8A CN108753765B (en) 2018-06-08 2018-06-08 Genome assembly method for constructing ultra-long continuous DNA sequence

Publications (1)

Publication Number Publication Date
WO2019233427A1 true WO2019233427A1 (en) 2019-12-12

Family

ID=63999576

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/090053 WO2019233427A1 (en) 2018-06-08 2019-06-05 Genome assembly method for constructing ultralong continuous dna sequence

Country Status (2)

Country Link
CN (1) CN108753765B (en)
WO (1) WO2019233427A1 (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108753765B (en) * 2018-06-08 2020-12-08 中国科学院遗传与发育生物学研究所 Genome assembly method for constructing ultra-long continuous DNA sequence

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2010075570A2 (en) * 2008-12-24 2010-07-01 New York University Methods, computer-accessible medium, and systems for score-driven whole-genome shotgun sequence assemble
US20120295260A1 (en) * 2008-11-07 2012-11-22 Industrial Technology Research Institute Methods for accurate sequence data and modified base position determination
CN104239750A (en) * 2014-08-25 2014-12-24 北京百迈客生物科技有限公司 High-throughput sequencing data-based genome de novo assembly method
CN105303068A (en) * 2015-10-27 2016-02-03 华中农业大学 Reference genome and de novo assembly combination based next-generation sequencing data assembly method
CN105986008A (en) * 2015-01-27 2016-10-05 深圳华大基因科技有限公司 CNV detection method and CNV detection apparatus
CN108753765A (en) * 2018-06-08 2018-11-06 中国科学院遗传与发育生物学研究所 A kind of genome assemble method of structure overlength continuous DNA sequence

Family Cites Families (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2015094844A1 (en) * 2013-12-18 2015-06-25 Pacific Bioscences Inc. String graph assembly for polyploid genomes
CN104017883B (en) * 2014-06-18 2015-11-18 深圳华大基因科技服务有限公司 The method and system of assembling genome sequence
US10839939B2 (en) * 2014-06-26 2020-11-17 10X Genomics, Inc. Processes and systems for nucleic acid sequence assembly
WO2016205767A1 (en) * 2015-06-18 2016-12-22 Pacific Biosciences Of California, Inc String graph assembly for polyploid genomes
CN107133493B (en) * 2016-02-26 2020-01-14 中国科学院数学与系统科学研究院 Method for assembling genome sequence, method for detecting structural variation and corresponding system
CN109817280B (en) * 2016-04-06 2023-04-14 晶能生物技术(上海)有限公司 Sequencing data assembling method
US20180060484A1 (en) * 2016-08-23 2018-03-01 Pacific Biosciences Of California, Inc. Extending assembly contigs by analyzing local assembly sub-graph topology and connections
CA3040138A1 (en) * 2016-10-11 2018-04-19 Giorgio Zoia Method and system for selective access of stored or transmitted bioinformatics data
CN107273716B (en) * 2017-05-03 2020-04-28 武汉菲沙基因信息有限公司 Method for assembling framework based on long segments
CN107895104B (en) * 2017-11-13 2020-07-07 深圳华大基因科技服务有限公司 Method and device for evaluating and verifying sequence assembly result of third-generation sequencing

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120295260A1 (en) * 2008-11-07 2012-11-22 Industrial Technology Research Institute Methods for accurate sequence data and modified base position determination
WO2010075570A2 (en) * 2008-12-24 2010-07-01 New York University Methods, computer-accessible medium, and systems for score-driven whole-genome shotgun sequence assemble
CN104239750A (en) * 2014-08-25 2014-12-24 北京百迈客生物科技有限公司 High-throughput sequencing data-based genome de novo assembly method
CN105986008A (en) * 2015-01-27 2016-10-05 深圳华大基因科技有限公司 CNV detection method and CNV detection apparatus
CN105303068A (en) * 2015-10-27 2016-02-03 华中农业大学 Reference genome and de novo assembly combination based next-generation sequencing data assembly method
CN108753765A (en) * 2018-06-08 2018-11-06 中国科学院遗传与发育生物学研究所 A kind of genome assemble method of structure overlength continuous DNA sequence

Also Published As

Publication number Publication date
CN108753765A (en) 2018-11-06
CN108753765B (en) 2020-12-08

Similar Documents

Publication Publication Date Title
CN109234267B (en) Genome assembly method
CN105303068B (en) It is a kind of to assemble based on reference gene group and from the beginning two generation sequencing data assemble methods being combined
Xue et al. L_RNA_scaffolder: scaffolding genomes with transcripts
CN104200133B (en) A kind of genome De novo sequence assembly methods based on reading and range distribution
WO2019233427A1 (en) Genome assembly method for constructing ultralong continuous dna sequence
CN104164479A (en) Heterozygous genome processing method
CN104239750A (en) High-throughput sequencing data-based genome de novo assembly method
US20140121991A1 (en) System and method for aligning genome sequence
KR101930253B1 (en) Apparatus and method constructing consensus reference genome map
CN102789553B (en) Method and device for assembling genomes by utilizing long transcriptome sequencing result
US20190078155A1 (en) Method for determining nucleotide sequence
US20140121983A1 (en) System and method for aligning genome sequence
CN103805689A (en) Characteristic kmer based metatypic chromosomal sequence assembly method and application thereof
CN112397148A (en) Sequence comparison method, sequence correction method and device thereof
CN116130001A (en) Third-generation sequence comparison algorithm based on k-mer positioning
CN111192635B (en) Analysis method for circular RNA identification and expression quantification
US20170270243A1 (en) Method for finding associated positions of bases of a read on a reference genome
CN113098526B (en) DNA self-index interval decompression method
CN111583065A (en) Power load data prediction method and device
CN115602244B (en) Genome variation detection method based on sequence alignment skeleton
TWI420007B (en) System and method of assembling dna reads
CN110544510B (en) Contig integration method based on adjacent algebraic model and quality grade evaluation
US20140121988A1 (en) System and method for aligning genome sequence considering repeats
CN115602246B (en) Sequence alignment method based on group genome
Fertin et al. DExTaR: Detection of exact tandem repeats based on the de Bruijn graph

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19814908

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19814908

Country of ref document: EP

Kind code of ref document: A1