WO2019233427A1

WO2019233427A1 - Genome assembly method for constructing ultralong continuous dna sequence

Info

Publication number: WO2019233427A1
Application number: PCT/CN2019/090053
Authority: WO
Inventors: 梁承志; 杜会龙
Original assignee: 中国科学院遗传与发育生物学研究所
Priority date: 2018-06-08
Filing date: 2019-06-05
Publication date: 2019-12-12
Also published as: CN108753765A; CN108753765B

Abstract

A genome assembly method for constructing an ultralong continuous DNA sequence, comprising: S1, finding an overlapping region between each pair of known DNA sequences; S2, starting from a free end of any anchor sequence fragment, extending the anchor sequence fragment with a read sequence overlapping therewith, repeating the extension a plurality of times until a read sequence that can be aligned to an end of another different anchor sequence fragment is found, so as to obtain one or a plurality of pathway sequences; S3, selecting at most one of all the pathway sequences as an effective joining sequence to join the end of the initial anchor sequence fragment to the end of the other ending anchor sequence fragment; S4, using the effective joining sequence to join the initial and corresponding ending anchor sequence fragment; using the joined anchor sequence fragment as a new anchor sequence fragment, or recording a free end of a remaining anchor sequence fragment, and proceeding to S2; and repeating steps S2-S4 to finally form an ultralong continuous DNA sequence.

Description

Genome assembly method for constructing ultra-long continuous DNA sequence

cross reference

This application claims priority from Chinese Patent Application No. 201810588945.8, entitled “A Method for Constructing a Genome Assembly of Ultra-Long Continuous DNA Sequences”, filed on June 8, 2018, the entire disclosure of which is incorporated herein by reference in its entirety.

Technical field

The invention relates to a genome assembly method for constructing ultra-long continuous DNA sequences, and belongs to the technical field of genome assembly.

Background technique

The sequencer generates random reads (Read) by sequencing the genomic fragments. The distribution of these Reads across the genome is random. The process of genome assembly is to arrange and connect these Reads in the correct order, and assemble them into contigs of contiguous bases, and finally restore the entire chromosome and the entire genome sequence. This assembly process generally includes three steps: the assembly of continuous fragments (Contig), the assembly of discontinuous non-continuous fragments (Scaffold), and the completion of gaps (GF). The difficulty of genome assembly stems from the large number of repeated sequences present in the genome (ie, two / segment or multiple / segment sequences with similar or identical sequences). Repeats can be divided into two major categories in the genome: tandem repeats and interspersed repeats. A tandem repeat is a sequence of very similar repeating units that are directly connected to the head and tail, and are generated by local repeats. Typical tandem repeats include rDNA, centromeric repeats, and the like. Interspersed repeats are non-locally repeated sequences distributed at different locations in the genome. In some repeats, there are both tandem and non-tandem repeats. These regions are long and form complex repeats. Reads derived from different copies of the repeat sequence have sequence similarity. At present, the length N50 of single molecule sequencing Read is generally greater than 10-15kb, and the longest is more than 100kb. If it is a repeating sequence plus the single test sequence at both ends is covered by a single Read, then there is no assembly problem in this area. What needs to be solved now is the assembly of repeating sequences that exceed the Read average or N50 length.

For long single-molecule sequencing data, the most commonly used genome assembly method now uses Overlap-Layout-Consensus (OLC) (Myers, et al. 2000, Science 287, 2196-2204) or String Graph (SG) (Myers 2005, Bioinformatics 21). Suppl 2, ii79-85). The OLC method can also be described succinctly with SG, collectively referred to as the SG-type method. Common software for existing SG methods includes PBcR (Berlin et al. 2015, Nat. Biotechnol. 33, 623–30), CANU (Koren et al. 2017, Genome et Res. 27, 722–736), and FALCON (Chin et et al. 2016, Nat.Methods 13,1050–1054), MECAT (Xiao et al. 2017, Nat.Methods.doi: 10.1038 / nmeth.4432), etc. The key in the SG method is to use the method of transitive reduction (Transitive reduction) (TU) to remove redundant reads (all reads that are particularly similar are compressed into one). That is, after constructing an overlay graph of all Reads, the number of edges in and out of many nodes is reduced to one using TU. This leaves no branches on many paths. If a Read node has a degree of overlapping edges greater than 1 in the simplified diagram, it is called a cross node, and the other nodes are internal nodes. A path without crossing nodes can form a Contig, which can be further compressed together in the SG. The cross node represents the connection between the single test sequence region and the repeated sequence region (Read on this node includes each part of the two types of sequences); the sequencer will make errors when detecting the Read sequence, causing its measurement Read sequences with sequencing errors, such as base insertions, deletions, mutations, or chimeras derived from sequences at different positions, may cause additional cross-node sequences. Due to the existence of sequencing errors, there is no unified standard to distinguish whether the differences between Read sequences are caused by sequencing errors or caused by different copies of repeated sequences. In the process of path simplification, the single test area is reduced to a single path formed by a series of Reads, which are connected together to form a single test sequence Contig; and a repeating sequence can also be compressed into a single path formed by a series of Reads to form Repeating sequence Contig. Because errors must be tolerated during sequence comparison, the Reads from different duplicate sequence copies will be compressed together, and the duplicate sequences of different copies will become one, so it cannot be distinguished. However, due to the existence of cross nodes, the formed repeat sequence Contig is broken at the compressed start and end positions, resulting in fragmentation of the assembled Contig, which in turn makes it impossible to truly restore the entire original genome sequence.

Summary of the Invention

The purpose of the present invention is to provide a genome assembly method for constructing an ultra-long continuous DNA sequence, which can effectively solve the problems existing in the prior art, especially in the prior art, compressing similar multi-segment repeating sequences into a string of Read. A single path; because Reads from different duplicate sequence copies will be compressed together, causing the duplicate sequences of different copies to become one, it cannot be distinguished; and because of the existence of cross nodes, the formed repeat sequence Contig is at the starting point of compression Disconnect from the end position, causing fragmentation of the assembled Contig.

In order to solve the above technical problem, the present invention adopts the following technical solution: a method for assembling a genome for constructing an ultra-long continuous DNA sequence, including the following steps:

S1. Perform a pairwise comparison of all known DNA sequences to find similar overlapping regions between each pair of sequences; wherein the known DNA sequences include anchor sequence fragments (that is, sequence fragments used for anchoring, It can include multiple types, such as a specific segment or segments of a specific sequence segment intercepted from a DNA sequence, and / or a specific segment or segments of a specific sequence segment that have been assembled, and / or selected from a random sequencing Read sequence One or several specific Read sequences, etc.) and random sequencing Read sequences; said anchor sequence fragments include at least two; said pairwise comparison of all known DNA sequences, including all Anchor sequence fragments are compared pairwise with all random sequencing Read sequences, and all random sequencing Read sequences are compared pairwise;

S2, starting from a free end (such as Es) of any anchor sequence fragment, extending the anchor sequence fragment with a random sequencing Read sequence that overlaps with it to form one or more extended sequences; then these The extended sequence uses the same method to continue the extension by using random sequencing Read sequences. Each sequence is extended multiple times until it encounters a random sequencing Read sequence that can be compared to the end of another different anchor sequence fragment. The extension of one end of the starting anchor sequence segment ends, and one or more pathway sequences connecting one end of the starting anchor sequence segment to another end of the different end anchor sequence segment are obtained. The pathway sequence forms sequence set A (that is, the pathway sequence in sequence set A connects the end of the starting anchor sequence segment Es to one or more different end anchoring sequence segment ends Ee1, ..., Eek);

S3. According to the pathway sequence in the sequence set A, a maximum of one sequence is selected as a valid linking sequence (from one start anchor There may be no effective linking sequence at the end of a given sequence fragment);

S4. Use the effective linking sequence to connect the starting anchor sequence fragment (such as end Es) and the corresponding end anchor sequence fragment; use the linked sequence fragment as a new anchor sequence fragment or record the remaining anchor sequence The free end of the fragment is transferred to S2; steps S2-S4 are continuously repeated, so as to finally form an ultra-long continuous DNA sequence.

In the present invention, any two anchor sequence fragments are not completely the same, so that the occurrence of conflicting ends can be avoided as much as possible. A sequence has two ends, and each end can be defined as a sequence of a specific length (for example, 1-50 kb). Then, a sequence of a specific length (for example, 1-50 kb) corresponding to the end is the terminal sequence. In actual operation, similar end sequences can be removed by sequence alignment (for example, the identity is> 98%), and new available ends can be generated after the sequence is shortened.

Preferably, in step S2, before selecting a candidate extended sequence, the method further includes: setting a global sequence similarity minimum threshold SImin; for any sequence X, first determining the sequence similarity of the sequence overlapping with it in the overlapping area Whether the value is greater than or equal to the minimum threshold SImin, and if so, use these overlapping sequences to extend sequence X, otherwise give up selecting these overlapping sequences to extend sequence X, thereby eliminating noise interference, improving the efficiency and speed of data processing, and improving The accuracy of the results.

Preferably, the minimum global sequence similarity threshold value SImin is set with reference to the sequencing read sequence accuracy value α at the whole genome level (for example, SImin = 1- (1-α) * 3), wherein, the The sequencing-read sequence accuracy value α at the genome-wide level is obtained by calculating: taking the known overlapping sequences with the highest overlap score for each sequence, taking at most the number of average sequencing depths; calculating the average of all overlapping regions The sequence identity value is used as the sequencing read sequence accuracy value α at the genome-wide level; the estimated sequencing accuracy value is used to set the minimum overlap screening threshold, which can improve the accuracy of this value setting and reduce background noise. Improved the accuracy of the results and the speed of calculations.

In the foregoing genome assembly method for constructing an ultra-long continuous DNA sequence, in step S2, when the sequence ends (a sequence has two ends, each end can be defined as a sequence of a specific length (such as 1-50 kb)) is extended For each step, select the sequence with the highest overlap score; or the sequence with the highest extension score; or randomly select a sequence; or a combination of any two or three of the above methods; where a sequence is selected randomly, any sequence is selected The probability of is determined by its overlap score or extension score (for example, the probability can be: the score of this sequence / the sum of the scores of all sequences that can be used for extension); each extension method is a greedy algorithm, if there is only one In the case of channels, it is not guaranteed to be correct, so combining multiple extension methods to obtain multiple channels can increase the probability of finding the correct result.

Through the above method, when extending the sequence, except for the first step, only one sequence is selected at each step, instead of using all the sequences to extend, thereby ensuring that it can be extended in a limited or short period of time. Long continuous DNA sequence. And if every step needs to be extended with all the sequences, as the number of extensions increases, the total number of sequences will increase exponentially, and eventually it will not be feasible to extend all sequences; Multiple Read extensions guarantee that multiple sequences can be generated at the end, increasing the probability that the final result contains the correct result.

In the above-mentioned genome assembly method for constructing ultra-long continuous DNA sequences, for the two sequences X1 and X2 whose ends overlap (one of which is not completely covered by the other sequence), the overlap score OS of the overlap region is: OS = (OL1 + OL2) * SI / 2; where OL1 and OL2 are the lengths of the overlapping regions in the sequences X1 and X2, SI is the sequence identity value of the overlapping regions between the sequences X1 and X2; the extension score of X2 to X1 is ES2 ES2 = OS + EL2 / 2- (OH1 + OH2) / 2, where OH1 and OH2 are the lengths of the unpaired overhang regions at the ends of the two sequences, and EL2 is the extension of X2 to X1 (similarly, X1 The extension score for X2 can also be calculated in the same way); In general, the higher the overlap score, the more likely this overlap region originates from the same position on the genome; the unpaired overhangs at the end of the sequence are few It is due to sequencing errors, most of which are caused by different copies of repeated sequences, so subtracting this value from the score increases the probability of finding the correct sequence; the length of the extended sequence is also important. Including the length of the extended sequence in the extension score can help Prefer long sequences in order to find longer overlap regions and higher overlap scores in subsequent extensions. Note that because DNA sequences are double-stranded complementary, only single strands can be used in sequence comparison, so sequences overlap Regions can be created on both chains. Through chain adjustment, the two can be unified without redundancy or contradiction.

The inventors found through research that there are two ways to generate overlapping regions between sequences: 1. Originating from the same position on the genome, the consistency of these sequences is often high, but due to sequencing errors, the consistency of the sequences is not 100%; 2. Derived from different copies of repeated sequences, but the identity of these sequences is often low.

Preferably, step S3 includes:

S31. Divide the pathway sequence set A starting from the end of the initial anchor sequence segment, such as Es, into one or more pathway sequence subsets A1, A2, ..., Ak (wherein , All the pathway sequences in the sub-set A1 are connected to the end-anchor sequence fragment end Ee1, and the other sub-sets and so on), each sub-set includes one or more pathway sequences;

S32. Obtain a sequence as a representative sequence of the subset according to the channel sequence in each of the channel sequence subsets Ai and calculate the number of valid channel sequences of the subset, where 1 ≦ i ≦ k;

S33. Among the representative sequences of all the pathway sequence subsets, a maximum of one is selected as a valid linking sequence connecting the end of the initial anchor sequence fragment (such as Es) to the end of the other end anchor sequence fragment.

Through the above method, it is possible to quickly and accurately find the most correct one of all pathway sequences starting from a free end of an initial anchor sequence fragment and connected to one end of one or more end anchor sequence fragments (in the Of the multiple end anchoring sequence ends connected to the end of the starting anchor sequence, only one end sequence end is correct, and the other end sequence ends are background noise), and then use the correct path sequence to connect the start anchor Sequence fragments and corresponding end-anchor sequence fragments to improve the accuracy of genome assembly; starting from the end of a starting anchor sequence, the pathway sequence found can be randomly connected to one or more different end-anchor sequences As the Read sequence generated in the same region of the genome has the highest similarity and overlap score, and when extended, the high-scoring sequence is preferentially selected, resulting in the most connected pathways reaching the end of the anchor sequence at the correct endpoint; not based on the end Grouping, high error rate, if the highest number of effective paths is not selected, the error rate is high; if there is a conflict, it means The two end up too similar sequence, consistency value is too high, not conflict resolution, easy wrong. In addition, through the above methods, the assembled sequences are all partially complete fragments, which increases the length of the assembled sequence. In the existing methods, most of the assembled fragments are local fragments. The long sequences contain more complete and easier genes. Arranged on the chromosome, it is easier to find the colinearity and structural variation between fragments. In addition, when the output sequence is not needed, the method of the present invention can also be used to judge the adjacent relationship between two anchored sequences. Or the distance between two adjacent anchoring sequences.

Preferably, step S32 includes:

S321. Divide each channel sequence subset Ai into one or more groups Ai1, ... Aig, where 1 ≦ i ≦ k;

S322. From each group, select sequences with the highest occurrence frequency of the sequence length and a range smaller than the highest value (for example, select all sequences with the highest occurrence frequency of the sequence length and half the highest value) to form a sequence set Bi1, …, Big, where the sequence set Bi1 corresponds to Ai1, and the other sets can be deduced by analogy;

S323: Compare all sequences in the sequence set Bij pair by pair. If the short sequence between the two sequences is covered by more than a certain ratio (such as more than 90%), the two sequences are considered as similar sequences; All sequences that can match up to the most similar sequences; if there are multiple sequences with the highest number of similar sequences and the same, select any sequence with the highest frequency of sequence length as the representative sequence of Aij, and Record the number of sequences similar to this representative sequence in Bij as the number of effective pathway sequences in sequence group Aij; where 1 <= j <= g;

S324. The groups in the sequence sub-set Ai are arranged according to the sequence length from left to right and from short to long, starting from the group with the highest length frequency peak and comparing with the first group with the longer channel sequence on the right side. If the total number of effective pathway sequences in the left group is higher than the total number of effective pathway sequences in the right group by a certain proportion (for example, more than twice), then set the representative sequence of the left group as the representative sequence of the sequence sub-set Ai to find The process of the representative sequence of the sequence sub-set Ai stops; otherwise, temporarily set the representative sequence of the right group as the representative sequence of the sequence sub-set Ai, then set this group as the left group, compare it with its first right group, and repeat the above Process until the total number of effective pathway sequences in the right group is lower than the total number of effective pathway sequences in the left group by a certain percentage (for example, less than 50%); thus, the representative sequence of the sequence sub-set Ai and the connected sequence are determined. The number of effective pathway sequences between the corresponding anchor sequence segment ends (such as Es and Eei) (that is, the number of effective pathway sequences of the corresponding sequence group, such as NPsi).

In the present invention, by grouping the path sequence sub-sets, the correct path sequence can be found in the case of multiple repeating units (without grouping, the complex sequence of multiple repeating units cannot be solved), thereby achieving the alignment of Assembly of complex repeating regions containing multiple repeating units. If there are no multiple repeating units in a repeating sequence, no grouping is required, and only one group will be automatically generated when grouping. Because there are multiple repeating units, it is easy to form a path with the wrong number of repeating units when extended. In this way, the group with the highest length frequency or a group to the right of it represents the correct path set. The number of effective paths in this group cannot be too large. Low (too low is most likely background noise). The method of the present invention selects this group with a relatively low number of effective paths, thereby increasing the probability of finding the correct path.

More preferably, in step S321, grouping is performed in the following manner:

S3211, arrange the pathway sequences in the pathway sequence sub-set Ai in order from short to long and left to right (left short right long);

S3212. Divide the channel sequences in the channel sequence subset Ai into multiple non-overlapping small windows according to the same length difference (such as 1 kb), and calculate the total number of sequences contained in each window;

S3213: Compare the total number of sequences contained in each window with the total number of sequences corresponding to the two windows directly adjacent to it; if the value is larger than both sides, the window is a peak window, and if the value is smaller than both sides, The window is a valley window; if there is no window on one side, the total number of sequences on this side is set to 0;

S3214. Calculate all the peak and valley windows; if the minimum value of the frequency of the sequence length in a valley window is less than the maximum value of the frequency of the sequence length in the nearest peak window to the right of a certain ratio (such as 4/5 ), The channel sequence on both sides is divided into two groups by using the lowest frequency sequence length in the valley window; and so on, the channel sequence subset Ai is divided into one or more groups.

The key to grouping is to separate multiple peaks according to the bottom value. If the distribution of the path length is narrow, for example, the distance between the shortest path and the longest path is less than 10kb, it is not necessary to group and treat them as a group. ; So when grouping according to the window method, it is not necessary to set each window too small, and of course it is not necessary to set it to 1kb.

Preferably, in step S33, at most one sequence is selected as a valid linking sequence connecting the end of the initial anchor sequence fragment (such as Es) to the end of the other end anchor sequence fragment:

S331. From the representative sequences of each pathway sequence subset of sequence set A (a representative sequence connects the starting anchor sequence fragment end Es to a different end anchor sequence fragment end Eei), , Select the maximum value and the second largest value of the number of effective pathway sequences, and calculate the conflict index at the end of the corresponding initial anchor sequence fragment: CIs = NPs _n / NPs _m , where CIs represents the corresponding initial anchor sequence fragment Conflict index of terminal Es, NPs _m is the maximum number of effective pathway sequences connected from the end of the starting anchor sequence fragment Es to other different end of the anchor sequence fragment ends, and NPs _n is the link from the end of the starting anchor sequence fragment Es The second largest number of effective pathway sequences to the ends of other different end-anchor sequence fragments; where NPs _m ≥ NPs _n ; if the collision index exceeds a threshold (such as 0.75), the end of the corresponding initial anchor sequence fragment is Call it the end of a conflict;

S332. For the end of the anchor sequence fragment that does not have a conflict, among the representative sequences of all the subsets of the channel sequence set A, select the representativeness of the channel sequence sub-set corresponding to the maximum value of the corresponding effective channel sequence number. The sequence is the final representative sequence of sequence set A, that is, a valid linking sequence connecting the end of the initial anchor sequence fragment (such as Es) to the end of the other end anchor sequence fragment is obtained. For the ends of the anchor sequence fragment in conflict, If this conflict can be resolved, the representative sequence corresponding to the end determined by the solution method is selected as a valid linking sequence connecting the end of the starting anchor sequence segment to another end anchor sequence segment; if this conflict cannot be resolved , If no valid linking sequence is found at the end of this anchoring sequence segment and any other anchoring sequence segment, go to step S2.

By the above method, whether the end of the anchor sequence fragment is a conflicting end and further selecting at most one sequence as a valid linking sequence connecting the end of the initial anchor sequence fragment (such as Es) to the end of the other end anchor sequence fragment, thereby avoiding Extending the conflicting ends incorrectly (causing DNA sequence chimeras, that is, sequences from different regions are joined together), guarantees that a longer continuous DNA sequence is formed, and the accuracy is higher.

In the above-mentioned genome assembly method for constructing ultra-long continuous DNA sequences, in step S332, the method for resolving conflicts includes: resolving conflicts at the ends of anchor sequence fragments located on different chromosomes according to the information of chromosome grouping; Sequence information to resolve conflicts at the ends of anchoring sequence fragments; or in the construction of ultra-long continuous DNA sequences, one of the two end anchoring sequence fragments that will cause conflicts at the ends of an initial anchoring sequence fragment has been used In the connection of other anchor sequence fragments, the conflict at the ends of the corresponding initial anchor sequence fragments is also resolved accordingly. The data based on these conflict resolution methods are relatively easy to obtain, so that conflicts can be resolved quickly. For example, the chromosomal grouping data has Hi-C or genetic maps, the neighboring information is the BioNano genome optical map, or 10xGenomics data. Minority conflicts can be resolved based on their own information.

Compared with the prior art, the present invention has the following advantages:

(1) In the present invention, all known DNA sequences are compared pair by pair to find similar overlapping regions between each pair of sequences, and then start from a free end (such as Es) of any one anchor sequence fragment, followed by It has overlapping random sequencing Read sequences to extend the anchor sequence fragment to form one or more extended sequences; and then use the same method for these extended sequences to continue the extension using random sequencing Read sequences to extend each sequence. Repeat several times until you encounter a random sequencing Read sequence that can be compared to the end of another different anchor sequence fragment, then the extension ends from one end of the initial anchor sequence fragment, and one end connected to the initial anchor sequence fragment is obtained. One or more pathway sequences at the end of another or more different end-anchor sequence fragments, said one or more pathway sequences forming sequence set A; one is selected according to the pathway sequences in said sequence set A The sequence is used as an effective linking sequence connecting the end of the initial anchoring sequence fragment to the end of the other end anchoring sequence fragment; using said effective linking sequence Connect the initial anchor sequence fragment and the corresponding end anchor sequence fragment; use the ligated sequence fragment as a new anchor sequence fragment or record the free ends of the remaining anchor sequence fragments, and repeat the above steps continuously to form a super Long continuous DNA sequence; by using the method of the present invention to construct an ultra-long continuous DNA sequence, it is more beneficial to restore the entire chromosome and the entire genome sequence;

(2) The present invention forms a random Read sequence into a pathway sequence, and then processes the pathway sequence, thereby greatly improving the accuracy of genome assembly. However, the existing technology only performs processing at the Read level and compresses similar Read sequences, so that the Read sequences from different repeating regions are compressed into one, so that the originally different repeating regions cannot be separated; and due to various For reasons such as sequencing errors, correction errors, etc., errors at the Read level can easily cause compression errors; while at the channel level, although there are errors, the length of the sequence of the channel is longer than the length of the Read sequence, making it easier to distinguish . At the channel level, even if the difference between the two channel sequences is only 1% -2%, it is possible to distinguish them, and those larger than 2% are easily separated. Therefore, the present invention can be further compared with the existing SG method. Improve the accuracy of genome assembly;

(3) In the present invention, by setting sequence identity values or overlap scores or extension scores, according to the sequence identity values or overlap scores or extension scores, copies of different repeat sequences (each repeat from the same repeat sequence family in the genome) (Sequence similarity is greater than 85%, especially similarity> 98%) Reads are separated as much as possible, and the Reads from each repeated sequence source are assembled into an independent Contig, and connected to the anchor sequence fragments at its two ends. , Eventually forming ultra-long continuous DNA sequences, while improving the accuracy of genome assembly;

(4) Through the method of the present invention, the assembly of a complex repeating region containing a plurality of repeating units is also achieved;

(5) The genome assembly method of the present invention can also be used for sequence filling of blank regions in a genomic sequence (the sequences at both ends of the blank region are used as anchor sequence fragments, and the final effective linking sequence is obtained by the method of the present invention);

(6) The genomic assembly method of the present invention can realize the assembly of genomic regions of repeated sequence regions, as well as the assembly of single-copy sequence regions;

(7) In order to verify the effect of the present invention, the inventor also performed a genome assembly test on the rice genome, maize genome, and human genome using the scheme of the invention, as follows:

First, the inventors tested with a high-quality rice genome (assembled genome size 390.3Mb, estimated true size does not exceed 394Mb) (Du et al, 2017). The results of assembly using the existing OLC-based SG-type assembly methods are: the total genome size is 402.5Mb; the Contig N50 size is 1.3Mb. After assembly using the method of the present invention, under the condition that the total genome size is slightly reduced (399.2Mb) (the genome becomes smaller because of the redundancy in the result of OLC-based SG-type assembly), the size of Contig N50 increases It reached 13.2Mb; in the case of using BioNano (sequence adjacent information) genome optical map to resolve the conflict, Contig N50 was further increased to 14.4Mb, and the entire chromosome 8 sequence was assembled into a Contig. After filtering out non-rice sequences, the entire genome is 391.6Mb and contains 40 Contigs, which is slightly larger than the original assembled reference genome, mainly due to the increase in centromeric repeat sequences. The chromosome 8 sequence assembled by the method of the present invention contains a sequence of about 387 kb missing from the previous reference genome (that is, assembly using the existing method, there are many fragments, the genome is incomplete, and many repeated sequences are missed, which is difficult. Arranged onto chromosomes, complex regions cannot be assembled, as shown in Figures 10a, 10b and 10c). In incomplete testing of 14 known regions of potentially complex repeats, the present invention can easily assemble 7 of them (as shown in Figure 8e, Figure 10b, Figure 10c, Figure 10e, Figure 10f, Figure 10g). , And shown in Figure 10h). The region assembled by the method of the present invention in the rice genome was subjected to quality inspection using other second-generation sequencing short sequences. It was found that 97.21% of the short sequences can be aligned to the genome, and 99.56% of the assembled sequences of the present invention can be covered by the second-generation short sequences. It shows that the assembled sequences of the present invention are all correct.

Secondly, as shown in Figure 11b, the published reference genome B73 of Maize RefGen_v4 (Jiao et al., 2017) was assembled using PBcR (OLC-based SG class) software, which contains a large number (90.55Mb total) of small The Contig sequence is not anchored to the chromosome, and the blank sequence on the chromosome is about 43Mb. For the same data, after genomic assembly using the method of the present invention, the size of Contig N50 increased from 1.3Mb to 61.2Mb, and the longest Contig increased from 7Mb to 140Mb. The length mapped to the chromosome increased from 2075.6Mb to 2104.2Mb (indicating that the assembly accuracy of the present invention is higher), the total number of blank areas decreased from 2,523 to 76, and only 2.8Mb of the sequence could not be anchored to the chromosome. . In addition to assembling blank sequences, it was found through BioNano genome optical spectrum verification that, as shown in Figures 11c and 11d, the assembly results of the present invention also corrected multiple errors in RefGen_v4, including two sequence direction errors and two position errors ( The existing method cannot detect such errors because the sequence is short, but using the method of the present invention to assemble and the sequence becomes longer, the errors caused by the previous sequence are too short to disappear).

Third, one human genome, HX1 (Shi et al., 2016), was assembled using FALCON (SG software) (HX1_FALCON). After improvement by the method of the present invention, Contig N50 increased from 8.3 Mb to 54.4 Mb. The longest Contig increased from 38Mb to 109.8Mb. After comparison, it is found that, as shown in FIG. 12f, in the human reference genome GRCH38, a plurality of blank regions are not assembled in HX1_FALCON, but have been filled in the assembly result of the present invention. In addition, from FIG. 12c and FIG. 12e As shown, the human reference genome itself is gapped, but after the assembly by the method of the present invention, the gap is filled; in addition, it can be seen from FIG. 12d that the fragment assembled by Falcon is a chimera; and the present invention The assembly method builds the correct sequence.

The above experiments show that: (1) the genome assembly method of the present invention can construct a longer continuous DNA sequence; (2) the present invention can be used to sequence fill a blank area in the genome sequence; (3) the method of the present invention Complex repeat sequence regions can be assembled; (4) The accuracy of genome assembly by the genome assembly method of the present invention is high.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a flowchart of a method according to an embodiment of the present invention;

2 is a flowchart of a method for obtaining a representative sequence of a subset of pathway sequences and the number of effective pathway sequences of the subset;

FIG. 3 is a schematic diagram of a grouping method of a path sequence sub-set; FIG.

FIG. 4 is a schematic diagram of a method for selecting an effective ligation sequence connecting the end of an initial anchoring sequence fragment (such as Es) to the end of another end anchoring sequence fragment;

Figure 5 is a schematic diagram of duplicate sequence copies derived from different regions; two similar repeat sequence fragments R1, R2 on the genome, the upper half of the sequence aligned with it are locally generated Read sequences, the sequence consistency is very high , And the lower half of the sequence aligned with it is another sequencing Read sequence source from another copy of the repeat sequence, with unpaired dangling ends and base mutations;

FIG. 6 is a schematic diagram of overlapping sequences and the corresponding overlapping diagram of FIG. 5; a, two overlapping sequences, OL, an overlapping sequence portion; OH, an unpaired hanging sequence portion at the end; EL, an extended sequence portion; b, A pathway composed of overlapping sequences; c, the connection diagram of the overlapping sequences in Fig. 5 (overlapping diagram); C1-C4 are anchor sequence fragments; R1 and R2 are repeating sequences; each sequence has two ends; U, Single test sequence, UR, single test sequence and boundary sequence of repeating region;

Figure 7 is a schematic diagram and an example of selecting a valid sequence from multiple pathway sequences and resolving conflicts; c, a schematic diagram of a connecting pathway sequence from the end of a start anchor sequence fragment to the end of a plurality of end anchor sequence fragments; d, showing The number of effective pathway sequences at the ends of different anchoring sequence fragments; e, showing conflicting ends; f, showing the relationship between the sequences in e; g, using the adjacent relationship between the sequences (BioNano optical atlas) to solve conflict in e;

Figure 8 shows an example of a complex repeat region containing multiple repeat units; ab, the optical spectrum of the BioNano genome shows that this region contains two repeat units. In each pair of comparison bars, the lower bar is the optical spectrum, and the upper bar The bar is the reference genome. A comparison of the two shows that a sequence is missing from the reference genome; cd, the frequency distribution of the length of the pathway sequence, which corresponds to the two sequences in a / b; e, which shows in b / d The sequence structure of cns1 and cns2 are two representative sequences of different lengths, where the line segments at both ends represent the anchoring sequence, and the box and triangle in the middle represent the assembled sequence;

FIG. 9 is a schematic diagram showing the results of processing using two different methods of the background technology and the present invention for two repeated sequences in the same repeated sequence family;

Figure 10 is an example of rice genome assembly results; ac is a schematic diagram showing the comparison results of rice genome assembly using the existing method and the method of the present invention. In each pair of comparison bars, the lower bar is the BioNano optical spectrum. The horizontal bar is the reference genome; the upper pair represents the schematic diagram of the result assembled using the existing method, and the lower pair represents the schematic diagram of the result assembled using the method of the present invention; e / f / g / h, respectively, represents multiple complex repeats Frequency distribution map of the channel sequence length in the sequence region;

Figure 11 is an example of maize genome assembly; bd. In each comparison bar, the lower bar is the BioNano optical spectrum, the upper bar is the reference genome, and the upper part shows the pair using the existing software PBcR (based on the SG method). The results of the assembly of the corn genome are shown in the lower part. The assembly results of the same region using the method of the present invention are shown. The assembly result of the invention is corrected; d, the position error in the PBcR sequence is corrected in the assembly result of the invention;

Figure 12 is an example of human genome assembly results; c / e, the upper horizontal bar represents the human reference genome (with gaps), and the lower horizontal bar represents the results of assembly by the method of the present invention (no gaps); d, in Falcon It is a chimera, and the result of assembly of the present invention does not include the chimera; f, Falcon results have multiple blank areas not assembled, but have been filled in the assembly results of the present invention (ie, HERA Contig629).

The invention is further described below with reference to the drawings and specific embodiments.

Detailed ways

An embodiment of the present invention: a method for assembling a genome to construct an ultra-long continuous DNA sequence, as shown in FIG. 1, includes the following steps:

S1. Perform a pairwise comparison of all known DNA sequences to find similar overlapping regions between each pair of sequences; wherein the known DNA sequences include anchor sequence fragments (that is, sequence fragments used for anchoring, For example, one or more specific sequence fragments intercepted from the DNA sequence, and / or one or more specific sequence fragments that have been assembled, and / or one or more selected from the random sequencing Read sequence Read sequence, etc.) and random read sequence (in order to improve the accuracy of the Read sequence, the Read sequence can be corrected first, or the original random sequence Read sequence can be used without correction; the correction method includes using The sequencing Read sequence obtained by other sequencing platforms has a low sequencing error rate for correction, and also includes the use of other Read sequences in this set for correction. In order to improve the efficiency of genome assembly, the random sequencing Read sequence here can be partially assembled. Short Contig sequences (and assembled long Contig sequences are used as anchor sequence fragments, and the length can be divided into 50kb boundaries, for example) instead ); Said anchor sequence fragments include at least two; said pairwise comparison of all known DNA sequences, including pairwise comparison of all anchor sequence fragments and all random sequencing Read sequences, and All random sequencing Read sequences are compared pairwise; in specific implementation, an overlapping graph can be constructed, which is an undirected simple graph constructed by the nodes representing a known sequence and the sequence overlap between them as an edge. Each known sequence is represented by two nodes, each node representing the end of a sequence fragment, and the two nodes are connected by an undirected edge (here called a coupling edge); in this overlapping graph If there is a non-coupling edge connection between two nodes, it means that there is overlap between the two ends, and one of them can be used to extend the other. When traversing the path in the graph, there is a basic requirement: When entering a node, it must come out through the coupling edge of that node (that is, after reaching an end node of a known sequence, it cannot come out from the end nodes of other sequences connected to the same end node, but must come from the same sequence. The other end node comes out to ensure the linear extension of the sequence); In this figure, identifying whether there is a connection between the two ends of two different anchoring sequences can be achieved by deep search or breadth search;

S2, starting from a free end (such as Es) node of any anchor sequence fragment, and using the random sequencing Read sequence node that overlaps with it to extend the pathway of the anchor sequence fragment to form one or more extended pathway sequences ; Then use the same method for these extended pathway sequences to continue the extension using random sequencing Read sequence, each sequence is extended multiple times until it encounters a random sequencing Read that can be connected to the end node of another different anchor sequence fragment At the end of the sequence, the extension ends from the end of the starting anchor sequence segment to obtain one or more pathway sequences connecting the end of the starting anchor sequence segment to the end of another or more different end anchor sequence segments ( As shown in the example of FIG. 7c), the one or more pathway sequences form sequence set A (that is, the pathway sequences in sequence set A connect the end of the starting anchor sequence segment Es to one or more other endpoints. Anchor sequence fragment ends Ee1, ..., Eek);

S3. According to the pathway sequence in the sequence set A, a maximum of one sequence is selected as a valid linking sequence (from one starting anchor) connecting the end of the initial anchor sequence fragment (such as Es) to the end of another end anchor sequence fragment. There may be no effective linking sequence at the end of a given sequence fragment);

For the above-mentioned anchor sequence fragments, one or more specific Read sequences selected from the randomly sequenced Read sequences can be selected by one of the following methods:

1) Compare all random sequencing Read sequences in pairs. If a Read sequence overlaps with other multiple Read sequences (such as more than 1/3 of the average sequencing depth), but there is no unpaired hanging sequence at the ends, it also follows Multiple other Read sequences overlap (for example, more than 1/3 of the average sequencing depth), but there is a terminal unpaired dangling sequence at the same end of this sequence, indicating that this sequence is a sequence located on the boundary of a single test and repeat sequence. Deduplicate all Read sequences that can be compared to the single test end of this boundary sequence, and retain a Read sequence with the highest average overlap score in all overlapping regions of the single test end as the anchor sequence segment.

2) According to the number of Read sequences overlapping at the end of each Read sequence, a sequence with an average depth above is used as a repeat sequence, and a sequence below or equal to the average depth is used as a single test sequence; take an average overlap score of all its overlapping areas The highest single test sequence extends to both sides until it stops when it encounters a Read sequence marked as a repeated sequence. The two single-column Read sequences closest to the end of this extended sequence were used as anchor sequence fragments.

3) The Read sequence aligned to any region on the reference genome is used as the anchor sequence fragment.

In order to improve the efficiency and accuracy of genome assembly, in step S2, before selecting a candidate extension sequence, the method may further include: setting a global sequence similarity minimum threshold SImin; for any sequence X, first determine which overlaps it with Whether the sequence similarity value of the sequence in the overlapping region is greater than or equal to the minimum threshold SImin, and if so, these overlapping sequences are selected to extend the sequence X, otherwise these overlapping sequences are not selected to extend the sequence X.

The minimum global sequence similarity threshold SImin can be set by referring to the sequencing read sequence accuracy value α at the whole genome level (for example, setting SImin = 1- (1-α) * 3), where the total The sequencing read sequence accuracy value α at the genome level is calculated by: taking the known overlapping sequences with the highest overlap score for each sequence, taking the maximum number of average sequencing depths; calculating the average sequence consistency of all overlapping regions The sex value is used as the sequencing read sequence accuracy value α at the genome-wide level.

In specific implementation, the minimum global sequence similarity threshold SImin may also be set empirically or arbitrarily. For example, a corrected random Read sequence is used as the extension sequence, and an assembled Contig is used as the anchor sequence. Therefore, in the implementation, a fixed Simin value of 97% can be used, and the effect is good enough. Because generally the accuracy of the read sequence after random sequencing is about 99%.

In order to extend a longer continuous DNA sequence, in step S2, when the end of the sequence (a sequence has two ends, each end can be defined as a sequence of a specific length (such as 1-50kb), each time One step can select the sequence with the highest overlap score; or the sequence with the highest extension score; or randomly select a sequence; or a combination of any two or three of the above methods; where a sequence is randomly selected, any one of the sequences is selected The probability is determined according to its overlap score or extension score (for example, the probability may be: the score of this sequence / the sum of the scores of all sequences that can be used as extension).

For two sequences with overlapping ends X1 and X2 (one of which is not completely covered by the other sequence), the overlap score OS of the overlapping region is: OS = (OL1 + OL2) * SI / 2; where OL1 and OL2 are respectively Is the length of the overlapping region in sequences X1 and X2, SI is the sequence identity value (Sequence Identity) of the overlapping region between sequences X1 and X2; the calculation of sequence identity value generally includes mismatches and insertions and deletions of bases , But if the original Read sequence has not been corrected, when calculating this value, it can be: number of mismatched bases / (number of matched bases + number of mismatched bases); the extension score ES2 of X2 to X1 is: ES2 = OS + EL2 / 2- (OH1 + OH2) / 2, where OH1 and OH2 are the lengths of the unpaired overhang regions at the ends of the two sequences, and EL2 is the extension of X2 to X1 (similarly, X1 to X2 The elongation score can be calculated in the same way).

In specific implementation, for the definition of the extension score, only the extension length can be considered, but the effect is not necessarily good.

Step S3 includes:

S31. The sequence set A is divided into one or more channel sequence sub-sets A1, A2, ..., Ak according to different end-anchor sequence fragments (where all the channel sequences in the sub-set A1 are connected to each other). The ends of the initial anchor sequence segment are Es and the ends of the end anchor sequence segment are Ee1, and so on for other subsets. Each subset includes one or more pathway sequences.

In specific implementation, the sequence set A may also be divided into one or more channel sequence sub-sets A1, A2, ..., Ak (where all the paths in the sub-set A1 are based on different end-anchor sequence fragments). The sequences are connected to the end of the starting anchor sequence fragment such as Es and the end of the anchor anchor sequence fragment such as Ee1, and other subsets, and so on), each subset includes one or more pathway sequences; then directly select the largest subset Any channel sequence with the highest frequency of mid-length is used as a valid linking sequence connecting the end of the initial anchor sequence fragment (such as Es) to the end of the corresponding end anchor sequence fragment; if the largest subset is more than one, randomly select one The largest sub-collection.

After the number of effective pathway sequences between the end pairs of all anchoring sequences with connected pathways is determined, the ends of all anchoring sequences can be used as nodes and the representative sequences between nodes as edges to construct an undirected The connection graph uses the number of effective paths as the length of the edges.

Step S32 may include (as shown in FIG. 2):

S324. The groups in the sequence sub-set Ai are arranged according to the sequence length from left to right and from short to long, starting from the group with the highest length frequency peak and comparing with the first group with the longer channel sequence on the right side. If the total number of effective pathway sequences in the left group is higher than the total number of effective pathway sequences in the right group by a certain proportion (for example, more than twice), then set the representative sequence of the left group as the representative sequence of the sequence sub-set Ai to find The process of the representative sequence of the sequence sub-set Ai stops; otherwise, temporarily set the representative sequence of the right group as the representative sequence of the sequence sub-set Ai, then set this group as the left group, compare it with its first right group, and repeat the above Process until the total number of effective pathway sequences in the right group is lower than the total number of effective pathway sequences in the left group by a certain percentage (for example, less than 50%); thus, the representative sequence of the sequence sub-set Ai and the connected sequence are determined. For the number of effective pathway sequences between the corresponding anchor sequence segment ends (such as Es and Eei) (that is, the number of effective pathway sequences of the corresponding sequence group, such as NPsi, as shown in the example of Figure 7d) (Shown)

In specific implementation, it is not necessary to select any one of the most frequent occurrences of length, or any of the most frequent occurrences of the length and the highest similarity with other sequences as the representative sequence of the sequence subset; You can directly select all the channels without looking at the similarity.

In the present invention, grouping can be performed in the following ways (as shown in Figure 3):

S3213: The total number of sequences contained in each window is compared with the total number of sequences corresponding to the two windows immediately adjacent to it; if the value is larger than both sides, the window is a peak window, and if the value is smaller than both sides, The window is a valley window; if there is no window on one side, the total number of sequences on this side is set to 0;

The key to grouping is to separate multiple peaks according to the bottom value. If the distribution of the path length is narrow, for example, the distance between the shortest path and the longest path is less than 10kb, it is not necessary to group and treat them as a group. ; So when grouping according to the window method, it is not necessary to set each window too small, and of course it is not necessary to set it to 1kb. The ratio of the minimum length frequency in the valley window to the maximum length frequency in the peak window is very important. If there are too many groups, it will cause background interference. If there are too few groups, it will not find the correct path.

In step S33, a maximum of one sequence is selected as a valid linking sequence connecting the ends of the initial anchoring sequence fragment (such as Es) to the end of the other end anchoring sequence fragment (as shown in FIG. 4):

S331. From the representative sequences of each pathway sequence subset of sequence set A (a representative sequence connects the starting anchor sequence fragment end Es to a different end anchor sequence fragment end Eei), , Select the maximum value and the second largest value of the number of effective pathway sequences, and calculate the conflict index at the end of the corresponding initial anchor sequence fragment: CIs = NPs _n / NPs _m , where CIs represents the corresponding initial anchor sequence fragment Conflict index of terminal Es, NPs _m is the maximum number of effective pathway sequences connected from the end of the starting anchor sequence fragment Es to other different end of the anchor sequence fragment ends, and NPs _n is the link from the end of the starting anchor sequence fragment Es The second largest number of effective pathway sequences to the ends of other different end-anchor sequence fragments; where, NPs _m ≥ NPs _n ; if the collision index exceeds a threshold (such as 0.75), the end of the corresponding initial anchor sequence fragment is Call it a collision end (as shown in the example of FIG. 7e, FIG. 7f shows the relationship between the sequences in FIG. 7e);

S332. For the end of the anchor sequence fragment that does not have a conflict, among the representative sequences of all the subsets of the channel sequence set A, select the representativeness of the channel sequence sub-set corresponding to the maximum value of the corresponding effective channel sequence number. The sequence is the final representative sequence of sequence set A, that is, a valid linking sequence connecting the end of the initial anchor sequence fragment (such as Es) to the end of the other end anchor sequence fragment is obtained. For the ends of the anchor sequence fragment in conflict, If this conflict can be resolved, the representative sequence corresponding to the end determined by the solution method is selected as the effective linking sequence connecting the end of the starting anchor sequence fragment to another end anchor sequence fragment (as shown in Figure 7g. The adjacent relationship between them resolves the conflict at the end of the sequence as shown in Figure 7e); if this conflict cannot be resolved, then no valid link sequence is found at the end of this anchor sequence segment and any other anchor sequence segment, go to step S2.

In specific implementation, it is not necessary to calculate the collision value, and always select a representative sequence with the highest number of effective paths for connection. If there are two highest effective paths, the representative sequence is randomly connected. This will result in some chimeras, but also many correct sequences.

In the specific implementation, the connection graph can be used to judge. If the end of a starting anchor sequence is connected to the ends of multiple different end anchor sequences, and the longest two sides have the same length (the number of effective paths), then this is explained. The ends of the starting anchor sequence are connected to two other end points by similar repeats; the ends of the non-conflicting ends can be connected from large to small according to the number of pathways; the conflicting ends are not used to connect to other End until this conflict can be resolved.

In step S332, the method for resolving conflicts includes: resolving conflicts at the ends of anchor sequence fragments located on different chromosomes according to the information of chromosome grouping; or resolving conflicts at the ends of anchor sequence fragments based on known adjacent sequence information; or In the process of constructing ultra-long continuous DNA sequences, one of the two end-point anchoring sequence fragments that cause a conflict between the ends of an initial anchoring sequence fragment has already been used in the connection of other anchoring sequence fragments. Conflicts at the ends of the initial anchor sequence fragments are also resolved accordingly.

By grouping the path sequence sub-sets in the present invention, the correct path sequence can be found in the case of multiple repeating units (without grouping, the complex sequence of multiple repeating units cannot be solved), thereby achieving the inclusion of multiple repeating units. Of complex repeating regions of two repeating units. Figure 8 shows an example of a complex repeat region containing multiple repeat units. As can be seen from Figures 8a and 8b, the genomic optical spectrum shows that this region contains two repeat units, while the reference genome sequence contains only one repeat unit; FIG. 8c and FIG. 8d are the corresponding frequency distribution diagrams of the length of the channel sequence obtained by using the above method of the present invention, respectively, corresponding to the two sequences in FIG. 8a and FIG. 8b; FIG. In the sequence structure in the present invention, two representative sequences of different lengths of cns1 and cns2 obtained by the method of the present invention, wherein the line segment represents the anchor sequence, and the middle box and triangle represent the assembled sequence. The invention can realize the assembly of complex sequences containing multiple repeating units, but the sequences assembled by the existing methods are incomplete and some will be missed.

The working principle of an embodiment of the present invention:

Figure 5 (The two R12 and R2 sequences in the genome have high sequence identity, but there are still some base differences that cause them to be inconsistent; the upper half of the aligned sequence is a locally generated Read sequence, followed by the The differences between the compared sequences are small; the aligned sequences in the lower half have unpaired dangling ends, which are sequencing Read sequences from different copies of the repeat sequence, which are significantly different from the compared sequences), Figure 6 (An overlay (partial) example, corresponding to the sequence region and sequencing read in Figure 5; C1-C4 are anchor sequence fragments; R1 and R2 are repeat sequences; each sequence has two ends; U: single test shell As shown in the sequence, UR, single-column, and boundary sequences of repeating regions), there are differences in the repeating sequence copies (the repeating sequence copies are similar sequences belonging to the same repeating sequence family) derived from different regions. The present invention can assemble the Read sequences derived from the above-mentioned differential repeat sequence copies to form separate Contig sequences, and then ligate the anchor sequence fragments at both ends to form an ultra-long continuous DNA sequence.

The most critical difference between the genomic assembly method of the present invention and the existing SG assembly method is that the present invention completely processes a repeat sequence region, while the SG method processes the repeat region at the Read level. Similar Read After compression, different repeating regions are compressed into one, so that the repeating regions that are different from each other cannot be separated. For various reasons, such as sequencing errors, correction errors, and so on, errors at the Read level can easily lead to compression errors, and at the channel level, although there are errors, the representative sequence of the channel is longer than Read and is easier to distinguish. At the pathway level, even if the sequence difference between the two pathways is only 1% -2%, it is possible to distinguish them, and those greater than 2% can be easily separated. The overlap between Reads already contains possible sequence differences. When extended, Reads from different duplicate copy sources are not easily linked together. Conversely, if they are connected above the threshold, it means that these Reads cannot be distinguished, and the two areas will conflict. Conflicts can also occur if only partially overlapping regions are similar, but can be resolved by resolving conflicts.

If the similarity between two repeated sequences (or a region longer than the Read length) is very high, such as> 99%, it is difficult to distinguish between Reads generated from the two regions, which will cause a pair of Contigs to be connected. Path will mix Reads from different sources. However, if the repeats with low similarity, such as <97%, the sequence similarity between Reads from the same region is high, the overlap score is high, and the reads from different sources are low, the overlap score is Low, so that the pathways formed between different duplicates can be distinguished. If the minimum sequence identity of each overlap is defined, such as 97%, when searching for a path, the overlaps below this value will be filtered out, so that the formed path will only include the correct path and will not form other duplications Pathways formed by copying. For example, for long Reads, the overlap between Reads can be set to a minimum value, such as 1kb. If the sequence similarity between two duplicate copies is <= 99%, and the sequence differences are evenly distributed, then there will be at least 10 single nucleotides in the 1kb overlap region if there are Reads between different copies. Polymorphic sites (SNPs), and Reads between the same copies will be free of SNPs. These Reads are easy to distinguish. In practice, because the differences between sequences are not completely evenly distributed, or because sequencing errors are not fully corrected, errors will result, but in general, the vast majority of duplicate copies of <99% are easy to distinguish of. At present, the average read rate of raw reads produced by single-molecule sequencing is between 10% and 15%. After self-correction, the average error rate of Read is greatly reduced, for example, the error rate of many Reads can be reduced to less than 1%. For most of the repeated sequences with a similarity of less than 95% on the genome, the reads produced by the read will not have a similarity higher than 95% after correction (in a few cases, more similar sequences will be generated due to correction errors), so It is easy to distinguish in the sequence alignment and assembly process, so it is basically not a problem when using existing software for assembly. The present invention needs to process mainly those sequences with a similarity greater than 95%. For all Contigs connected to a Congtig (Ni), only one Contig (Nj) is adjacent to it. The purpose of the present invention is to determine which Nj is adjacent to Ni by comparing the number and quality of the paths from Ni to Nj. If only one pathway is compared, it is more susceptible to accidental factors.

In the pathway sequence connecting two Contigs, not all sequences are identical. Therefore, the present invention needs to find a representative sequence to represent these pathways and use it as a potential sequence connecting two Contigs. The representative sequence can also be one of these channels that takes the length of the highest occurrence frequency as a reference sequence, compares all other sequences, and then calculates the consensus sequence of these sequences. If there is a consensus sequence between the two Contigs, if the path sequence aligned to the consensus sequence accounts for more than 50% of the total, it is confirmed that there is a valid consensus sequence between the two Contigs, and the number of connections is higher than The number of pathway sequences. If it is less than 50%, the repeat sequence between the two Contigs is considered too complicated, and the number of connections can be set to zero. A repeating region containing multiple repeating units, because a repeating unit can be crossed or repeated during extension, resulting in a shorter or longer representative sequence, showing a regular distribution of multiple length frequency peaks.

As shown in FIG. 9, the two sequences shown in FIG. 9 (a) indicate that in the original genomic sequence, the sequence A is connected to the D sequence through one copy of a repeat sequence, and the C sequence is connected to another through the repeat sequence. The copy (that is, the same repeat family as the above-mentioned repeat sequence) is linked to sequence B. When using the existing SG method for assembly, two copies of the repeating sequence shown in Fig. 9 (a) (also two similar sequences) are compressed into one, as shown in Fig. 9 (b). ; Note that during the compression process, due to the difference between the two original sequences, after compression into one, it is likely to be different from the two, resulting in an assembled sequence error; according to Figure 9 (b), It is impossible to determine whether sequence A should be connected to D sequence or B sequence through repeated sequences, nor whether sequence C should be connected to B sequence or D sequence through repeated sequences, that is, due to the cross node's Existing, the assembled Contig after assembly of the repeated sequence is broken near the compressed start and end positions, resulting in fragmentation of the assembled Contig, which cannot truly restore the entire original genome sequence. After adopting the present invention, as shown in FIG. 9 (c), the present invention assembles similar repeat sequences into two different contig sequences, and uses the method of the present invention to find the starting point from the end of a certain sequence to the end of all unknown sequences. The representative sequences of all the pathway sequences of the sequence are then found from these representative sequences to find the most correctly connected pathway from the end, that is, the method of the present invention can correctly determine that the sequence A passes the repeated sequence (ie The final effective linking sequence to be found in the present invention) is linked to sequence D, and sequence C is linked to sequence B through a similar sequence of the repeat sequence (also the final effective linking sequence to be found in the present invention), so that it can be correctly The entire original genomic sequence is reduced, and most of the repeating sequences or similar repeating sequences can be assembled by using the method of the present invention, and finally an ultra-long continuous DNA sequence is formed.

A single copy region can also be assembled using the method of the present invention, because it is not known whether the sequence is derived from a repeat sequence or a single copy sequence before assembly.

Industrial applicability

The invention provides a genome assembly method for constructing an ultra-long continuous DNA sequence. The method of the present invention includes: S1, finding an overlapping region between each pair of known DNA sequences; S2, starting from a free end of any anchor sequence segment, extending it with a Read sequence that overlaps with it, and cycling Repeatedly, until a Read sequence that can be compared to the end of another different anchor sequence fragment is obtained, and one or more pathway sequences are obtained; S3, at most one of all pathway sequences is selected as the connection starting anchor sequence fragment. A valid ligation sequence from the end to the end of another end anchoring sequence fragment; S4. Use the effective ligation sequence to link the start and corresponding end anchoring sequence fragments; use the new anchor sequence fragment or record the remaining anchoring sequence after ligation The free end of the fragment is transferred to S2; steps S2-S4 are repeated to finally form an ultra-long continuous DNA sequence. The method of the invention is beneficial for restoring the entire chromosome and the entire genome sequence, further improving the accuracy of genome assembly, and can realize the genome assembly of the repeated sequence region, as well as the assembly of a single copy sequence region. The assembly of the unit's complex and repeating regions has good economic value and application prospects.

Claims

A genome assembly method for constructing an ultra-long continuous DNA sequence, comprising the following steps:

S1: Perform a pairwise comparison of all known DNA sequences to find similar overlapping regions between each pair of sequences; wherein the known DNA sequences include anchor sequence fragments and random sequencing Read sequences; the anchors The predetermined sequence fragment includes at least two; the described pairwise comparison of all known DNA sequences includes a pairwise comparison of all anchor sequence fragments and all random sequencing Read sequences, and all random sequencing Read Compare the sequences one by one;

S2, starting from a free end of any anchor sequence fragment, extending the anchor sequence fragment with a random sequencing Read sequence that overlaps with it to form one or more extended sequences; and then using these extended sequences with The same method uses random sequencing of the Read sequence to continue the extension. Each sequence is extended multiple times until it encounters a random sequencing Read sequence that can be compared to the end of another different anchor sequence fragment. One end of the fragment is extended to obtain one or more pathway sequences connecting one end of the initial anchor sequence fragment to another or more different end anchor sequences. The one or more pathway sequences form a sequence. Collection A;

S3. According to the pathway sequence in the sequence set A, a maximum of one sequence is selected as a valid linking sequence connecting the end of the starting anchor sequence fragment to the end of the other end anchor sequence fragment;

S4. Use the effective linking sequence to connect the starting anchor sequence fragment and the corresponding end anchor sequence fragment; use the linked sequence fragment as a new anchor sequence fragment or record the free ends of the remaining anchor sequence fragments, Go to S2; repeat steps S2-S4 continuously, so as to finally form an ultra-long continuous DNA sequence.
The genome assembly method for constructing an ultra-long continuous DNA sequence according to claim 1, characterized in that, in step S2, before selecting a candidate extension sequence, further comprising: setting a global sequence similarity minimum threshold SImin; For sequence X, first determine whether the sequence similarity value of the sequence that overlaps with it is greater than or equal to the minimum threshold SImin. If so, use these overlapping sequences to extend sequence X, otherwise give up using these overlapping sequences to extend Sequence X.
The genome assembly method for constructing an ultra-long continuous DNA sequence according to claim 2, characterized in that the minimum global sequence similarity threshold value SImin is set with reference to the sequencing read sequence accuracy value α at the whole genome level, wherein The sequence read sequence accuracy value α at the genome-wide level is obtained by calculating the following methods: taking known overlapping sequences with the highest overlap score of each sequence, taking at most the number of average sequencing depths; calculating all overlaps The average sequence identity value of the region is used as the sequencing read sequence accuracy value α at the genome-wide level.
The genome assembly method for constructing an ultra-long continuous DNA sequence according to claim 1, characterized in that, in step S2, when the sequence ends are extended, each step selects the sequence with the highest overlap score; or the sequence with the highest extension score; Either randomly select a sequence; or a combination of any two or three of the above; wherein when a sequence is randomly selected, the probability that any one sequence is selected is determined according to its overlap score or extension score.
The genome assembly method for constructing an ultra-long continuous DNA sequence according to claim 3 or 4, characterized in that, for two sequences X1 and X2 with overlapping ends, the overlapping fraction OS of the overlapping region is: OS = (OL1 + OL2 ) * SI / 2; where OL1 and OL2 are the lengths of the overlapping regions in the sequences X1 and X2, SI is the sequence identity value of the overlapping regions between the sequences X1 and X2; the extension score ES2 of X2 to X1 is: ES2 = OS + EL2 / 2- (OH1 + OH2) / 2, where OH1 and OH2 are the lengths of the mismatched dangling regions at the ends of the two sequences, and EL2 is the extension of X2 to X1.
The method for assembling a genome of an ultra-long continuous DNA sequence according to claim 1, wherein step S3 comprises:

S31. Divide the sequence set A into one or more path sequence sub-sets A1, A2, ..., Ak according to different end-anchor sequence fragments, and each sub-set includes one or more path sequences;

S32. Obtain a sequence as a representative sequence of the subset according to the channel sequence in each of the channel sequence subsets Ai and calculate the number of valid channel sequences of the subset, where 1 ≦ i ≦ k;

S33. Among the representative sequences of all the channel sequence sub-sets, at most one is selected as a valid linking sequence connecting the end of the start anchor sequence fragment to the end of the other end anchor sequence fragment.
The method for assembling a genome of an ultra-long continuous DNA sequence according to claim 6, wherein step S32 comprises:

S321. Divide each channel sequence subset Ai into one or more groups Ai1, ... Aig, where 1 ≦ i ≦ k;

S322. From each group, a sequence with a highest occurrence frequency of the sequence length and a range smaller than the highest value is selected to form a sequence set Bi1, ..., Big, where the sequence set Bi1 corresponds to Ai1, and the other sets are deduced by analogy;

S323: Perform a pairwise comparison of all sequences in the sequence set Bij. If the short sequence between the two sequences is covered by more than a certain ratio, the two sequences are considered as similar sequences; all the ones that can be compared to the most are selected. A sequence of similar sequences; if there are multiple sequences with the highest number of similar sequences and the same, then select any sequence with the highest frequency of sequence length as the representative sequence of Aij, and record this representative in Bij The number of sequences with similar sexual sequences is taken as the number of effective pathway sequences of the sequence group Aij; where 1 <= j <= g;

S324. The groups in the sequence sub-set Ai are arranged according to the sequence length from left to right and from short to long, starting from the group with the highest length frequency peak and comparing with the first group with the longer channel sequence on the right side. If the total number of effective pathway sequences in the left group is higher than the total number of effective pathway sequences in the right group, the representative sequence of the left group is the representative sequence of the sequence sub-set Ai, and the representativeness of the sequence sub-set Ai is sought. The sequence process stops; otherwise, temporarily set the representative sequence of the right group as the representative sequence of the sequence sub-set Ai, then set this group as the left group, compare it with its first right group, and repeat the above process until the right group The total number of effective pathway sequences is lower than a certain percentage of the total number of effective pathway sequences in the left group; thus, the validity between the representative sequence of the sequence subset Ai and the ends of a pair of corresponding anchoring sequence fragments connected to it is determined. Number of pathway sequences.
The genome assembly method for constructing an ultra-long continuous DNA sequence according to claim 7, characterized in that, in step S321, grouping is performed in the following manner:

S3211, arrange the pathway sequences in the pathway sequence sub-set Ai in order from short to long and from left to right;

S3212. Divide the channel sequences in the channel sequence subset Ai into multiple non-overlapping small windows according to the same length difference, and calculate the total number of sequences contained in each window;

S3213: The total number of sequences contained in each window is compared with the total number of sequences corresponding to the two windows immediately adjacent to it; if the value is larger than both sides, the window is a peak window, and if the value is smaller than both sides, The window is a valley window; if there is no window on one side, the total number of sequences on this side is set to 0;

S3214. Calculate all the peak and valley windows; if the minimum frequency of the sequence length in a valley window is less than a certain percentage of the maximum frequency of the sequence length in the nearest peak window to the right, use this The sequence length with the lowest frequency in the valley window divides the channel sequences on both sides into two groups; and so on, the channel sequence subset Ai is divided into one or more groups.
The genome assembly method for constructing an ultra-long continuous DNA sequence according to claim 6, characterized in that in step S33, at most one sequence is selected as follows to connect the end of the anchoring sequence fragment to the anchoring sequence of the other end point. A valid ligation sequence at the end of the fragment:

S331. From the number of effective channel sequences corresponding to the representative sequences of each channel sequence sub-set of sequence set A, select the maximum value and the second largest value of the number of effective channel sequences to calculate the corresponding end of the starting anchor sequence fragment The collision index is: CIs = NPs n / NPs m , where CIs represents the collision index of the corresponding end of the starting anchor sequence fragment Es, and NPs m is the link from the end of the starting anchor sequence fragment Es to other different end anchors The maximum number of effective pathway sequences at the end of a sequence fragment. NPs n is the second largest number of effective pathway sequences connected from the end of the initial anchor sequence fragment Es to the ends of other different end anchor sequence fragments; where NPs m ≥NPs n ; if the collision index exceeds the threshold, the end of the corresponding starting anchor sequence segment is called a collision end;

S332. For the end of the anchor sequence fragment that does not have a conflict, among the representative sequences of all the subsets of the channel sequence set A, select the representativeness of the channel sequence sub-set corresponding to the maximum value of the corresponding effective channel sequence number. The sequence is the final representative sequence of sequence set A, that is, a valid linking sequence connecting the end of the starting anchor sequence fragment to another end of the anchor sequence fragment is obtained. For the conflicting anchor sequence fragment ends, if the conflict can When the solution is resolved, the representative sequence corresponding to the end determined by the solution is selected as the effective linking sequence connecting the end of the starting anchor sequence fragment to another end anchor sequence fragment; if this conflict cannot be resolved, then it is not found If the end of the anchor sequence fragment is connected to any other anchor sequence fragment and is a valid linking sequence, go to step S2.
The genome assembly method for constructing an ultra-long continuous DNA sequence according to claim 9, characterized in that, in step S332, the method for resolving conflicts comprises: resolving conflicts at the ends of anchor sequence fragments located on different chromosomes according to the information of chromosome grouping ; Or resolve conflicts at the ends of anchoring sequence fragments based on known adjacent sequence information; or in the process of constructing ultra-long continuous DNA sequences, two end anchors that will cause conflicts at the ends of an initial anchoring sequence fragment One of the ends of the sequence fragment has been used in the connection of other anchor sequence fragments, and the conflicts at the ends of the corresponding initial anchor sequence fragments have been resolved accordingly.