Detailed Description
Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are illustrative and intended to be illustrative of the invention and are not to be construed as limiting the invention.
In a first aspect of the invention, the invention proposes a method of assembling a sequence of separate long fragments, see fig. 1, the method comprising:
(a) obtaining a read set through sequencing, and recording sequencing holes corresponding to reads in the read set, wherein one sequencing hole comprises at least one long fragment sequence;
(b) extending a plurality of seed sequences in parallel by using the reads and sequencing wells corresponding to the reads to obtain a plurality of sequence contigs, the plurality of seed sequences being determined by known sequences;
(c) and constructing a framework sequence based on the reads, the sequence contigs and the sequencing holes corresponding to the reads contained in the sequence contigs so as to obtain an assembly result for separating the long fragment sequences.
According to an embodiment of the invention, the seed sequence is obtained based on a genomic reference sequence by the following steps:
in the genomic reference sequence, interrupting by N; and
the broken reference sequence is truncated by a predetermined length to obtain the seed sequence.
The size of the predetermined length is not particularly limited, and in general, the predetermined length is not smaller than the size of the insert so that reads from the same insert can be located on the same seed sequence. According to an embodiment of the invention, the sequencing is double-ended sequencing, and the predetermined length is at least twice the length of the sequencing library insert in the sequencing, facilitating the positioning of paired reads onto the same seed sequence. According to an embodiment of the invention, the set of reads comprises paired reads, the seed sequence being obtained based on the reads in the following steps: (1) the read is cut into a plurality of Kmers in a sliding mode, an index RKI of the Kmers to the read is constructed, and the index RKI is used for accessing the corresponding read through the Kmers; (2) extracting a pair of paired reads without a high frequency Kmer from the set of reads; (3) respectively determining all the reads corresponding to the Kmers of two reads in the pair of reads in the step (2) by using the index RKI to obtain a first read group and a second read group; (4) determining sequencing wells corresponding to the first read group and the second read group in the step (3) respectively to obtain a first sequencing well set and a second sequencing well set; (5) determining the intersection of the first sequencing well set and the second sequencing well set in (4), and if the size of the intersection is not significantly different from the expected number of valid sequencing wells of the base, determining the paired reads in (2) as the seed sequence, wherein the expected number of valid sequencing wells of the base is determined by the nucleic acid amount of the sample. The high frequency is referred to as a high frequency Kmer, and the number of occurrences of a certain Kmer is higher than the average number of occurrences of the Kmer, i.e., the high frequency Kmer is considered. In more cases, the inventors define as high frequency Kmer, Kmer as desired, occurring a number of times that is much higher than the average number of times that Kmer occurs, said "much higher" being at least 10 times, 20 times, 30 times, 40 times, 50 times, or 60 times the average number of times that Kmer occurs. The Kmer frequency expectation is related to the sequencing depth, the read length and the Kmer size, according to one embodiment of the present invention, for K19, i.e. 19mer, 100bp read length, 100X sequencing depth, the Kmer expectation frequency is 100 (sequencing depth) { [100 (read length) -19(Kmer size) +1]/100 (read length) } ═ 82, and at an average frequency of 82, kmers with frequencies ranging from 3000 to 5000 are set as high frequency kmers.
The term "no significant difference" may mean no difference in statistical sense or in general. According to an embodiment of the present invention, (5): determining the paired reads in (2) as the seed sequence if the size of the intersection is between half and twice the expected number of valid sequencing wells for the base.
The seed sequence is generated by constructing it from only known sequences, such as reference sequences, reads obtained by sequencing, assembly fragments, etc., and by combining the information from known sequences, RKI and sequencing wells, and by de novo assembly of species that are completely free of reference sequences.
According to an embodiment of the present invention, (b) includes: (i) the read is cut into a plurality of Kmers in a sliding mode, an index RKI of the Kmers to the read is constructed, and the index RKI is used for accessing the corresponding read through the Kmers; (ii) extending the plurality of seed sequences in parallel based on the reads and their corresponding indices RKI to obtain the plurality of sequence contigs.
According to an embodiment of the invention, the RKI is obtained by: performing sliding cutting on the read reads into a plurality of Kmers; and constructing a hash taking the Kmer as a key value, wherein the hash forms the RKI, and the hash records the frequency of the Kmer, the read segment to which the Kmer belongs, and the position and the direction of the Kmer on the read segment.
According to an embodiment of the invention, the seed sequence is extended by repeating the following steps: selecting a seed sequence suitable for extension; mapping the reads to the seed sequence to obtain an extended sequence; performing base-by-base alignment treatment on reads positioned at the end of the extended sequence; and if the consensus process fails, performing a heterozygosity identification, a phasing process, and/or resolving the repeat sequence.
According to an embodiment of the invention, the seed sequence suitable for extension is selected by: sliding cutting the seed sequence into Kmer; obtaining a read corresponding to the Kmer through the RKI; aligning the corresponding reads to the seed sequence; determining the coverage condition of the sequencing holes on the seed sequence based on the sequencing holes corresponding to the corresponding reads; and determining a seed sequence suitable for extension based on the coverage condition.
According to an embodiment of the invention, the reads are localized to the seed sequence by: sliding cutting the seed sequence into Kmer; obtaining a read corresponding to the Kmer through the RKI; and positioning the reads corresponding to the Kmer to the seed sequence, and comparing the reads one by one.
According to embodiments of the present invention, during the consensus process, if the set of valid sequencing wells of an extended site is evenly assigned by different base types at that site, it is judged that heterozygosity exists.
According to an embodiment of the present invention, after determining the presence of heterozygosity, the extension sequence is divided into a plurality of strands to be respectively extended.
According to an embodiment of the present invention, the read set includes a plurality of pairs of paired reads, a distance between two reads of the pair of paired reads is L, and during the unification process, if a distance between a read of the pair of paired reads positioned downstream in the extension direction and the corresponding read is not L, it is determined that a position at which the read positioned downstream in the extension direction is positioned is a start point of the repeat sequence.
It should be noted that the numerical values referred to in the present invention are mostly of statistical significance, and thus, unless otherwise specified, any numerical value expressed in a precise manner may represent a range, for example, an interval of plus or minus 10% of the numerical value; or the values are generally normally distributed, and values expressed in a precise manner include intervals of positive and negative standard deviations of the values. The distance L between two reads in a pair of reads is said to correspond to the length of the insert. Generally, in the experimental stage, the size of the constructed sequencing library is constant, i.e. the size of the inserted fragment is a fixed value, theoretically, the outer end distance between the paired reads obtained after the double-end sequencing is the fixed value, and in the data obtained after the actual sequencing, the distance between the paired reads is in normal distribution. In the process of determining and analyzing the repeated sequence in this example, the inventor sets the L as the size of the insert in the test stage, and it can be understood by those skilled in the art that the L is the value between the positive and negative standard deviations of the sizes of the inserts in any test stage or the interval between the positive and negative standard deviations of the sizes of the inserts in the test stage.
According to an embodiment of the present invention, the end point of the repeating sequence is a position where a read positioned downstream in the extending direction of the pair of reads having a distance L from the corresponding read is positioned, or a collision site when extending.
According to an embodiment of the present invention, whether the repeated sequence is a tandem repeated sequence is determined by: performing sliding cuts on reads mapped to the repeated sequence to obtain Kmers; judging whether the Kmer obtained by each read through sliding cutting has a repeated Kmer, if not, judging that no tandem repeat sequence exists, and if so, judging that the tandem repeat sequence exists.
According to an embodiment of the present invention, if the repeated sequence is a tandem repeated sequence, the tandem repeated sequence is resolved by: determining the length of the tandem repeat sequence; the reads containing the tandem repeat end point are positionally adjusted in an end point alignment to identify the base at the position of the collision.
According to an embodiment of the present invention, if the repeated sequence is not a tandem repeated sequence, and there are the following cases, the repeated sequence is judged to be a large double-repeated sequence by: the repeat sequence is greater than L in length, or reads of the pair of reads that correspond to reads positioned downstream of the repeat sequence are also positioned on the repeat sequence.
According to an embodiment of the present invention, if the repeated sequence is a large double-repeated sequence, the large double-repeated sequence is resolved by: and comparing the difference between the number of the effective sequencing holes corresponding to the upstream repetitive sequence and the number of the effective sequencing holes corresponding to the downstream repetitive sequence in the large double-repetitive sequence with the expected value of the number of the effective sequencing holes of the base, and solving the conflict on the large double-repetitive sequence according to the difference. For example, if two repeats in a large double-repeat sequence are relatively far apart, the conflicting bases at the upstream will contain significantly more effective sequencing wells (EW) than at the downstream, and it will be apparent when the defined number difference is greater than half the expected EW number, the bases can be identified by such a comparison to resolve the conflict; if the difference between the upstream and downstream EW numbers is less than or equal to half the desired EW number, a Helper Contig (HC) can be constructed using the upstream arm corresponding to the wrong downstream arm within one inset length from the start of the repeat region, comparing the EW set on the HC to the EW set of upstream colliding bases versus the difference between it and the downstream colliding bases.
According to an embodiment of the present invention, if the repeated sequence is not a large double complex sequence, and there are the following cases, the repeated sequence is determined to be a small double complex sequence: the length of the repeated sequence is less than L.
According to an embodiment of the invention, the small double repeat sequence is resolved by at least one of: (p) determining the base at the collision site by comparing the proximity of the distance between pairs of reads supporting two colliding bases to the mean, using the mean of the distances between pairs of reads supporting each base as the expected position of the collision site; (k) constructing an auxiliary contig using reads corresponding to reads of the paired reads that are positioned to the most upstream of the non-repetitive sequence in the extended sequence standard deviation range, and determining bases at the collision sites using reads of the paired reads that are corresponding to reads positioned to the downstream of the auxiliary contig.
According to an embodiment of the present invention, if the small double-repeated sequence cannot be resolved, i.e. the conflict cannot be resolved, the extension of the seed sequence is terminated.
Referring to fig. 2, according to an embodiment of the present invention, (c) includes: (iii) establishing a merged connection relationship between the sequence contigs based on the reads; (iv) constructing a primary backbone sequence based on the sequence contigs and sequencing wells corresponding to the reads comprised by the sequence contigs; (v) obtaining an assembly result separating long fragment sequences by merging the sequence contigs to construct the backbone sequence after mutually examining the merged linkage relationship between the plurality of sequence contigs and the primary backbone sequence.
The method of any of the above embodiments of the invention has at least one of the following four characteristics and advantages:
and (3) iterative extension: the extension of the seed sequence is an iterative process, and the part of the extension is used as an extension base of the next iteration, so that the seed sequence can be continuously extended.
Linear extension: unlike other graph-based assembly algorithms, the linear extension method makes the structure encountered during assembly simpler, the logical relationship is relatively clear, and the classification processing is easy, while the graph algorithm gathers the information of all the repeated areas together, the amount of the involved information is more, and the generated structure is more complex. That is, the solution of the graph is a one-time solution, as long as one repeat region is solved, all the genomic regions associated therewith (which all have the same or similar repeat sequences) are solved at the same time, while the linear method only solves the repeat of the current region at one time, does not solve the other regions at the same time, and is solved again when the repeat is extended to another region of the genome. The problem that the relevant information is recorded after one iteration is solved, so that the information encountered in the next iteration is simpler, and the extra calculation loss is reduced. In addition, the linear extension mode makes the extension of each seed sequence independent, and the mode is easier for calculation parallelization.
Multipath extension: since the objective of the algorithm is to construct polyploid genome, extension of seed sequence will use well information to perform multi-path joint extension of multiple haploids, whenever heterozygous region is encountered, phasing operation will be performed in real time in combination with the situation of previously assembled heterozygous region, i.e. phasing is also linear, and throughout each module of the whole assembly system (e.g. contig combination, skeleton sequence construction). In order to save resource consumption, the extension of the homozygous region still adopts single-path extension, when the heterozygous region is encountered, the extension is divided into multiple paths to carry out extension, and after the heterozygous region is extended, the extension is combined again to form a single path to continue extension.
And (3) global information judgment: if an algorithm only uses the related information under the current judgment, and does not consider the selection branch in the subsequent global scope, an error is easy to occur or the global optimal solution cannot be reached (such as Dijkstra greedy algorithm in the single-source shortest path calculation). For genome assembly, an assembly error is often caused by wrong selection by a greedy algorithm, so that the algorithm does not adopt the greedy algorithm to solve the conflict during extension and uses global information to perform correlation judgment. Because the length (100 Kbp) of the LFR is long, all information (Kmer, read, mate-pair and well information) provided by the position is considered during path selection, wherein the range involved in judgment of the well information can be about 100Kbp, if large-scale repetition does not exist in large quantity, the algorithm is basically consistent with the global property, and when the extension meets the complex condition, the algorithm strictly selects to terminate the extension rather than take a higher value for extension, so that a greedy judgment mode is avoided. It is particularly noted that, since all well information at the position is considered in the extension, the algorithm has stronger global property compared with the conventional hierarchical assembly algorithm, and the requirement of the sequencing depth can be greatly reduced, namely, each well does not require a particularly high sequencing depth, so that a great deal of resources are saved, namely, the cost and the time are saved.
It will be understood by those skilled in the art that all or part of the steps of the method of one aspect of the present invention described above may be implemented by instructing the relevant hardware through a program, which may be stored in a computer-readable storage medium, and the storage medium may include: read-only memory, random access memory, magnetic or optical disk, and the like.
The process is written into an executable program, and the executable program comprises the following steps: first, seeds and reads are Read, and the reads are constructed as the Index (Read Kmer Index, RKI) of the Kmer to the reads, so that the target reads can be quickly accessed through the Kmer during comparison. Then seeds are lengthened as much as possible, and since the extension between seed and seed is independent, this step can be accelerated by parallel operation. The extended seed is a sequence contig (contig), and at this time, a framework sequence (scaffold) is pre-constructed by beginning through well information, and only the context between contigs is established, and the scaffold is not immediately constructed. Also, by having used the reads and pairs information to establish a merged connection relationship between contigs, these contigs are not immediately merged. Then, the relation of combining between contigs and the relation in the pre-constructed scaffold are mutually checked, the relations are simplified and conflicts are solved, then the combining of contigs and the construction of scaffold are carried out, and finally, the assembly result is output.
In yet another aspect, the present invention provides an apparatus for assembling a sequence of separate long fragments for performing the method of any of the embodiments of the present invention described above. Referring to fig. 3, the apparatus includes: the input module is used for obtaining a read set through sequencing and recording sequencing holes corresponding to reads in the read set, wherein one sequencing hole comprises at least one long fragment sequence; an extension module that extends a plurality of seed sequences in parallel using the reads and the sequencing wells corresponding to the reads to obtain a plurality of sequence contigs, the plurality of seed sequences being determined by known sequences; and the framework sequence construction module is used for constructing a framework sequence based on the reading segments, the sequence contig and the sequencing holes corresponding to the reading segments contained in the sequence contig so as to obtain an assembly result of separating the long fragment sequences.
Those skilled in the art will understand that the processing steps or means employed in any of the above embodiments of the present invention can be implemented by corresponding functional modules or included sub-modules of the apparatus. The above description of the technical features and advantages of the method of the invention applies equally to this device of the invention.
According to an embodiment of the invention, the seed sequence is obtained based on a genomic reference sequence by the following steps: in the genomic reference sequence, interrupting by N; and truncating the broken reference sequence according to a preset length so as to obtain the seed sequence.
According to an embodiment of the invention, the predetermined length is not less than the length of the sequencing library insert in the sequencing.
According to an embodiment of the invention, the set of reads comprises paired reads, the seed sequence being obtained based on the reads in the following steps: (1) the read is cut into a plurality of Kmers in a sliding mode, an index RKI of the Kmers to the read is constructed, and the index RKI is used for accessing the corresponding read through the Kmers; (2) extracting a pair of paired reads without a high frequency Kmer from the set of reads; (3) respectively determining all the reads corresponding to the Kmers of two reads in the pair of reads in the step (2) by using the index RKI to obtain a first read group and a second read group; (4) determining sequencing wells corresponding to the first read group and the second read group in the step (3) respectively to obtain a first sequencing well set and a second sequencing well set; (5) determining the intersection of the first sequencing well set and the second sequencing well set in (4), and if the size of the intersection is not significantly different from the expected number of valid sequencing wells of the base, determining the paired reads in (2) as the seed sequence. The expected number of valid sequencing wells for the base is determined by the amount of nucleic acid in the sample.
According to an embodiment of the present invention, in (5) of the above embodiment: determining the paired reads in (2) as the seed sequence if the size of the intersection is between half and twice the expected number of valid sequencing wells for the base.
According to the embodiment of the present invention, the apparatus further includes an index constructing module, connected to the extending module, configured to slidably cut the read into a plurality of kmers, construct an index RKI of the Kmer to the read, and configured to access the corresponding read through the Kmer; the following are then implemented with the extension module: extending the plurality of seed sequences in parallel based on the reads and their corresponding indices RKI to obtain the plurality of sequence contigs.
According to an embodiment of the invention, the RKI is obtained by: performing sliding cutting on the read to obtain a plurality of Kmers; and constructing a hash taking the Kmer as a key value, wherein the hash forms the RKI, and the hash records the frequency of the Kmer, the read segment to which the Kmer belongs, and the position and the direction of the Kmer on the read segment.
According to an embodiment of the invention, the seed sequence is extended by repeating the following steps: selecting a seed sequence suitable for extension; mapping the reads to the seed sequence to obtain an extended sequence; performing base-by-base alignment treatment on reads positioned at the end of the extended sequence; and if the consensus process fails, performing a heterozygosity identification, a phasing process, and/or resolving the repeat sequence.
According to an embodiment of the invention, the seed sequence suitable for extension is selected by: sliding cutting the seed sequence into Kmer; obtaining a read corresponding to the Kmer through the RKI; aligning the corresponding reads to the seed sequence; determining the coverage condition of the sequencing holes on the seed sequence based on the sequencing holes corresponding to the corresponding reads; and determining a seed sequence suitable for extension based on the coverage condition.
According to an embodiment of the invention, the reads are localized to the seed sequence by: sliding cutting the seed sequence into Kmer; obtaining a read corresponding to the Kmer through the RKI; and positioning the reads corresponding to the Kmer to the seed sequence, and comparing the reads one by one.
According to embodiments of the present invention, during the consensus process, if the set of valid sequencing wells of an extended site is evenly assigned by different base types at that site, it is judged that heterozygosity exists.
According to an embodiment of the present invention, after determining the presence of heterozygosity, the extension sequence is divided into a plurality of strands to be respectively extended.
According to an embodiment of the present invention, the read set includes a plurality of pairs of paired reads, a distance between two reads of the pair of paired reads is L, and during the unification process, if a distance between a read of the pair of paired reads positioned downstream in the extension direction and the corresponding read is not L, it is determined that a position at which the read positioned downstream in the extension direction is positioned is a start point of the repeat sequence.
According to an embodiment of the present invention, the end point of the repeating sequence is a position where a read positioned downstream in the extending direction of the pair of reads having a distance L from the corresponding read is positioned, or a collision site when extending.
According to an embodiment of the present invention, whether the repeated sequence is a tandem repeated sequence is determined by: performing sliding cuts on reads mapped to the repeated sequence to obtain Kmers; judging whether the Kmer obtained by each read through sliding cutting has a repeated Kmer, if not, judging that no tandem repeat sequence exists, and if so, judging that the tandem repeat sequence exists.
According to an embodiment of the present invention, if the repeated sequence is a tandem repeated sequence, the tandem repeated sequence is resolved by: determining the length of the tandem repeat sequence; and adjusting the position of the read segment containing the end point of the tandem repeat sequence in an end point alignment mode.
According to an embodiment of the present invention, if the repeated sequence is not a tandem repeated sequence, and there are the following cases, the repeated sequence is determined to be a large double-repeated sequence: the repeat sequence is greater than L in length, or reads of the pair of reads that correspond to reads positioned downstream of the repeat sequence are also positioned on the repeat sequence.
According to an embodiment of the invention, the large double-repeat sequence is resolved by: and comparing the difference between the number of the effective sequencing holes corresponding to the upstream repetitive sequence and the number of the effective sequencing holes corresponding to the downstream repetitive sequence in the large double-repetitive sequence with the expected value of the number of the effective sequencing holes of the base, and solving the conflict on the large double-repetitive sequence according to the difference.
According to an embodiment of the present invention, if the repeated sequence is not a large double complex sequence, and there are the following cases, the repeated sequence is determined to be a small double complex sequence: the length of the repeated sequence is less than L.
According to an embodiment of the present invention, if the repeated sequence is a small double repeated sequence, the small double repeated sequence is resolved by at least one of: (p) determining the base at the collision site by comparing the proximity of the distance between pairs of reads supporting two colliding bases to the mean, using the mean of the distances between pairs of reads supporting each base as the expected position of the collision site; (k) constructing a helper contig using reads corresponding to reads in pairs of reads that are located to the most upstream of the non-repeated sequence in the extended sequence standard deviation range, updating the distance mean by adding these reads data using reads corresponding to reads in pairs of reads that are located to the downstream of the helper contig, and determining the base at the collision site also by comparing the proximity of the mean to the distance between pairs of reads that support two colliding bases.
According to an embodiment of the present invention, if the small double complex sequence cannot be resolved, the extension of the seed sequence is terminated.
Referring to fig. 4, according to an embodiment of the present invention, the skeleton sequence building module includes a primary skeleton sequence building module, configured to establish a merged connection relationship between the sequence contigs based on the reads; the device also comprises a merging connection relation establishing module which is used for establishing a primary framework sequence based on the sequence contig and the sequencing holes corresponding to the reads contained in the sequence contig; the method also comprises an assembling module, which is used for constructing the framework sequence by combining the sequence contigs after mutually checking the combined connection relation among the sequence contigs and the primary framework sequence, and obtaining an assembling result of separating long fragment sequences.
In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.
For ease of understanding, the assembly method according to the preferred embodiment of the present invention is analyzed in detail step by step.
The algorithm completely stores the reads data which continuously need to participate in comparison in the extension process into the memory in a binary mode until the whole program is terminated in order to achieve the purpose of efficient calculation. The information on the quality values provided in sequencing (e.g., the quality values in the Fastq format) that are not recorded and used during assembly are designed to be removed or corrected for quality value outlier bases and reads in a pre-assembly data preprocessing step, i.e., the read sequences are considered normal in the sequencing quality values and assembled regardless of differences between the different quality value bases.
After reading the Reads and recording the information of the wells to which the Reads belong, sliding and cutting the Reads into Kmer to construct a Reads Kmer Index (RKI), which is hash (hash) with the Kmer as a key and records the frequency of the Kmer, the set of the Reads to which the Reads belong, the position and the direction of the Kmer on the Reads. The hash is characterized by extremely fast search speed and only O (1) time complexity, so the core function of the index is to quickly access its corresponding sets of reads through the Kmer to determine which reads are to be located on the seed for detailed alignment.
The RKI structure needs to record information of a large number of reads, and further, since one read comprises a plurality of Kmers, the read will appear in entries of the plurality of Kmers, which causes the memory consumption to be larger, for a human genome, the memory consumption under 100X sequencing quantity is about 3TB, and compared with the memory consumption under 1TB of the traditional assembly technology, the memory required by the algorithm is larger. The data structure can be further optimized, which can reduce the cost and resource bottleneck of assembly. The optimization can be performed by omitting or limiting the information of the ultra-high frequency kmers in the related reads to reduce the memory overhead, because the effect of the kmers on assembly is small, the time is wasted when the kmers are used for comparing and positioning the reads, and the efficiency is extremely low.
The data structure design related to RKI is obviously different from the design in some DBG algorithms, the graphs of relationships among Kmers are constructed by reading reads from files, the reads do not need to be stored in a memory, and the reads relationships corresponding to the Kmers do not need to be recorded in detail, because the representation of a genome is converted from a pile of disordered reads into a group of logically-associated Kmers, a large amount of arrangement and compression are carried out on the data, and then the construction of the genome only needs to operate on the Kmer graphs, and does not need to directly operate the reads. The algorithm does not construct an overall graph, and reads need to be randomly accessed by the Kmer continuously in the seed extension (genome construction) process to obtain extension (assembly) materials.
Finally, relevant parameters for experimental purposes also need to be read in, such as the size of the inserts of pairs of reads in the library construction, the number of wells to isolate the LFR, and the number of cells to be put in, and the number of ploidies (obtained by optical means or informatics analysis) of the target genome. But the size of the insert and the number of the input cells can be obtained by training statistics in the initialization of the seed sequence, and the ploidy number of the target genome can also be obtained by hybrid recognition calculation. The different methods for obtaining information have certain errors, and the actual assembly needs to be combined, analyzed and applied.
After seed is selected, initialization and validity determination are performed before extension is started. And initializing to obtain relevant information on the seed, wherein the information is used for later legality judgment and seed extension materials. And the validity determination is used to discard seed that is on the extended region and that is not suitable for extension.
Firstly, cutting a seed into a Kmer by sliding, obtaining reads corresponding to the Kmer by RKI, then finely comparing the reads with the seed, and determining the coverage condition of the seed by the well by using the well information corresponding to the reads.
A Well that covers substantially the entire seed is defined as a "valid Well (EW)" and other wells that do not yet cover enough will be recorded as candidates and updated in later extensions. In the assembly of long fragment separating sequences, each site on the genome is covered by multiple LFRs, the number of the covered sites is related to the number of cells and the number of ploidy, as the extension progresses, the LFR assembly covering the extension region changes transiently, and the old LFR is replaced by the new LFR, just like a relay race. For example, if 10 diploid cells are used to construct the LFR, the number of wells per site is expected to be 10 cells, 40 diploid DNA duplexes, if the LFR length is 100 kbps, the average spacing between adjacent LFRs is 100 kbps/40 2.5 kbps, i.e., the expected number of wells in the 2.5 kbps range is 40, and if the calculated range is expanded, the number of wells will increase by one for each 2.5 kbps increase.
In seed extension, the set of EWs is identified by their well coverage of the extended (or initialized) tile. When an LFR just starts to cover the extended seed, the coverage degree of the LFR on the extended seed continuously rises along with the extension of the seed until the LFR is completely covered, and the well corresponding to the LFR which basically completely covers the seed can be used as the EW for auxiliary extension; as the seed continues to extend forward, the LFR will gradually exit the region where the seed extends, and its coverage will continuously decrease until it is 0, and when the LFR covers the seed obviously insufficiently, the corresponding well will no longer be used as EW for assisting the extension. This produces a transitional change in the EW constellation as it extends.
The EW set updated continuously with the extension guides the filtering of reads during assembly, and reads only belonging to the region near the current seed can be separated from numerous reads in the whole genome, so that some complex regions in the whole genome range are degenerated into simple regions in the LFR length range, and the simple regions are easy to assemble, namely the core of the assembly mode of separating long fragment sequences.
The obvious difference between this assembly method that considers a collection of multiple wells simultaneously in extension and the traditional two-stage assembly method that considers a single well separately (i.e., each well is assembled separately and then the assembly results are combined for secondary assembly) is the use of EW. The method can expand the depth information required for assembly from the accumulated depth of a single well to the integrated depth of the joint whole EW set, reduces the sequencing depth required by the single well and only needs low coverage. Taking 10 diploid cells (40 copies) as an example, previous secondary assembly methods required a sequencing depth of 40 × 50X-2,000X, where 50X is the sequencing depth required for routine genome assembly; while the present algorithm only requires a sequencing depth of 40 × 2.5 ═ 100X, where 2.5X is the sequencing depth required for an EW to be identified without the tolerant partial region being sequenced, the cumulative depth of the entire set of EWs when extended is 100X, i.e., 2.5X is used to identify the EW and 100X is cumulative for extension. This approach greatly reduces sequencing costs and saves time.
In addition, transient changes in EW can also be used to determine the positional relationship between contigs, creating extremely long framework sequences.
According to the embodiment of the invention, when a seed is initialized and extended, reads need to be aligned to the seed to provide materials for obtaining information (such as paired information of well and reads) and extending sequences, and it is impractical to try to align all reads, which consumes a lot of time, so that the reads can be finely aligned only when the reads hit on the seed are initially screened by using RKI, and the method is as follows:
1. sliding cutting the seed sequence to obtain Kmers, and obtaining reads corresponding to the Kmers through RKI. The whole seed sequence is cleaved upon seed initiation, while only the sequence at the contig end is cleaved during extension. For efficiency reasons, very high frequency kmers (one Kmer for very many reads) are not used to obtain reads. Specifically, for reads acquisition during extension, if the ends are all found to be very high frequency Kmers, only the endmost one Kmer is taken for reads localization.
2. The reads are positioned on the seed through the position records of the Kmer on the seed and the reads, and are compared base by base, so that certain tolerance is provided for the substitution type error in sequencing, but the error of the indel type cannot be tolerated without a gap module.
The purpose of reads filtering is to remove reads that are incorrectly located due to duplication. According to the algorithm, firstly, reads which are wrongly positioned due to the repetition of the whole genome level are filtered through EW information, and then, right arm reads which are wrongly positioned due to the simple repetition inside the LFR are filtered through mate-pair information (the left arm reads are not filtered). It should be noted that the left arm reads in the duplicate region will not participate in the validity determination of its corresponding right arm reads, since it may also locate errors itself, and the complex type of duplicates will be handled by the duplicate resolution module.
4. In order to obtain an extended sequence, it is necessary to make the reads located at the ends of contigs uniform base by base. Similar to reads comparison, only substitution type sequencing errors are tolerated in the consistency process, and indel type tolerance is not available. Because the mate-pair filtering of the reads can only act on the reads of the right arm, the reads of the right arm cannot generate positioning errors, and the method has obvious advantages with the reads of the left arm. Based on the characteristic, when the filtered right arm reads are sufficient in number, the module only uses the reads for consistency, if the filtered right arm reads can be combined into a sequence, the extension is completed, then the module enters a related information updating module, and if the filtered right arm reads cannot be combined into a sequence, namely, conflict occurs during consistency, the left arm reads in the region need to be combined for heterozygosis or repeated identification and solution. And when the number of the right arm reads is insufficient, the left arm reads is combined to carry out consistency.
5. During the merging process of the consistent base-by-base sites, if more than one base is found to appear in the same position in a non-low frequency (the low frequency is mainly caused by sequencing errors), collision can be caused, and the consistency of the sites is judged to be failed, which is mainly caused by hybridization and repetition. Compared with the repeated method, the heterozygous identification of the diploid genome is simpler, the characteristics are that the conflicting bases are only two, the halved EW sets are combined, and the EW sets supported by the diploid bases are basically not intersected, so the heterozygous identification is carried out after the consistency is failed.
Sequencing errors do not fit this situation compared to shuffling, and the main features of sequencing errors are low frequency and random. Because most sequencing technologies have few sequencing errors in the total sequencing quantity, only few differences can be found during consistency, even under the condition of large fluctuation of the sequencing depth, the differences caused by the sequencing errors still only occupy few components, the absolute number of the sequencing errors can change along with the fluctuation of the sequencing depth, but the proportion is not changed, namely the sequencing error rate is not changed. On the other hand, the randomness is expressed in that the sequencing errors do not deviate in different wells, and the sequencing errors occur in all the wells with the same probability, and even though the sequencing errors have certain bias on the error detection mode, the conditions are the same in all the wells. This means that the bases with sequencing errors are not concentrated on the well, and different bases support intersection sets among EW sets in the case of consistency conflict, and cannot be distinguished. It should be noted that the sequencing depth cannot be too low, since the characteristics of the sequencing error become less obvious when the overall sequencing depth is low, and misjudgment is very likely to occur.
When heterozygosity is identified, contigs will be split into two for two-way extension, and phasing will be performed in conjunction with EW conditions of the previous heterozygosity zone when splitting. In principle two heterozygous regions from the same haploid should have similar EW sets, in contrast to two heterozygous regions from different haplotypes should have no intersection between the EW sets. In this way, phasing operations of adjacent heterozygous sites can be performed, in particular, if the previous heterozygous region is too far apart (beyond the LFR length), where there is no intersection between the EW's of even two heterozygous regions on the same haploid, phasing will be judged to be a failure and a new phased block is established, while contigs themselves are still continuous and not interrupted at the same time.
The iterative seed sequence extension process is shown in fig. 5. The method is different from other phasing algorithms, only the EW relation of two adjacent sites is considered, the EW relation is not considered together by combining a plurality of sites, the phasing mode is linear, the logic is simple, and the phasing algorithm still has an improved space. Of course, the phasing method of the algorithm has certain advantages, and unlike some methods which only consider single-site hybridization, the algorithm performs phasing on all hybridization types, so that the phasing length is improved.
Like other approaches to identifying structural variations based on assembly, the present algorithm can also identify large structural variant heterozygotes, which is an advantage of the present algorithm over other approaches that deal with separating long fragment sequences to isolate haplotypes. Furthermore, since insertion-deletion heterozygous and large structural variant heterozygous do not differ in the phasing patterns in the algorithm, the algorithm can construct more complete and longer haplotypes than other phasing methods that only consider single alternative heterozygous sites, and can also obtain functions that cannot be obtained by re-sequencing methods, especially phasing large structural variants, which is of great significance for downstream analysis.
Repetitive sequences (reproducible Sequence) are identical or similar sequences occurring at different positions in the genome. It occurs abundantly in the genome, as the total length of the various types of repetitive sequences of the human genome account for almost half the genome size. Repeated sequences are always important problems influencing the assembly quality, and whether the repetition can be correctly solved is also the problem that various assembly algorithms are most concerned about and continuously try new strategies to break through. Duplicate resolution is naturally also a key module in the present algorithm, mainly dealing with complex duplicates within LFR, mainly including adjacent small-sized duplicates, large-sized duplicates, and tandem duplicates. These duplicate solutions are described below.
Different repeat types are solved by different modules, the identification of the repeat types needs the assistance of related information, the identification of a repeat area is one of important information for identifying the repeat types, the main purpose is to identify left arm reads which cannot be used for filtering, namely, mate-pair reads are all in a repeat area, because one of the characteristics of the repeat area is that the reads can be wrongly positioned at the position, and if the wrongly positioned reads are used for extension, wrong extension is often caused.
As shown in fig. 6, the repetitive region identification mainly includes identification of a start point and an end point. When the left arm corresponding to the right arm reads in the extension region is found not to be near the position one insert length upstream during extension, the start of the repeat region can be determined.
The end points of the repeated area are mainly divided into two types, and for simple repetition (the problem can be solved only by filtering wrongly positioned reads through mate-pair relation), the end point of the repeated area is the position of the lost left arm when the reads of the right arm do not appear any more; for complex repeats, the overshoot point during extension is the endpoint.
In addition, the difference regions appearing in the repeated segments will be considered as non-repeated regions, i.e. similar repeated segments will be strictly divided into a plurality of sub-repeated segments to be processed. In this case, the property identification of the region after the difference point is different, and the left arm corresponding to the right arm reads of the start point of the repeat region is in the previous repeat region but not in the non-repeat region, and a situation occurs in which the positioning of the paired reads is incorrect. At this time, the left arms reads near the length position of the insert before the extension point are checked, if the corresponding right arms participate in the extension, the currently extended region is the repeat region, and if a part of the regions is found not to participate in the extension, the regions already extend into the non-repeat region.
Because the mate-pair filtering of reads can only act on the right arm reads, the positions of the right arm reads and the left arm reads are different when the extension is performed, however, if a region exists, the left arm reads can be checked through the right arm reads, and the left arm reads can play a role in solving the duplication. This is just the concept of Helper Contig (HC).
HC is a contig that solves the problem of complex repetitions, is used only as an aid, and does not appear as a formal contig in the assembly results. The essence is to further utilize the mate-pair information of reads, and expect that HC will cross the current repeat region, appearing on the downstream non-repeat region, and use this non-repeat region to help resolve the repeat. If HC fails to cross the repeat region, it generally fails to work. The current application objects of HC are mainly divided into the following two categories:
(1) HC case 1-adjacent small double repeats:
because the length of the insertion fragment of mate-pair reads has a certain Standard Deviation (SD), if two adjacent small repeats slightly longer than the length of the reads-Kmer +1 exist, the SD deviation causes the confusion of mate-pair information during extension, the conflict occurs during consistency, and two conflicting bases are supported by EW and mate-pair reads.
For such a problem, the algorithm first uses the average value calculated by mate-pair information supporting each base as the expected position, and if the distance between the two positions is too close and the actual situation is fuzzy, the algorithm needs to construct HC assistance for recognition.
The right arm corresponding to the left arm reads at the leftmost end within the range of contig end SD length, which is not in the repeated area, is first used as the start of HC construction, and this reads is extended in a manner similar to seed. When extended to a sufficient length, the position of the colliding bases can be calculated separately by locating the left arm corresponding to the right arm reads on this HC. In essence, using HC in this case only improves the reliability of the calculated distance, and if SD is large, an error may still occur.
(2) HC case 2-large double repeat:
when the length of the repeat region is greater than the length of the insert, the left arm reads will not be available for filtering of reads, and in essence, the left arm will not be available for filtering the right arm within the repeat region as long as it is within the repeat region, whether the region between the two arms is repetitive or non-repetitive.
Large-scale repetition is long, and well information is easy to be different, so that the large-scale repetition can be solved by utilizing the well information according to different conditions, and partial conditions need to be solved by combining HC, and the large-scale repetition is mainly divided into two types:
the two repeats are far apart, in which case, although both colliding bases have EW support, bases upstream will contain significantly more EW than downstream. Conflicts can now be resolved by this comparison. In contrast, for other wells that have not been identified as EWs, it may be found that support more downstream than upstream, which may also be used to assist in conflict resolution, as shown in FIG. 7. If the above information remains ambiguous, the HC can be constructed using the left arm corresponding to the wrong right arm within one insert length from the start of the repeat region, and it is expected that the EW set on the HC will differ less from the EW set of upstream colliding bases than it does from downstream colliding bases.
The close distance between the two repeated sequences, in which case the well information will not change significantly and cannot be used for conflict resolution, but there may be some simple cases where the HC can resolve the conflict, see fig. 8:
a) when the distance between the collision site and the end point of the repeat region is less than the length of the insert, it can be found that the right arm corresponding to the left arm reads supporting the upstream base can be positioned on the HC (the construction of HC is the same as the previous large repeat, and is also constructed by using the left arm corresponding to the wrong right arm within one insert length range from the start point of the repeat region), as shown in FIG. 8 a;
b) when the distance between the collision site and the start of the downstream repeat is less than the insert length, it was found that the right arm corresponding to the left arm reads supporting the upstream base could be found within the repeat region, as shown in FIG. 8 b.
A Repeat consisting of multiple copies of a Repeat Unit joined in series is defined as a Tandem Repeat (TR), and the shortest Repeat segment in the Tandem Repeat region is a Tandem Repeat Unit (TRU). A tandem repeat unit has a plurality of different phases (e.g., ACT, CTA, TAC) equal in number and length.
The sequence at the end of the tandem repeat region and the sequence differences in the repeat region cause collisions with the sequence in the repeat region, as in the case of conventional repeats, whereas a tandem repeat region is equivalent to a plurality of conventional repeat regions linked together, unlike conventional repeats. This makes this iterative solution both identical and different from other conventional iterative solutions.
As shown in fig. 9, the positioning of reads may be shifted due to the tandem repeat, which may cause conflict due to errors in positioning of reads with repeated end points or repeated differences during the consistency. The core material that raises this problem is that the Kmer used to locate reads is a repeated Kmer. In this case, the head of the incorrectly positioned reads is behind the start of the tandem repeat area (because the non-repeat area before the start will cause the incorrectly positioned reads to fail in the fine alignment process), and the offsets of the reads due to the incorrect positioning are smaller than the SD of the length of the inserted segment, i.e. the mate-pair cannot filter out the reads with the positioning errors.
The first prerequisite to solving TR is correct identification, which in the present algorithm is confirmed by finding tandem repeat units. Since TR larger than the length of the insert also satisfies the activation condition of HC case2, the present algorithm puts the activation decision of TR before the activation decision of HC case 2.
TRUs are mainly identified by finding that kmers are used periodically. For TRUs smaller than the length of reads-Kmer +1 (i.e. TRUs appear twice or more on one read), slides and cuts reads on the collision site into kmers and searches whether multiple occurrences of kmers exist in one read, if most reads have the condition, and the repeated kmers of the reads are consistent or consistent on TRU phases, it can be determined that the current collision is caused by tandem repeat, and the de-tandem repeat module will be activated. And for TRUs with length larger than the reads-Kmer +1, scanning the Kmers in the range from the starting point of the repeated area to the conflict point, and if the Kmers appearing in a fixed period are found, judging that the conflict is caused by the tandem repeat.
After identifying the contents of the TRU, reads at the collision site can be classified into four categories: 1) reads within the TR region, which are completely covered by the TRU; 2) reads containing the start of the TR, which find the TRU only at the tail; 3) reads containing TR endpoints, which find TRUs only at the head end; 4) the reads of both the start and end of the TR are included, which is only present when the TR is smaller than the length of the reads, and now the reads of type 1 that are completely covered by the TRU are not found.
For TR with length less than the length of reads, the position of the reads containing the TR end point is adjusted by an end point alignment mode, and then the conflict can be eliminated. And for TR's larger than the length of reads, only "N" can be crossed and filled in to resolve. For TR which is longer than the reads and shorter than the reads insert length, aligning the right arm reads which are supported by the left arm in the non-repetitive area and contain the TR terminal point according to the terminal point and carrying out consistency, estimating the distance according to the reads insert length and filling in the estimated number of N; for TR larger than the length of the insert, it can be found that none of the right arm reads containing the end point of TR is supported by the left arm reads in the non-repetitive region, and at this time, these reads are directly normalized, and only one "N" is filled as a mark. It should be noted that minor differences (e.g., individual base substitutions, insertions, and deletions) in the TR region can be confused with the TR-terminus, and conflicts can arise when identified reads that contain the TR-terminus are aligned. At this time, the conflicting reads can be assembled by using a DBG method to naturally construct a difference sequence and a TR endpoint sequence in the separated TR region, and then the position of the assembled sequence is calculated by using the position of the left arm read on the previously extended non-repetitive region corresponding to the right arm read located on the assembled sequence, so that the sequence farthest from the extension point is the correct TR endpoint.
According to the embodiment of the invention, since the extension of seed is parallel, in order to avoid that the same region is repeatedly extended by a plurality of seed, the reads which participate in the extension need to be marked as 'used', when other seed initializes or extends, the extension is stopped when finding the reads, and then the connection is carried out by the contig merging module. It should be noted that reads within the duplicate area of mate-pair are not explicitly located, and to prevent false extension stops, reads of the duplicate nature will be labeled "duplicate" and will not be used as a determination of redundant extension.
According to embodiments of the present invention, when reads in a well are located on contigs, which may be mislocations due to repetition or sequencing errors, the algorithm requires that the well have relatively sufficient coverage for the extended seed region to be considered an EW, while EW with insufficient coverage will be discarded. Based on efficiency considerations, the update of the EW set is performed each time a certain extension length is reached, and only coverage conditions within this length range are considered.
After all contigs have been extended, they are combined, and the phasing between contigs is performed. Contigs that can be merged are divided into two cases:
contigs that stopped by the presence of "used" reads were found during extension. The merging point of two contigs is now calculated directly from the positions of these "used" reads on the two contigs.
Contig in which extension is stopped due to duplication, too low depth of local sequencing, and the like. At this time, whether mate-pair reads of outward reads (left arm reads in case of downstream extension and right arm reads in case of upstream extension) on the terminal non-repetitive region of a contig are positioned on other contigs is considered, and if the match-pair reads are found successfully, the distance between two contigs is estimated by the length of the insertion segment and the estimated number of 'N's is filled.
Because the basic unit of skeleton sequence construction in the algorithm is also the contigs which are extended and terminated but not combined yet, the contig combining step establishes the relationship between contigs (overlapped or connected with intervals) in actual processing, and then combines the relationship between contigs established in the skeleton sequence constructing step to carry out mutual inspection, so as to simplify the relationship and solve the conflict, and then the contig combining operation is carried out substantially.
As the extension progresses, the well/LFR has a characteristic of transition change, the characteristic can be used for determining the position relation between contigs, and the algorithm finally carries out a skeleton sequence construction (scaffold) operation on all contigs. In fact, this thinking has been reflected in the previous large-scale iterative solutions.
In the algorithm, a sequence formed by determining the context between contigs mainly through well information is defined as a framework sequence (scaffold), which is different from the framework sequence constructed through mate-pair information of reads in the traditional definition. The scaffold in the algorithm only expresses the context of contigs, but the specific distance between contigs cannot be confirmed, and the contigs are connected by only one 'N'. In the traditional sense, the scaffold not only determines the context between contigs through the mate-pair of reads, but also calculates the distance between contigs through the positions of pairs of reads respectively positioned on two contigs and the lengths of their insertion fragments, which is consistent with the method in the contig combination. Both of those approaches involve sequential Arrangement of contigs, which is an Optimal Linear Arrangement (OLA) with NP-Hard properties. It should be noted that contigs in scaffold are self-directed (four deoxynucleotides A, C, T, G are called DNA base sequences according to polydeoxyribonucleotides formed by connecting 3 ', 5' phosphodiester bonds, and are the expression forms of contigs and scaffold in assembly, the connection of deoxynucleotides has strict directionality, and 3 ', 5' phosphodiester bonds are formed between 3 '-OH of the previous deoxynucleotide and 5' -phosphate of the next deoxynucleotide to form a linear DNA macromolecule without branching, and the expression of DNA defines that 5 'end is in "+" direction towards 3' end and 3 'end is in "-" direction towards 5' end), so contigs with correct position and wrong direction can cause large assembly errors and are expressed as serious subsequence inversion errors.
Typically, a well appears among multiple adjacent contigs, and a set of adjacent contigs is obtained through the well. First, find the well with the contig as the starting point in the well to which the contig belongs, and define this contig as the starting point contig of this well. Then, other contigs containing the well are found out, the intersection of the well set contained in the well and the well set of the starting contig is calculated respectively, obviously, the contig which has larger intersection with the well set of the starting contig is closer to the starting contig, thereby the front-back position relation of a group of contigs can be determined, and then a scaffold is constructed.
For the directionality judgment of contigs in the scaffold, the well sets of the head end and the tail end of the contigs need to be considered separately, and the similarity of the well sets with the head end or the tail end of the adjacent contigs is calculated respectively. When contigs are short, transitional changes of Well are not obvious, and situations with unknown positions can occur, so after the preliminary screening of all contigs is completed, the results of combining the contigs are needed to be checked and corrected, the relations obtained by the two methods are combined, and finally the scaffold sequence is output.
While embodiments of the invention have been shown and described, it will be understood by those of ordinary skill in the art that: various changes, modifications, substitutions and alterations can be made to the embodiments without departing from the principles and spirit of the invention, the scope of which is defined by the claims and their equivalents.