CN108350495B

CN108350495B - Method and apparatus for assembling partitioned long fragment sequences

Info

Publication number: CN108350495B
Application number: CN201680063769.5A
Authority: CN
Inventors: 谢寅龙; 黄伟华; 李净净; 郭瑞东; 唐静波; 邓超
Original assignee: MGI Tech Co Ltd
Current assignee: Shenzhen Hua Made Dazhi Technology Co Ltd
Priority date: 2016-02-26
Filing date: 2016-02-26
Publication date: 2021-10-01
Anticipated expiration: 2036-02-26
Also published as: HK1254399A1; WO2017143585A1; CN108350495A

Abstract

The invention relates to a method, a device and a system for assembling partitioned long fragment sequences. A method of assembling a partitioned long fragment sequence, comprising: (a) obtaining a read set through sequencing, and recording sequencing holes corresponding to reads in the read set, wherein one sequencing hole comprises at least one long fragment sequence; (b) extending a plurality of seed sequences in parallel by using the reads and sequencing wells corresponding to the reads to obtain a plurality of sequence contigs, the plurality of seed sequences being determined by known sequences; (c) and constructing a framework sequence based on the reads, the sequence contigs and the sequencing holes corresponding to the reads contained in the sequence contigs so as to obtain an assembly result for separating the long fragment sequences.

Description

Method and apparatus for assembling partitioned long fragment sequences

Technical Field

The present invention relates to the field of biotechnology, and in particular, to methods and apparatus for assembling partitioned long fragment sequences.

Background

The complete construction of the human genome from the first time has been carried out, so far, the genome map is available, various bioinformatics analysis methods and software aiming at human genome re-sequencing emerge like spring bamboo shoots in the rain, and the method makes great contribution to the development of human diseases, medicine and health research.

Comparative Genomics (Comparative Genomics) is a discipline that compares known genes and genomic structures to understand gene function, expression mechanisms, and species evolution. The rationale is that the identity between two organisms is usually encoded by evolutionarily conserved DNA, and the related DNA fragments between the two will be the same or similar. All that is required for comparative genomics analysis is the presence of a genomic map (genomic reference sequence) and sequencing of the comparative subject. Mutation detection is an important matter in genomics, and International human Genome haplotype Project (The International HapMap Project) and Genome-wide association analysis (GWAS) are related researches based on single-nucleotide polymorphism (SNP) type mutation.

Assembly refers to the process of integrating shorter fragment sequences into longer sequences. Limited by sequencing technology, the genome cannot read the complete sequence content of a chromosome through sequencing from the beginning to the end, but the whole genome is often broken into base fragments with the length of tens to thousands, the content of the fragments is read through large-scale parallel sequencing, the fragmented fragments are analyzed and integrated by assembly, and finally, the fragmented fragments are restored into a relatively complete genome sequence. Identification of variation by assembly is a new application of assembly technology, and in fact assembly is primarily aimed at the construction of genomes. It is important that the genome be constructed and assembled in the absence of a genomic reference sequence, which is also known as "de novo assembly" (denovo assembly). The human genome reference sequence is about 3Gbp (3 x 10) in length⁹bp) which is the haploid number of base pairs, whereas humans are diploid organisms, which in practice should have a genome size of 2 x 3Gbp to 6 Gbp. The diploid genome of a human individual is derived from a haploid contributed by each of its two parents, and the differences between the two haploid sets allow the individual toHeterozygosity occurs at one or more loci on homologous chromosomes. Moreover, the human reference sequence is constructed from data from multiple individuals, which results in it being effectively a confounding chimeric haploid genome of various heterozygotes. As the genome research is advanced, the haploid human reference sequence can not meet the demand, the construction of haplotype (Haplotyping) is increasingly important, and the genome analysis based on haplotype information is also continuously appeared.

Haplotype information helps to deeply interpret the relationship between genotype and phenotype, and two individuals who are completely the same in a heterozygous set can also have different phenotypes and disease susceptibility due to the difference of haplotypes, which is of great importance for functional studies (such as specific expression of alleles) and disease studies (such as Mediterranean fever and breast cancer). The act of dividing the heterozygous information at multiple heterozygous sites to determine the haplotype is called Phasing (Phasing), which is an important operation for haplotype construction, and methods for haplotype construction are many and mainly classified into the following five categories:

1. method for applying population statistics to data of multiple unrelated individuals

2. Method for applying Mendelian inheritance to family data

3. Method for directly utilizing sequencing sequence information

4. Method by experimental operation

5. By physical separation

It is important to note that the core of the physical partitioning method is the partitioning of sequences that are fragmented into long pieces of DNA, either by means of fosmid plasmid acquisition or by direct physical partitioning using multiwell plates. After separation, further fragmentation and amplification operations (library construction) are performed, which are required for sequencing, and in order to distinguish the different separation units, different barcode sequences (barcodes) are attached to the sequences in these units, respectively. This approach divides the whole genome into many sub-parts by partitioning, and when the number of sub-parts separated is large, each sub-part contains only one haploid content from a small region on the genome. This allows the emergence of heterozygous regions at the whole genome level in homozygous form in these small regions, respectively, which is of paramount importance for the construction of haplotypes.

Each separation unit has its own unique barcode sequence, and after sequencing, its own reads (reads) belonging to each separation unit can be retrieved by identifying the barcode sequence. The separation unit of the Fosmid plasmid separation technology is called as a Fosmid plasmid pool (Fosmid pool), and each Fosmid plasmid pool comprises one or more long fragments with the length of about 37 Kbp; the perforated plate separation technique refers to its separation unit as well, each well contains a plurality of long segments, and the lengths of the long segments are different in different techniques. Regardless, the partitioning method pioneers new types of information, and the total set of reads is partitioned into multiple clusters by barcode, which is compared with the sequencing approach of Whole Genome Shotgun method (WGS), the reads are not randomly derived from any position on the Whole Genome any more, but the reads in each cluster set are from a common small region on the Genome, which is the region covered by the long fragment when partitioning. Although these small regions are derived from arbitrary positions on the whole genome, reads in each cluster are constrained and aggregated, and the information added by the aggregation becomes the key for building haplotypes.

Long Fragment Reading (LFR) technology greatly improves the complexity of library construction, so that the construction of haplotypes is reduced in both time and cost.

However, the assembly of current long fragment reading techniques remains to be improved.

Disclosure of Invention

The present invention aims to solve at least one of the above technical problems to at least some extent or to at least provide a useful commercial choice.

In a first aspect of the invention, the invention provides a method of assembling a sequence of partitioned long fragments, comprising:

(a) obtaining a read set through sequencing, and recording sequencing holes corresponding to reads in the read set, wherein one sequencing hole comprises at least one long fragment sequence;

(b) extending a plurality of seed sequences in parallel by using the reads and sequencing wells corresponding to the reads to obtain a plurality of sequence contigs, the plurality of seed sequences being determined by known sequences;

(c) and constructing a framework sequence based on the reads, the sequence contigs and the sequencing holes corresponding to the reads contained in the sequence contigs so as to obtain an assembly result for separating the long fragment sequences.

It will be understood by those skilled in the art that all or part of the steps of the method of one aspect of the present invention described above may be implemented by instructing the relevant hardware through a program, which may be stored in a computer-readable storage medium, and the storage medium may include: read-only memory, random access memory, magnetic or optical disk, and the like.

In a second aspect of the invention, the invention provides a system for assembling a sequence of partitioned long fragments, the system comprising: a data input unit for inputting data; a data output unit for outputting data; a storage unit for storing data including a computer executable program; and the processor is connected with the data input unit, the data output unit and the storage unit and is used for executing the computer executable program, and the execution of the program comprises the completion of the method.

In a third aspect of the invention, the invention provides an apparatus for assembling a sequence of partitioned long fragments, comprising:

the input module is used for obtaining a read set through sequencing and recording sequencing holes corresponding to reads in the read set, wherein one sequencing hole comprises at least one long fragment sequence;

an extension module that extends a plurality of seed sequences in parallel using the reads and the sequencing wells corresponding to the reads to obtain a plurality of sequence contigs, the plurality of seed sequences being determined by known sequences;

and the framework sequence construction module is used for constructing a framework sequence based on the reading segments, the sequence contig and the sequencing holes corresponding to the reading segments contained in the sequence contig so as to obtain an assembly result of separating the long fragment sequences.

The method and/or device of the invention described above have at least the following characteristics and advantages:

and (3) iterative extension: the extension of the seed sequence is an iterative process, and the part of the extension is used as an extension base of the next iteration, so that the seed sequence can be continuously extended.

Linear extension: unlike other graph-based assembly algorithms, the linear extension method makes the structure encountered during assembly simpler, the logical relationship is relatively clear, and the classification processing is easy, while the graph algorithm gathers the information of all the repeated areas together, the amount of the involved information is more, and the generated structure is more complex. That is, the solution of the graph is a one-time solution, as long as one repeat region is solved, all the genomic regions associated therewith (which all have the same or similar repeat sequences) are solved at the same time, while the linear method only solves the repeat of the current region at one time, does not solve the other regions at the same time, and is solved again when the repeat is extended to another region of the genome. The problem that the relevant information is recorded after one iteration is solved, so that the information encountered in the next iteration is simpler, and the extra calculation loss is reduced. In addition, the linear extension mode makes the extension of each seed sequence independent, and the mode is easier for calculation parallelization.

Multipath extension: since the objective of the algorithm is to construct polyploid genome, extension of seed sequence will use sequencing well information to perform multiplex joint extension of multiple haploids, whenever heterozygous area is encountered, phasing will be performed in real time in combination with the situation of previously assembled heterozygous area, i.e. phasing is also performed linearly and throughout the modules of the whole assembly apparatus. In order to save resource consumption, the extension of the homozygous region still adopts single-path extension, when the heterozygous region is encountered, the extension is divided into multiple paths to carry out extension, and after the heterozygous region is extended, the extension is combined again to form a single path to continue extension.

And (3) global information judgment: if an algorithm only uses the related information under the current judgment, and does not consider the selection branch in the subsequent global scope, an error is easy to occur or the global optimal solution cannot be reached (such as Dijkstra greedy algorithm in the single-source shortest path calculation). For genome assembly, an assembly error is often caused by wrong selection by a greedy algorithm, so the method does not adopt the greedy algorithm to solve the conflict during extension, and uses global information to perform correlation judgment. Because the length (100 Kbp) of the LFR is longer, all information (Kmer, reads, mate-pair and well information) provided by the position is considered during path selection, wherein the range involved in judgment of the well information can be about 100Kbp, if large-scale repetition does not exist in a large quantity, the algorithm is basically consistent with the global property, and when the extension meets a complex condition, the algorithm strictly selects to terminate the extension rather than take a higher value for extension, so that a greedy judgment mode is avoided. It is particularly noted that, since all well information at the position is considered in the extension, the algorithm has stronger global property compared with the conventional hierarchical assembly algorithm, and the requirement of the sequencing depth can be greatly reduced, namely, each well does not require a particularly high sequencing depth, so that a great deal of resources are saved, namely, the cost and the time are saved.

Additional aspects and advantages of the invention will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention.

Drawings

The above and/or additional aspects and advantages of the present invention will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:

FIG. 1 is a flow diagram of a method of assembling partitioned long fragment sequences in an embodiment of the invention.

Fig. 2 is a flow diagram of a method of assembling partitioned long fragment sequences in an embodiment of the invention.

FIG. 3 is a schematic diagram of an apparatus for assembling separate long fragment sequences in an embodiment of the present invention.

FIG. 4 is a schematic diagram of an apparatus for assembling separate long fragment sequences in an embodiment of the present invention.

Fig. 5 is a schematic diagram of a seed sequence iterative extension process in an embodiment of the invention.

Fig. 6 is a schematic diagram of identifying a repetition region in an embodiment of the present invention.

Fig. 7 is a schematic diagram of handling collisions of large double complex sequences with two repeat regions that are widely separated in an embodiment of the invention.

FIG. 8 is a diagram of handling collisions of large double complex sequences with two repeat regions that are separated by a small distance in an embodiment of the invention.

Fig. 9 is a schematic diagram of handling collisions of a tandem repeat sequence in an embodiment of the invention.

Detailed Description

Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are illustrative and intended to be illustrative of the invention and are not to be construed as limiting the invention.

In a first aspect of the invention, the invention proposes a method of assembling a sequence of separate long fragments, see fig. 1, the method comprising:

According to an embodiment of the invention, the seed sequence is obtained based on a genomic reference sequence by the following steps:

in the genomic reference sequence, interrupting by N; and

the broken reference sequence is truncated by a predetermined length to obtain the seed sequence.

The size of the predetermined length is not particularly limited, and in general, the predetermined length is not smaller than the size of the insert so that reads from the same insert can be located on the same seed sequence. According to an embodiment of the invention, the sequencing is double-ended sequencing, and the predetermined length is at least twice the length of the sequencing library insert in the sequencing, facilitating the positioning of paired reads onto the same seed sequence. According to an embodiment of the invention, the set of reads comprises paired reads, the seed sequence being obtained based on the reads in the following steps: (1) the read is cut into a plurality of Kmers in a sliding mode, an index RKI of the Kmers to the read is constructed, and the index RKI is used for accessing the corresponding read through the Kmers; (2) extracting a pair of paired reads without a high frequency Kmer from the set of reads; (3) respectively determining all the reads corresponding to the Kmers of two reads in the pair of reads in the step (2) by using the index RKI to obtain a first read group and a second read group; (4) determining sequencing wells corresponding to the first read group and the second read group in the step (3) respectively to obtain a first sequencing well set and a second sequencing well set; (5) determining the intersection of the first sequencing well set and the second sequencing well set in (4), and if the size of the intersection is not significantly different from the expected number of valid sequencing wells of the base, determining the paired reads in (2) as the seed sequence, wherein the expected number of valid sequencing wells of the base is determined by the nucleic acid amount of the sample. The high frequency is referred to as a high frequency Kmer, and the number of occurrences of a certain Kmer is higher than the average number of occurrences of the Kmer, i.e., the high frequency Kmer is considered. In more cases, the inventors define as high frequency Kmer, Kmer as desired, occurring a number of times that is much higher than the average number of times that Kmer occurs, said "much higher" being at least 10 times, 20 times, 30 times, 40 times, 50 times, or 60 times the average number of times that Kmer occurs. The Kmer frequency expectation is related to the sequencing depth, the read length and the Kmer size, according to one embodiment of the present invention, for K19, i.e. 19mer, 100bp read length, 100X sequencing depth, the Kmer expectation frequency is 100 (sequencing depth) { [100 (read length) -19(Kmer size) +1]/100 (read length) } ═ 82, and at an average frequency of 82, kmers with frequencies ranging from 3000 to 5000 are set as high frequency kmers.

The term "no significant difference" may mean no difference in statistical sense or in general. According to an embodiment of the present invention, (5): determining the paired reads in (2) as the seed sequence if the size of the intersection is between half and twice the expected number of valid sequencing wells for the base.

The seed sequence is generated by constructing it from only known sequences, such as reference sequences, reads obtained by sequencing, assembly fragments, etc., and by combining the information from known sequences, RKI and sequencing wells, and by de novo assembly of species that are completely free of reference sequences.

According to an embodiment of the present invention, (b) includes: (i) the read is cut into a plurality of Kmers in a sliding mode, an index RKI of the Kmers to the read is constructed, and the index RKI is used for accessing the corresponding read through the Kmers; (ii) extending the plurality of seed sequences in parallel based on the reads and their corresponding indices RKI to obtain the plurality of sequence contigs.

According to an embodiment of the invention, the RKI is obtained by: performing sliding cutting on the read reads into a plurality of Kmers; and constructing a hash taking the Kmer as a key value, wherein the hash forms the RKI, and the hash records the frequency of the Kmer, the read segment to which the Kmer belongs, and the position and the direction of the Kmer on the read segment.

According to an embodiment of the invention, the seed sequence is extended by repeating the following steps: selecting a seed sequence suitable for extension; mapping the reads to the seed sequence to obtain an extended sequence; performing base-by-base alignment treatment on reads positioned at the end of the extended sequence; and if the consensus process fails, performing a heterozygosity identification, a phasing process, and/or resolving the repeat sequence.

According to an embodiment of the invention, the seed sequence suitable for extension is selected by: sliding cutting the seed sequence into Kmer; obtaining a read corresponding to the Kmer through the RKI; aligning the corresponding reads to the seed sequence; determining the coverage condition of the sequencing holes on the seed sequence based on the sequencing holes corresponding to the corresponding reads; and determining a seed sequence suitable for extension based on the coverage condition.

According to an embodiment of the invention, the reads are localized to the seed sequence by: sliding cutting the seed sequence into Kmer; obtaining a read corresponding to the Kmer through the RKI; and positioning the reads corresponding to the Kmer to the seed sequence, and comparing the reads one by one.

According to embodiments of the present invention, during the consensus process, if the set of valid sequencing wells of an extended site is evenly assigned by different base types at that site, it is judged that heterozygosity exists.

According to an embodiment of the present invention, after determining the presence of heterozygosity, the extension sequence is divided into a plurality of strands to be respectively extended.

According to an embodiment of the present invention, the read set includes a plurality of pairs of paired reads, a distance between two reads of the pair of paired reads is L, and during the unification process, if a distance between a read of the pair of paired reads positioned downstream in the extension direction and the corresponding read is not L, it is determined that a position at which the read positioned downstream in the extension direction is positioned is a start point of the repeat sequence.

It should be noted that the numerical values referred to in the present invention are mostly of statistical significance, and thus, unless otherwise specified, any numerical value expressed in a precise manner may represent a range, for example, an interval of plus or minus 10% of the numerical value; or the values are generally normally distributed, and values expressed in a precise manner include intervals of positive and negative standard deviations of the values. The distance L between two reads in a pair of reads is said to correspond to the length of the insert. Generally, in the experimental stage, the size of the constructed sequencing library is constant, i.e. the size of the inserted fragment is a fixed value, theoretically, the outer end distance between the paired reads obtained after the double-end sequencing is the fixed value, and in the data obtained after the actual sequencing, the distance between the paired reads is in normal distribution. In the process of determining and analyzing the repeated sequence in this example, the inventor sets the L as the size of the insert in the test stage, and it can be understood by those skilled in the art that the L is the value between the positive and negative standard deviations of the sizes of the inserts in any test stage or the interval between the positive and negative standard deviations of the sizes of the inserts in the test stage.

According to an embodiment of the present invention, the end point of the repeating sequence is a position where a read positioned downstream in the extending direction of the pair of reads having a distance L from the corresponding read is positioned, or a collision site when extending.

According to an embodiment of the present invention, whether the repeated sequence is a tandem repeated sequence is determined by: performing sliding cuts on reads mapped to the repeated sequence to obtain Kmers; judging whether the Kmer obtained by each read through sliding cutting has a repeated Kmer, if not, judging that no tandem repeat sequence exists, and if so, judging that the tandem repeat sequence exists.

According to an embodiment of the present invention, if the repeated sequence is a tandem repeated sequence, the tandem repeated sequence is resolved by: determining the length of the tandem repeat sequence; the reads containing the tandem repeat end point are positionally adjusted in an end point alignment to identify the base at the position of the collision.

According to an embodiment of the present invention, if the repeated sequence is not a tandem repeated sequence, and there are the following cases, the repeated sequence is judged to be a large double-repeated sequence by: the repeat sequence is greater than L in length, or reads of the pair of reads that correspond to reads positioned downstream of the repeat sequence are also positioned on the repeat sequence.

According to an embodiment of the present invention, if the repeated sequence is a large double-repeated sequence, the large double-repeated sequence is resolved by: and comparing the difference between the number of the effective sequencing holes corresponding to the upstream repetitive sequence and the number of the effective sequencing holes corresponding to the downstream repetitive sequence in the large double-repetitive sequence with the expected value of the number of the effective sequencing holes of the base, and solving the conflict on the large double-repetitive sequence according to the difference. For example, if two repeats in a large double-repeat sequence are relatively far apart, the conflicting bases at the upstream will contain significantly more effective sequencing wells (EW) than at the downstream, and it will be apparent when the defined number difference is greater than half the expected EW number, the bases can be identified by such a comparison to resolve the conflict; if the difference between the upstream and downstream EW numbers is less than or equal to half the desired EW number, a Helper Contig (HC) can be constructed using the upstream arm corresponding to the wrong downstream arm within one inset length from the start of the repeat region, comparing the EW set on the HC to the EW set of upstream colliding bases versus the difference between it and the downstream colliding bases.

According to an embodiment of the present invention, if the repeated sequence is not a large double complex sequence, and there are the following cases, the repeated sequence is determined to be a small double complex sequence: the length of the repeated sequence is less than L.

According to an embodiment of the invention, the small double repeat sequence is resolved by at least one of: (p) determining the base at the collision site by comparing the proximity of the distance between pairs of reads supporting two colliding bases to the mean, using the mean of the distances between pairs of reads supporting each base as the expected position of the collision site; (k) constructing an auxiliary contig using reads corresponding to reads of the paired reads that are positioned to the most upstream of the non-repetitive sequence in the extended sequence standard deviation range, and determining bases at the collision sites using reads of the paired reads that are corresponding to reads positioned to the downstream of the auxiliary contig.

According to an embodiment of the present invention, if the small double-repeated sequence cannot be resolved, i.e. the conflict cannot be resolved, the extension of the seed sequence is terminated.

Referring to fig. 2, according to an embodiment of the present invention, (c) includes: (iii) establishing a merged connection relationship between the sequence contigs based on the reads; (iv) constructing a primary backbone sequence based on the sequence contigs and sequencing wells corresponding to the reads comprised by the sequence contigs; (v) obtaining an assembly result separating long fragment sequences by merging the sequence contigs to construct the backbone sequence after mutually examining the merged linkage relationship between the plurality of sequence contigs and the primary backbone sequence.

The method of any of the above embodiments of the invention has at least one of the following four characteristics and advantages:

Multipath extension: since the objective of the algorithm is to construct polyploid genome, extension of seed sequence will use well information to perform multi-path joint extension of multiple haploids, whenever heterozygous region is encountered, phasing operation will be performed in real time in combination with the situation of previously assembled heterozygous region, i.e. phasing is also linear, and throughout each module of the whole assembly system (e.g. contig combination, skeleton sequence construction). In order to save resource consumption, the extension of the homozygous region still adopts single-path extension, when the heterozygous region is encountered, the extension is divided into multiple paths to carry out extension, and after the heterozygous region is extended, the extension is combined again to form a single path to continue extension.

And (3) global information judgment: if an algorithm only uses the related information under the current judgment, and does not consider the selection branch in the subsequent global scope, an error is easy to occur or the global optimal solution cannot be reached (such as Dijkstra greedy algorithm in the single-source shortest path calculation). For genome assembly, an assembly error is often caused by wrong selection by a greedy algorithm, so that the algorithm does not adopt the greedy algorithm to solve the conflict during extension and uses global information to perform correlation judgment. Because the length (100 Kbp) of the LFR is long, all information (Kmer, read, mate-pair and well information) provided by the position is considered during path selection, wherein the range involved in judgment of the well information can be about 100Kbp, if large-scale repetition does not exist in large quantity, the algorithm is basically consistent with the global property, and when the extension meets the complex condition, the algorithm strictly selects to terminate the extension rather than take a higher value for extension, so that a greedy judgment mode is avoided. It is particularly noted that, since all well information at the position is considered in the extension, the algorithm has stronger global property compared with the conventional hierarchical assembly algorithm, and the requirement of the sequencing depth can be greatly reduced, namely, each well does not require a particularly high sequencing depth, so that a great deal of resources are saved, namely, the cost and the time are saved.

The process is written into an executable program, and the executable program comprises the following steps: first, seeds and reads are Read, and the reads are constructed as the Index (Read Kmer Index, RKI) of the Kmer to the reads, so that the target reads can be quickly accessed through the Kmer during comparison. Then seeds are lengthened as much as possible, and since the extension between seed and seed is independent, this step can be accelerated by parallel operation. The extended seed is a sequence contig (contig), and at this time, a framework sequence (scaffold) is pre-constructed by beginning through well information, and only the context between contigs is established, and the scaffold is not immediately constructed. Also, by having used the reads and pairs information to establish a merged connection relationship between contigs, these contigs are not immediately merged. Then, the relation of combining between contigs and the relation in the pre-constructed scaffold are mutually checked, the relations are simplified and conflicts are solved, then the combining of contigs and the construction of scaffold are carried out, and finally, the assembly result is output.

In yet another aspect, the present invention provides an apparatus for assembling a sequence of separate long fragments for performing the method of any of the embodiments of the present invention described above. Referring to fig. 3, the apparatus includes: the input module is used for obtaining a read set through sequencing and recording sequencing holes corresponding to reads in the read set, wherein one sequencing hole comprises at least one long fragment sequence; an extension module that extends a plurality of seed sequences in parallel using the reads and the sequencing wells corresponding to the reads to obtain a plurality of sequence contigs, the plurality of seed sequences being determined by known sequences; and the framework sequence construction module is used for constructing a framework sequence based on the reading segments, the sequence contig and the sequencing holes corresponding to the reading segments contained in the sequence contig so as to obtain an assembly result of separating the long fragment sequences.

Those skilled in the art will understand that the processing steps or means employed in any of the above embodiments of the present invention can be implemented by corresponding functional modules or included sub-modules of the apparatus. The above description of the technical features and advantages of the method of the invention applies equally to this device of the invention.

According to an embodiment of the invention, the seed sequence is obtained based on a genomic reference sequence by the following steps: in the genomic reference sequence, interrupting by N; and truncating the broken reference sequence according to a preset length so as to obtain the seed sequence.

According to an embodiment of the invention, the predetermined length is not less than the length of the sequencing library insert in the sequencing.

According to an embodiment of the invention, the set of reads comprises paired reads, the seed sequence being obtained based on the reads in the following steps: (1) the read is cut into a plurality of Kmers in a sliding mode, an index RKI of the Kmers to the read is constructed, and the index RKI is used for accessing the corresponding read through the Kmers; (2) extracting a pair of paired reads without a high frequency Kmer from the set of reads; (3) respectively determining all the reads corresponding to the Kmers of two reads in the pair of reads in the step (2) by using the index RKI to obtain a first read group and a second read group; (4) determining sequencing wells corresponding to the first read group and the second read group in the step (3) respectively to obtain a first sequencing well set and a second sequencing well set; (5) determining the intersection of the first sequencing well set and the second sequencing well set in (4), and if the size of the intersection is not significantly different from the expected number of valid sequencing wells of the base, determining the paired reads in (2) as the seed sequence. The expected number of valid sequencing wells for the base is determined by the amount of nucleic acid in the sample.

According to an embodiment of the present invention, in (5) of the above embodiment: determining the paired reads in (2) as the seed sequence if the size of the intersection is between half and twice the expected number of valid sequencing wells for the base.

According to the embodiment of the present invention, the apparatus further includes an index constructing module, connected to the extending module, configured to slidably cut the read into a plurality of kmers, construct an index RKI of the Kmer to the read, and configured to access the corresponding read through the Kmer; the following are then implemented with the extension module: extending the plurality of seed sequences in parallel based on the reads and their corresponding indices RKI to obtain the plurality of sequence contigs.

According to an embodiment of the invention, the RKI is obtained by: performing sliding cutting on the read to obtain a plurality of Kmers; and constructing a hash taking the Kmer as a key value, wherein the hash forms the RKI, and the hash records the frequency of the Kmer, the read segment to which the Kmer belongs, and the position and the direction of the Kmer on the read segment.

According to an embodiment of the present invention, if the repeated sequence is a tandem repeated sequence, the tandem repeated sequence is resolved by: determining the length of the tandem repeat sequence; and adjusting the position of the read segment containing the end point of the tandem repeat sequence in an end point alignment mode.

According to an embodiment of the present invention, if the repeated sequence is not a tandem repeated sequence, and there are the following cases, the repeated sequence is determined to be a large double-repeated sequence: the repeat sequence is greater than L in length, or reads of the pair of reads that correspond to reads positioned downstream of the repeat sequence are also positioned on the repeat sequence.

According to an embodiment of the invention, the large double-repeat sequence is resolved by: and comparing the difference between the number of the effective sequencing holes corresponding to the upstream repetitive sequence and the number of the effective sequencing holes corresponding to the downstream repetitive sequence in the large double-repetitive sequence with the expected value of the number of the effective sequencing holes of the base, and solving the conflict on the large double-repetitive sequence according to the difference.

According to an embodiment of the present invention, if the repeated sequence is a small double repeated sequence, the small double repeated sequence is resolved by at least one of: (p) determining the base at the collision site by comparing the proximity of the distance between pairs of reads supporting two colliding bases to the mean, using the mean of the distances between pairs of reads supporting each base as the expected position of the collision site; (k) constructing a helper contig using reads corresponding to reads in pairs of reads that are located to the most upstream of the non-repeated sequence in the extended sequence standard deviation range, updating the distance mean by adding these reads data using reads corresponding to reads in pairs of reads that are located to the downstream of the helper contig, and determining the base at the collision site also by comparing the proximity of the mean to the distance between pairs of reads that support two colliding bases.

According to an embodiment of the present invention, if the small double complex sequence cannot be resolved, the extension of the seed sequence is terminated.

Referring to fig. 4, according to an embodiment of the present invention, the skeleton sequence building module includes a primary skeleton sequence building module, configured to establish a merged connection relationship between the sequence contigs based on the reads; the device also comprises a merging connection relation establishing module which is used for establishing a primary framework sequence based on the sequence contig and the sequencing holes corresponding to the reads contained in the sequence contig; the method also comprises an assembling module, which is used for constructing the framework sequence by combining the sequence contigs after mutually checking the combined connection relation among the sequence contigs and the primary framework sequence, and obtaining an assembling result of separating long fragment sequences.

In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

For ease of understanding, the assembly method according to the preferred embodiment of the present invention is analyzed in detail step by step.

The algorithm completely stores the reads data which continuously need to participate in comparison in the extension process into the memory in a binary mode until the whole program is terminated in order to achieve the purpose of efficient calculation. The information on the quality values provided in sequencing (e.g., the quality values in the Fastq format) that are not recorded and used during assembly are designed to be removed or corrected for quality value outlier bases and reads in a pre-assembly data preprocessing step, i.e., the read sequences are considered normal in the sequencing quality values and assembled regardless of differences between the different quality value bases.

After reading the Reads and recording the information of the wells to which the Reads belong, sliding and cutting the Reads into Kmer to construct a Reads Kmer Index (RKI), which is hash (hash) with the Kmer as a key and records the frequency of the Kmer, the set of the Reads to which the Reads belong, the position and the direction of the Kmer on the Reads. The hash is characterized by extremely fast search speed and only O (1) time complexity, so the core function of the index is to quickly access its corresponding sets of reads through the Kmer to determine which reads are to be located on the seed for detailed alignment.

The RKI structure needs to record information of a large number of reads, and further, since one read comprises a plurality of Kmers, the read will appear in entries of the plurality of Kmers, which causes the memory consumption to be larger, for a human genome, the memory consumption under 100X sequencing quantity is about 3TB, and compared with the memory consumption under 1TB of the traditional assembly technology, the memory required by the algorithm is larger. The data structure can be further optimized, which can reduce the cost and resource bottleneck of assembly. The optimization can be performed by omitting or limiting the information of the ultra-high frequency kmers in the related reads to reduce the memory overhead, because the effect of the kmers on assembly is small, the time is wasted when the kmers are used for comparing and positioning the reads, and the efficiency is extremely low.

The data structure design related to RKI is obviously different from the design in some DBG algorithms, the graphs of relationships among Kmers are constructed by reading reads from files, the reads do not need to be stored in a memory, and the reads relationships corresponding to the Kmers do not need to be recorded in detail, because the representation of a genome is converted from a pile of disordered reads into a group of logically-associated Kmers, a large amount of arrangement and compression are carried out on the data, and then the construction of the genome only needs to operate on the Kmer graphs, and does not need to directly operate the reads. The algorithm does not construct an overall graph, and reads need to be randomly accessed by the Kmer continuously in the seed extension (genome construction) process to obtain extension (assembly) materials.

Finally, relevant parameters for experimental purposes also need to be read in, such as the size of the inserts of pairs of reads in the library construction, the number of wells to isolate the LFR, and the number of cells to be put in, and the number of ploidies (obtained by optical means or informatics analysis) of the target genome. But the size of the insert and the number of the input cells can be obtained by training statistics in the initialization of the seed sequence, and the ploidy number of the target genome can also be obtained by hybrid recognition calculation. The different methods for obtaining information have certain errors, and the actual assembly needs to be combined, analyzed and applied.

After seed is selected, initialization and validity determination are performed before extension is started. And initializing to obtain relevant information on the seed, wherein the information is used for later legality judgment and seed extension materials. And the validity determination is used to discard seed that is on the extended region and that is not suitable for extension.

Firstly, cutting a seed into a Kmer by sliding, obtaining reads corresponding to the Kmer by RKI, then finely comparing the reads with the seed, and determining the coverage condition of the seed by the well by using the well information corresponding to the reads.

A Well that covers substantially the entire seed is defined as a "valid Well (EW)" and other wells that do not yet cover enough will be recorded as candidates and updated in later extensions. In the assembly of long fragment separating sequences, each site on the genome is covered by multiple LFRs, the number of the covered sites is related to the number of cells and the number of ploidy, as the extension progresses, the LFR assembly covering the extension region changes transiently, and the old LFR is replaced by the new LFR, just like a relay race. For example, if 10 diploid cells are used to construct the LFR, the number of wells per site is expected to be 10 cells, 40 diploid DNA duplexes, if the LFR length is 100 kbps, the average spacing between adjacent LFRs is 100 kbps/40 2.5 kbps, i.e., the expected number of wells in the 2.5 kbps range is 40, and if the calculated range is expanded, the number of wells will increase by one for each 2.5 kbps increase.

In seed extension, the set of EWs is identified by their well coverage of the extended (or initialized) tile. When an LFR just starts to cover the extended seed, the coverage degree of the LFR on the extended seed continuously rises along with the extension of the seed until the LFR is completely covered, and the well corresponding to the LFR which basically completely covers the seed can be used as the EW for auxiliary extension; as the seed continues to extend forward, the LFR will gradually exit the region where the seed extends, and its coverage will continuously decrease until it is 0, and when the LFR covers the seed obviously insufficiently, the corresponding well will no longer be used as EW for assisting the extension. This produces a transitional change in the EW constellation as it extends.

The EW set updated continuously with the extension guides the filtering of reads during assembly, and reads only belonging to the region near the current seed can be separated from numerous reads in the whole genome, so that some complex regions in the whole genome range are degenerated into simple regions in the LFR length range, and the simple regions are easy to assemble, namely the core of the assembly mode of separating long fragment sequences.

The obvious difference between this assembly method that considers a collection of multiple wells simultaneously in extension and the traditional two-stage assembly method that considers a single well separately (i.e., each well is assembled separately and then the assembly results are combined for secondary assembly) is the use of EW. The method can expand the depth information required for assembly from the accumulated depth of a single well to the integrated depth of the joint whole EW set, reduces the sequencing depth required by the single well and only needs low coverage. Taking 10 diploid cells (40 copies) as an example, previous secondary assembly methods required a sequencing depth of 40 × 50X-2,000X, where 50X is the sequencing depth required for routine genome assembly; while the present algorithm only requires a sequencing depth of 40 × 2.5 ═ 100X, where 2.5X is the sequencing depth required for an EW to be identified without the tolerant partial region being sequenced, the cumulative depth of the entire set of EWs when extended is 100X, i.e., 2.5X is used to identify the EW and 100X is cumulative for extension. This approach greatly reduces sequencing costs and saves time.

In addition, transient changes in EW can also be used to determine the positional relationship between contigs, creating extremely long framework sequences.

According to the embodiment of the invention, when a seed is initialized and extended, reads need to be aligned to the seed to provide materials for obtaining information (such as paired information of well and reads) and extending sequences, and it is impractical to try to align all reads, which consumes a lot of time, so that the reads can be finely aligned only when the reads hit on the seed are initially screened by using RKI, and the method is as follows:

1. sliding cutting the seed sequence to obtain Kmers, and obtaining reads corresponding to the Kmers through RKI. The whole seed sequence is cleaved upon seed initiation, while only the sequence at the contig end is cleaved during extension. For efficiency reasons, very high frequency kmers (one Kmer for very many reads) are not used to obtain reads. Specifically, for reads acquisition during extension, if the ends are all found to be very high frequency Kmers, only the endmost one Kmer is taken for reads localization.

2. The reads are positioned on the seed through the position records of the Kmer on the seed and the reads, and are compared base by base, so that certain tolerance is provided for the substitution type error in sequencing, but the error of the indel type cannot be tolerated without a gap module.

The purpose of reads filtering is to remove reads that are incorrectly located due to duplication. According to the algorithm, firstly, reads which are wrongly positioned due to the repetition of the whole genome level are filtered through EW information, and then, right arm reads which are wrongly positioned due to the simple repetition inside the LFR are filtered through mate-pair information (the left arm reads are not filtered). It should be noted that the left arm reads in the duplicate region will not participate in the validity determination of its corresponding right arm reads, since it may also locate errors itself, and the complex type of duplicates will be handled by the duplicate resolution module.

4. In order to obtain an extended sequence, it is necessary to make the reads located at the ends of contigs uniform base by base. Similar to reads comparison, only substitution type sequencing errors are tolerated in the consistency process, and indel type tolerance is not available. Because the mate-pair filtering of the reads can only act on the reads of the right arm, the reads of the right arm cannot generate positioning errors, and the method has obvious advantages with the reads of the left arm. Based on the characteristic, when the filtered right arm reads are sufficient in number, the module only uses the reads for consistency, if the filtered right arm reads can be combined into a sequence, the extension is completed, then the module enters a related information updating module, and if the filtered right arm reads cannot be combined into a sequence, namely, conflict occurs during consistency, the left arm reads in the region need to be combined for heterozygosis or repeated identification and solution. And when the number of the right arm reads is insufficient, the left arm reads is combined to carry out consistency.

5. During the merging process of the consistent base-by-base sites, if more than one base is found to appear in the same position in a non-low frequency (the low frequency is mainly caused by sequencing errors), collision can be caused, and the consistency of the sites is judged to be failed, which is mainly caused by hybridization and repetition. Compared with the repeated method, the heterozygous identification of the diploid genome is simpler, the characteristics are that the conflicting bases are only two, the halved EW sets are combined, and the EW sets supported by the diploid bases are basically not intersected, so the heterozygous identification is carried out after the consistency is failed.

Sequencing errors do not fit this situation compared to shuffling, and the main features of sequencing errors are low frequency and random. Because most sequencing technologies have few sequencing errors in the total sequencing quantity, only few differences can be found during consistency, even under the condition of large fluctuation of the sequencing depth, the differences caused by the sequencing errors still only occupy few components, the absolute number of the sequencing errors can change along with the fluctuation of the sequencing depth, but the proportion is not changed, namely the sequencing error rate is not changed. On the other hand, the randomness is expressed in that the sequencing errors do not deviate in different wells, and the sequencing errors occur in all the wells with the same probability, and even though the sequencing errors have certain bias on the error detection mode, the conditions are the same in all the wells. This means that the bases with sequencing errors are not concentrated on the well, and different bases support intersection sets among EW sets in the case of consistency conflict, and cannot be distinguished. It should be noted that the sequencing depth cannot be too low, since the characteristics of the sequencing error become less obvious when the overall sequencing depth is low, and misjudgment is very likely to occur.

When heterozygosity is identified, contigs will be split into two for two-way extension, and phasing will be performed in conjunction with EW conditions of the previous heterozygosity zone when splitting. In principle two heterozygous regions from the same haploid should have similar EW sets, in contrast to two heterozygous regions from different haplotypes should have no intersection between the EW sets. In this way, phasing operations of adjacent heterozygous sites can be performed, in particular, if the previous heterozygous region is too far apart (beyond the LFR length), where there is no intersection between the EW's of even two heterozygous regions on the same haploid, phasing will be judged to be a failure and a new phased block is established, while contigs themselves are still continuous and not interrupted at the same time.

The iterative seed sequence extension process is shown in fig. 5. The method is different from other phasing algorithms, only the EW relation of two adjacent sites is considered, the EW relation is not considered together by combining a plurality of sites, the phasing mode is linear, the logic is simple, and the phasing algorithm still has an improved space. Of course, the phasing method of the algorithm has certain advantages, and unlike some methods which only consider single-site hybridization, the algorithm performs phasing on all hybridization types, so that the phasing length is improved.

Like other approaches to identifying structural variations based on assembly, the present algorithm can also identify large structural variant heterozygotes, which is an advantage of the present algorithm over other approaches that deal with separating long fragment sequences to isolate haplotypes. Furthermore, since insertion-deletion heterozygous and large structural variant heterozygous do not differ in the phasing patterns in the algorithm, the algorithm can construct more complete and longer haplotypes than other phasing methods that only consider single alternative heterozygous sites, and can also obtain functions that cannot be obtained by re-sequencing methods, especially phasing large structural variants, which is of great significance for downstream analysis.

Repetitive sequences (reproducible Sequence) are identical or similar sequences occurring at different positions in the genome. It occurs abundantly in the genome, as the total length of the various types of repetitive sequences of the human genome account for almost half the genome size. Repeated sequences are always important problems influencing the assembly quality, and whether the repetition can be correctly solved is also the problem that various assembly algorithms are most concerned about and continuously try new strategies to break through. Duplicate resolution is naturally also a key module in the present algorithm, mainly dealing with complex duplicates within LFR, mainly including adjacent small-sized duplicates, large-sized duplicates, and tandem duplicates. These duplicate solutions are described below.

Different repeat types are solved by different modules, the identification of the repeat types needs the assistance of related information, the identification of a repeat area is one of important information for identifying the repeat types, the main purpose is to identify left arm reads which cannot be used for filtering, namely, mate-pair reads are all in a repeat area, because one of the characteristics of the repeat area is that the reads can be wrongly positioned at the position, and if the wrongly positioned reads are used for extension, wrong extension is often caused.

As shown in fig. 6, the repetitive region identification mainly includes identification of a start point and an end point. When the left arm corresponding to the right arm reads in the extension region is found not to be near the position one insert length upstream during extension, the start of the repeat region can be determined.

The end points of the repeated area are mainly divided into two types, and for simple repetition (the problem can be solved only by filtering wrongly positioned reads through mate-pair relation), the end point of the repeated area is the position of the lost left arm when the reads of the right arm do not appear any more; for complex repeats, the overshoot point during extension is the endpoint.

In addition, the difference regions appearing in the repeated segments will be considered as non-repeated regions, i.e. similar repeated segments will be strictly divided into a plurality of sub-repeated segments to be processed. In this case, the property identification of the region after the difference point is different, and the left arm corresponding to the right arm reads of the start point of the repeat region is in the previous repeat region but not in the non-repeat region, and a situation occurs in which the positioning of the paired reads is incorrect. At this time, the left arms reads near the length position of the insert before the extension point are checked, if the corresponding right arms participate in the extension, the currently extended region is the repeat region, and if a part of the regions is found not to participate in the extension, the regions already extend into the non-repeat region.

Because the mate-pair filtering of reads can only act on the right arm reads, the positions of the right arm reads and the left arm reads are different when the extension is performed, however, if a region exists, the left arm reads can be checked through the right arm reads, and the left arm reads can play a role in solving the duplication. This is just the concept of Helper Contig (HC).

HC is a contig that solves the problem of complex repetitions, is used only as an aid, and does not appear as a formal contig in the assembly results. The essence is to further utilize the mate-pair information of reads, and expect that HC will cross the current repeat region, appearing on the downstream non-repeat region, and use this non-repeat region to help resolve the repeat. If HC fails to cross the repeat region, it generally fails to work. The current application objects of HC are mainly divided into the following two categories:

(1) HC case 1-adjacent small double repeats:

because the length of the insertion fragment of mate-pair reads has a certain Standard Deviation (SD), if two adjacent small repeats slightly longer than the length of the reads-Kmer +1 exist, the SD deviation causes the confusion of mate-pair information during extension, the conflict occurs during consistency, and two conflicting bases are supported by EW and mate-pair reads.

For such a problem, the algorithm first uses the average value calculated by mate-pair information supporting each base as the expected position, and if the distance between the two positions is too close and the actual situation is fuzzy, the algorithm needs to construct HC assistance for recognition.

The right arm corresponding to the left arm reads at the leftmost end within the range of contig end SD length, which is not in the repeated area, is first used as the start of HC construction, and this reads is extended in a manner similar to seed. When extended to a sufficient length, the position of the colliding bases can be calculated separately by locating the left arm corresponding to the right arm reads on this HC. In essence, using HC in this case only improves the reliability of the calculated distance, and if SD is large, an error may still occur.

(2) HC case 2-large double repeat:

when the length of the repeat region is greater than the length of the insert, the left arm reads will not be available for filtering of reads, and in essence, the left arm will not be available for filtering the right arm within the repeat region as long as it is within the repeat region, whether the region between the two arms is repetitive or non-repetitive.

Large-scale repetition is long, and well information is easy to be different, so that the large-scale repetition can be solved by utilizing the well information according to different conditions, and partial conditions need to be solved by combining HC, and the large-scale repetition is mainly divided into two types:

the two repeats are far apart, in which case, although both colliding bases have EW support, bases upstream will contain significantly more EW than downstream. Conflicts can now be resolved by this comparison. In contrast, for other wells that have not been identified as EWs, it may be found that support more downstream than upstream, which may also be used to assist in conflict resolution, as shown in FIG. 7. If the above information remains ambiguous, the HC can be constructed using the left arm corresponding to the wrong right arm within one insert length from the start of the repeat region, and it is expected that the EW set on the HC will differ less from the EW set of upstream colliding bases than it does from downstream colliding bases.

The close distance between the two repeated sequences, in which case the well information will not change significantly and cannot be used for conflict resolution, but there may be some simple cases where the HC can resolve the conflict, see fig. 8:

a) when the distance between the collision site and the end point of the repeat region is less than the length of the insert, it can be found that the right arm corresponding to the left arm reads supporting the upstream base can be positioned on the HC (the construction of HC is the same as the previous large repeat, and is also constructed by using the left arm corresponding to the wrong right arm within one insert length range from the start point of the repeat region), as shown in FIG. 8 a;

b) when the distance between the collision site and the start of the downstream repeat is less than the insert length, it was found that the right arm corresponding to the left arm reads supporting the upstream base could be found within the repeat region, as shown in FIG. 8 b.

A Repeat consisting of multiple copies of a Repeat Unit joined in series is defined as a Tandem Repeat (TR), and the shortest Repeat segment in the Tandem Repeat region is a Tandem Repeat Unit (TRU). A tandem repeat unit has a plurality of different phases (e.g., ACT, CTA, TAC) equal in number and length.

The sequence at the end of the tandem repeat region and the sequence differences in the repeat region cause collisions with the sequence in the repeat region, as in the case of conventional repeats, whereas a tandem repeat region is equivalent to a plurality of conventional repeat regions linked together, unlike conventional repeats. This makes this iterative solution both identical and different from other conventional iterative solutions.

As shown in fig. 9, the positioning of reads may be shifted due to the tandem repeat, which may cause conflict due to errors in positioning of reads with repeated end points or repeated differences during the consistency. The core material that raises this problem is that the Kmer used to locate reads is a repeated Kmer. In this case, the head of the incorrectly positioned reads is behind the start of the tandem repeat area (because the non-repeat area before the start will cause the incorrectly positioned reads to fail in the fine alignment process), and the offsets of the reads due to the incorrect positioning are smaller than the SD of the length of the inserted segment, i.e. the mate-pair cannot filter out the reads with the positioning errors.

The first prerequisite to solving TR is correct identification, which in the present algorithm is confirmed by finding tandem repeat units. Since TR larger than the length of the insert also satisfies the activation condition of HC case2, the present algorithm puts the activation decision of TR before the activation decision of HC case 2.

TRUs are mainly identified by finding that kmers are used periodically. For TRUs smaller than the length of reads-Kmer +1 (i.e. TRUs appear twice or more on one read), slides and cuts reads on the collision site into kmers and searches whether multiple occurrences of kmers exist in one read, if most reads have the condition, and the repeated kmers of the reads are consistent or consistent on TRU phases, it can be determined that the current collision is caused by tandem repeat, and the de-tandem repeat module will be activated. And for TRUs with length larger than the reads-Kmer +1, scanning the Kmers in the range from the starting point of the repeated area to the conflict point, and if the Kmers appearing in a fixed period are found, judging that the conflict is caused by the tandem repeat.

After identifying the contents of the TRU, reads at the collision site can be classified into four categories: 1) reads within the TR region, which are completely covered by the TRU; 2) reads containing the start of the TR, which find the TRU only at the tail; 3) reads containing TR endpoints, which find TRUs only at the head end; 4) the reads of both the start and end of the TR are included, which is only present when the TR is smaller than the length of the reads, and now the reads of type 1 that are completely covered by the TRU are not found.

For TR with length less than the length of reads, the position of the reads containing the TR end point is adjusted by an end point alignment mode, and then the conflict can be eliminated. And for TR's larger than the length of reads, only "N" can be crossed and filled in to resolve. For TR which is longer than the reads and shorter than the reads insert length, aligning the right arm reads which are supported by the left arm in the non-repetitive area and contain the TR terminal point according to the terminal point and carrying out consistency, estimating the distance according to the reads insert length and filling in the estimated number of N; for TR larger than the length of the insert, it can be found that none of the right arm reads containing the end point of TR is supported by the left arm reads in the non-repetitive region, and at this time, these reads are directly normalized, and only one "N" is filled as a mark. It should be noted that minor differences (e.g., individual base substitutions, insertions, and deletions) in the TR region can be confused with the TR-terminus, and conflicts can arise when identified reads that contain the TR-terminus are aligned. At this time, the conflicting reads can be assembled by using a DBG method to naturally construct a difference sequence and a TR endpoint sequence in the separated TR region, and then the position of the assembled sequence is calculated by using the position of the left arm read on the previously extended non-repetitive region corresponding to the right arm read located on the assembled sequence, so that the sequence farthest from the extension point is the correct TR endpoint.

According to the embodiment of the invention, since the extension of seed is parallel, in order to avoid that the same region is repeatedly extended by a plurality of seed, the reads which participate in the extension need to be marked as 'used', when other seed initializes or extends, the extension is stopped when finding the reads, and then the connection is carried out by the contig merging module. It should be noted that reads within the duplicate area of mate-pair are not explicitly located, and to prevent false extension stops, reads of the duplicate nature will be labeled "duplicate" and will not be used as a determination of redundant extension.

According to embodiments of the present invention, when reads in a well are located on contigs, which may be mislocations due to repetition or sequencing errors, the algorithm requires that the well have relatively sufficient coverage for the extended seed region to be considered an EW, while EW with insufficient coverage will be discarded. Based on efficiency considerations, the update of the EW set is performed each time a certain extension length is reached, and only coverage conditions within this length range are considered.

After all contigs have been extended, they are combined, and the phasing between contigs is performed. Contigs that can be merged are divided into two cases:

contigs that stopped by the presence of "used" reads were found during extension. The merging point of two contigs is now calculated directly from the positions of these "used" reads on the two contigs.

Contig in which extension is stopped due to duplication, too low depth of local sequencing, and the like. At this time, whether mate-pair reads of outward reads (left arm reads in case of downstream extension and right arm reads in case of upstream extension) on the terminal non-repetitive region of a contig are positioned on other contigs is considered, and if the match-pair reads are found successfully, the distance between two contigs is estimated by the length of the insertion segment and the estimated number of 'N's is filled.

Because the basic unit of skeleton sequence construction in the algorithm is also the contigs which are extended and terminated but not combined yet, the contig combining step establishes the relationship between contigs (overlapped or connected with intervals) in actual processing, and then combines the relationship between contigs established in the skeleton sequence constructing step to carry out mutual inspection, so as to simplify the relationship and solve the conflict, and then the contig combining operation is carried out substantially.

As the extension progresses, the well/LFR has a characteristic of transition change, the characteristic can be used for determining the position relation between contigs, and the algorithm finally carries out a skeleton sequence construction (scaffold) operation on all contigs. In fact, this thinking has been reflected in the previous large-scale iterative solutions.

In the algorithm, a sequence formed by determining the context between contigs mainly through well information is defined as a framework sequence (scaffold), which is different from the framework sequence constructed through mate-pair information of reads in the traditional definition. The scaffold in the algorithm only expresses the context of contigs, but the specific distance between contigs cannot be confirmed, and the contigs are connected by only one 'N'. In the traditional sense, the scaffold not only determines the context between contigs through the mate-pair of reads, but also calculates the distance between contigs through the positions of pairs of reads respectively positioned on two contigs and the lengths of their insertion fragments, which is consistent with the method in the contig combination. Both of those approaches involve sequential Arrangement of contigs, which is an Optimal Linear Arrangement (OLA) with NP-Hard properties. It should be noted that contigs in scaffold are self-directed (four deoxynucleotides A, C, T, G are called DNA base sequences according to polydeoxyribonucleotides formed by connecting 3 ', 5' phosphodiester bonds, and are the expression forms of contigs and scaffold in assembly, the connection of deoxynucleotides has strict directionality, and 3 ', 5' phosphodiester bonds are formed between 3 '-OH of the previous deoxynucleotide and 5' -phosphate of the next deoxynucleotide to form a linear DNA macromolecule without branching, and the expression of DNA defines that 5 'end is in "+" direction towards 3' end and 3 'end is in "-" direction towards 5' end), so contigs with correct position and wrong direction can cause large assembly errors and are expressed as serious subsequence inversion errors.

Typically, a well appears among multiple adjacent contigs, and a set of adjacent contigs is obtained through the well. First, find the well with the contig as the starting point in the well to which the contig belongs, and define this contig as the starting point contig of this well. Then, other contigs containing the well are found out, the intersection of the well set contained in the well and the well set of the starting contig is calculated respectively, obviously, the contig which has larger intersection with the well set of the starting contig is closer to the starting contig, thereby the front-back position relation of a group of contigs can be determined, and then a scaffold is constructed.

For the directionality judgment of contigs in the scaffold, the well sets of the head end and the tail end of the contigs need to be considered separately, and the similarity of the well sets with the head end or the tail end of the adjacent contigs is calculated respectively. When contigs are short, transitional changes of Well are not obvious, and situations with unknown positions can occur, so after the preliminary screening of all contigs is completed, the results of combining the contigs are needed to be checked and corrected, the relations obtained by the two methods are combined, and finally the scaffold sequence is output.

While embodiments of the invention have been shown and described, it will be understood by those of ordinary skill in the art that: various changes, modifications, substitutions and alterations can be made to the embodiments without departing from the principles and spirit of the invention, the scope of which is defined by the claims and their equivalents.

Claims

1. A method of assembling a sequence of separate long fragments, comprising:

2. The method of claim 1, wherein the seed sequence is obtained based on a genomic reference sequence by:

in the genomic reference sequence, performing disruption according to a specific distance N between sequence contigs; and

truncating the broken reference sequence according to a preset length so as to obtain the seed sequence;

the predetermined length is not less than the length of the sequencing library insert in the sequencing.

3. The method of claim 1, wherein the set of reads comprises paired reads, and wherein the seed sequence is obtained based on the reads by:

(1) the read is cut into a plurality of Kmers in a sliding mode, an index RKI of the Kmers to the read is constructed, and the index RKI is used for accessing the corresponding read through the Kmers;

(2) extracting a pair of paired reads without a high frequency Kmer from the set of reads;

(3) respectively determining all the reads corresponding to the Kmers of two reads in the pair of reads in the step (2) by using the index RKI to obtain a first read group and a second read group;

(4) determining sequencing wells corresponding to the first read group and the second read group in the step (3) respectively to obtain a first sequencing well set and a second sequencing well set;

(5) determining the intersection of the first sequencing well set and the second sequencing well set in (4), and if the size of the intersection is not significantly different from the expected number of valid sequencing wells of the base, determining the paired reads in (2) as the seed sequence;

wherein in (5):

determining the paired reads in (2) as the seed sequence if the size of the intersection is between half and twice the expected number of valid sequencing wells for the base.

4. The method of claim 1, wherein (b) comprises:

(i) the read is cut into a plurality of Kmers in a sliding mode, an index RKI of the Kmers to the read is constructed, and the index RKI is used for accessing the corresponding read through the Kmers;

(ii) extending the plurality of seed sequences in parallel based on the reads and their corresponding indices RKI to obtain the plurality of sequence contigs.

5. The method according to claim 4, wherein the RKI is obtained by:

performing sliding cutting on the read reads into a plurality of Kmers;

and constructing a hash taking the Kmer as a key value, wherein the hash forms the RKI, and the hash records the frequency of the Kmer, the read segment to which the Kmer belongs, and the position and the direction of the Kmer on the read segment.

6. The method of claim 4, wherein the seed sequence is extended by repeating the steps of:

selecting a seed sequence suitable for extension;

mapping the reads to the seed sequence to obtain an extended sequence;

performing base-by-base alignment treatment on reads positioned at the end of the extended sequence; and

if the process of consensus fails, then a heterozygosity identification, phasing process and/or resolution of the repeated sequences is performed.

7. The method of claim 6, wherein the seed sequence suitable for extension is selected by:

sliding cutting the seed sequence into Kmer;

obtaining a read corresponding to the Kmer through the RKI;

aligning the corresponding reads to the seed sequence;

determining the coverage condition of the sequencing holes on the seed sequence based on the sequencing holes corresponding to the corresponding reads; and

based on the coverage condition, a seed sequence suitable for extension is determined.

8. The method of claim 6, wherein the reads are localized to the seed sequence by:

sliding cutting the seed sequence into Kmer;

obtaining a read corresponding to the Kmer through the RKI;

and positioning the reads corresponding to the Kmer to the seed sequence, and comparing the reads one by one.

9. The method of claim 6, wherein during the consensus process, if the pool of valid sequencing wells of an extended site is evenly assigned by different base types at that site, it is determined that heterozygosity exists.

10. The method of claim 9, wherein after determining the presence of heterozygosity, the extended sequence is divided into a plurality of strands that are separately extended.

11. The method of claim 6, wherein the set of reads comprises a plurality of pairs of paired reads, a distance between two reads of a pair of paired reads being L,

during the unification process, if a distance between a read positioned downstream in the extension direction and a corresponding read in the pair of reads is not L, it is determined that the position at which the read positioned downstream in the extension direction is positioned is a start point of the repetitive sequence.

12. The method of claim 11, wherein an end point of the repeating sequence is a position at which a read positioned downstream in the extending direction of the pair of reads is positioned, a distance between the read positioned downstream in the extending direction and the corresponding read is L,

or collision sites upon extension.

13. The method of claim 12, wherein the determining whether the repeated sequence is a tandem repeated sequence is performed by:

performing sliding cuts on reads mapped to the repeated sequence to obtain Kmers;

judging whether the Kmer obtained by each read through sliding cutting has a repeated Kmer, if not, judging that no tandem repeat sequence exists, and if so, judging that the tandem repeat sequence exists.

14. The method of claim 13, wherein if the repeated sequence is a tandem repeated sequence, the tandem repeated sequence is resolved by:

(m) determining the length of the tandem repeat sequence;

(n) adjusting the position of the read containing the end point of the tandem repeat sequence in an end point alignment manner.

15. The method of claim 13, wherein if the repeated sequence is not a tandem repeated sequence and there are the following conditions, determining that the repeated sequence is a large double-repeated sequence by:

the repeat sequence is greater than L in length, or reads of the pair of reads that correspond to reads positioned downstream of the repeat sequence are also positioned on the repeat sequence.

16. The method of claim 15, wherein if the repeated sequence is a large double-repeated sequence, the large double-repeated sequence is resolved by:

and comparing the difference between the number of the effective sequencing holes corresponding to the upstream repetitive sequence and the number of the effective sequencing holes corresponding to the downstream repetitive sequence in the large double-repetitive sequence with the expected value of the number of the effective sequencing holes of the base, and solving the conflict on the large double-repetitive sequence according to the difference.

17. The method of claim 15, wherein the repeated sequence is determined to be a small double complex sequence if the repeated sequence is not a large double complex sequence and the following conditions exist:

the length of the repeated sequence is less than L.

18. The method of claim 17, wherein the small double repeat sequence is resolved by at least one of:

(p) determining the base at the collision site by comparing the proximity of the distance between pairs of reads supporting two colliding bases to the mean, using the mean of the distances between pairs of reads supporting each base as the expected position of the collision site;

(k) constructing an auxiliary contig using reads corresponding to reads of the paired reads that are positioned to the most upstream of the non-repetitive sequence in the extended sequence standard deviation range, and determining bases at the collision sites using reads of the paired reads that are corresponding to reads positioned to the downstream of the auxiliary contig.

19. The method of claim 18, wherein the extending the seed sequence is terminated if the small duplex complex sequence cannot be resolved.

20. The method of claim 1, wherein (c) comprises:

(iii) establishing a merged connection relationship between the sequence contigs based on the reads;

(iv) constructing a primary backbone sequence based on the sequence contigs and sequencing wells corresponding to the reads comprised by the sequence contigs;

(v) obtaining an assembly result separating long fragment sequences by merging the sequence contigs to construct the backbone sequence after mutually examining the merged linkage relationship between the plurality of sequence contigs and the primary backbone sequence.

21. An apparatus for assembling a sequence of partitioned long fragments, comprising:

22. The apparatus of claim 21, wherein the seed sequence is obtained based on the genomic reference sequence by:

23. The apparatus of claim 21, wherein the set of reads comprises paired reads, and wherein the seed sequence is obtained based on the reads by:

wherein in (5):

24. The apparatus of claim 21, comprising in the extension module performing:

25. The device of claim 24, wherein the RKI is obtained by:

performing sliding cutting on the read to obtain a plurality of Kmers;

26. The apparatus of claim 24, wherein the seed sequence is extended by repeating the steps of:

selecting a seed sequence suitable for extension;

mapping the reads to the seed sequence to obtain an extended sequence;

27. The apparatus of claim 26, wherein the seed sequence suitable for extension is selected by:

sliding cutting the seed sequence into Kmer;

obtaining a read corresponding to the Kmer through the RKI;

aligning the corresponding reads to the seed sequence;

28. The apparatus of claim 26, wherein the reads are localized to the seed sequence by:

sliding cutting the seed sequence into Kmer;

obtaining a read corresponding to the Kmer through the RKI;

29. The apparatus of claim 26, wherein during the consensus process, if the pool of valid sequencing wells of an extended site is evenly assigned by different base types at that site, it is determined that heterozygosity exists.

30. The device of claim 29, wherein after determining the presence of heterozygosity, the extension sequence is divided into a plurality of strands that are individually extended.

31. The apparatus of claim 26, wherein the set of reads comprises a plurality of pairs of paired reads, a distance between two reads of a pair of paired reads being L,

32. The apparatus of claim 31, wherein an end point of the repeating sequence is a position of a pair of reads located downstream in the direction of extension with a distance L from a corresponding read, or a collision site when extended.

33. The apparatus of claim 32, wherein the determination of whether the repeated sequence is a tandem repeated sequence is made by:

34. The apparatus of claim 33, wherein if the repeated sequence is a tandem repeated sequence, the tandem repeated sequence is resolved by:

(m) determining the length of the tandem repeat sequence;

35. The apparatus of claim 33, wherein the repeated sequence is determined to be a large double-repeated sequence if the repeated sequence is not a tandem repeated sequence and if:

36. The apparatus of claim 35, wherein if the repeated sequence is a large double-repeated sequence, the large double-repeated sequence is parsed by:

37. The apparatus of claim 35, wherein the repeated sequence is determined to be a small double complex sequence if the repeated sequence is not a large double complex sequence and if:

the length of the repeated sequence is less than L.

38. The apparatus of claim 37, wherein if the repeated sequence is a small double-repeated sequence, the small double-repeated sequence is resolved by at least one of:

39. The apparatus of claim 38, wherein the extension of the seed sequence is terminated if the small double complex sequence cannot be resolved.

40. The apparatus of claim 21, wherein the framework sequence construction module is configured to:

41. A computer-readable medium storing a computer-executable program, execution of which comprises performing the method of any one of claims 1-20.

42. A system for assembling a sequence of partitioned long fragments, comprising:

a data input unit for inputting data;

a data output unit for outputting data;

a storage unit for storing data including a computer executable program;

a processor coupled to the data input unit, the data output unit, and the storage unit, for executing the computer-executable program, the executing the program comprising performing the method of any of claims 1-20.