WO2017143585A1

WO2017143585A1 - Method and apparatus for assembling separated long fragment sequences

Info

Publication number: WO2017143585A1
Application number: PCT/CN2016/074665
Authority: WO
Inventors: 谢寅龙; 黄伟华; 李净净; 郭瑞东; 唐静波; 邓超
Original assignee: 深圳华大基因研究院
Priority date: 2016-02-26
Filing date: 2016-02-26
Publication date: 2017-08-31
Also published as: CN108350495B; CN108350495A; HK1254399A1

Abstract

A method, apparatus, and system for assembling separated long fragment sequences. A method for assembling separated long fragment sequences, comprising: (a) acquiring a read fragment set by means of sequencing, and recording a sequencing hole corresponding to the read fragments in the read fragment set, the sequencing hole comprising at least one long fragment sequence; (b) using the read fragments and the sequencing holes corresponding to the read fragments to implement parallel extension of multiple seed sequences in order to acquire multiple sequence contigs, the multiple seed sequences being determined by means of known sequences; and (c) on the basis of the read fragments, the sequence overlap contigs, and the sequencing holes corresponding to the read segments contained in the sequence overlap contigs, constructing a skeleton sequence in order to acquire the results of the assembly of the separated long fragments sequences.

Description

Method and apparatus for assembling long segment sequences

Technical field

The present invention relates to the field of biotechnology, and in particular, to a method and apparatus for assembling sequences that separate long segments.

Background technique

Since the first human genome has been more completely constructed, it has been more than ten years. With genomic maps, various bioinformatics analysis methods and software for human genome resequencing have sprung up, for human diseases, medicine and health. The development of the research has made a significant contribution.

Comparative Genomics is a discipline that compares known gene and genomic structures to understand gene function, expression mechanism, and species evolution. The basic principle is that the same characteristics between two organisms are usually encoded by evolutionarily conserved DNA, and the relevant DNA fragments will be the same or similar. What is needed for comparative genomics analysis is the presence of a genomic map (genomic reference sequence) and sequencing of the compared objects. Mutation detection is an important part of comparative genomics. The International HapMap Project and Genome-wide Association Study (GWAS) are based on single nucleotide polymorphisms. This type of variation in single-nucleotide polymorphism (SNP) is used for related research.

Assembly refers to the process of integrating shorter fragment sequences into longer sequences. Limited by sequencing technology, the genome cannot read its complete sequence content by sequencing the chromosomes from beginning to end, and often breaks the whole genome into tens to thousands of base fragments, which are processed by massively parallel sequencing. The contents of these fragments are read, and these fragmented fragments are analyzed and integrated using assembly, and finally re-reduced into relatively complete genomic sequences. Identifying mutations through assembly is a new application of assembly techniques, and in fact the main purpose of assembly is to build the genome. When there is no genomic reference sequence, the construction of the genome is a process from scratch, which is especially important. This type of assembly is also called "denovo assembly". The human genome reference sequence is about 3Gbp (3*10 ⁹ bp) in length, which is the number of base pairs in haploids, while humans are diploid organisms, and should actually have a genome size of 2*3Gbp=6Gbp. The diploid genome of a human individual is derived from a haploid contributed by each of its two parents, and the various differences between the two sets of haploids cause the individual to have one or more sites on the homologous chromosome. There are different alleles present, and this phenomenon is a heterozygous phenomenon. Moreover, the human reference sequence is constructed from data from multiple individuals, which results in a virtually heterozygous haploid genome. With the deepening of genomic research, the haploid human reference sequence is increasingly unable to meet the demand, the construction of haplotype (Haplotyping) is increasingly important, genomic analysis based on haplotype information is also emerging.

The haplotype information helps to interpret the relationship between genotype and phenotype. The two individuals with the same heterozygous collection will also have different phenotypes and disease susceptibility depending on the haplotype. Studies (such as the specific expression of alleles) and disease research (such as Mediterranean fever, breast cancer) are important. The operation of dividing the heterozygous information at multiple heterozygous sites to determine the haplotype is called phase phasing, which is an important operation for constructing haplotypes, and there are many methods for constructing haplotypes. Mainly divided into the following five categories:

1. Method of using population statistics for data of multiple unrelated individuals

2. The method of applying Mendelian inheritance to family data

3. Direct use of sequencing sequence information

4. Method of experiment operation

5. By physical separation

It is important to note that the core of the physical separation method is to separate the sequences that break into long fragments of DNA, either by means of fosmid plasmids or by direct physical separation of the multiwell plates. Further fragmentation and amplification operations (construction libraries) required for sequencing after separation, in order to distinguish between different separation units, respectively, the sequences in these units were attached with different barcodes. In this way, the whole gene component is divided into many sub-portions by separation. When the number of separated sub-portions is large, each sub-portion only contains the content of one haploid of a small area on the genome. This allows heterozygous regions at the genome-wide level to appear in homozygous form in these small regions, which is of great importance for the construction of haplotypes.

Each partition unit has its own unique barcode sequence to retrieve the own reads belonging to each partition unit by identifying the barcode sequence after sequencing. Fosmid plasmid separation technology refers to the separation unit as fosmid plasmid pool (fosmid pool), each fosmid plasmid pool contains one or more long fragments of about 37Kbp long; and the perforated plate separation technique is called the separation unit is well (well ), each well contains multiple long segments, the length of which varies from technology to technology. In any case, the separation method pioneered a new type of information, and the total collection of reads is separated into a multi-group collection by barcode, which is no longer compared to the whole Genome Shotgun (WGS) sequencing method. Randomly derived from any location on the genome, but the reads in each group come from a common small area on the genome, which is the area covered by long fragments when separated. Although these small regions are still derived from any position on the whole genome, the reads in each cluster are constrained and aggregated. The added information of this cluster becomes the key to constructing haplotypes.

Long Fragment Reads (LFR) technology dramatically improves the complexity of library construction, allowing haplotype construction to be reduced both in time and cost.

However, the assembly of current long segment read techniques remains to be improved.

Summary of the invention

The present invention is directed to solving at least some of the above technical problems or at least providing a useful commercial choice.

In a first aspect of the invention, the invention provides a method of assembling a sequence of separated long segments, comprising:

(a) obtaining a set of reads by sequencing, and recording the sequencing holes corresponding to the reads in the read set, one sequencing well comprising at least one long segment sequence;

(b) using the read segment and the sequencing hole corresponding to the read segment, extending the plurality of seed sequences in parallel to obtain a plurality of sequence contigs, wherein the plurality of seed sequences are determined by a known sequence;

(c) constructing a skeleton sequence based on the read sequence, the sequence contig, and the sequencing well corresponding to the read included in the sequence contig to obtain an assembly result of separating the long fragment sequence.

It will be understood by those skilled in the art that all or part of the steps of the method of the present invention may be performed by a program to instruct related hardware. The program may be stored in a computer readable storage medium, and the storage medium may include: a read only memory, a random Memory, disk or disc, etc.

In a second aspect of the present invention, the present invention provides a system for assembling a sequence of separated long segments, the system comprising: a data input unit for inputting data; a data output unit for outputting data; and a storage unit for Storing data, including a computer executable program; a processor coupled to the data input unit, the data output unit, and the storage unit for executing the computer executable program, the executing the program comprising performing the above method .

In a third aspect of the invention, the invention provides an apparatus for assembling a sequence of long segment segments, comprising:

Inputting a module, obtaining a read set by sequencing, and recording a sequencing hole corresponding to the read in the read set, wherein one sequencing hole comprises at least one long segment sequence;

An extension module, using the read segment and the sequencing hole corresponding to the read segment, to extend a plurality of seed sequences in parallel to obtain a plurality of sequence contigs, wherein the plurality of seed sequences are determined by a known sequence;

The skeleton sequence constructing module constructs a skeleton sequence based on the read segment, the sequence contig, and the sequencing hole corresponding to the read segment included in the sequence contig to obtain an assembly result of separating the long segment sequence.

The above described method and/or apparatus of the present invention has at least the following features and advantages:

Iterative extension: The extension of the seed sequence is an iterative process, and the extension will be used as an extension base for the next iteration, so that the seed sequence can be extended.

Linear extension: Unlike other graph-based assembly algorithms, the linear extension method makes the structure encountered during assembly simpler, the logic relationship is relatively clear and easy to classify, and the graph algorithm gathers information of all repeated regions in one place. It involves more information and a more complex structure. That is, the solution of the graph is a one-time solution. As long as one repeat region is solved, all the genomic regions associated with it (they have the same or similar repeat sequences) will be all The solution is solved at the same time, and the linear method only solves the repetition of the current region at one time, and does not solve the other regions at the same time, and then re-interprets when the repetition is extended to another region of the genome. Resolving a record of related information after a repetition, making the information encountered in the next solution repetition easier, which reduces the extra computational loss. In addition, the linear extension mode makes the extension of each seed sequence independent of each other, and this mode is easier to calculate parallelization.

Multiple Extension: Since the goal of the algorithm is to construct a polyploid genome, the extension of the seed sequence will use the well information to perform multiple multiplex extensions of the haplotype, whenever a heterozygous region is encountered. The phasing operation is carried out in real time in conjunction with the previously assembled hybrid zone condition, ie the phase is also linear and runs through the various modules of the assembly device. In order to save resource consumption, the extension of the homozygous area still adopts a single extension, and when it encounters the hybrid area, it is divided into multiple ways to extend, and after the extension of the hybrid area is completed, it is merged back to the single way to continue to extend.

Global information decision: If an algorithm only uses the relevant information under the current decision, regardless of the selection branch in the global scope, it is prone to error or cannot reach the global optimal solution (such as Dijkstra in the shortest path of single source) Greedy algorithm). For genome assembly, greedy algorithms often lead to assembly errors due to wrong choices. Therefore, this method does not use greedy algorithms to resolve conflicts during extension, but uses global information to make correlation decisions. Since the length of the LFR (~100Kbp) is long, all the information (Kmer, reads, mate-pair, and well information) that the location can provide is considered in the path selection, and the well information can make the determination range of 100Kbp. Left and right, if there is not a large number of large repetitions, the algorithm is basically global, and when the extension encounters a complicated situation, the algorithm will strictly choose to terminate the extension instead of taking a higher value to extend, avoiding the use of greedy judgment mode. What needs to be specially mentioned is that since all the well information at the position is considered in the extension, this makes the algorithm more global than the conventional hierarchical assembly algorithm, and can greatly reduce the sequencing depth requirement, that is, each well There is no requirement for a particularly high sequencing depth, which saves a lot of resources, both cost and time.

The additional aspects and advantages of the invention will be set forth in part in the description which follows.

DRAWINGS

The above and/or additional aspects and advantages of the present invention will become apparent and readily understood from

1 is a flow chart of a method of assembling a sequence of separated long segments in an embodiment of the present invention.

2 is a flow chart of a method of assembling a sequence of separated long segments in an embodiment of the present invention.

Figure 3 is a block diagram showing the construction of an apparatus for assembling a sequence of long segments in an embodiment of the present invention.

4 is a schematic view showing the structure of an apparatus for assembling a sequence of long segment segments in an embodiment of the present invention.

FIG. 5 is a schematic diagram of a seed sequence iterative extension process in an embodiment of the present invention.

Fig. 6 is a schematic diagram of identifying a repeating region in an embodiment of the present invention.

Figure 7 is a schematic illustration of the processing of conflicts of large double repeat sequences having two repeating regions that are relatively large in an embodiment of the present invention.

Figure 8 is a schematic illustration of the processing of collisions of large double repeat sequences having two repeating regions that are relatively small in an embodiment of the present invention.

Figure 9 is a schematic illustration of the conflicting processing of tandem repeats in an embodiment of the invention.

detailed description

The embodiments of the present invention are described in detail below, and the examples of the embodiments are illustrated in the drawings, wherein the same or similar reference numerals are used to refer to the same or similar elements or elements having the same or similar functions. The embodiments described below with reference to the drawings are intended to be illustrative of the invention and are not to be construed as limiting.

In a first aspect of the invention, the invention proposes a method of assembling a sequence of separated long segments, see Fig. 1, the method comprising:

According to an embodiment of the invention, the seed sequence is obtained based on the genomic reference sequence according to the following steps:

In the genomic reference sequence, interrupted by N;

The interrupted reference sequence is truncated by a predetermined length to obtain the seed sequence.

The size of the predetermined length is not particularly limited. Generally, the predetermined length is not less than the size of the insert, so that the reads from the same insert can be positioned on the same seed sequence. According to an embodiment of the invention, the sequencing is double-end sequencing, the predetermined length being at least twice the length of the sequencing library insert in the sequencing, facilitating the positioning of the paired reads onto the same seed sequence. According to an embodiment of the invention, the set of reads comprises a pair of reads, the seed sequence being obtained based on the read according to the following steps: (1) sliding the read into a plurality of Kmers, constructing Kmer's index RKI for the read segment, for accessing the corresponding read segment by Kmer; (2) extracting a pair of read pairs without the high frequency Kmer from the read set; (3) using the index RKI, respectively determining all the reads corresponding to the Kmer of the two of the pair of read pairs in (2), obtaining the first read group and the second read group; (4) respectively determining (3) a sequencing hole corresponding to the first read segment group and the second read segment group, obtaining a first sequencing hole set and a second sequencing hole set; (5) determining (4) The intersection of the first set of sequencing holes and the second set of sequencing wells, if the size of the intersection is not significantly different from the expected value of the number of valid sequencing holes of the base, determining that the paired reads in (2) are The seed sequence, the expected number of effective sequencing wells for the base is determined by the amount of nucleic acid in the sample. The so-called high frequency is relative to the average frequency, and the number of occurrences of a certain Kmer is higher than the average number of occurrences of Kmer, which is considered to be a high frequency Kmer. In more cases, the inventors define, as needed, a Kmer whose number of occurrences is much higher than the average of the occurrence of Kmer as a high frequency Kmer, which is said to be "far above" as at least 10 times the average number of occurrences of Kmer. 20 times, 30 times, 40 times, 50 times or 60 times. Kmer frequency expectation is related to sequencing depth, read length and Kmer size, according to one embodiment of the invention, for K=19, ie 19mer, 100 bp read length, 100X sequencing depth, Kmer desired frequency is 100 (sequencing depth)* {[100 (read length) -19 (Kmer size) +1] / 100 (read length)} = 82. When the average frequency is 82, the Kmer having a frequency of 3000 to 5000 is set as the high frequency Kmer.

There is no significant difference, which can be a statistically significant difference, or it can be said that there is no significant difference. According to an embodiment of the present invention, in (5), if the size of the intersection is between half and two times the expected value of the number of effective sequencing holes of the base, determining that the paired reads in (2) are Seed sequence.

The manner in which the above seed sequences are generated, the former is constructed based only on known sequences, such as reading sequences obtained according to reference sequences, sequencing, assembly fragments, etc., and the latter is constructed by combining known sequences, RKIs, and sequencing holes. It can be applied to de novo assembly of species without a reference sequence at all.

According to an embodiment of the invention, (b) comprises: (i) slidingly cutting the read into a plurality of Kmers, constructing an index RKI of the Kmer to the read for accessing the corresponding read by the Kmer; And extending the plurality of seed sequences in parallel based on the read segment and its corresponding index RKI to obtain the plurality of sequence contigs.

According to an embodiment of the present invention, the RKI is obtained by slidingly cutting the read reads into a plurality of Kmers; constructing a hash with a Kmer as a key value, the hash forming the RKI, And the hash records the frequency of the Kmer, the associated read segment, the position and direction of the Kmer on the read segment.

According to an embodiment of the invention, the seed sequence is extended by repeating the steps of: selecting a seed sequence suitable for extension; positioning the read to the seed sequence to obtain an extension sequence; positioning the extension Reads at the end of the sequence are per-base-consistent; and if the co-processing fails, heterozygous recognition, phasing processing, and/or parsing of the repetitive sequences are performed.

According to an embodiment of the invention, a seed sequence suitable for extension is selected by slidingly cutting a seed sequence into a Kmer; obtaining a read corresponding to the Kmer by the RKI; and the corresponding read with the seed The sequences are aligned; based on the sequencing wells corresponding to the corresponding reads, determining the coverage of the seed sequences by the sequencing wells; and determining the seed sequences suitable for extension based on the coverage conditions.

According to an embodiment of the invention, the reading is located to the seed sequence by sliding slashing the seed sequence into a Kmer; acquiring the Kmer-compatible read by the RKI; reading the Kmer corresponding The segments are mapped to the seed sequence and aligned on a base by base basis.

According to an embodiment of the present invention, in the process of the consistency process, if the set of effective sequencing holes of the extended site is equally distributed by the base types different in the site, it is judged that there is a hybrid.

According to an embodiment of the present invention, after it is determined that there is a hybrid, the extended sequence is divided into a plurality of strips for extension.

According to an embodiment of the invention, the set of reads comprises a plurality of pairs of pairs, and the distance between two of the pair of pairs is L, in the process of consistency, if paired If the distance between the read segment located downstream in the extension direction and the corresponding read segment in the read segment is non-L, it is determined that the position at which the read segment located downstream of the extension direction is located is the start point of the repeated sequence.

It should be noted that the numerical values involved in the present invention are mostly statistically significant. Therefore, unless otherwise stated, any numerical value expressed in an accurate manner may represent a range, for example, an interval including plus or minus 10% of the numerical value; or the numerical value The population is normally distributed, and the value expressed in an accurate manner contains the interval of the positive and negative standard deviation of the value. The distance L between two reads in a pair of paired reads corresponds to the length of the inserted segment. Generally, in the experimental stage, the size of the sequenced library constructed is certain, that is, the size of the inserted fragment is a fixed value. Theoretically, the distance between the outer ends of the paired reads obtained after double-end sequencing is the fixed value. In the data obtained after actual sequencing, the distance between the paired reads is normally distributed. In the process of repeating sequence determination and parsing of this example, the inventors set the L to the size of the insert of the experimental phase, and those skilled in the art can understand that the L is set to be the positive and negative standard deviation of the insert size of any test phase. The value between them is either the positive or negative standard deviation interval of the insertion fragment size in the experimental phase, and the repeating sequence in the extension process can also be determined and resolved by the method of the deduplication sequence of the present invention.

According to an embodiment of the invention, the end point of the repeating sequence is a position in the paired read segment located at a distance L downstream of the corresponding read segment from the read segment located downstream of the extension direction, or Conflict site.

According to an embodiment of the present invention, it is determined whether the repeating sequence is a tandem repeating sequence by performing a sliding cut on a read positioned to the repeating sequence to obtain a Kmer; determining that each of the read segments is subjected to sliding cutting Whether there is a repeat Kmer in the Kmer, if the repeat Kmer is absent, it is judged that there is no tandem repeat, and if the repeat Kmer is present, it is judged that there is a tandem repeat.

According to an embodiment of the present invention, if the repeat sequence is a tandem repeat sequence, the tandem repeat sequence is resolved by: determining the length of the tandem repeat sequence; and using the read sequence including the end point of the tandem repeat sequence The position of the end point is adjusted to determine the base at the conflicting position.

According to an embodiment of the present invention, if the repeated sequence is not a tandem repeat sequence, and the following conditions exist, The repeat sequence is judged to be a large double repeat sequence: the length of the repeat sequence is greater than L, or the read corresponding to the read position located downstream of the repeat sequence in the paired read is also located on the repeat sequence.

According to an embodiment of the present invention, if the repeat sequence is a large double repeat sequence, the large double repeat sequence is analyzed by comparing the number of effective sequencing wells corresponding to the upstream repeat sequence in the large double repeat sequence with The difference between the difference in the number of effective sequencing wells corresponding to the downstream repeat sequence and the expected value of the number of valid sequencing wells of the base, and the conflict on the large double repeat sequence is solved according to the size of the difference. For example, if the two repeats in a large double repeat are relatively far apart, the upstream conflicting base contains more efficient sequencing holes (EW) than the downstream, and the defined number difference is greater than half of the expected EW. Obviously, the base can be determined by this comparison to resolve the conflict; if the difference between the upstream and downstream EW numbers is less than or equal to half the expected EW number, the wrong downstream arm within the length of the insertion segment from the start of the repeating region can be utilized. The corresponding upstream arm constructs a helper contig (HC), comparing the EW set on the HC with the EW set of the upstream conflict base compared to its downstream conflict base.

According to an embodiment of the present invention, if the repeat sequence is not a large double repeat sequence, and the following conditions exist, it is determined that the repeat sequence is a small double repeat sequence: the length of the repeat sequence is less than L.

According to an embodiment of the invention, the small double repeat sequence is resolved by at least one of: (p) utilizing the mean of the distance between the paired reads supporting each base as the desired position of the collision site, Comparing the proximity of the distance between the paired reads supporting the two conflicting bases to the mean to determine the base at the conflicting site; (k) using the paired reads to locate the extended sequence A readout corresponding to the most upstream read of the non-repetitive sequence in the standard deviation range to construct a helper contig, using the read corresponding to the read located downstream of the helper contig in the paired read to determine the conflicting site Base on.

According to an embodiment of the invention, if the small double repeat sequence cannot be resolved, the conflict cannot be resolved, and the extension of the seed sequence is terminated.

Referring to FIG. 2, according to an embodiment of the present invention, (c) includes: (iii) establishing a merged connection relationship between the sequence contigs based on the read; (iv) based on the sequence contig and the </ RTI> </ RTI> </ RTI> </ RTI> </ RTI> </ RTI> </ RTI> </ RTI> </ RTI> </ RTI> </ RTI> </ RTI> </ RTI> Sequence contigs are constructed to construct the backbone sequences to obtain assembly results that separate long fragment sequences.

The method of any of the above embodiments of the present invention has at least one of the following four characteristics and advantages:

Linear extension: Unlike other graph-based assembly algorithms, the linear extension method makes the structure encountered during assembly more Simple, logical relationships are relatively clear and easy to classify, while graph algorithms aggregate information from all repeat regions in one place, involving more information and more complex structures. That is, the solution of the graph is a one-time solution. As long as one repeat region is solved, all the genomic regions associated with the same (they have the same or similar repeat sequences) will all be solved at the same time, and the linear method only solves the current region at a time. The repetition of the other regions is not solved at the same time, and is extended once the extension is repeated in another region of the genome. Resolving a record of related information after a repetition, making the information encountered in the next solution repetition easier, which reduces the extra computational loss. In addition, the linear extension mode makes the extension of each seed sequence independent of each other, and this mode is easier to calculate parallelization.

Multiple extension: Since the goal of the algorithm is to construct a polyploid genome, the extension of the seed sequence will use the well information to perform multiple multiplex extensions of the haplotype, which will be combined with the previous one when encountering the heterozygous region. The assembled heterozygous zone condition is immediately phased, ie the phase is also linear and runs through the various modules of the assembly system (eg contig merge, skeleton sequence construction). In order to save resource consumption, the extension of the homozygous area still adopts a single extension, and when it encounters the hybrid area, it is divided into multiple ways to extend, and after the extension of the hybrid area is completed, it is merged back to the single way to continue to extend.

Global information decision: If an algorithm only uses the relevant information under the current decision, regardless of the selection branch in the global scope, it is prone to error or cannot reach the global optimal solution (such as Dijkstra in the shortest path of single source) Greedy algorithm). For genome assembly, greedy algorithms often lead to assembly errors due to wrong choices. Therefore, this algorithm does not use greedy algorithms to resolve conflicts during extension, but uses global information to make correlation decisions. Since the length of the LFR (~100Kbp) is long, all the information (Kmer, read, mate-pair, and well information) that the location can provide is considered in the path selection, and the well information can make the decision range of 100Kbp. Left and right, if the large repetition does not exist in a large amount, the algorithm is basically global, and when the extension encounters a complicated situation, the algorithm will strictly choose to terminate the extension instead of taking a higher value to extend, avoiding the use of greedy judgment mode. What needs to be specially mentioned is that since all the well information at the position is considered in the extension, this makes the algorithm more global than the conventional hierarchical assembly algorithm, and can greatly reduce the sequencing depth requirement, that is, each well There is no requirement for a particularly high sequencing depth, which saves a lot of resources, both cost and time.

The above process is written into an executable program. The execution program includes: first, reading the seeds and reads, and constructing the reads into the Kmer index of the Reads (Read Kmer Index, RKI), so that the target can be accessed through the Kmer quickly. . Then extend the seeds as much as possible. Since the extension between the seed and the seed is independent of each other, this step can be accelerated by parallel operations. The extended seed is the sequence contig, at this time through the well The information begins to pre-build the skeleton sequence (scaffold), only establishes the context between contig, and does not immediately construct the scaffold. Similarly, by using the reads and paired reads information to establish a merged join relationship between contigs, these contigs are not immediately merged. Then, the relationship between the merged contig and the relationship in the pre-built scaffold are tested against each other, the relationship is simplified and the conflict is solved, and then the contig is merged and the scaffold is constructed, and finally the assembly result is output.

In yet another aspect, the present invention provides an apparatus for assembling a sequence of separated long segments for performing the method of any of the above-described embodiments of the present invention. Referring to FIG. 3, the apparatus includes: an input module, obtains a read set by sequencing, and records a sequencing hole corresponding to the read in the read set, one sequencing hole includes at least one long segment sequence; and an extension module uses the read And the sequencing holes corresponding to the segments and the read segments, and extending the plurality of seed sequences in parallel to obtain a plurality of sequence contigs, wherein the plurality of seed sequences are determined by known sequences; the skeleton sequence building module is based on the reads The sequence contigs and the sequencing wells corresponding to the reads included in the sequence contig, construct a skeleton sequence to obtain an assembly result of separating the long fragment sequences.

Those skilled in the art can understand that the processing steps or the methods used in any of the above embodiments of the present invention can be implemented by using corresponding functional modules of the device or submodules included. The above description of the technical features and advantages of the method of the invention applies equally to the apparatus of the invention.

According to an embodiment of the invention, the seed sequence is obtained based on a genomic reference sequence according to the following steps: in the genomic reference sequence, interrupted by N; and the interrupted reference sequence is truncated by a predetermined length, In order to obtain the seed sequence.

According to an embodiment of the invention, the predetermined length is not less than the length of the sequencing library insert in the sequencing.

According to an embodiment of the invention, the set of reads comprises a pair of reads, the seed sequence being obtained based on the read according to the following steps: (1) sliding the read into a plurality of Kmers, constructing Kmer's index RKI for the read segment, for accessing the corresponding read segment by Kmer; (2) extracting a pair of read pairs without the high frequency Kmer from the read set; (3) using the index RKI, respectively determining all the reads corresponding to the Kmer of the two of the pair of read pairs in (2), obtaining the first read group and the second read group; (4) respectively determining (3) a first read segment group and a second read segment group corresponding to the sequencing holes to obtain a first sequencing hole set and a second sequencing hole set; (5) determining the first sequencing hole set and the second sequencing hole set in (4) The intersection of the pair of reads in (2) is determined to be the seed sequence if the size of the intersection is not significantly different from the expected number of valid sequencing wells of the base. The expected number of effective sequencing wells for the base is determined by the amount of nucleic acid in the sample.

According to an embodiment of the present invention, in (5) of the above embodiment, if the size of the intersection is between half and two times the expected value of the number of effective sequencing holes of the base, determining the paired reading in (2) The segment is the seed sequence.

According to an embodiment of the present invention, the method further includes an index construction module coupled to the extension module for using the The read segment is slidably cut into a plurality of Kmers, and an index RKI of the Kmer to the read segment is constructed for accessing the corresponding read segment by the Kmer; then the extension module is used to implement the following: based on the read segment and its corresponding index RKI And extending the plurality of seed sequences in parallel to obtain the plurality of sequence contigs.

According to an embodiment of the present invention, the RKI is obtained by slidingly cutting the read into a plurality of Kmers; constructing a hash with a Kmer as a key, the hash constitutes the RKI, and The hash records the frequency of the Kmer, the associated read, and the position and orientation of the Kmer on the read.

According to an embodiment of the present invention, if the repeat sequence is a tandem repeat sequence, the tandem repeat sequence is resolved by: determining the length of the tandem repeat sequence; and using the read sequence including the end point of the tandem repeat sequence The position is adjusted by the end point alignment.

According to an embodiment of the present invention, if the repeat sequence is not a tandem repeat sequence, and the following conditions exist, it is determined that the repeat sequence is a large double repeat sequence: the length of the repeat sequence is greater than L, or in a paired read The read corresponding to the read located downstream of the repeat sequence is also located on the repeat.

According to an embodiment of the present invention, the large double repeat sequence is analyzed by comparing the number of effective sequencing wells corresponding to the upstream repeat sequence in the large double repeat sequence with the number of effective sequencing wells corresponding to the downstream repeat sequence The difference between the difference and the expected number of valid sequencing wells of the base, based on the magnitude of the difference, resolves the conflict on the large double repeat.

According to an embodiment of the invention, if the repeat sequence is a small double repeat, the small double repeat is resolved by at least one of: (p) utilizing a distance between pairs of reads supporting each base The mean value is the desired position of the conflicting site, and the bases at the conflicting sites are determined by comparing the distance between the paired reads supporting the two conflicting bases to the mean; (k) utilizing Constructing a helper contig by constructing a read corresponding to the most upstream read of the non-repetitive sequence in the range of standard deviations of the extended sequence in the paired read, using the paired read to locate downstream of the auxiliary contig Reading the corresponding segments of the segment, by updating the read data to update the distance mean, and also by comparing the proximity of the distance between the paired reads supporting the two conflicting bases to the mean to determine the conflicting site Base on.

According to an embodiment of the invention, if the small double repeat sequence cannot be resolved, the extension of the seed sequence is terminated.

Referring to FIG. 4, according to an embodiment of the present invention, the skeleton sequence construction module includes a primary skeleton sequence construction module, configured to establish a merge connection relationship between the sequence overlapping groups based on the read segment, and further include a merge connection relationship. Establishing a module for constructing a primary skeleton sequence based on the sequence contig and the sequencing hole corresponding to the read segment included in the sequence contig; and further comprising an assembly module for locating the plurality of sequence contigs After the merged connection relationship and the primary skeleton sequence are mutually verified, the assembly result of separating the long fragment sequences is obtained by combining the sequence contigs to construct the skeleton sequence.

In the description of the present specification, the description with reference to the terms "one embodiment", "some embodiments", "example", "specific example", or "some examples" and the like means a specific feature described in connection with the embodiment or example. A structure, material or feature is included in at least one embodiment or example of the invention. In this specification, the schematic representation of the above terms is not necessarily Refers to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in a suitable manner in any one or more embodiments or examples.

For ease of understanding, a step-by-step detailed analysis of the assembly method in accordance with a preferred embodiment of the present invention is performed below.

The program first reads the seed data and the reads data for subsequent steps. In order to achieve efficient calculation, the algorithm continuously stores the read data that needs to participate in the comparison process in binary form until the entire program terminates. The quality value information provided in the sequencing (such as the quality value in the Fastq format) is not recorded and used during the assembly process. This information is designed to be used in the pre-assembly data pre-processing steps to remove or correct the mass values of the aberrant bases and reads. The work, that is, the sequence read in is considered to be normal in the quality of the sequencing, and the difference in bases of different quality values is ignored in assembly.

After reading the reads and recording the information of the wells it belongs to, slide the reads into Kmer and build the Reads Kmer Index (RKI). This is the hash of the Kmer key, which records the frequency of the Kmer. The sets of reads and the location, direction, etc. of the Kmer on these reads. The characteristic of hash is that the search speed is extremely fast, and the time complexity is only O(1). Therefore, the core function of the index is to quickly access the corresponding sets of reads through Kmer to determine which reads are to be positioned on the seed. Refined alignment.

This RKI structure requires a large amount of information to be recorded, and further, since a read contains multiple Kmers, this read will appear in multiple Kmer entries, which makes memory consumption even larger. For the genome, the memory consumption of 100X sequencing is about 3TB, which requires more memory than traditional assembly technology consumes less than 1TB of memory. The data structure can still be further optimized, which can reduce assembly costs and resource bottlenecks. The way to optimize can be to disable or limit the memory of the UHF to reduce the memory overhead, because these Kmers have little effect on the assembly, and it will waste more on the positioning of the reads. Time is extremely inefficient.

The above data structure design of RKI is obviously different from the design of some DBG algorithms. They construct the graph of the relationship between Kmer by reading the reads from the file, and do not need to store the reads in the memory, and the Kmer corresponds. The reads relationship does not need to be recorded in detail, because the representation of the genome has been transformed from a mess of reads into a group of logically associated Kmers, and the data has been extensively sorted and compressed. After the genome is constructed, only the Kmer map needs to be operated. No need to directly manipulate the reads. However, this algorithm does not construct a whole picture. In the process of seed extension (genome construction), it is necessary to continuously access the reads through Kmer to obtain the extended (assembly) material.

Finally, relevant parameters related to the experiment also need to be read in, such as the insert size of the paired reads in the library construction, the number of wells separating the LFR, and the number of cells input, the number of ploidy of the target genome (by optical means or informatics) Analysis obtained). However, the size of the inserted fragment and the number of cells input can be statistically trained when the seed sequence is initialized. The ploidy number of the target genome can also be calculated by heterozygous recognition. The above different ways of obtaining information will have certain errors, and the actual assembly should be combined with analysis and application.

After the seed is selected, it will be initialized and legally determined before it begins to extend. Initialization obtains relevant information on the seed, which is used for subsequent legality determination and seed extension material. The legitimacy decision is used to discard those seeds that are on the extended area and those that are not suitable for extension.

First, the sliding is cut into Kmer, the readings corresponding to these Kmers are obtained through RKI, and then these reads are finely aligned to the seed, and the well information corresponding to these reads is used to determine the coverage of the seed by the well.

A well that completely covers the entire seed is defined as "Effective Well (EW)", and other well-covered wells will be recorded as candidates and updated in subsequent extensions. In the method of assembling long segment sequences, each locus on the genome is covered by multiple LFRs, and the number of coverage is related to the number of cells and the number of ploidy. As the extension progresses, the LFR set covering the extended region will occur. Transitional changes, the new LFR continues to replace the old LFR, just like the relay race. For example, if 10 diploid cells are used to construct LFR, the number of wells per site is expected to be 10 cells * diploid * DNA double strand = 40, and if LFR length is 100 Kbp, then adjacent LFR The average interval is 100Kbp/40=2.5Kbp, that is, the expected number of wells in the range of 2.5Kbp is 40. If the range of participation calculation is expanded, the number of wells will increase by one for each 2.5Kbp.

During the seed extension process, the set of EWs is confirmed by its well coverage of the extended (or initialized) blocks. When an LFR first covers the extended seed, its coverage of the seed will rise as the seed extends until it is completely covered. The well corresponding to the LFR that completely covers the seed can be used as an auxiliary extension. EW; and as the seed continues to extend forward, the LFR will gradually withdraw from the area where the seed extends, and its coverage will continue to decrease until it reaches 0. When the LFR is obviously insufficient for the seed coverage, the corresponding well will not Used as an EW to assist in extension. This creates a transitional change in the EW set at the time of extension.

This extended EW set will guide the filtering of the assembly during assembly. It can separate the reads that belong to the vicinity of the current seed from the vast majority of the entire genome, making some complex genome-wide regions degenerate into LFR lengths. The simple area of the range makes it easy to assemble, which is the core of the way to separate long segment sequences.

This assembly method that considers a set of multiple wells while extending is different from the conventional two-component assembly method in which a single well is considered (ie, each well is assembled separately and then combined to perform secondary assembly). The difference is the use of EW. This approach allows the depth information required for assembly to be extended from a single well to a cumulative depth that combines the entire EW set, reducing the sequencing depth required for a single well, requiring only low coverage. By 10 For example, a diploid cell (40 copies) requires a sequencing depth of 40*50X=2,000X in the previous secondary assembly mode, of which 50X is the sequencing depth required for conventional genome assembly; this algorithm only requires 40*2.5 X=100X sequencing depth, where 2.5X is the depth of sequencing required for EW to be identified if the partial region is not sequenced, and the cumulative depth of the entire EW set is 100X when extended, ie 2.5X for identification EW and accumulated 100X for extension. This approach greatly reduces the cost of sequencing and saves time.

In addition, the transitional changes of EW can also be used to determine the positional relationship between contigs and to construct extremely long skeleton sequences.

According to an embodiment of the present invention, both the initialization and the extension need to compare the reads to the seed to provide the material to obtain information (such as the paired information of well and reads) and the extended sequence, and try to compare all the reads. It is impractical, it will consume a lot of time, so use RKI to screen the reads, hit the Kmer's reads on the seed to make a fine comparison, as follows:

1. Sliding and cutting the seed sequence to obtain the Kmer, and obtaining the reads corresponding to these Kmers through the RKI. The seed is cut when the seed is initialized, and only the sequence at the end of the contig is cut during the extension. Based on efficiency considerations, very high frequency Kmers (a Kmer corresponds to a large number of reads) are not used to obtain reads. In particular, for the read acquisition during the extension process, if the end is found to be extremely high frequency Kmer, only the last Kmer is taken for the reading positioning.

2. Position the reads on the seed through Kmer's position record on the seed and reads, and compare them on a per-base basis. There is some tolerance for the replacement error in the sequencing, but there is no module to open the gap, and can not tolerate indel. Type of error.

The purpose of 3.reads filtering is to remove reads that are mislocated due to repetition. The algorithm firstly filters out the mislocated readers due to the genome-wide repetition through the EW information, and then filters out the right arm reads that are mislocated due to the simple repetition inside the LFR through the mate-pair information (the left arm reads are not filtered). It should be noted that the left arm reads in the repeat region will not participate in the legitimacy judgment of their corresponding right arm reads, because they may also be positioned incorrectly, and the complex type of repetition will be handled by the duplicate solution module.

4. In order to obtain an extended sequence, the reads positioned at the end of the contig need to be base-by-base consistent. Similar to the read alignment, only the replacement sequencing error is tolerated in the process of consistency, and there is no tolerance of indel type. Since the mate-pair filtering of the reads can only be applied to the right arm reads, this makes the right arm reads no positioning error, and has obvious advantages with the left arm reads. Based on this feature, when the number of filtered right arm reads is sufficient, the module only uses these reads for consistency. If it can be combined into one sequence, the extension will be completed, and then the module for updating related information will be entered. If it is not possible to form a sequence, that is, a conflict occurs during the consistency, it is necessary to combine the left arm reads of the region. Identification and resolution of heterozygous or repeated. When the number of right arm reads is insufficient, it is also necessary to combine the left arm reads for consistency.

5. Consistent base-by-base site merging process, if more than one base is found to be non-low frequency at the same position (low frequency is mainly caused by sequencing error), it will cause conflict and determine the consistency of the site. Sexuality failure, which is mainly caused by heterozygosity and repetition. Compared with the repetition, the heterozygous recognition of the diploid genome is relatively simple, and its characteristic is that there are only two kinds of conflicting bases, and the EW sets are semi-halved, and the EW sets supported by each have almost no intersection, so the consistency is achieved. Hybrid identification will be performed first after failure.

Compared with hybridization, sequencing errors do not conform to this situation. The main features of sequencing errors are low frequency and random. Because most sequencing technologies only account for a very small number of sequencing errors, only a small amount of difference is found in the consistency, even in the case of large fluctuations in sequencing depth, the difference caused by sequencing errors Still only a very small number of components, the absolute number of sequencing errors will change with the depth of sequencing, but the ratio does not change, that is, the sequencing error rate remains unchanged. On the other hand, its randomness is manifested in the fact that sequencing errors are not biased in different wells, and sequencing errors with the same probability appear in each well, even though the sequencing error itself may have a certain bias in the error detection mode, but The conditions in each well are the same. This means that the bases that are incorrectly sequenced are not concentrated on the well. When the consensus conflicts, the different bases respectively support the intersection between the EW sets and cannot be distinguished. It should be noted that when the overall sequencing depth is low, the characteristics of sequencing errors will become inconspicuous. In this case, misjudgment is highly prone to occur, so the sequencing depth cannot be too low.

When confirmed to be heterozygous, the contig will be split into two for two-way extension, which will be phased in conjunction with the EW condition of the previous hybrid region. In principle, two heterozygous regions from the same haploid should have similar EW sets. Conversely, there should be no intersection between EW sets from two heterozygous regions on different haploids. In this way, the phasing operation of the adjacent heterozygous sites can be performed. In particular, if the distance of the previous hybrid region is too far (more than the LFR length), even the two heterozygous regions on the same haploid are EW will not have an intersection, the phasing will be judged as a failure, and a new phased block will be created, and the contig itself is still continuous and not interrupted at the same time.

See Figure 5 for the iterative extension of the seed sequence. The above method is different from other phasing algorithms. Only the EW relationship of two adjacent sites is considered, and no multiple sites are considered together, and the phasing method is linear, the logic is simple, and the phasing algorithm still There is room for improvement. Of course, the phasing method in this algorithm also has certain advantages. Unlike some methods that only consider the unit point hybrid, the algorithm will phasing all the hybrid types, which makes the length of the phasing improved.

Like other methods based on assembly to identify structural variations, the algorithm can also identify large structural variant heterozygotes, which is the advantage of this algorithm in separating the haplotypes from other long processes. Further, Since there is no difference in the phasing mode between the insertion-deletion heterozygous and the large structural variant heterozygous in this algorithm, this algorithm can be constructed more than other phasing methods that only consider single replacement heterozygous sites. Complete longer haplotypes also provide features that are not available in resequencing methods, especially phasing large structural variations, which is significant for downstream analysis.

Repetitive Sequences are identical or similar sequences that occur at different positions in the genome. It appears in large numbers in the genome, such as the total length of various types of repeats in the human genome accounting for nearly half of the size of the genome. Repeating sequences have always been an important issue affecting the quality of assembly. Whether or not the correct resolution can be solved is also the most concerned about various assembly algorithms and constantly trying new strategies to achieve breakthroughs. Repeated resolution is naturally a key module in this algorithm, mainly dealing with complex repetitions within LFR, mainly including adjacent small double repetitions, large double repetitions, and tandem repetitions. These repeated solutions are described below.

Different types of repetition will be solved by different modules, and the identification of repeated types requires the assistance of relevant information. Repeated area recognition is one of the important information for identifying the type of repetition. The main purpose is to identify the left arm that cannot be used for filtering. Reads, ie mate-pair reads are in the repeating region, because one of the characteristics of the repeating region is that reads will be incorrectly located here. If incorrectly positioned reads are used for extension, it will often lead to erroneous extension. .

As shown in FIG. 6, the repeated area recognition mainly includes the identification of the start point and the end point. When extending, it is found that the left arm corresponding to the right arm reads in the extension region is not near the position of the upstream insertion length, and the starting point of the repeated region can be determined.

The end points of the repeating zone are mainly divided into two categories. For simple repetition (only the mate-pair relationship can be used to filter the mislocated reads), the end of the repeating zone is the position where the right arm reading of the left arm is no longer present; For complex repetitions, the conflict point at the time of extension is the end point.

In addition, the difference regions appearing in the repeated segments will be considered as non-repetitive regions, that is, similar repeating segments will be strictly divided into a plurality of sub-repeat segments to be processed. In this case, the regional property recognition after the difference point is different. The left arm corresponding to the right arm reading of the starting point of the repeating region is in the previous repeating region and not in the non-repetitive region, and the paired reads are located. Wrong situation. At this time, check the left arm reads near the length of the inserted segment before the extension point. If the corresponding right arm participates in the extension, the current extended area is the repeating area. If a part is found not to participate in the extension, it indicates that Extend into the non-repetitive area.

Since the mate-pair filtering of the reads can only be applied to the right arm reads, the position of the right arm reads and the left arm reads is not the same when extended. However, if there is a region, the right arm reads can be used to test the left arm reads. The left arm reads can play a role in resolving the repetition. This is the concept of Helper Contig (HC).

HC is a contig used to solve complex duplications, only for auxiliary use, not as a formal contig In the assembly results. The essence is to further utilize the mate-pair information of the reads, and expect the HC to cross the current repeating region and appear on the downstream non-repetitive region, and use this non-repetitive region to help resolve the duplication. If the HC fails to cross the repeat zone successfully, it will generally not work. At present, the application objects for HC are mainly divided into the following two categories:

(1) HC case1-adjacent small double repeat:

Since the length of the insert of the mate-pair reads has a certain standard deviation (SD), if there are two small repeats that are slightly longer than the read length -Kmer+1, the extension will cause mate- due to this SD deviation. The pair information is confusing and conflicts when it is consistent. Both bases of the conflict are supported by EW and mate-pair reads.

For this problem, the algorithm first uses the mean value calculated by the mate-pair information supporting each base as the desired position, and the distance is considered to be the base that should be extended. If the distance between the two positions is too close, the actual situation If it is fuzzy, you need to construct HC to assist in identification.

First, the right arm corresponding to the leftmost left arm of the contig end SD length range which is not in the repeating region is used as the starting point for constructing the HC, and the read is extended in a manner similar to the seed. When extended to a sufficient length, the position of the conflicting base can be calculated separately by the left arm corresponding to the right arm reads positioned on this HC. In essence, the use of HC in this case only improves the reliability of the calculation distance, and if the SD is large, errors may still occur.

(2) HC case2-large double repeat:

When the length of the repeating region is greater than the length of the insert, the left arm reads will not be used for the filtering of the reads. In essence, the left arm cannot be used to filter the right arm in the repeating region as long as it is in the repeating region, regardless of the region between the arms. Whether it is repeated or non-repetitive.

Large-scale repetitions are prone to differences due to their long lengths. Therefore, well information can be used to solve problems according to different conditions. In some cases, it needs to be solved by combining HC. There are two main types:

The two repeat sequences are far apart. In this case, although the two conflicting bases have EW support, the bases in the upstream contain more EW than the downstream ones. This comparison can be used to resolve conflicts. On the contrary, for other wells that have not yet been identified as EW, it can be found that there are more downstream supports than upstream, which can also be used to assist in the resolution of conflicts, as shown in FIG. If the above information is still ambiguous, the HC can be constructed using the left arm corresponding to the wrong right arm within the length of the insertion segment starting from the beginning of the repeating region, and it is expected that the EW set on the HC and the EW set of the upstream conflicting base are compared with The difference in downstream conflict bases is smaller.

The two repeat sequences are close together, in which case the well information will not change significantly and cannot be used for conflict resolution, but there may be some simple cases where the HC can resolve the conflict, see Figures 8a and 8b:

a) When the distance between the collision site and the end of the repeat region is less than the length of the insert, the left arm supporting the upstream base can be found The right arm corresponding to the reads can be positioned on the HC (the HC is constructed in the same way as the previous type of large repeat, and is constructed using the left arm corresponding to the wrong right arm within the length of the insert from the beginning of the repeating region). As shown in Figure 8a;

b) When the distance between the conflicting site and the starting point of the downstream repeat is less than the length of the insert, it can be found that the right arm corresponding to the left arm reads supporting the upstream base can be found in the repeating region, as shown in Fig. 8b.

A repeat consisting of multiple consecutive copies of a repeating unit is defined as Tandem Repeat (TR), and the shortest repeat in the tandem repeat region is a Tandem Repeat Unit (TRU). A tandem repeat unit has a number of different phases (such as ACT, CTA, TAC), the number of which is equal to its length.

Like the conflict caused by the conventional repetition, the sequence of the end of the tandem repeat region and the sequence of the difference within the repeat region conflict with the sequence within the repeat region, and unlike the conventional repeat, a tandem repeat region is equivalent to a plurality of conventional repeat region linkages. to make. This makes this repetitive solution both identical and different from other conventionally repeated methods.

As shown in Figure 9, the tandem repetition will shift the positioning of the reads, which often causes conflicts in the consistency of the repeated positioning or repeated differences in the positioning of the reads. The core essence of this problem is that the Kmer used to locate the reads is a repeating Kmer. In this case, the head of the mislocated reader will be after the start of the tandem repeat region (because the non-repetitive region before the start point will cause the mislocated read to fail during the fine alignment process), and the reads are offset by the wrong positioning. SD smaller than the length of the inserted segment, that is, mate-pair cannot filter out these misplaced reads. Further, if the length of the tandem repeat is greater than the SD of the inserted length, the read positioning in the repeating region will occur in SD. The periodic positioning extends to the upstream error aggregation situation, the misplaced reads distance deviation is not greater than the SD length of the insert, and is concentrated at the head end of each SD unit, which means that the series is repeated before the collision is encountered. Each SD length unit in the zone will be compressed and shortened, and the reads of the next SD unit will be continuously shifted forward.

The primary premise for solving TR is correct identification. In this algorithm, TR is confirmed by finding tandem repeating units. Since the TR larger than the length of the inserted segment also meets the activation condition of the HC case 2, the algorithm places the activation decision of the TR before the activation decision of the HC case 2.

TRU is primarily identified by discovering that Kmer is used periodically. For TRUs smaller than the read length -Kmer+1 (that is, TRU appears twice or more on a reads), the reads on the collision site can be slid into Kmers and a Kmer is found in the reads. This situation exists in most reads, and if the Kmers of these reads are consistent or consistent in the TRU phase, it can be judged that the current conflict is caused by the series repetition, and the de-serial repeat module will be activated. For a TRU larger than the length of the read-Kmer+1, the Kmer in the range from the start point of the repetition region to the collision point is scanned. If the Kmer appears in a fixed period, it can be judged that the collision is caused by the tandem repetition.

After identifying the contents of the TRU, the reads on the conflicting sites can be divided into four categories: 1) the reads in the TR area, which are completely covered by the TRU; 2) the reads containing the TR starting point, which are only found at the end TRU; 3) contains Reads at the end of the TR, these reads only find the TRU at the head end; 4) the reads that contain both the TR start and end points, this case only exists when TR is less than the length of reads, and no complete type 1 will be found at this time. TRU covered reads.

For TRs smaller than the length of the read, the conflicts are eliminated by adjusting the position with the end of the TR containing the end of the TR. For a TR larger than the length of the read, it can only be solved by crossing and filling in "N". For a TR larger than the length of the reads and smaller than the length of the reads insert, there will be a left arm support in the non-repetition zone and the right arm reads containing the TR end point are aligned and consistent after the end point, and the inserts according to these reads can be followed. Length to estimate the distance and fill in the number of "N" for the estimated amount; for TRs larger than the length of the inserted segment, it can be found that the right arm reads containing the TR end point are not supported by the left arm in the non-repetitive area, at this time directly These reads are consistent, just fill in an "N" as a tag. It should be noted that small differences (such as individual base substitutions, insertions, and deletions) in the TR region are confused with the TR endpoint, and conflicts occur when the recognized reads containing the TR endpoint are consistent. At this point, these conflicting reads can be assembled in a DBG manner to naturally construct the difference sequence and the TR end sequence in the separated TR region, and then the right arm reads located on the sequence obtained by using these assemblies are corresponding to the front The position of the left arm reads on the extended non-repetitive region is used to calculate the position of the assembled sequence such that the sequence furthest from the extension point is the correct TR endpoint.

According to an embodiment of the present invention, since the extension of the seeds is parallel, in order to prevent the same region from being repeatedly extended by a plurality of seeds, it is necessary to mark the read that has participated in the extension as "used", and find this when other seeds are initialized or extended. The reads will stop extending and then be connected by the contig merge module. It should be noted that the mate-pair is not clearly located in the repeat area. To prevent the erroneous extension from stopping, the repeating nature of the reads will be marked as "repetition" and will not be used as a redundant extension. determination.

According to an embodiment of the present invention, when a read in a well is positioned on a contig, which may be an erroneous positioning due to a repetition or sequencing error, the algorithm requires that well have relatively sufficient coverage of the extended seed area. EW can be considered as EW, and EW with insufficient coverage will be discarded. Based on efficiency considerations, the EW set is updated every time a certain extension is reached, and only the coverage within this length range is examined.

The contig will be merged after all contigs have been extended, and contig-phased operations will also be performed. The contig that can be merged is divided into two cases:

A contig that was stopped when it was found to contain "used" reads. At this point, the merge points of the two contigs are calculated directly by the position of these "used" reads on the two contigs.

The contig of the extended stop due to repetition, localized sequencing depth, etc. At this time, the mate-pair reads on the non-repetitive area at the end of the contig (the left arm reads when the downstream extends and the right arm reads when the upstream extends) are examined. Whether it is located on other contigs, if it is found successfully, the distance between the two contigs is estimated by inserting the length of the segment and filled in the number of "N" of the estimated amount.

Because the basic unit of the skeleton sequence construction in this algorithm is also the contig that has been extended and not yet merged, the merging step of the actual processing first establishes the relationship between contigs (overlapped or overlapped), and then combines the skeleton. The relationship between contigs established in the sequence step is mutually checked, the relationships are simplified and the conflicts are resolved, and the contig merge operation is substantially performed.

As the extension continues, it can be found that the well/LFR has the characteristic of transitional change. This feature can be used to determine the positional relationship between contigs. Finally, the algorithm performs a scaffolding on all contigs. operating. In fact, this thinking has been reflected in the previous large and repetitive solutions.

In this algorithm, the sequence formed by the contig is mainly determined by the information of the well to determine the context, which is a scaffold, which is different from the skeleton sequence constructed by the mate-pair information of the traditional definition. The scaffold in this algorithm only expresses the context of contig, but the specific distance between contig cannot be confirmed, and they are connected by only one "N". In the traditional sense, the scaffold not only determines the context between contigs through the mate-pair of the reads, but also calculates the distance between contigs by the positions of the pairs of reads positioned on the two contigs and their insert lengths. The method in the merge with contig is consistent. Regardless of the kind of scaffolding method involving contig, this is an Optimal Linear Arrangement (OLA) with NP-Hard properties. It should also be noted that the contig in scaffold also has its own orientation (the four deoxynucleotides A, C, T, and G are called polydeoxynucleotides formed by 3', 5' phosphodiester bonds. The DNA base sequence is a representation of contig and scaffold in assembly. The deoxynucleotide linkage has strict directionality and is 5'-OH of the first deoxynucleotide and 5' of the next deoxynucleotide. The 3', 5' phosphodiester bond forms a linear DNA macromolecule with no branching. The expression of DNA defines the 5' end to the 3' end as the "+" direction and the 3' end to the 5' end. For the "-" direction), the contig with the correct position and the wrong direction will also cause a large assembly error, which is manifested as a sub-sequence flip type serious error.

In general, a well will appear in multiple adjacent contigs, through which a set of adjacent contigs can be obtained. First, find the well with the contig as the starting point in the well to which contig belongs, and define this contig as the starting point contig of this well. Then find out the other contigs containing this well, and calculate the intersection of the well set they contain and the well set of the starting point contig. Obviously, the larger the intersection of the well set with the starting point contig, the closer the contig is to the starting point contig, by This determines the context of a set of contigs and then constructs a scaffold.

For the directional judgment of contig in scaffold, it is necessary to consider the well set of the head end and the tail end separately. Each computes an approximation to the well set of the head or tail of the nearby contig. When the contig is short, the transitional change of Well is not obvious. At this time, the position may be unknown. Therefore, after completing the initial scaffolding of all contigs, it is necessary to combine the results of the contig merge to test and correct, and merge the two. The relationship obtained by the method ultimately outputs the scaffold sequence.

While the embodiments of the present invention have been shown and described, the embodiments of the invention may The scope of the invention is defined by the claims and their equivalents.

Claims

A method for assembling a sequence of separated long segments, comprising:

(a) obtaining a set of reads by sequencing, and recording the sequencing holes corresponding to the reads in the read set, one sequencing well comprising at least one long segment sequence;

(b) using the read segment and the sequencing hole corresponding to the read segment, extending the plurality of seed sequences in parallel to obtain a plurality of sequence contigs, wherein the plurality of seed sequences are determined by a known sequence;

(c) constructing a skeleton sequence based on the read sequence, the sequence contig, and the sequencing well corresponding to the read included in the sequence contig to obtain an assembly result of separating the long fragment sequence.
The method of claim 1 wherein said seed sequence is obtained based on a genomic reference sequence according to the following steps:

In the genomic reference sequence, interrupted by N;

The interrupted reference sequence is truncated by a predetermined length to obtain the seed sequence.
The method of claim 2 wherein said predetermined length is no less than the length of the sequencing library insert in said sequencing.
The method of claim 1 wherein said set of reads comprises paired reads, said seed sequence being obtained based on said reads in accordance with the following steps:

(1) sliding the read segment into a plurality of Kmers, constructing an index RKI of the Kmer to the read segment, for accessing the corresponding read segment through the Kmer;

(2) extracting a pair of read segments having no high frequency Kmer from the set of read segments;

(3) using the index RKI, respectively determining all the reads corresponding to the Kmer of the two read pairs of the pair of paired reads in (2), obtaining the first read segment group and the second read segment group;

(4) respectively determining the sequencing holes corresponding to the first read segment group and the second read segment group of (3), and obtaining the first sequencing hole set and the second sequencing hole set;

(5) determining the intersection of the first sequencing hole set and the second sequencing hole set in (4), if the size of the intersection is not significantly different from the expected number of effective sequencing holes of the base, determining (2) The paired reads are the seed sequence.
The method according to claim 4, wherein in (5):

If the size of the intersection is between half and two times the expected value of the number of valid sequencing wells of the base, it is determined that the paired reads in (2) are the seed sequence.
The method of claim 1 wherein (b) comprises:

(i) sliding the read segment into a plurality of Kmers, constructing an index RKI of the Kmer to the read segment, for accessing the corresponding read segment through the Kmer;

(ii) performing parallel extension of the plurality of seed sequences based on the read segment and its corresponding index RKI to obtain the plurality of sequence contigs.
The method of claim 6 wherein said RKI is obtained by the following steps:

Sliding and cutting the read reads into a plurality of Kmers;

A hash with a Kmer as a key value is constructed, the hash constitutes the RKI, and the hash records the frequency of the Kmer, the associated read segment, the location and direction of the Kmer on the read segment.
The method of claim 6 wherein said seed sequence is extended by repeating the following steps:

Selecting a seed sequence suitable for extension;

Positioning the read to the seed sequence to obtain an extension sequence;

Readings located at the end of the extended sequence are subjected to base-by-base conformation;

If the coherence process fails, hybrid identification, phasing processing, and/or parsing of the repetitive sequence are performed.
The method according to claim 8, wherein the seed sequence suitable for extension is selected by the following steps:

Sliding and cutting the seed sequence into a Kmer;

Obtaining, by the RKI, a read corresponding to the Kmer;

Comparing the corresponding read segment with the seed sequence;

Determining a coverage of the seed sequence by the sequencing hole based on the sequencing holes corresponding to the corresponding read segment;

Based on the coverage condition, a seed sequence suitable for extension is determined.
The method of claim 8 wherein said read is located to said seed sequence by:

Sliding and cutting the seed sequence into a Kmer;

Obtaining, by the RKI, a read corresponding to the Kmer;

The Kmer-compatible reads are positioned to the seed sequence and aligned on a base-by-base basis.
The method according to claim 8, wherein in the process of the consistency, if the set of effective sequencing wells of the extended sites is equally distributed by the different base types of the sites, it is judged that there is a heterozygosity.
The method according to claim 11, wherein after determining that there is a hybrid, the extended sequence is divided into a plurality of strips for extension.
The method of claim 8 wherein said set of reads comprises pairs of pairs of reads, one pair The distance between the two reads in the paired read is L,

In the process of the consistency process, if the distance between the read segment located downstream in the extension direction and the corresponding read segment in the paired read segment is non-L, it is determined that the read segment located downstream of the extension direction is located. The position is the starting point of the repeating sequence.
The method according to claim 13, wherein the end point of the repeating sequence is a position in the paired read position positioned in the downstream direction of the extending direction, and the reading is located downstream of the extending direction and corresponding to the reading The distance between the readings is L,

Or it is a conflicting site when extending.
The method according to claim 14, wherein the repeating sequence is determined to be a tandem repeat by the following steps:

Performing a sliding cut on a read positioned to the repeating sequence to obtain a Kmer;

It is judged whether or not the Kmer obtained by sliding cutting of each of the read segments has a repeating Kmer. If the repeating Kmer is not present, it is judged that there is no tandem repeating sequence, and if the repeating Kmer is present, it is judged that there is a tandem repeating sequence.
The method according to claim 15, wherein if the repeat sequence is a tandem repeat sequence, the tandem repeat sequence is resolved by the following steps:

(m) determining the length of the tandem repeat sequence;

(n) Position adjustment is performed by aligning the readings including the end points of the tandem repeats with the end points.
The method according to claim 15, wherein if the repeat sequence is not a tandem repeat sequence and the following conditions exist, the repeat sequence is judged to be a large double repeat sequence by:

The length of the repeat sequence is greater than L, or the read corresponding to the read located downstream of the repeat sequence in the paired read is also located on the repeat.
The method according to claim 17, wherein if the repeat sequence is a large double repeat sequence, the large double repeat sequence is resolved by:

Comparing the difference between the number of effective sequencing wells corresponding to the upstream repeat sequence in the large double repeat sequence and the number of effective sequencing wells corresponding to the downstream repeat sequence and the expected value of the number of valid sequencing wells of the base, depending on the difference Resolving conflicts on the large double repeat sequences.
The method according to claim 17, wherein if the repeat sequence is not a large double repeat sequence and the following conditions exist, it is determined that the repeat sequence is a small double repeat sequence:

The repeat sequence has a length less than L.
The method according to claim 19, wherein said small double repeat is performed by at least one of the following The sequence is parsed:

(p) using the mean of the distance between pairs of reads supporting each base as the desired position of the collision site, by comparing the distance between the pair of reads supporting the two conflicting bases to the mean Degree to determine the base at the conflicting site;

(k) constructing an auxiliary contig by using reads corresponding to the most upstream reads of the non-repetitive sequences in the range of standard deviations of the extended sequence in the paired reads, using the paired reads to locate the auxiliary The reads corresponding to the reads downstream of the contig are used to determine the bases at the collision site.
The method according to claim 20, wherein the extension of the seed sequence is terminated if the small double repeat sequence cannot be resolved.
The method of claim 1 wherein (c) comprises:

(iii) establishing a merged connection relationship between the sequence contigs based on the read segment;

(iv) constructing a primary skeleton sequence based on the sequence contigs corresponding to the sequence contigs and the reads contained in the sequence contig;

(v) after merging the merged connection relationship between the plurality of sequence contigs with the primary skeletal sequence, by merging the sequence contigs to construct the skeletal sequence, obtaining assembly of the separated long segment sequences result.
An apparatus for assembling a sequence of long segment segments, comprising:

Inputting a module, obtaining a read set by sequencing, and recording a sequencing hole corresponding to the read in the read set, wherein one sequencing hole comprises at least one long segment sequence;

An extension module, using the read segment and the sequencing hole corresponding to the read segment, to extend a plurality of seed sequences in parallel to obtain a plurality of sequence contigs, wherein the plurality of seed sequences are determined by a known sequence;

The skeleton sequence constructing module constructs a skeleton sequence based on the read segment, the sequence contig, and the sequencing hole corresponding to the read segment included in the sequence contig to obtain an assembly result of separating the long segment sequence.
The device according to claim 23, wherein said seed sequence is obtained based on a genomic reference sequence according to the following steps:

In the genomic reference sequence, interrupted by N;

The interrupted reference sequence is truncated by a predetermined length to obtain the seed sequence.
The device according to claim 24, wherein said predetermined length is not less than the length of the sequencing library insert in said sequencing.
The apparatus of claim 23 wherein said set of reads comprises a pair of reads, said sequence of seeds being obtained based on said reads in accordance with the following steps:

(1) sliding the read segment into a plurality of Kmers, constructing an index RKI of the Kmer to the read segment, for accessing the corresponding read segment through the Kmer;

(2) extracting a pair of read segments having no high frequency Kmer from the set of read segments;

(3) using the index RKI, respectively determining all the reads corresponding to the Kmer of the two read pairs of the pair of paired reads in (2), obtaining the first read segment group and the second read segment group;

(4) respectively determining the sequencing holes corresponding to the first read segment group and the second read segment group of (3), and obtaining the first sequencing hole set and the second sequencing hole set;

(5) determining the intersection of the first sequencing hole set and the second sequencing hole set in (4), if the size of the intersection is not significantly different from the expected number of effective sequencing holes of the base, determining (2) The paired reads are the seed sequence.
The apparatus according to claim 26, wherein in (5):

If the size of the intersection is between half and two times the expected value of the number of valid sequencing wells of the base, it is determined that the paired reads in (2) are the seed sequence.
The apparatus according to claim 23, wherein said extending module comprises:

(i) sliding the read segment into a plurality of Kmers, constructing an index RKI of the Kmer to the read segment, for accessing the corresponding read segment through the Kmer;

(ii) performing parallel extension of the plurality of seed sequences based on the read segment and its corresponding index RKI to obtain the plurality of sequence contigs.
The apparatus according to claim 28, wherein said RKI is obtained by the following steps:

Sliding and cutting the read segment into a plurality of Kmers;

A hash with a Kmer as a key value is constructed, the hash constitutes the RKI, and the hash records the frequency of the Kmer, the associated read segment, the location and direction of the Kmer on the read segment.
The apparatus of claim 28 wherein said seed sequence is extended by repeating the following steps:

Selecting a seed sequence suitable for extension;

Positioning the read to the seed sequence to obtain an extension sequence;

Readings located at the end of the extended sequence are subjected to base-by-base conformation;

If the coherence process fails, hybrid identification, phasing processing, and/or parsing of the repetitive sequence are performed.
30. Apparatus according to claim 30 wherein the seed sequence suitable for extension is selected by the following steps:

Sliding and cutting the seed sequence into a Kmer;

Obtaining, by the RKI, a read corresponding to the Kmer;

Comparing the corresponding read segment with the seed sequence;

Determining a coverage of the seed sequence by the sequencing hole based on the sequencing holes corresponding to the corresponding read segment;

Based on the coverage condition, a seed sequence suitable for extension is determined.
The apparatus of claim 30 wherein said read is located to said seed sequence by the following steps:

Sliding and cutting the seed sequence into a Kmer;

Obtaining, by the RKI, a read corresponding to the Kmer;

The Kmer-compatible reads are positioned to the seed sequence and aligned on a base-by-base basis.
The apparatus according to claim 30, wherein in the process of the consistency, if the set of effective sequencing wells of the extended sites is equally distributed by the different base types of the sites, it is judged that there is a heterozygosity.
The apparatus according to claim 33, wherein after the presence of the hybrid is determined, the extension sequence is divided into a plurality of strips for extension.
The apparatus according to claim 30, wherein said set of reads comprises a plurality of pairs of pairs, and a distance between two of the pair of pairs is L,

In the process of the consistency process, if the distance between the read segment located downstream in the extension direction and the corresponding read segment in the paired read segment is non-L, it is determined that the read segment located downstream of the extension direction is located. The position is the starting point of the repeating sequence.
The apparatus according to claim 35, wherein the end point of said repeating sequence is a position in which the reading in the extending direction is located at a distance L between the paired reading and the corresponding reading. , or a conflicting site when extending.
The apparatus according to claim 36, wherein the repeating sequence is judged to be a tandem repeat by the following steps:

Performing a sliding cut on a read positioned to the repeating sequence to obtain a Kmer;

It is judged whether or not the Kmer obtained by sliding cutting of each of the read segments has a repeating Kmer. If the repeating Kmer is not present, it is judged that there is no tandem repeating sequence, and if the repeating Kmer is present, it is judged that there is a tandem repeating sequence.
The apparatus according to claim 37, wherein if said repeat sequence is a tandem repeat sequence, said tandem repeat sequence is resolved by the following steps:

(m) determining the length of the tandem repeat sequence;

(n) Position adjustment is performed by aligning the readings including the end points of the tandem repeats with the end points.
The apparatus according to claim 37, wherein said repeating sequence is determined to be a large double repeating sequence if said repeating sequence is not a tandem repeating sequence, and wherein:

The length of the repeat sequence is greater than L, or the read corresponding to the read located downstream of the repeat sequence in the paired read is also located on the repeat.
The apparatus according to claim 39, wherein if said repeating sequence is a large double repeating sequence, said large double repeating sequence is resolved by:

Comparing the difference between the number of effective sequencing wells corresponding to the upstream repeat sequence in the large double repeat sequence and the number of effective sequencing wells corresponding to the downstream repeat sequence and the expected value of the number of valid sequencing wells of the base, depending on the difference Resolving conflicts on the large double repeat sequences.
The apparatus according to claim 39, wherein said repeating sequence is said to be a small double repeating sequence if said repeating sequence is not a large double repeating sequence, and wherein:

The repeat sequence has a length less than L.
The apparatus according to claim 41, wherein if said repeating sequence is a small double repeating sequence, said small double repeating sequence is resolved by at least one of:

(p) using the mean of the distance between pairs of reads supporting each base as the desired position of the collision site, by comparing the distance between the pair of reads supporting the two conflicting bases to the mean Degree to determine the base at the conflicting site;

(k) constructing an auxiliary contig by using reads corresponding to the most upstream reads of the non-repetitive sequences in the range of standard deviations of the extended sequence in the paired reads, using the paired reads to locate the auxiliary The reads corresponding to the reads downstream of the contig are used to determine the bases at the collision site.
40. Apparatus according to claim 42 wherein the extension of the seed sequence is terminated if the small double repeat sequence cannot be resolved.
The method of claim 1 wherein said performing in said skeleton sequence building module:

(iii) establishing a merged connection relationship between the sequence contigs based on the read segment;

(iv) constructing a primary skeleton sequence based on the sequence contigs corresponding to the sequence contigs and the reads contained in the sequence contig;

(v) after merging the merged connection relationship between the plurality of sequence contigs with the primary skeletal sequence, by merging the sequence contigs to construct the skeletal sequence, obtaining assembly of the separated long segment sequences result.
A computer readable medium, for storing a computer executable program, the executing the program comprising performing the method of any of claims 1-22.
A system for assembling a sequence of long segment segments, comprising:

a data input unit for inputting data;

a data output unit for outputting data;

a storage unit for storing data, including a computer executable program;

And a processor coupled to the data input unit, the data output unit, and the storage unit for executing the computer executable program, the executing the program comprising completing the method of any of claims 1-22.