WO2017143585A1 - Method and apparatus for assembling separated long fragment sequences - Google Patents

Method and apparatus for assembling separated long fragment sequences Download PDF

Info

Publication number
WO2017143585A1
WO2017143585A1 PCT/CN2016/074665 CN2016074665W WO2017143585A1 WO 2017143585 A1 WO2017143585 A1 WO 2017143585A1 CN 2016074665 W CN2016074665 W CN 2016074665W WO 2017143585 A1 WO2017143585 A1 WO 2017143585A1
Authority
WO
WIPO (PCT)
Prior art keywords
sequence
read
reads
sequencing
kmer
Prior art date
Application number
PCT/CN2016/074665
Other languages
French (fr)
Chinese (zh)
Inventor
谢寅龙
黄伟华
李净净
郭瑞东
唐静波
邓超
Original Assignee
深圳华大基因研究院
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳华大基因研究院 filed Critical 深圳华大基因研究院
Priority to PCT/CN2016/074665 priority Critical patent/WO2017143585A1/en
Priority to CN201680063769.5A priority patent/CN108350495B/en
Publication of WO2017143585A1 publication Critical patent/WO2017143585A1/en
Priority to HK18113476.6A priority patent/HK1254399A1/en

Links

Images

Classifications

    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B25/00ICT specially adapted for hybridisation; ICT specially adapted for gene or protein expression

Definitions

  • the present invention relates to the field of biotechnology, and in particular, to a method and apparatus for assembling sequences that separate long segments.
  • Comparative Genomics is a discipline that compares known gene and genomic structures to understand gene function, expression mechanism, and species evolution.
  • the basic principle is that the same characteristics between two organisms are usually encoded by evolutionarily conserved DNA, and the relevant DNA fragments will be the same or similar.
  • What is needed for comparative genomics analysis is the presence of a genomic map (genomic reference sequence) and sequencing of the compared objects. Mutation detection is an important part of comparative genomics.
  • the International HapMap Project and Genome-wide Association Study (GWAS) are based on single nucleotide polymorphisms. This type of variation in single-nucleotide polymorphism (SNP) is used for related research.
  • Assembly refers to the process of integrating shorter fragment sequences into longer sequences. Limited by sequencing technology, the genome cannot read its complete sequence content by sequencing the chromosomes from beginning to end, and often breaks the whole genome into tens to thousands of base fragments, which are processed by massively parallel sequencing. The contents of these fragments are read, and these fragmented fragments are analyzed and integrated using assembly, and finally re-reduced into relatively complete genomic sequences. Identifying mutations through assembly is a new application of assembly techniques, and in fact the main purpose of assembly is to build the genome. When there is no genomic reference sequence, the construction of the genome is a process from scratch, which is especially important. This type of assembly is also called "denovo assembly”.
  • the diploid genome of a human individual is derived from a haploid contributed by each of its two parents, and the various differences between the two sets of haploids cause the individual to have one or more sites on the homologous chromosome. There are different alleles present, and this phenomenon is a heterozygous phenomenon.
  • the human reference sequence is constructed from data from multiple individuals, which results in a virtually heterozygous haploid genome. With the deepening of genomic research, the haploid human reference sequence is increasingly unable to meet the demand, the construction of haplotype (Haplotyping) is increasingly important, genomic analysis based on haplotype information is also emerging.
  • the haplotype information helps to interpret the relationship between genotype and phenotype.
  • the two individuals with the same heterozygous collection will also have different phenotypes and disease susceptibility depending on the haplotype.
  • Studies such as the specific expression of alleles
  • disease research such as Mediterranean fever, breast cancer
  • phase phasing is an important operation for constructing haplotypes, and there are many methods for constructing haplotypes. Mainly divided into the following five categories:
  • the core of the physical separation method is to separate the sequences that break into long fragments of DNA, either by means of fosmid plasmids or by direct physical separation of the multiwell plates. Further fragmentation and amplification operations (construction libraries) required for sequencing after separation, in order to distinguish between different separation units, respectively, the sequences in these units were attached with different barcodes. In this way, the whole gene component is divided into many sub-portions by separation. When the number of separated sub-portions is large, each sub-portion only contains the content of one haploid of a small area on the genome. This allows heterozygous regions at the genome-wide level to appear in homozygous form in these small regions, which is of great importance for the construction of haplotypes.
  • Each partition unit has its own unique barcode sequence to retrieve the own reads belonging to each partition unit by identifying the barcode sequence after sequencing.
  • Fosmid plasmid separation technology refers to the separation unit as fosmid plasmid pool (fosmid pool), each fosmid plasmid pool contains one or more long fragments of about 37Kbp long; and the perforated plate separation technique is called the separation unit is well (well ), each well contains multiple long segments, the length of which varies from technology to technology.
  • the separation method pioneered a new type of information, and the total collection of reads is separated into a multi-group collection by barcode, which is no longer compared to the whole Genome Shotgun (WGS) sequencing method.
  • the reads in each group come from a common small area on the genome, which is the area covered by long fragments when separated. Although these small regions are still derived from any position on the whole genome, the reads in each cluster are constrained and aggregated. The added information of this cluster becomes the key to constructing haplotypes.
  • LFR Long Fragment Reads
  • the present invention is directed to solving at least some of the above technical problems or at least providing a useful commercial choice.
  • the invention provides a method of assembling a sequence of separated long segments, comprising:
  • the program may be stored in a computer readable storage medium, and the storage medium may include: a read only memory, a random Memory, disk or disc, etc.
  • the present invention provides a system for assembling a sequence of separated long segments, the system comprising: a data input unit for inputting data; a data output unit for outputting data; and a storage unit for Storing data, including a computer executable program; a processor coupled to the data input unit, the data output unit, and the storage unit for executing the computer executable program, the executing the program comprising performing the above method .
  • the invention provides an apparatus for assembling a sequence of long segment segments, comprising:
  • An extension module using the read segment and the sequencing hole corresponding to the read segment, to extend a plurality of seed sequences in parallel to obtain a plurality of sequence contigs, wherein the plurality of seed sequences are determined by a known sequence;
  • the skeleton sequence constructing module constructs a skeleton sequence based on the read segment, the sequence contig, and the sequencing hole corresponding to the read segment included in the sequence contig to obtain an assembly result of separating the long segment sequence.
  • Iterative extension The extension of the seed sequence is an iterative process, and the extension will be used as an extension base for the next iteration, so that the seed sequence can be extended.
  • the linear extension method makes the structure encountered during assembly simpler, the logic relationship is relatively clear and easy to classify, and the graph algorithm gathers information of all repeated regions in one place. It involves more information and a more complex structure. That is, the solution of the graph is a one-time solution. As long as one repeat region is solved, all the genomic regions associated with it (they have the same or similar repeat sequences) will be all The solution is solved at the same time, and the linear method only solves the repetition of the current region at one time, and does not solve the other regions at the same time, and then re-interprets when the repetition is extended to another region of the genome. Resolving a record of related information after a repetition, making the information encountered in the next solution repetition easier, which reduces the extra computational loss.
  • the linear extension mode makes the extension of each seed sequence independent of each other, and this mode is easier to calculate parallelization.
  • the extension of the seed sequence will use the well information to perform multiple multiplex extensions of the haplotype, whenever a heterozygous region is encountered.
  • the phasing operation is carried out in real time in conjunction with the previously assembled hybrid zone condition, ie the phase is also linear and runs through the various modules of the assembly device.
  • the extension of the homozygous area still adopts a single extension, and when it encounters the hybrid area, it is divided into multiple ways to extend, and after the extension of the hybrid area is completed, it is merged back to the single way to continue to extend.
  • the algorithm is basically global, and when the extension encounters a complicated situation, the algorithm will strictly choose to terminate the extension instead of taking a higher value to extend, avoiding the use of greedy judgment mode.
  • the algorithm makes the algorithm more global than the conventional hierarchical assembly algorithm, and can greatly reduce the sequencing depth requirement, that is, each well There is no requirement for a particularly high sequencing depth, which saves a lot of resources, both cost and time.
  • FIG. 1 is a flow chart of a method of assembling a sequence of separated long segments in an embodiment of the present invention.
  • FIG. 2 is a flow chart of a method of assembling a sequence of separated long segments in an embodiment of the present invention.
  • Figure 3 is a block diagram showing the construction of an apparatus for assembling a sequence of long segments in an embodiment of the present invention.
  • FIG. 4 is a schematic view showing the structure of an apparatus for assembling a sequence of long segment segments in an embodiment of the present invention.
  • FIG. 5 is a schematic diagram of a seed sequence iterative extension process in an embodiment of the present invention.
  • Fig. 6 is a schematic diagram of identifying a repeating region in an embodiment of the present invention.
  • Figure 7 is a schematic illustration of the processing of conflicts of large double repeat sequences having two repeating regions that are relatively large in an embodiment of the present invention.
  • Figure 8 is a schematic illustration of the processing of collisions of large double repeat sequences having two repeating regions that are relatively small in an embodiment of the present invention.
  • Figure 9 is a schematic illustration of the conflicting processing of tandem repeats in an embodiment of the invention.
  • the invention proposes a method of assembling a sequence of separated long segments, see Fig. 1, the method comprising:
  • the seed sequence is obtained based on the genomic reference sequence according to the following steps:
  • the interrupted reference sequence is truncated by a predetermined length to obtain the seed sequence.
  • the size of the predetermined length is not particularly limited. Generally, the predetermined length is not less than the size of the insert, so that the reads from the same insert can be positioned on the same seed sequence.
  • the sequencing is double-end sequencing, the predetermined length being at least twice the length of the sequencing library insert in the sequencing, facilitating the positioning of the paired reads onto the same seed sequence.
  • the set of reads comprises a pair of reads, the seed sequence being obtained based on the read according to the following steps: (1) sliding the read into a plurality of Kmers, constructing Kmer's index RKI for the read segment, for accessing the corresponding read segment by Kmer; (2) extracting a pair of read pairs without the high frequency Kmer from the read set; (3) using the index RKI, respectively determining all the reads corresponding to the Kmer of the two of the pair of read pairs in (2), obtaining the first read group and the second read group; (4) respectively determining (3) a sequencing hole corresponding to the first read segment group and the second read segment group, obtaining a first sequencing hole set and a second sequencing hole set; (5) determining (4) The intersection of the first set of sequencing holes and the second set of sequencing wells, if the size of the intersection is not significantly different from the expected value of the number of valid sequencing holes of the base, determining that the paired reads in (2) are The seed sequence, the expected number of effective sequencing wells for
  • the so-called high frequency is relative to the average frequency, and the number of occurrences of a certain Kmer is higher than the average number of occurrences of Kmer, which is considered to be a high frequency Kmer.
  • the inventors define, as needed, a Kmer whose number of occurrences is much higher than the average of the occurrence of Kmer as a high frequency Kmer, which is said to be "far above” as at least 10 times the average number of occurrences of Kmer. 20 times, 30 times, 40 times, 50 times or 60 times.
  • the Kmer having a frequency of 3000 to 5000 is set as the high frequency Kmer.
  • the former is constructed based only on known sequences, such as reading sequences obtained according to reference sequences, sequencing, assembly fragments, etc., and the latter is constructed by combining known sequences, RKIs, and sequencing holes. It can be applied to de novo assembly of species without a reference sequence at all.
  • (b) comprises: (i) slidingly cutting the read into a plurality of Kmers, constructing an index RKI of the Kmer to the read for accessing the corresponding read by the Kmer; And extending the plurality of seed sequences in parallel based on the read segment and its corresponding index RKI to obtain the plurality of sequence contigs.
  • the RKI is obtained by slidingly cutting the read reads into a plurality of Kmers; constructing a hash with a Kmer as a key value, the hash forming the RKI, And the hash records the frequency of the Kmer, the associated read segment, the position and direction of the Kmer on the read segment.
  • the seed sequence is extended by repeating the steps of: selecting a seed sequence suitable for extension; positioning the read to the seed sequence to obtain an extension sequence; positioning the extension Reads at the end of the sequence are per-base-consistent; and if the co-processing fails, heterozygous recognition, phasing processing, and/or parsing of the repetitive sequences are performed.
  • a seed sequence suitable for extension is selected by slidingly cutting a seed sequence into a Kmer; obtaining a read corresponding to the Kmer by the RKI; and the corresponding read with the seed
  • the sequences are aligned; based on the sequencing wells corresponding to the corresponding reads, determining the coverage of the seed sequences by the sequencing wells; and determining the seed sequences suitable for extension based on the coverage conditions.
  • the reading is located to the seed sequence by sliding slashing the seed sequence into a Kmer; acquiring the Kmer-compatible read by the RKI; reading the Kmer corresponding The segments are mapped to the seed sequence and aligned on a base by base basis.
  • the set of effective sequencing holes of the extended site is equally distributed by the base types different in the site, it is judged that there is a hybrid.
  • the extended sequence is divided into a plurality of strips for extension.
  • the set of reads comprises a plurality of pairs of pairs, and the distance between two of the pair of pairs is L, in the process of consistency, if paired If the distance between the read segment located downstream in the extension direction and the corresponding read segment in the read segment is non-L, it is determined that the position at which the read segment located downstream of the extension direction is located is the start point of the repeated sequence.
  • any numerical value expressed in an accurate manner may represent a range, for example, an interval including plus or minus 10% of the numerical value; or the numerical value
  • the population is normally distributed, and the value expressed in an accurate manner contains the interval of the positive and negative standard deviation of the value.
  • the distance L between two reads in a pair of paired reads corresponds to the length of the inserted segment.
  • the size of the sequenced library constructed is certain, that is, the size of the inserted fragment is a fixed value. Theoretically, the distance between the outer ends of the paired reads obtained after double-end sequencing is the fixed value.
  • the distance between the paired reads is normally distributed.
  • the inventors set the L to the size of the insert of the experimental phase, and those skilled in the art can understand that the L is set to be the positive and negative standard deviation of the insert size of any test phase. The value between them is either the positive or negative standard deviation interval of the insertion fragment size in the experimental phase, and the repeating sequence in the extension process can also be determined and resolved by the method of the deduplication sequence of the present invention.
  • the end point of the repeating sequence is a position in the paired read segment located at a distance L downstream of the corresponding read segment from the read segment located downstream of the extension direction, or Conflict site.
  • the repeating sequence is a tandem repeating sequence by performing a sliding cut on a read positioned to the repeating sequence to obtain a Kmer; determining that each of the read segments is subjected to sliding cutting Whether there is a repeat Kmer in the Kmer, if the repeat Kmer is absent, it is judged that there is no tandem repeat, and if the repeat Kmer is present, it is judged that there is a tandem repeat.
  • the tandem repeat sequence is resolved by: determining the length of the tandem repeat sequence; and using the read sequence including the end point of the tandem repeat sequence The position of the end point is adjusted to determine the base at the conflicting position.
  • the repeat sequence is judged to be a large double repeat sequence: the length of the repeat sequence is greater than L, or the read corresponding to the read position located downstream of the repeat sequence in the paired read is also located on the repeat sequence.
  • the large double repeat sequence is analyzed by comparing the number of effective sequencing wells corresponding to the upstream repeat sequence in the large double repeat sequence with The difference between the difference in the number of effective sequencing wells corresponding to the downstream repeat sequence and the expected value of the number of valid sequencing wells of the base, and the conflict on the large double repeat sequence is solved according to the size of the difference. For example, if the two repeats in a large double repeat are relatively far apart, the upstream conflicting base contains more efficient sequencing holes (EW) than the downstream, and the defined number difference is greater than half of the expected EW.
  • EW efficient sequencing holes
  • the base can be determined by this comparison to resolve the conflict; if the difference between the upstream and downstream EW numbers is less than or equal to half the expected EW number, the wrong downstream arm within the length of the insertion segment from the start of the repeating region can be utilized.
  • the corresponding upstream arm constructs a helper contig (HC), comparing the EW set on the HC with the EW set of the upstream conflict base compared to its downstream conflict base.
  • HC helper contig
  • the repeat sequence is not a large double repeat sequence, and the following conditions exist, it is determined that the repeat sequence is a small double repeat sequence: the length of the repeat sequence is less than L.
  • the small double repeat sequence is resolved by at least one of: (p) utilizing the mean of the distance between the paired reads supporting each base as the desired position of the collision site, Comparing the proximity of the distance between the paired reads supporting the two conflicting bases to the mean to determine the base at the conflicting site; (k) using the paired reads to locate the extended sequence A readout corresponding to the most upstream read of the non-repetitive sequence in the standard deviation range to construct a helper contig, using the read corresponding to the read located downstream of the helper contig in the paired read to determine the conflicting site Base on.
  • the small double repeat sequence cannot be resolved, the conflict cannot be resolved, and the extension of the seed sequence is terminated.
  • (c) includes: (iii) establishing a merged connection relationship between the sequence contigs based on the read; (iv) based on the sequence contig and the ⁇ / RTI> ⁇ / RTI> ⁇ / RTI> ⁇ / RTI> ⁇ / RTI> ⁇ / RTI> ⁇ / RTI> ⁇ / RTI> ⁇ / RTI> ⁇ / RTI> ⁇ / RTI> ⁇ / RTI> ⁇ / RTI> ⁇ / RTI> ⁇ / RTI> Sequence contigs are constructed to construct the backbone sequences to obtain assembly results that separate long fragment sequences.
  • Iterative extension The extension of the seed sequence is an iterative process, and the extension will be used as an extension base for the next iteration, so that the seed sequence can be extended.
  • Linear extension Unlike other graph-based assembly algorithms, the linear extension method makes the structure encountered during assembly more Simple, logical relationships are relatively clear and easy to classify, while graph algorithms aggregate information from all repeat regions in one place, involving more information and more complex structures. That is, the solution of the graph is a one-time solution. As long as one repeat region is solved, all the genomic regions associated with the same (they have the same or similar repeat sequences) will all be solved at the same time, and the linear method only solves the current region at a time. The repetition of the other regions is not solved at the same time, and is extended once the extension is repeated in another region of the genome. Resolving a record of related information after a repetition, making the information encountered in the next solution repetition easier, which reduces the extra computational loss. In addition, the linear extension mode makes the extension of each seed sequence independent of each other, and this mode is easier to calculate parallelization.
  • the extension of the seed sequence will use the well information to perform multiple multiplex extensions of the haplotype, which will be combined with the previous one when encountering the heterozygous region.
  • the assembled heterozygous zone condition is immediately phased, ie the phase is also linear and runs through the various modules of the assembly system (eg contig merge, skeleton sequence construction).
  • the extension of the homozygous area still adopts a single extension, and when it encounters the hybrid area, it is divided into multiple ways to extend, and after the extension of the hybrid area is completed, it is merged back to the single way to continue to extend.
  • the algorithm is basically global, and when the extension encounters a complicated situation, the algorithm will strictly choose to terminate the extension instead of taking a higher value to extend, avoiding the use of greedy judgment mode.
  • the algorithm makes the algorithm more global than the conventional hierarchical assembly algorithm, and can greatly reduce the sequencing depth requirement, that is, each well There is no requirement for a particularly high sequencing depth, which saves a lot of resources, both cost and time.
  • the program may be stored in a computer readable storage medium, and the storage medium may include: a read only memory, a random Memory, disk or disc, etc.
  • the above process is written into an executable program.
  • the execution program includes: first, reading the seeds and reads, and constructing the reads into the Kmer index of the Reads (Read Kmer Index, RKI), so that the target can be accessed through the Kmer quickly. . Then extend the seeds as much as possible. Since the extension between the seed and the seed is independent of each other, this step can be accelerated by parallel operations.
  • the extended seed is the sequence contig, at this time through the well
  • the information begins to pre-build the skeleton sequence (scaffold), only establishes the context between contig, and does not immediately construct the scaffold. Similarly, by using the reads and paired reads information to establish a merged join relationship between contigs, these contigs are not immediately merged. Then, the relationship between the merged contig and the relationship in the pre-built scaffold are tested against each other, the relationship is simplified and the conflict is solved, and then the contig is merged and the scaffold is constructed, and finally the assembly result is output.
  • the present invention provides an apparatus for assembling a sequence of separated long segments for performing the method of any of the above-described embodiments of the present invention.
  • the apparatus includes: an input module, obtains a read set by sequencing, and records a sequencing hole corresponding to the read in the read set, one sequencing hole includes at least one long segment sequence; and an extension module uses the read And the sequencing holes corresponding to the segments and the read segments, and extending the plurality of seed sequences in parallel to obtain a plurality of sequence contigs, wherein the plurality of seed sequences are determined by known sequences; the skeleton sequence building module is based on the reads The sequence contigs and the sequencing wells corresponding to the reads included in the sequence contig, construct a skeleton sequence to obtain an assembly result of separating the long fragment sequences.
  • the seed sequence is obtained based on a genomic reference sequence according to the following steps: in the genomic reference sequence, interrupted by N; and the interrupted reference sequence is truncated by a predetermined length, In order to obtain the seed sequence.
  • the predetermined length is not less than the length of the sequencing library insert in the sequencing.
  • the set of reads comprises a pair of reads, the seed sequence being obtained based on the read according to the following steps: (1) sliding the read into a plurality of Kmers, constructing Kmer's index RKI for the read segment, for accessing the corresponding read segment by Kmer; (2) extracting a pair of read pairs without the high frequency Kmer from the read set; (3) using the index RKI, respectively determining all the reads corresponding to the Kmer of the two of the pair of read pairs in (2), obtaining the first read group and the second read group; (4) respectively determining (3) a first read segment group and a second read segment group corresponding to the sequencing holes to obtain a first sequencing hole set and a second sequencing hole set; (5) determining the first sequencing hole set and the second sequencing hole set in (4)
  • the intersection of the pair of reads in (2) is determined to be the seed sequence if the size of the intersection is not significantly different from the expected number of valid sequencing wells of the base.
  • the expected number of effective sequencing wells for the base is determined by the intersection of the
  • the size of the intersection is between half and two times the expected value of the number of effective sequencing holes of the base, determining the paired reading in (2) The segment is the seed sequence.
  • the method further includes an index construction module coupled to the extension module for using the The read segment is slidably cut into a plurality of Kmers, and an index RKI of the Kmer to the read segment is constructed for accessing the corresponding read segment by the Kmer; then the extension module is used to implement the following: based on the read segment and its corresponding index RKI And extending the plurality of seed sequences in parallel to obtain the plurality of sequence contigs.
  • the RKI is obtained by slidingly cutting the read into a plurality of Kmers; constructing a hash with a Kmer as a key, the hash constitutes the RKI, and The hash records the frequency of the Kmer, the associated read, and the position and orientation of the Kmer on the read.
  • the seed sequence is extended by repeating the steps of: selecting a seed sequence suitable for extension; positioning the read to the seed sequence to obtain an extension sequence; positioning the extension Reads at the end of the sequence are per-base-consistent; and if the co-processing fails, heterozygous recognition, phasing processing, and/or parsing of the repetitive sequences are performed.
  • a seed sequence suitable for extension is selected by slidingly cutting a seed sequence into a Kmer; obtaining a read corresponding to the Kmer by the RKI; and the corresponding read with the seed
  • the sequences are aligned; based on the sequencing wells corresponding to the corresponding reads, determining the coverage of the seed sequences by the sequencing wells; and determining the seed sequences suitable for extension based on the coverage conditions.
  • the reading is located to the seed sequence by sliding slashing the seed sequence into a Kmer; acquiring the Kmer-compatible read by the RKI; reading the Kmer corresponding The segments are mapped to the seed sequence and aligned on a base by base basis.
  • the set of effective sequencing holes of the extended site is equally distributed by the base types different in the site, it is judged that there is a hybrid.
  • the extended sequence is divided into a plurality of strips for extension.
  • the set of reads comprises a plurality of pairs of pairs, and the distance between two of the pair of pairs is L, in the process of consistency, if paired If the distance between the read segment located downstream in the extension direction and the corresponding read segment in the read segment is non-L, it is determined that the position at which the read segment located downstream of the extension direction is located is the start point of the repeated sequence.
  • the end point of the repeating sequence is a position in the paired read segment located at a distance L downstream of the corresponding read segment from the read segment located downstream of the extension direction, or Conflict site.
  • the repeating sequence is a tandem repeating sequence by performing a sliding cut on a read positioned to the repeating sequence to obtain a Kmer; determining that each of the read segments is subjected to sliding cutting Whether there is a repeat Kmer in the Kmer, if the repeat Kmer is absent, it is judged that there is no tandem repeat, and if the repeat Kmer is present, it is judged that there is a tandem repeat.
  • the tandem repeat sequence is resolved by: determining the length of the tandem repeat sequence; and using the read sequence including the end point of the tandem repeat sequence The position is adjusted by the end point alignment.
  • the repeat sequence is not a tandem repeat sequence, and the following conditions exist, it is determined that the repeat sequence is a large double repeat sequence: the length of the repeat sequence is greater than L, or in a paired read The read corresponding to the read located downstream of the repeat sequence is also located on the repeat.
  • the large double repeat sequence is analyzed by comparing the number of effective sequencing wells corresponding to the upstream repeat sequence in the large double repeat sequence with the number of effective sequencing wells corresponding to the downstream repeat sequence The difference between the difference and the expected number of valid sequencing wells of the base, based on the magnitude of the difference, resolves the conflict on the large double repeat.
  • the repeat sequence is not a large double repeat sequence, and the following conditions exist, it is determined that the repeat sequence is a small double repeat sequence: the length of the repeat sequence is less than L.
  • the small double repeat is resolved by at least one of: (p) utilizing a distance between pairs of reads supporting each base The mean value is the desired position of the conflicting site, and the bases at the conflicting sites are determined by comparing the distance between the paired reads supporting the two conflicting bases to the mean; (k) utilizing Constructing a helper contig by constructing a read corresponding to the most upstream read of the non-repetitive sequence in the range of standard deviations of the extended sequence in the paired read, using the paired read to locate downstream of the auxiliary contig Reading the corresponding segments of the segment, by updating the read data to update the distance mean, and also by comparing the proximity of the distance between the paired reads supporting the two conflicting bases to the mean to determine the conflicting site Base on.
  • the extension of the seed sequence is terminated.
  • the skeleton sequence construction module includes a primary skeleton sequence construction module, configured to establish a merge connection relationship between the sequence overlapping groups based on the read segment, and further include a merge connection relationship. Establishing a module for constructing a primary skeleton sequence based on the sequence contig and the sequencing hole corresponding to the read segment included in the sequence contig; and further comprising an assembly module for locating the plurality of sequence contigs After the merged connection relationship and the primary skeleton sequence are mutually verified, the assembly result of separating the long fragment sequences is obtained by combining the sequence contigs to construct the skeleton sequence.
  • the program first reads the seed data and the reads data for subsequent steps.
  • the algorithm continuously stores the read data that needs to participate in the comparison process in binary form until the entire program terminates.
  • the quality value information provided in the sequencing (such as the quality value in the Fastq format) is not recorded and used during the assembly process. This information is designed to be used in the pre-assembly data pre-processing steps to remove or correct the mass values of the aberrant bases and reads.
  • the work, that is, the sequence read in is considered to be normal in the quality of the sequencing, and the difference in bases of different quality values is ignored in assembly.
  • RKI Reads Kmer Index
  • This RKI structure requires a large amount of information to be recorded, and further, since a read contains multiple Kmers, this read will appear in multiple Kmer entries, which makes memory consumption even larger.
  • the memory consumption of 100X sequencing is about 3TB, which requires more memory than traditional assembly technology consumes less than 1TB of memory.
  • the data structure can still be further optimized, which can reduce assembly costs and resource bottlenecks.
  • the way to optimize can be to disable or limit the memory of the UHF to reduce the memory overhead, because these Kmers have little effect on the assembly, and it will waste more on the positioning of the reads. Time is extremely inefficient.
  • RKI data structure design of RKI is obviously different from the design of some DBG algorithms. They construct the graph of the relationship between Kmer by reading the reads from the file, and do not need to store the reads in the memory, and the Kmer corresponds.
  • the reads relationship does not need to be recorded in detail, because the representation of the genome has been transformed from a mess of reads into a group of logically associated Kmers, and the data has been extensively sorted and compressed. After the genome is constructed, only the Kmer map needs to be operated. No need to directly manipulate the reads. However, this algorithm does not construct a whole picture. In the process of seed extension (genome construction), it is necessary to continuously access the reads through Kmer to obtain the extended (assembly) material.
  • relevant parameters related to the experiment also need to be read in, such as the insert size of the paired reads in the library construction, the number of wells separating the LFR, and the number of cells input, the number of ploidy of the target genome (by optical means or informatics) Analysis obtained).
  • the size of the inserted fragment and the number of cells input can be statistically trained when the seed sequence is initialized.
  • the ploidy number of the target genome can also be calculated by heterozygous recognition.
  • Initialization obtains relevant information on the seed, which is used for subsequent legality determination and seed extension material.
  • the legitimacy decision is used to discard those seeds that are on the extended area and those that are not suitable for extension.
  • the sliding is cut into Kmer, the readings corresponding to these Kmers are obtained through RKI, and then these reads are finely aligned to the seed, and the well information corresponding to these reads is used to determine the coverage of the seed by the well.
  • EW Effective Well
  • LFR length 100 Kbp
  • LFR length 100 Kbp
  • LFR length 100 Kbp
  • LFR length 100 Kbp
  • the set of EWs is confirmed by its well coverage of the extended (or initialized) blocks.
  • an LFR first covers the extended seed, its coverage of the seed will rise as the seed extends until it is completely covered.
  • the well corresponding to the LFR that completely covers the seed can be used as an auxiliary extension. EW; and as the seed continues to extend forward, the LFR will gradually withdraw from the area where the seed extends, and its coverage will continue to decrease until it reaches 0.
  • the corresponding well will not Used as an EW to assist in extension. This creates a transitional change in the EW set at the time of extension.
  • This extended EW set will guide the filtering of the assembly during assembly. It can separate the reads that belong to the vicinity of the current seed from the vast majority of the entire genome, making some complex genome-wide regions degenerate into LFR lengths.
  • the simple area of the range makes it easy to assemble, which is the core of the way to separate long segment sequences.
  • This assembly method that considers a set of multiple wells while extending is different from the conventional two-component assembly method in which a single well is considered (ie, each well is assembled separately and then combined to perform secondary assembly).
  • the difference is the use of EW.
  • This approach allows the depth information required for assembly to be extended from a single well to a cumulative depth that combines the entire EW set, reducing the sequencing depth required for a single well, requiring only low coverage.
  • This approach greatly reduces the cost of sequencing and saves time.
  • transitional changes of EW can also be used to determine the positional relationship between contigs and to construct extremely long skeleton sequences.
  • both the initialization and the extension need to compare the reads to the seed to provide the material to obtain information (such as the paired information of well and reads) and the extended sequence, and try to compare all the reads. It is impractical, it will consume a lot of time, so use RKI to screen the reads, hit the Kmer's reads on the seed to make a fine comparison, as follows:
  • 3.reads filtering is to remove reads that are mislocated due to repetition.
  • the algorithm firstly filters out the mislocated readers due to the genome-wide repetition through the EW information, and then filters out the right arm reads that are mislocated due to the simple repetition inside the LFR through the mate-pair information (the left arm reads are not filtered). It should be noted that the left arm reads in the repeat region will not participate in the legitimacy judgment of their corresponding right arm reads, because they may also be positioned incorrectly, and the complex type of repetition will be handled by the duplicate solution module.
  • the reads positioned at the end of the contig need to be base-by-base consistent. Similar to the read alignment, only the replacement sequencing error is tolerated in the process of consistency, and there is no tolerance of indel type. Since the mate-pair filtering of the reads can only be applied to the right arm reads, this makes the right arm reads no positioning error, and has obvious advantages with the left arm reads. Based on this feature, when the number of filtered right arm reads is sufficient, the module only uses these reads for consistency. If it can be combined into one sequence, the extension will be completed, and then the module for updating related information will be entered.
  • Consistent base-by-base site merging process if more than one base is found to be non-low frequency at the same position (low frequency is mainly caused by sequencing error), it will cause conflict and determine the consistency of the site.
  • Sexuality failure which is mainly caused by heterozygosity and repetition. Compared with the repetition, the heterozygous recognition of the diploid genome is relatively simple, and its characteristic is that there are only two kinds of conflicting bases, and the EW sets are semi-halved, and the EW sets supported by each have almost no intersection, so the consistency is achieved. Hybrid identification will be performed first after failure.
  • sequencing errors do not conform to this situation.
  • the main features of sequencing errors are low frequency and random. Because most sequencing technologies only account for a very small number of sequencing errors, only a small amount of difference is found in the consistency, even in the case of large fluctuations in sequencing depth, the difference caused by sequencing errors Still only a very small number of components, the absolute number of sequencing errors will change with the depth of sequencing, but the ratio does not change, that is, the sequencing error rate remains unchanged.
  • its randomness is manifested in the fact that sequencing errors are not biased in different wells, and sequencing errors with the same probability appear in each well, even though the sequencing error itself may have a certain bias in the error detection mode, but The conditions in each well are the same.
  • the contig When confirmed to be heterozygous, the contig will be split into two for two-way extension, which will be phased in conjunction with the EW condition of the previous hybrid region.
  • two heterozygous regions from the same haploid should have similar EW sets.
  • the algorithm can also identify large structural variant heterozygotes, which is the advantage of this algorithm in separating the haplotypes from other long processes. Further, Since there is no difference in the phasing mode between the insertion-deletion heterozygous and the large structural variant heterozygous in this algorithm, this algorithm can be constructed more than other phasing methods that only consider single replacement heterozygous sites. Complete longer haplotypes also provide features that are not available in resequencing methods, especially phasing large structural variations, which is significant for downstream analysis.
  • Repetitive Sequences are identical or similar sequences that occur at different positions in the genome. It appears in large numbers in the genome, such as the total length of various types of repeats in the human genome accounting for nearly half of the size of the genome. Repeating sequences have always been an important issue affecting the quality of assembly. Whether or not the correct resolution can be solved is also the most concerned about various assembly algorithms and constantly trying new strategies to achieve breakthroughs. Repeated resolution is naturally a key module in this algorithm, mainly dealing with complex repetitions within LFR, mainly including adjacent small double repetitions, large double repetitions, and tandem repetitions. These repeated solutions are described below.
  • Repeated area recognition is one of the important information for identifying the type of repetition.
  • the main purpose is to identify the left arm that cannot be used for filtering. Reads, ie mate-pair reads are in the repeating region, because one of the characteristics of the repeating region is that reads will be incorrectly located here. If incorrectly positioned reads are used for extension, it will often lead to erroneous extension. .
  • the repeated area recognition mainly includes the identification of the start point and the end point.
  • the left arm corresponding to the right arm reads in the extension region is not near the position of the upstream insertion length, and the starting point of the repeated region can be determined.
  • the end points of the repeating zone are mainly divided into two categories.
  • the end of the repeating zone is the position where the right arm reading of the left arm is no longer present;
  • the conflict point at the time of extension is the end point.
  • the difference regions appearing in the repeated segments will be considered as non-repetitive regions, that is, similar repeating segments will be strictly divided into a plurality of sub-repeat segments to be processed.
  • the regional property recognition after the difference point is different.
  • the left arm corresponding to the right arm reading of the starting point of the repeating region is in the previous repeating region and not in the non-repetitive region, and the paired reads are located. Wrong situation. At this time, check the left arm reads near the length of the inserted segment before the extension point. If the corresponding right arm participates in the extension, the current extended area is the repeating area. If a part is found not to participate in the extension, it indicates that Extend into the non-repetitive area.
  • the mate-pair filtering of the reads can only be applied to the right arm reads, the position of the right arm reads and the left arm reads is not the same when extended. However, if there is a region, the right arm reads can be used to test the left arm reads. The left arm reads can play a role in resolving the repetition. This is the concept of Helper Contig (HC).
  • HC Helper Contig
  • HC is a contig used to solve complex duplications, only for auxiliary use, not as a formal contig In the assembly results. The essence is to further utilize the mate-pair information of the reads, and expect the HC to cross the current repeating region and appear on the downstream non-repetitive region, and use this non-repetitive region to help resolve the duplication. If the HC fails to cross the repeat zone successfully, it will generally not work.
  • the application objects for HC are mainly divided into the following two categories:
  • the algorithm first uses the mean value calculated by the mate-pair information supporting each base as the desired position, and the distance is considered to be the base that should be extended. If the distance between the two positions is too close, the actual situation If it is fuzzy, you need to construct HC to assist in identification.
  • the right arm corresponding to the leftmost left arm of the contig end SD length range which is not in the repeating region is used as the starting point for constructing the HC, and the read is extended in a manner similar to the seed.
  • the position of the conflicting base can be calculated separately by the left arm corresponding to the right arm reads positioned on this HC.
  • the use of HC in this case only improves the reliability of the calculation distance, and if the SD is large, errors may still occur.
  • the left arm reads will not be used for the filtering of the reads. In essence, the left arm cannot be used to filter the right arm in the repeating region as long as it is in the repeating region, regardless of the region between the arms. Whether it is repeated or non-repetitive.
  • the two repeat sequences are far apart.
  • the bases in the upstream contain more EW than the downstream ones. This comparison can be used to resolve conflicts.
  • the HC can be constructed using the left arm corresponding to the wrong right arm within the length of the insertion segment starting from the beginning of the repeating region, and it is expected that the EW set on the HC and the EW set of the upstream conflicting base are compared with The difference in downstream conflict bases is smaller.
  • the left arm supporting the upstream base can be found
  • the right arm corresponding to the reads can be positioned on the HC (the HC is constructed in the same way as the previous type of large repeat, and is constructed using the left arm corresponding to the wrong right arm within the length of the insert from the beginning of the repeating region). As shown in Figure 8a;
  • a repeat consisting of multiple consecutive copies of a repeating unit is defined as Tandem Repeat (TR), and the shortest repeat in the tandem repeat region is a Tandem Repeat Unit (TRU).
  • TR Tandem Repeat
  • TRU Tandem Repeat Unit
  • a tandem repeat unit has a number of different phases (such as ACT, CTA, TAC), the number of which is equal to its length.
  • tandem repeat region is equivalent to a plurality of conventional repeat region linkages. to make. This makes this repetitive solution both identical and different from other conventionally repeated methods.
  • the tandem repetition will shift the positioning of the reads, which often causes conflicts in the consistency of the repeated positioning or repeated differences in the positioning of the reads.
  • the core essence of this problem is that the Kmer used to locate the reads is a repeating Kmer.
  • the head of the mislocated reader will be after the start of the tandem repeat region (because the non-repetitive region before the start point will cause the mislocated read to fail during the fine alignment process), and the reads are offset by the wrong positioning. SD smaller than the length of the inserted segment, that is, mate-pair cannot filter out these misplaced reads. Further, if the length of the tandem repeat is greater than the SD of the inserted length, the read positioning in the repeating region will occur in SD.
  • the periodic positioning extends to the upstream error aggregation situation, the misplaced reads distance deviation is not greater than the SD length of the insert, and is concentrated at the head end of each SD unit, which means that the series is repeated before the collision is encountered.
  • Each SD length unit in the zone will be compressed and shortened, and the reads of the next SD unit will be continuously shifted forward.
  • TR is confirmed by finding tandem repeating units. Since the TR larger than the length of the inserted segment also meets the activation condition of the HC case 2, the algorithm places the activation decision of the TR before the activation decision of the HC case 2.
  • TRU is primarily identified by discovering that Kmer is used periodically. For TRUs smaller than the read length -Kmer+1 (that is, TRU appears twice or more on a reads), the reads on the collision site can be slid into Kmers and a Kmer is found in the reads. This situation exists in most reads, and if the Kmers of these reads are consistent or consistent in the TRU phase, it can be judged that the current conflict is caused by the series repetition, and the de-serial repeat module will be activated. For a TRU larger than the length of the read-Kmer+1, the Kmer in the range from the start point of the repetition region to the collision point is scanned. If the Kmer appears in a fixed period, it can be judged that the collision is caused by the tandem repetition.
  • the reads on the conflicting sites can be divided into four categories: 1) the reads in the TR area, which are completely covered by the TRU; 2) the reads containing the TR starting point, which are only found at the end TRU; 3) contains Reads at the end of the TR, these reads only find the TRU at the head end; 4) the reads that contain both the TR start and end points, this case only exists when TR is less than the length of reads, and no complete type 1 will be found at this time. TRU covered reads.
  • the conflicts are eliminated by adjusting the position with the end of the TR containing the end of the TR.
  • a TR larger than the length of the read it can only be solved by crossing and filling in "N".
  • N For a TR larger than the length of the reads and smaller than the length of the reads insert, there will be a left arm support in the non-repetition zone and the right arm reads containing the TR end point are aligned and consistent after the end point, and the inserts according to these reads can be followed.
  • these conflicting reads can be assembled in a DBG manner to naturally construct the difference sequence and the TR end sequence in the separated TR region, and then the right arm reads located on the sequence obtained by using these assemblies are corresponding to the front
  • the position of the left arm reads on the extended non-repetitive region is used to calculate the position of the assembled sequence such that the sequence furthest from the extension point is the correct TR endpoint.
  • the extension of the seeds is parallel, in order to prevent the same region from being repeatedly extended by a plurality of seeds, it is necessary to mark the read that has participated in the extension as "used", and find this when other seeds are initialized or extended.
  • the reads will stop extending and then be connected by the contig merge module. It should be noted that the mate-pair is not clearly located in the repeat area. To prevent the erroneous extension from stopping, the repeating nature of the reads will be marked as "repetition" and will not be used as a redundant extension. determination.
  • the algorithm when a read in a well is positioned on a contig, which may be an erroneous positioning due to a repetition or sequencing error, the algorithm requires that well have relatively sufficient coverage of the extended seed area. EW can be considered as EW, and EW with insufficient coverage will be discarded. Based on efficiency considerations, the EW set is updated every time a certain extension is reached, and only the coverage within this length range is examined.
  • the contig will be merged after all contigs have been extended, and contig-phased operations will also be performed.
  • the contig that can be merged is divided into two cases:
  • the contig of the extended stop due to repetition, localized sequencing depth, etc.
  • the mate-pair reads on the non-repetitive area at the end of the contig are examined. Whether it is located on other contigs, if it is found successfully, the distance between the two contigs is estimated by inserting the length of the segment and filled in the number of "N" of the estimated amount.
  • the merging step of the actual processing first establishes the relationship between contigs (overlapped or overlapped), and then combines the skeleton.
  • the relationship between contigs established in the sequence step is mutually checked, the relationships are simplified and the conflicts are resolved, and the contig merge operation is substantially performed.
  • the sequence formed by the contig is mainly determined by the information of the well to determine the context, which is a scaffold, which is different from the skeleton sequence constructed by the mate-pair information of the traditional definition.
  • the scaffold in this algorithm only expresses the context of contig, but the specific distance between contig cannot be confirmed, and they are connected by only one "N".
  • the scaffold not only determines the context between contigs through the mate-pair of the reads, but also calculates the distance between contigs by the positions of the pairs of reads positioned on the two contigs and their insert lengths. The method in the merge with contig is consistent.
  • Optimal Linear Arrangement with NP-Hard properties.
  • the contig in scaffold also has its own orientation (the four deoxynucleotides A, C, T, and G are called polydeoxynucleotides formed by 3', 5' phosphodiester bonds.
  • the DNA base sequence is a representation of contig and scaffold in assembly.
  • the deoxynucleotide linkage has strict directionality and is 5'-OH of the first deoxynucleotide and 5' of the next deoxynucleotide.
  • the 3', 5' phosphodiester bond forms a linear DNA macromolecule with no branching.
  • DNA defines the 5' end to the 3' end as the "+” direction and the 3' end to the 5' end.
  • the contig with the correct position and the wrong direction will also cause a large assembly error, which is manifested as a sub-sequence flip type serious error.
  • a well will appear in multiple adjacent contigs, through which a set of adjacent contigs can be obtained.

Landscapes

  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Chemical & Material Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Organic Chemistry (AREA)
  • Physics & Mathematics (AREA)
  • Biophysics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Zoology (AREA)
  • Biotechnology (AREA)
  • Wood Science & Technology (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Genetics & Genomics (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Medical Informatics (AREA)
  • Analytical Chemistry (AREA)
  • Theoretical Computer Science (AREA)
  • Immunology (AREA)
  • Microbiology (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biochemistry (AREA)
  • General Engineering & Computer Science (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

A method, apparatus, and system for assembling separated long fragment sequences. A method for assembling separated long fragment sequences, comprising: (a) acquiring a read fragment set by means of sequencing, and recording a sequencing hole corresponding to the read fragments in the read fragment set, the sequencing hole comprising at least one long fragment sequence; (b) using the read fragments and the sequencing holes corresponding to the read fragments to implement parallel extension of multiple seed sequences in order to acquire multiple sequence contigs, the multiple seed sequences being determined by means of known sequences; and (c) on the basis of the read fragments, the sequence overlap contigs, and the sequencing holes corresponding to the read segments contained in the sequence overlap contigs, constructing a skeleton sequence in order to acquire the results of the assembly of the separated long fragments sequences.

Description

对分隔长片段序列进行组装的方法和装置Method and apparatus for assembling long segment sequences 技术领域Technical field
本发明涉及生物技术领域,具体地,本发明涉及对分隔长片段序列进行组装的方法和装置。The present invention relates to the field of biotechnology, and in particular, to a method and apparatus for assembling sequences that separate long segments.
背景技术Background technique
从第一次人类基因组得到较完整的构建至今已有十数载,有了基因组图谱,各种针对人基因组重测序的生物信息学分析方法和软件如雨后春笋般涌现,对人类疾病、医学和健康研究的发展作出了重大贡献。Since the first human genome has been more completely constructed, it has been more than ten years. With genomic maps, various bioinformatics analysis methods and software for human genome resequencing have sprung up, for human diseases, medicine and health. The development of the research has made a significant contribution.
比较基因组学(Comparative Genomics)是一门对已知的基因和基因组结构进行比较以了解基因的功能、表达机理和物种进化的学科。其基本原则是两生物间相同的特性通常是由进化保守的DNA所编码的,那么两者间相关的DNA片段将是相同或相近的。比较基因组学方面的分析所需要的是基因组图谱(基因组参考序列)的存在和对比较对象的测序。变异检测则是比较基因组学中的重要内容,国际人类基因组单体型图计划(The International HapMap Project)和全基因组关联分析(Genome-wide association study,GWAS)正是基于对单核苷酸多态性(single-nucleotide polymorphism,SNP)这种类型的变异来进行相关研究的。Comparative Genomics is a discipline that compares known gene and genomic structures to understand gene function, expression mechanism, and species evolution. The basic principle is that the same characteristics between two organisms are usually encoded by evolutionarily conserved DNA, and the relevant DNA fragments will be the same or similar. What is needed for comparative genomics analysis is the presence of a genomic map (genomic reference sequence) and sequencing of the compared objects. Mutation detection is an important part of comparative genomics. The International HapMap Project and Genome-wide Association Study (GWAS) are based on single nucleotide polymorphisms. This type of variation in single-nucleotide polymorphism (SNP) is used for related research.
组装指的是将较短的碎片序列整合成较长序列的过程。受限于测序技术,基因组并不能通过对染色体从头到尾的测序读出其完整的序列内容,而往往将全基因组打碎成数十至数千长度的碱基片段,通过大规模并行测序来读出这些片段的内容,再使用组装来对这些零碎的片段进行分析整合,最后重新还原成相对完整的基因组序列。通过组装来识别变异是对组装技术的一种新的运用,而实际上组装主要的目的是构建基因组。当没有基因组参考序列时,基因组的构建组装则是一个从无到有的过程,显得尤为重要,这类组装也称为“从头组装”(denovo assembly)。人类基因组参考序列长度约为3Gbp(3*109bp),这是单倍体的碱基对数量,而人类是二倍体生物,实际上应该有2*3Gbp=6Gbp的基因组大小。一个人类个体的二倍体基因组来源于他的两个亲本各自贡献的一个单倍体,这两套单倍体间存在的各种差异使该个体在同源染色体上的一个或多个位点上有不同的等位基因存在,这种现象就是杂合现象。而且,人类参考序列是通过多个个体的数据构建而成,这导致它实际上是一个混杂嵌合了各种杂合的单倍体基因组。随着对基因组研究的深入,单倍体的人类参考序列越来越不能满足需求,构建单体型(Haplotyping)显得日益重要,基于单体型信息的基因组分析也不断出现。 Assembly refers to the process of integrating shorter fragment sequences into longer sequences. Limited by sequencing technology, the genome cannot read its complete sequence content by sequencing the chromosomes from beginning to end, and often breaks the whole genome into tens to thousands of base fragments, which are processed by massively parallel sequencing. The contents of these fragments are read, and these fragmented fragments are analyzed and integrated using assembly, and finally re-reduced into relatively complete genomic sequences. Identifying mutations through assembly is a new application of assembly techniques, and in fact the main purpose of assembly is to build the genome. When there is no genomic reference sequence, the construction of the genome is a process from scratch, which is especially important. This type of assembly is also called "denovo assembly". The human genome reference sequence is about 3Gbp (3*10 9 bp) in length, which is the number of base pairs in haploids, while humans are diploid organisms, and should actually have a genome size of 2*3Gbp=6Gbp. The diploid genome of a human individual is derived from a haploid contributed by each of its two parents, and the various differences between the two sets of haploids cause the individual to have one or more sites on the homologous chromosome. There are different alleles present, and this phenomenon is a heterozygous phenomenon. Moreover, the human reference sequence is constructed from data from multiple individuals, which results in a virtually heterozygous haploid genome. With the deepening of genomic research, the haploid human reference sequence is increasingly unable to meet the demand, the construction of haplotype (Haplotyping) is increasingly important, genomic analysis based on haplotype information is also emerging.
单体型信息有助于深入解读基因型和表型之间的关系,杂合集完全相同的两个个体也会因单体型的不同产生不同的表型和疾病易感性,它对功能方面的研究(如等位基因的特异性表达)和疾病方面的研究(如地中海热,乳腺癌)有着重要意义。将多个杂合位点上的杂合信息分拨以确定单体型的动作称为定相(Phasing),是构建单体型的重要操作,而构建单体型的方法目前已有很多,主要分为以下五类:The haplotype information helps to interpret the relationship between genotype and phenotype. The two individuals with the same heterozygous collection will also have different phenotypes and disease susceptibility depending on the haplotype. Studies (such as the specific expression of alleles) and disease research (such as Mediterranean fever, breast cancer) are important. The operation of dividing the heterozygous information at multiple heterozygous sites to determine the haplotype is called phase phasing, which is an important operation for constructing haplotypes, and there are many methods for constructing haplotypes. Mainly divided into the following five categories:
1.对多个无亲缘关系个体的数据运用种群统计的方法1. Method of using population statistics for data of multiple unrelated individuals
2.对家系数据运用孟德尔遗传的方法2. The method of applying Mendelian inheritance to family data
3.直接利用测序序列信息的方法3. Direct use of sequencing sequence information
4.通过实验操作的方法4. Method of experiment operation
5.通过物理分隔的方法5. By physical separation
需要重点指出的,物理分隔的方法的核心是对断裂成DNA长片段的序列进行分隔,无论是通过fosmid质粒获取的方式,还是用多孔板直接物理分隔的方式。分隔后再进行测序所需的进一步打碎和扩增操作(构建文库),为了区分不同的分隔单元,分别对这些单元中的序列接上了不同的条码序列(barcode)。这种方式通过分隔把全基因组分成了很多个子部份,当分离的子部份数量很多时,每个子部份则仅包含基因组上一小块区域的其中一个单倍体的内容。这使得全基因组水平上的杂合区域分别在这些小区域中以纯合的形态出现,这对单体型的构建有着极为重要的意义。It is important to note that the core of the physical separation method is to separate the sequences that break into long fragments of DNA, either by means of fosmid plasmids or by direct physical separation of the multiwell plates. Further fragmentation and amplification operations (construction libraries) required for sequencing after separation, in order to distinguish between different separation units, respectively, the sequences in these units were attached with different barcodes. In this way, the whole gene component is divided into many sub-portions by separation. When the number of separated sub-portions is large, each sub-portion only contains the content of one haploid of a small area on the genome. This allows heterozygous regions at the genome-wide level to appear in homozygous form in these small regions, which is of great importance for the construction of haplotypes.
每个分隔单元有了自己独有的条码序列便可以在测序之后通过对条码序列的识别来找回属于各个分隔单元自己的读(reads)。Fosmid质粒分隔技术中称其分隔单元为fosmid质粒池(fosmid pool),每个fosmid质粒池中包含一条或多条长约37Kbp的长片段;而多孔板分隔技术中称其分隔单元为井(well),每个井中包含多条长片段,其长度在不同技术中各不相同。不管怎样,分隔的方法开创性地带来了新类型的信息,reads总集合通过barcode分隔成了多群集合,这与全基因组鸟枪法(Whole Genome Shotgun,WGS)的测序手段相比,reads不再随机地来源自全基因组上的任意位置,而是每群集合中的reads都来自于基因组上的一块共同的小区域内,这些小区域便是分隔时的长片段所覆盖的区域。虽然这些小区域依旧是来源自全基因组上的任意位置,但是每群集合中的reads却受到了约束,发生了聚集,这种聚集所新增的信息成为构建单体型的关键。Each partition unit has its own unique barcode sequence to retrieve the own reads belonging to each partition unit by identifying the barcode sequence after sequencing. Fosmid plasmid separation technology refers to the separation unit as fosmid plasmid pool (fosmid pool), each fosmid plasmid pool contains one or more long fragments of about 37Kbp long; and the perforated plate separation technique is called the separation unit is well (well ), each well contains multiple long segments, the length of which varies from technology to technology. In any case, the separation method pioneered a new type of information, and the total collection of reads is separated into a multi-group collection by barcode, which is no longer compared to the whole Genome Shotgun (WGS) sequencing method. Randomly derived from any location on the genome, but the reads in each group come from a common small area on the genome, which is the area covered by long fragments when separated. Although these small regions are still derived from any position on the whole genome, the reads in each cluster are constrained and aggregated. The added information of this cluster becomes the key to constructing haplotypes.
长片段读(Long Fragment Reads,LFR)技术大幅改善了文库构建的复杂程度,使得单体型的构建无论在时间还是成本上都得到了降低。Long Fragment Reads (LFR) technology dramatically improves the complexity of library construction, allowing haplotype construction to be reduced both in time and cost.
然而,目前的长片段读技术的组装仍有待改进。 However, the assembly of current long segment read techniques remains to be improved.
发明内容Summary of the invention
本发明旨在至少在一定程度上解决上述技术问题之一或至少提供一种有用的商业选择。The present invention is directed to solving at least some of the above technical problems or at least providing a useful commercial choice.
在本发明的第一方面,本发明提出了一种对分隔长片段序列进行组装的方法,包括:In a first aspect of the invention, the invention provides a method of assembling a sequence of separated long segments, comprising:
(a)通过测序获得读段集,并记录所述读段集中的读段对应的测序孔,一个测序孔包含至少一条长片段序列;(a) obtaining a set of reads by sequencing, and recording the sequencing holes corresponding to the reads in the read set, one sequencing well comprising at least one long segment sequence;
(b)利用所述读段及所述读段对应的测序孔,对多个种子序列进行并行延伸,以获得多个序列重叠群,所述多个种子序列通过已知序列确定;(b) using the read segment and the sequencing hole corresponding to the read segment, extending the plurality of seed sequences in parallel to obtain a plurality of sequence contigs, wherein the plurality of seed sequences are determined by a known sequence;
(c)基于所述读段、所述序列重叠群以及所述序列重叠群包含的读段对应的测序孔,构建骨架序列,以获得分隔长片段序列的组装结果。(c) constructing a skeleton sequence based on the read sequence, the sequence contig, and the sequencing well corresponding to the read included in the sequence contig to obtain an assembly result of separating the long fragment sequence.
本领域技术人员可以理解,上述本发明一方面方法的全部或部分步骤可以通过程序来指令相关硬件完成,该程序可以存储于一计算机可读存储介质中,存储介质可以包括:只读存储器、随机存储器、磁盘或光盘等。It will be understood by those skilled in the art that all or part of the steps of the method of the present invention may be performed by a program to instruct related hardware. The program may be stored in a computer readable storage medium, and the storage medium may include: a read only memory, a random Memory, disk or disc, etc.
在本发明的第二方面,本发明提供一种对分隔长片段序列进行组装的系统,该系统包括:数据输入单元,用于输入数据;数据输出单元,用于输出数据;存储单元,用于存储数据,其中包括计算机可执行程序;处理器,与所述数据输入单元、所述数据输出单元和所述存储单元连接,用于执行所述计算机可执行程序,执行所述程序包括完成上述方法。In a second aspect of the present invention, the present invention provides a system for assembling a sequence of separated long segments, the system comprising: a data input unit for inputting data; a data output unit for outputting data; and a storage unit for Storing data, including a computer executable program; a processor coupled to the data input unit, the data output unit, and the storage unit for executing the computer executable program, the executing the program comprising performing the above method .
在本发明的第三方面,本发明提出了一种对分隔长片段序列进行组装的装置,包括:In a third aspect of the invention, the invention provides an apparatus for assembling a sequence of long segment segments, comprising:
输入模块,通过测序获得读段集,并记录所述读段集中的读段对应的测序孔,一个测序孔包含至少一条长片段序列;Inputting a module, obtaining a read set by sequencing, and recording a sequencing hole corresponding to the read in the read set, wherein one sequencing hole comprises at least one long segment sequence;
延伸模块,利用所述读段及所述读段对应的测序孔,对多个种子序列进行并行延伸,以获得多个序列重叠群,所述多个种子序列通过已知序列确定;An extension module, using the read segment and the sequencing hole corresponding to the read segment, to extend a plurality of seed sequences in parallel to obtain a plurality of sequence contigs, wherein the plurality of seed sequences are determined by a known sequence;
骨架序列构建模块,基于所述读段、所述序列重叠群以及所述序列重叠群包含的读段对应的测序孔,构建骨架序列,以获得分隔长片段序列的组装结果。The skeleton sequence constructing module constructs a skeleton sequence based on the read segment, the sequence contig, and the sequencing hole corresponding to the read segment included in the sequence contig to obtain an assembly result of separating the long segment sequence.
上述本发明的方法和/或装置至少具有以下特性和优越性:The above described method and/or apparatus of the present invention has at least the following features and advantages:
迭代延伸:种子序列的延伸是一个迭代过程,本次延伸的部分将作为下次迭代的延伸基座,使得种子序列能不断延长。Iterative extension: The extension of the seed sequence is an iterative process, and the extension will be used as an extension base for the next iteration, so that the seed sequence can be extended.
线性延伸:与其他基于图的组装算法不同,线性延伸方式使得组装时遇到的结构较为简单,逻辑关系相对清晰且容易分类处理,而图的算法会将所有重复区域的信息聚集于一处,其涉及的信息量更多且产生的结构更复杂。即图的解法是一次性解决,只要解掉一个重复区域,所有跟该处相关联的基因组区域(它们都具有相同或相近的重复序列)就会全 部同时解决,而线性方法一次仅解决当前区域的重复,对其他区域并不同时解决,当延伸至该重复在基因组的另一区域时则再解一次。解决了一个重复后会进行相关信息的记录,使得下一次解重复时遇到的信息更简单,这降低了额外的计算损耗。此外,线性延伸方式使各个种子序列的延伸相互独立,这种模式更易于计算并行化。Linear extension: Unlike other graph-based assembly algorithms, the linear extension method makes the structure encountered during assembly simpler, the logic relationship is relatively clear and easy to classify, and the graph algorithm gathers information of all repeated regions in one place. It involves more information and a more complex structure. That is, the solution of the graph is a one-time solution. As long as one repeat region is solved, all the genomic regions associated with it (they have the same or similar repeat sequences) will be all The solution is solved at the same time, and the linear method only solves the repetition of the current region at one time, and does not solve the other regions at the same time, and then re-interprets when the repetition is extended to another region of the genome. Resolving a record of related information after a repetition, making the information encountered in the next solution repetition easier, which reduces the extra computational loss. In addition, the linear extension mode makes the extension of each seed sequence independent of each other, and this mode is easier to calculate parallelization.
多路延伸:由于该算法的目标是构建多倍体基因组,对种子序列进行延伸时会利用测序孔(well)信息进行多个单倍体的多路联合延伸,每当遇到杂合区域时会结合前面已装成的杂合区状况即时进行定相的操作,即定相也是线性进行的,且贯穿整个组装装置的各个模块。为了节省资源消耗,纯合区的延伸依旧采用单路延伸,当遇到杂合区时再分为多路进行延伸,而杂合区延伸完毕后重新合并回单路继续延伸。Multiple Extension: Since the goal of the algorithm is to construct a polyploid genome, the extension of the seed sequence will use the well information to perform multiple multiplex extensions of the haplotype, whenever a heterozygous region is encountered. The phasing operation is carried out in real time in conjunction with the previously assembled hybrid zone condition, ie the phase is also linear and runs through the various modules of the assembly device. In order to save resource consumption, the extension of the homozygous area still adopts a single extension, and when it encounters the hybrid area, it is divided into multiple ways to extend, and after the extension of the hybrid area is completed, it is merged back to the single way to continue to extend.
全局信息判定:一个算法若只利用当前判定下相关的信息,而不考虑其后的全局范围内的选择支,则容易出现错误或者无法达到全局最优解(如求单源最短路径中的Dijkstra贪婪算法)。对于基因组组装,贪婪算法往往因错误的选择而导致组装错误,所以本方法并不采用贪婪算法来解决延伸时的冲突,而用全局信息进行相关判定。由于LFR的长度(~100Kbp)较长,在路径选择时考虑了该位置所能提供的全部信息(Kmer、reads、mate-pair及well信息),其中well信息可使得其判定涉及的范围在100Kbp左右,若大型重复没有大量存在,该算法是基本符合全局性的,而当延伸遇到复杂情况时,本算法会严格选择终止延伸而不是取较高值进行延伸,避免使用贪婪的判定模式。需要特别提出的是,由于延伸时考虑了该位置上所有的well信息,这使得该算法与常规的分级组装算法相比具有更强的全局性,可以大幅降低测序深度的要求,即每个well并不要求有特别高的测序深度,这节省了大量的资源,无论是成本还是时间。Global information decision: If an algorithm only uses the relevant information under the current decision, regardless of the selection branch in the global scope, it is prone to error or cannot reach the global optimal solution (such as Dijkstra in the shortest path of single source) Greedy algorithm). For genome assembly, greedy algorithms often lead to assembly errors due to wrong choices. Therefore, this method does not use greedy algorithms to resolve conflicts during extension, but uses global information to make correlation decisions. Since the length of the LFR (~100Kbp) is long, all the information (Kmer, reads, mate-pair, and well information) that the location can provide is considered in the path selection, and the well information can make the determination range of 100Kbp. Left and right, if there is not a large number of large repetitions, the algorithm is basically global, and when the extension encounters a complicated situation, the algorithm will strictly choose to terminate the extension instead of taking a higher value to extend, avoiding the use of greedy judgment mode. What needs to be specially mentioned is that since all the well information at the position is considered in the extension, this makes the algorithm more global than the conventional hierarchical assembly algorithm, and can greatly reduce the sequencing depth requirement, that is, each well There is no requirement for a particularly high sequencing depth, which saves a lot of resources, both cost and time.
本发明的附加方面和优点将在下面的描述中部分给出,部分将从下面的描述中变得明显,或通过本发明的实践了解到。The additional aspects and advantages of the invention will be set forth in part in the description which follows.
附图说明DRAWINGS
本发明的上述和/或附加的方面和优点从结合下面附图对实施例的描述中将变得明显和容易理解,其中:The above and/or additional aspects and advantages of the present invention will become apparent and readily understood from
图1是本发明的实施例中的对分隔长片段序列进行组装的方法的流程图。1 is a flow chart of a method of assembling a sequence of separated long segments in an embodiment of the present invention.
图2是本发明的实施例中的对分隔长片段序列进行组装的方法的流程图。2 is a flow chart of a method of assembling a sequence of separated long segments in an embodiment of the present invention.
图3是本发明的实施例中的对分隔长片段序列进行组装的装置的结构示意图。Figure 3 is a block diagram showing the construction of an apparatus for assembling a sequence of long segments in an embodiment of the present invention.
图4是本发明的实施例中的对分隔长片段序列进行组装的装置的结构示意图。4 is a schematic view showing the structure of an apparatus for assembling a sequence of long segment segments in an embodiment of the present invention.
图5是本发明的实施例中的种子序列迭代延伸过程的示意图。 FIG. 5 is a schematic diagram of a seed sequence iterative extension process in an embodiment of the present invention.
图6是本发明的实施例中的识别重复区域的示意图。Fig. 6 is a schematic diagram of identifying a repeating region in an embodiment of the present invention.
图7是本发明的实施例中的处理具有相距较大的两个重复区域的大型双重复序列的冲突的示意图。Figure 7 is a schematic illustration of the processing of conflicts of large double repeat sequences having two repeating regions that are relatively large in an embodiment of the present invention.
图8是本发明的实施例中的处理具有相距较小的两个重复区域的大型双重复序列的冲突的示意图。Figure 8 is a schematic illustration of the processing of collisions of large double repeat sequences having two repeating regions that are relatively small in an embodiment of the present invention.
图9是本发明的实施例中的处理串联重复序列的冲突的示意图。Figure 9 is a schematic illustration of the conflicting processing of tandem repeats in an embodiment of the invention.
具体实施方式detailed description
下面详细描述本发明的实施例,所述实施例的示例在附图中示出,其中自始至终相同或类似的标号表示相同或类似的元件或具有相同或类似功能的元件。下面通过参考附图描述的实施例是示例性的,旨在用于解释本发明,而不能理解为对本发明的限制。The embodiments of the present invention are described in detail below, and the examples of the embodiments are illustrated in the drawings, wherein the same or similar reference numerals are used to refer to the same or similar elements or elements having the same or similar functions. The embodiments described below with reference to the drawings are intended to be illustrative of the invention and are not to be construed as limiting.
在本发明的第一方面,本发明提出了一种对分隔长片段序列进行组装的方法,参见图1,该方法包括:In a first aspect of the invention, the invention proposes a method of assembling a sequence of separated long segments, see Fig. 1, the method comprising:
(a)通过测序获得读段集,并记录所述读段集中的读段对应的测序孔,一个测序孔包含至少一条长片段序列;(a) obtaining a set of reads by sequencing, and recording the sequencing holes corresponding to the reads in the read set, one sequencing well comprising at least one long segment sequence;
(b)利用所述读段及所述读段对应的测序孔,对多个种子序列进行并行延伸,以获得多个序列重叠群,所述多个种子序列通过已知序列确定;(b) using the read segment and the sequencing hole corresponding to the read segment, extending the plurality of seed sequences in parallel to obtain a plurality of sequence contigs, wherein the plurality of seed sequences are determined by a known sequence;
(c)基于所述读段、所述序列重叠群以及所述序列重叠群包含的读段对应的测序孔,构建骨架序列,以获得分隔长片段序列的组装结果。(c) constructing a skeleton sequence based on the read sequence, the sequence contig, and the sequencing well corresponding to the read included in the sequence contig to obtain an assembly result of separating the long fragment sequence.
根据本发明的实施例,所述种子序列是基于基因组参考序列按照下列步骤获得的:According to an embodiment of the invention, the seed sequence is obtained based on the genomic reference sequence according to the following steps:
在所述基因组参考序列中,按照N进行打断;以及In the genomic reference sequence, interrupted by N;
将经过打断的参考序列按照预定长度进行截断,以便获得所述种子序列。The interrupted reference sequence is truncated by a predetermined length to obtain the seed sequence.
对预定长度的大小不作特别限制,一般的,预定长度为不小于插入片段的大小,使来自同一插入片段的读段能够定位到同一种子序列上。根据本发明的实施例,测序为双末端测序,所述预定长度至少为所述测序中测序文库插入片段长度的两倍,利于成对读段定位至同一种子序列上。根据本发明的实施例,所述读段集包括成对读段,所述种子序列是基于所述读段按照下列步骤获得的:(1)将所述读段滑动切割成多个Kmer,构建Kmer对所述读段的索引RKI,用于通过Kmer访问对应的读段;(2)从所述读段集中提取一对不具有高频Kmer的成对读段;(3)利用所述索引RKI,分别确定(2)中的一对成对读段中的两个读段的Kmer对应的所有读段,获得第一读段群和第二读段群;(4)分别确定(3)的第一读段群和第二读段群对应的测序孔,获得第一测序孔集和第二测序孔集;(5)确定(4) 中的第一测序孔集和第二测序孔集的交集,若所述交集的大小与碱基的有效测序孔的数量期望值无显著差异,则确定(2)中的成对读段为所述种子序列,所述碱基的有效测序孔的数量期望值由样本的核酸量决定。所称的高频是相对于平均频率来说的,某个Kmer的出现次数高于Kmer出现的平均数,即可认为其为高频Kmer。在更多情况中,发明人根据需要将出现次数远高于Kmer出现的平均数的Kmer定义为高频Kmer,所称的“远高于”为出现次数为Kmer出现的平均次数的至少10倍、20倍、30倍、40倍、50倍或者60倍。Kmer频率期望与测序深度、读段长度及Kmer大小相关,根据本发明的一个实施例,对于K=19,即19mer,100bp读段长度,100X测序深度,Kmer期望频率为100(测序深度)*{[100(读长)-19(Kmer大小)+1]/100(读长)}=82,在平均频率为82时,设置频率为3000至5000范围的Kmer为高频Kmer。The size of the predetermined length is not particularly limited. Generally, the predetermined length is not less than the size of the insert, so that the reads from the same insert can be positioned on the same seed sequence. According to an embodiment of the invention, the sequencing is double-end sequencing, the predetermined length being at least twice the length of the sequencing library insert in the sequencing, facilitating the positioning of the paired reads onto the same seed sequence. According to an embodiment of the invention, the set of reads comprises a pair of reads, the seed sequence being obtained based on the read according to the following steps: (1) sliding the read into a plurality of Kmers, constructing Kmer's index RKI for the read segment, for accessing the corresponding read segment by Kmer; (2) extracting a pair of read pairs without the high frequency Kmer from the read set; (3) using the index RKI, respectively determining all the reads corresponding to the Kmer of the two of the pair of read pairs in (2), obtaining the first read group and the second read group; (4) respectively determining (3) a sequencing hole corresponding to the first read segment group and the second read segment group, obtaining a first sequencing hole set and a second sequencing hole set; (5) determining (4) The intersection of the first set of sequencing holes and the second set of sequencing wells, if the size of the intersection is not significantly different from the expected value of the number of valid sequencing holes of the base, determining that the paired reads in (2) are The seed sequence, the expected number of effective sequencing wells for the base is determined by the amount of nucleic acid in the sample. The so-called high frequency is relative to the average frequency, and the number of occurrences of a certain Kmer is higher than the average number of occurrences of Kmer, which is considered to be a high frequency Kmer. In more cases, the inventors define, as needed, a Kmer whose number of occurrences is much higher than the average of the occurrence of Kmer as a high frequency Kmer, which is said to be "far above" as at least 10 times the average number of occurrences of Kmer. 20 times, 30 times, 40 times, 50 times or 60 times. Kmer frequency expectation is related to sequencing depth, read length and Kmer size, according to one embodiment of the invention, for K=19, ie 19mer, 100 bp read length, 100X sequencing depth, Kmer desired frequency is 100 (sequencing depth)* {[100 (read length) -19 (Kmer size) +1] / 100 (read length)} = 82. When the average frequency is 82, the Kmer having a frequency of 3000 to 5000 is set as the high frequency Kmer.
所称的无显著差异,可以是无统计意义上的差异,也可以是一般所说的无明显差异。根据本发明的实施例,(5)中:若所述交集的大小在碱基的有效测序孔的数量期望值的一半到两倍之间,则确定(2)中的成对读段为所述种子序列。There is no significant difference, which can be a statistically significant difference, or it can be said that there is no significant difference. According to an embodiment of the present invention, in (5), if the size of the intersection is between half and two times the expected value of the number of effective sequencing holes of the base, determining that the paired reads in (2) are Seed sequence.
上述种子序列的生成方式,前者属于只依据已知序列来构建,例如依据参考序列、测序获得的读段、组装片段等,后者属于结合已知序列、RKI以及测序孔的信息来构建,后者可应用于完全没有参考序列的物种的从头组装。The manner in which the above seed sequences are generated, the former is constructed based only on known sequences, such as reading sequences obtained according to reference sequences, sequencing, assembly fragments, etc., and the latter is constructed by combining known sequences, RKIs, and sequencing holes. It can be applied to de novo assembly of species without a reference sequence at all.
根据本发明的实施例,(b)包括:(i)将所述读段滑动切割成多个Kmer,构建Kmer对所述读段的索引RKI,用于通过Kmer访问对应的读段;(ii)基于所述读段及其对应的索引RKI,对所述多个种子序列进行并行延伸,以获得所述多个序列重叠群。According to an embodiment of the invention, (b) comprises: (i) slidingly cutting the read into a plurality of Kmers, constructing an index RKI of the Kmer to the read for accessing the corresponding read by the Kmer; And extending the plurality of seed sequences in parallel based on the read segment and its corresponding index RKI to obtain the plurality of sequence contigs.
根据本发明的实施例,所述RKI是通过下列步骤获得的:对所述读段reads进行滑动切割成多个Kmer;构建以Kmer为键值的哈希,所述哈希构成所述RKI,并且所述哈希记载所述Kmer的频率、所属读段、所述Kmer在所述读段上的位置和方向。According to an embodiment of the present invention, the RKI is obtained by slidingly cutting the read reads into a plurality of Kmers; constructing a hash with a Kmer as a key value, the hash forming the RKI, And the hash records the frequency of the Kmer, the associated read segment, the position and direction of the Kmer on the read segment.
根据本发明的实施例,通过重复下列步骤对所述种子序列进行延伸:选择适于延伸的种子序列;将所述读段定位至所述种子序列,以获得延伸序列;将定位在所述延伸序列末端的读段进行逐碱基一致性化处理;以及如果一致性化处理失败,则进行杂合识别、定相处理和/或对重复序列进行解析。According to an embodiment of the invention, the seed sequence is extended by repeating the steps of: selecting a seed sequence suitable for extension; positioning the read to the seed sequence to obtain an extension sequence; positioning the extension Reads at the end of the sequence are per-base-consistent; and if the co-processing fails, heterozygous recognition, phasing processing, and/or parsing of the repetitive sequences are performed.
根据本发明的实施例,通过下列步骤选择适于延伸的种子序列:将种子序列滑动切割成Kmer;通过所述RKI获取所述Kmer对应的读段;将所述对应的读段与所述种子序列进行比对;基于所述对应的读段对应的测序孔,确定测序孔对所述种子序列的覆盖状况;以及基于所述覆盖状况,确定适于延伸的种子序列。 According to an embodiment of the invention, a seed sequence suitable for extension is selected by slidingly cutting a seed sequence into a Kmer; obtaining a read corresponding to the Kmer by the RKI; and the corresponding read with the seed The sequences are aligned; based on the sequencing wells corresponding to the corresponding reads, determining the coverage of the seed sequences by the sequencing wells; and determining the seed sequences suitable for extension based on the coverage conditions.
根据本发明的实施例,通过下列步骤将所述读段定位至所述种子序列:将种子序列滑动切割成Kmer;通过所述RKI获取所述Kmer对应的读段;将所述Kmer对应的读段定位至所述种子序列,并逐个碱基进行比对。According to an embodiment of the invention, the reading is located to the seed sequence by sliding slashing the seed sequence into a Kmer; acquiring the Kmer-compatible read by the RKI; reading the Kmer corresponding The segments are mapped to the seed sequence and aligned on a base by base basis.
根据本发明的实施例,在一致性化处理过程中,如果延伸的位点的有效测序孔集合被该位点不同的碱基型平均分配,则判断存在杂合。According to an embodiment of the present invention, in the process of the consistency process, if the set of effective sequencing holes of the extended site is equally distributed by the base types different in the site, it is judged that there is a hybrid.
根据本发明的实施例,在确定存在杂合之后,将所述延伸序列分为多条分别进行延伸。According to an embodiment of the present invention, after it is determined that there is a hybrid, the extended sequence is divided into a plurality of strips for extension.
根据本发明的实施例,所述读段集包含多对成对读段,一对成对读段中的两个读段之间的距离为L,在一致性化处理过程中,如果成对读段中的定位在延伸方向下游的读段与对应的读段之间的距离为非L,则确定该定位在延伸方向下游的读段所定位的位置为重复序列的起点。According to an embodiment of the invention, the set of reads comprises a plurality of pairs of pairs, and the distance between two of the pair of pairs is L, in the process of consistency, if paired If the distance between the read segment located downstream in the extension direction and the corresponding read segment in the read segment is non-L, it is determined that the position at which the read segment located downstream of the extension direction is located is the start point of the repeated sequence.
需要说明的是,本发明中涉及的数值大多具有统计意义,因此,如无特殊说明,任意以精确方式表达的数值均可代表一个范围,例如包含该数值正负10%的区间;或者该数值总体呈正态分布,以精确方式表达的数值包含该数值正负标准差的区间。所称一对成对读段中的两个读段之间的距离L,对应插入片段的长度。一般的,在试验阶段,构建的测序文库大小是一定的,即插入片段的大小是一个定值,理论上,双末端测序后获得的成对读段之间的外端距离为该定值,而在实际测序后获得的数据中,成对读段之间的距离呈正态分布。在该示例的重复序列判定与解析的过程中,发明人设置该L为试验阶段的插入片段的大小,本领域技术人员可以理解,设置该L为任一试验阶段插入片段大小的正负标准差之间的数值或者为试验阶段插入片段大小的正负标准差区间,利用本发明的解重复序列的方法,同样可以判定和解析延伸过程中的重复序列。It should be noted that the numerical values involved in the present invention are mostly statistically significant. Therefore, unless otherwise stated, any numerical value expressed in an accurate manner may represent a range, for example, an interval including plus or minus 10% of the numerical value; or the numerical value The population is normally distributed, and the value expressed in an accurate manner contains the interval of the positive and negative standard deviation of the value. The distance L between two reads in a pair of paired reads corresponds to the length of the inserted segment. Generally, in the experimental stage, the size of the sequenced library constructed is certain, that is, the size of the inserted fragment is a fixed value. Theoretically, the distance between the outer ends of the paired reads obtained after double-end sequencing is the fixed value. In the data obtained after actual sequencing, the distance between the paired reads is normally distributed. In the process of repeating sequence determination and parsing of this example, the inventors set the L to the size of the insert of the experimental phase, and those skilled in the art can understand that the L is set to be the positive and negative standard deviation of the insert size of any test phase. The value between them is either the positive or negative standard deviation interval of the insertion fragment size in the experimental phase, and the repeating sequence in the extension process can also be determined and resolved by the method of the deduplication sequence of the present invention.
根据本发明的实施例,所述重复序列的终点为成对读段中的与对应的读段之间的距离为L的定位在延伸方向下游的读段所定位的位置,或者为延伸时的冲突位点。According to an embodiment of the invention, the end point of the repeating sequence is a position in the paired read segment located at a distance L downstream of the corresponding read segment from the read segment located downstream of the extension direction, or Conflict site.
根据本发明的实施例,通过下列步骤判断所述重复序列是否为串联重复序列:对定位至所述重复序列的读段进行滑动切割以获得Kmer;判断每条所述读段进行滑动切割所得到的Kmer是否存在重复Kmer,如果不存在所述重复Kmer,则判断不存在串联重复序列,如果存在所述重复Kmer,则判断存在串联重复序列。According to an embodiment of the present invention, it is determined whether the repeating sequence is a tandem repeating sequence by performing a sliding cut on a read positioned to the repeating sequence to obtain a Kmer; determining that each of the read segments is subjected to sliding cutting Whether there is a repeat Kmer in the Kmer, if the repeat Kmer is absent, it is judged that there is no tandem repeat, and if the repeat Kmer is present, it is judged that there is a tandem repeat.
根据本发明的实施例,若所述重复序列为串联重复序列,通过下列步骤对所述串联重复序列进行解析:确定所述串联重复序列的长度;将包含所述串联重复序列终点的读段用终点对齐的方式进行位置调整,以确定冲突位置上的碱基。According to an embodiment of the present invention, if the repeat sequence is a tandem repeat sequence, the tandem repeat sequence is resolved by: determining the length of the tandem repeat sequence; and using the read sequence including the end point of the tandem repeat sequence The position of the end point is adjusted to determine the base at the conflicting position.
根据本发明的实施例,若所述重复序列非为串联重复序列,且存在以下情况,则通过 以下判断所述重复序列为大型双重复序列:所述重复序列的长度大于L,或者成对读段中定位在所述重复序列下游的读段对应的读段也定位在该重复序列上。According to an embodiment of the present invention, if the repeated sequence is not a tandem repeat sequence, and the following conditions exist, The repeat sequence is judged to be a large double repeat sequence: the length of the repeat sequence is greater than L, or the read corresponding to the read position located downstream of the repeat sequence in the paired read is also located on the repeat sequence.
根据本发明的实施例,若所述重复序列为大型双重复序列,通过以下对所述大型双重复序列进行解析:比较所述大型双重复序列中的上游重复序列对应的有效测序孔的数量与下游重复序列对应的有效测序孔的数量的差值与碱基的有效测序孔的数量期望值的差异,依据差异的大小,解决所述大型双重复序列上的冲突。例如,若大型双重复序列中的两重复序列距离相对较远,处在上游的冲突碱基所含有的有效测序孔(EW)会明显比下游多,定义数量差值大于期望EW数量的一半时为明显,可以通过这种比较方式确定碱基来解决冲突;若上游与下游的EW数量差值小于等于期望EW数量的一半,可以利用离重复区起点一个插入片段长度范围内的错误下游臂所对应的上游臂来构建辅助重叠群(HC),比较HC上的EW集合与上游冲突碱基的EW集合比它与下游冲突碱基的差异。According to an embodiment of the present invention, if the repeat sequence is a large double repeat sequence, the large double repeat sequence is analyzed by comparing the number of effective sequencing wells corresponding to the upstream repeat sequence in the large double repeat sequence with The difference between the difference in the number of effective sequencing wells corresponding to the downstream repeat sequence and the expected value of the number of valid sequencing wells of the base, and the conflict on the large double repeat sequence is solved according to the size of the difference. For example, if the two repeats in a large double repeat are relatively far apart, the upstream conflicting base contains more efficient sequencing holes (EW) than the downstream, and the defined number difference is greater than half of the expected EW. Obviously, the base can be determined by this comparison to resolve the conflict; if the difference between the upstream and downstream EW numbers is less than or equal to half the expected EW number, the wrong downstream arm within the length of the insertion segment from the start of the repeating region can be utilized. The corresponding upstream arm constructs a helper contig (HC), comparing the EW set on the HC with the EW set of the upstream conflict base compared to its downstream conflict base.
根据本发明的实施例,若所述重复序列非为大型双重复序列,且存在以下情况,则判断所述重复序列为小型双重复序列:所述重复序列的长度小于L。According to an embodiment of the present invention, if the repeat sequence is not a large double repeat sequence, and the following conditions exist, it is determined that the repeat sequence is a small double repeat sequence: the length of the repeat sequence is less than L.
根据本发明的实施例,通过以下至少之一对所述小型双重复序列进行解析:(p)利用支持各碱基的成对读段之间的距离的均值作为冲突位点的期望位置,通过比较支持两种冲突碱基的成对读段之间的距离与所述均值的接近程度,以确定该冲突位点上的碱基;(k)利用成对读段中定位至所述延伸序列标准差范围中的非重复序列的最上游的读段对应的读段来构建辅助重叠群,利用成对读段中定位至所述辅助重叠群下游的读段对应的读段来确定冲突位点上的碱基。According to an embodiment of the invention, the small double repeat sequence is resolved by at least one of: (p) utilizing the mean of the distance between the paired reads supporting each base as the desired position of the collision site, Comparing the proximity of the distance between the paired reads supporting the two conflicting bases to the mean to determine the base at the conflicting site; (k) using the paired reads to locate the extended sequence A readout corresponding to the most upstream read of the non-repetitive sequence in the standard deviation range to construct a helper contig, using the read corresponding to the read located downstream of the helper contig in the paired read to determine the conflicting site Base on.
根据本发明的实施例,若无法解析所述小型双重复序列,即无法解决该冲突,则终止对种子序列的延伸。According to an embodiment of the invention, if the small double repeat sequence cannot be resolved, the conflict cannot be resolved, and the extension of the seed sequence is terminated.
参见图2,根据本发明的实施例,(c)包括:(iii)基于所述读段,建立所述序列重叠群之间的合并连接关系;(iv)基于所述序列重叠群和所述序列重叠群包含的读段对应的测序孔,构建初级骨架序列;(v)在对所述多个序列重叠群之间的合并连接关系与所述初级骨架序列进行相互检验之后,通过合并所述序列重叠群以构建所述骨架序列,获得分隔长片段序列的组装结果。Referring to FIG. 2, according to an embodiment of the present invention, (c) includes: (iii) establishing a merged connection relationship between the sequence contigs based on the read; (iv) based on the sequence contig and the </ RTI> </ RTI> </ RTI> </ RTI> </ RTI> </ RTI> </ RTI> </ RTI> </ RTI> </ RTI> </ RTI> </ RTI> </ RTI> Sequence contigs are constructed to construct the backbone sequences to obtain assembly results that separate long fragment sequences.
上述本发明任一实施例的方法具有以下四个特性和优越性中的至少之一:The method of any of the above embodiments of the present invention has at least one of the following four characteristics and advantages:
迭代延伸:种子序列的延伸是一个迭代过程,本次延伸的部分将作为下次迭代的延伸基座,使得种子序列能不断延长。Iterative extension: The extension of the seed sequence is an iterative process, and the extension will be used as an extension base for the next iteration, so that the seed sequence can be extended.
线性延伸:与其他基于图的组装算法不同,线性延伸方式使得组装时遇到的结构较为 简单,逻辑关系相对清晰且容易分类处理,而图的算法会将所有重复区域的信息聚集于一处,其涉及的信息量更多且产生的结构更复杂。即图的解法是一次性解决,只要解掉一个重复区域,所有跟该处相关联的基因组区域(它们都具有相同或相近的重复序列)就会全部同时解决,而线性方法一次仅解决当前区域的重复,对其他区域并不同时解决,当延伸至该重复在基因组的另一区域时则再解一次。解决了一个重复后会进行相关信息的记录,使得下一次解重复时遇到的信息更简单,这降低了额外的计算损耗。此外,线性延伸方式使各个种子序列的延伸相互独立,这种模式更易于计算并行化。Linear extension: Unlike other graph-based assembly algorithms, the linear extension method makes the structure encountered during assembly more Simple, logical relationships are relatively clear and easy to classify, while graph algorithms aggregate information from all repeat regions in one place, involving more information and more complex structures. That is, the solution of the graph is a one-time solution. As long as one repeat region is solved, all the genomic regions associated with the same (they have the same or similar repeat sequences) will all be solved at the same time, and the linear method only solves the current region at a time. The repetition of the other regions is not solved at the same time, and is extended once the extension is repeated in another region of the genome. Resolving a record of related information after a repetition, making the information encountered in the next solution repetition easier, which reduces the extra computational loss. In addition, the linear extension mode makes the extension of each seed sequence independent of each other, and this mode is easier to calculate parallelization.
多路延伸:由于该算法的目标是构建多倍体基因组,对种子序列进行延伸时会利用well信息进行多个单倍体的多路联合延伸,每当遇到杂合区域时会结合前面已装成的杂合区状况即时进行定相的操作,即定相也是线性进行的,且贯穿整个组装系统的各个模块(如重叠群合并,骨架序列构建)。为了节省资源消耗,纯合区的延伸依旧采用单路延伸,当遇到杂合区时再分为多路进行延伸,而杂合区延伸完毕后重新合并回单路继续延伸。Multiple extension: Since the goal of the algorithm is to construct a polyploid genome, the extension of the seed sequence will use the well information to perform multiple multiplex extensions of the haplotype, which will be combined with the previous one when encountering the heterozygous region. The assembled heterozygous zone condition is immediately phased, ie the phase is also linear and runs through the various modules of the assembly system (eg contig merge, skeleton sequence construction). In order to save resource consumption, the extension of the homozygous area still adopts a single extension, and when it encounters the hybrid area, it is divided into multiple ways to extend, and after the extension of the hybrid area is completed, it is merged back to the single way to continue to extend.
全局信息判定:一个算法若只利用当前判定下相关的信息,而不考虑其后的全局范围内的选择支,则容易出现错误或者无法达到全局最优解(如求单源最短路径中的Dijkstra贪婪算法)。对于基因组组装,贪婪算法往往因错误的选择而导致组装错误,所以本算法并不采用贪婪算法来解决延伸时的冲突,而用全局信息进行相关判定。由于LFR的长度(~100Kbp)较长,在路径选择时考虑了该位置所能提供的全部信息(Kmer、read、mate-pair及well信息),其中well信息可使得其判定涉及的范围在100Kbp左右,若大型重复不大量存在,该算法是基本符合全局性的,而当延伸遇到复杂情况时,本算法会严格选择终止延伸而不是取较高值进行延伸,避免使用贪婪的判定模式。需要特别提出的是,由于延伸时考虑了该位置上所有的well信息,这使得该算法与常规的分级组装算法相比具有更强的全局性,可以大幅降低测序深度的要求,即每个well并不要求有特别高的测序深度,这节省了大量的资源,无论是成本还是时间。Global information decision: If an algorithm only uses the relevant information under the current decision, regardless of the selection branch in the global scope, it is prone to error or cannot reach the global optimal solution (such as Dijkstra in the shortest path of single source) Greedy algorithm). For genome assembly, greedy algorithms often lead to assembly errors due to wrong choices. Therefore, this algorithm does not use greedy algorithms to resolve conflicts during extension, but uses global information to make correlation decisions. Since the length of the LFR (~100Kbp) is long, all the information (Kmer, read, mate-pair, and well information) that the location can provide is considered in the path selection, and the well information can make the decision range of 100Kbp. Left and right, if the large repetition does not exist in a large amount, the algorithm is basically global, and when the extension encounters a complicated situation, the algorithm will strictly choose to terminate the extension instead of taking a higher value to extend, avoiding the use of greedy judgment mode. What needs to be specially mentioned is that since all the well information at the position is considered in the extension, this makes the algorithm more global than the conventional hierarchical assembly algorithm, and can greatly reduce the sequencing depth requirement, that is, each well There is no requirement for a particularly high sequencing depth, which saves a lot of resources, both cost and time.
本领域技术人员可以理解,上述本发明一方面方法的全部或部分步骤可以通过程序来指令相关硬件完成,该程序可以存储于一计算机可读存储介质中,存储介质可以包括:只读存储器、随机存储器、磁盘或光盘等。It will be understood by those skilled in the art that all or part of the steps of the method of the present invention may be performed by a program to instruct related hardware. The program may be stored in a computer readable storage medium, and the storage medium may include: a read only memory, a random Memory, disk or disc, etc.
将上述过程编写成可执行程序,执行程序包括:首先,读入seeds及reads,并把reads构建成Kmer对reads的索引(Read Kmer Index,RKI),以便比对时迅速通过Kmer来访问目标reads。然后把seeds尽可能地延长,由于seed与seed间的延伸相互独立,这个步骤可以通过并行运算进行加速。延伸后的seed就是序列重叠群(contig),此时通过well的 信息开始预构建骨架序列(scaffold),仅建立contig间的前后关系,并不立即构建出scaffold。同样,通过已使用reads和成对reads信息来建立contig间的合并连接关系,也不立即合并这些contig。接着将contig间合并的关系与预构建scaffold中的关系互相检验,简化这些关系并解决冲突,再进行contig的合并及scaffold的构建,最后输出组装结果。The above process is written into an executable program. The execution program includes: first, reading the seeds and reads, and constructing the reads into the Kmer index of the Reads (Read Kmer Index, RKI), so that the target can be accessed through the Kmer quickly. . Then extend the seeds as much as possible. Since the extension between the seed and the seed is independent of each other, this step can be accelerated by parallel operations. The extended seed is the sequence contig, at this time through the well The information begins to pre-build the skeleton sequence (scaffold), only establishes the context between contig, and does not immediately construct the scaffold. Similarly, by using the reads and paired reads information to establish a merged join relationship between contigs, these contigs are not immediately merged. Then, the relationship between the merged contig and the relationship in the pre-built scaffold are tested against each other, the relationship is simplified and the conflict is solved, and then the contig is merged and the scaffold is constructed, and finally the assembly result is output.
在又一方面,本发明提出了一种对分隔长片段序列进行组装的装置,该装置用以实施上述本发明任一实施例的方法。参见图3,该装置包括:输入模块,通过测序获得读段集,并记录所述读段集中的读段对应的测序孔,一个测序孔包含至少一条长片段序列;延伸模块,利用所述读段及所述读段对应的测序孔,对多个种子序列进行并行延伸,以获得多个序列重叠群,所述多个种子序列通过已知序列确定;骨架序列构建模块,基于所述读段、所述序列重叠群以及所述序列重叠群包含的读段对应的测序孔,构建骨架序列,以获得分隔长片段序列的组装结果。In yet another aspect, the present invention provides an apparatus for assembling a sequence of separated long segments for performing the method of any of the above-described embodiments of the present invention. Referring to FIG. 3, the apparatus includes: an input module, obtains a read set by sequencing, and records a sequencing hole corresponding to the read in the read set, one sequencing hole includes at least one long segment sequence; and an extension module uses the read And the sequencing holes corresponding to the segments and the read segments, and extending the plurality of seed sequences in parallel to obtain a plurality of sequence contigs, wherein the plurality of seed sequences are determined by known sequences; the skeleton sequence building module is based on the reads The sequence contigs and the sequencing wells corresponding to the reads included in the sequence contig, construct a skeleton sequence to obtain an assembly result of separating the long fragment sequences.
本领域技术人员能够理解,上述本发明任一实施例中的处理步骤或采用的手段,可以利用该装置的相应功能模块或者包含的子模块来实现。上述对本发明的方法的技术特征和优点的描述,同样适用本发明的这一装置。Those skilled in the art can understand that the processing steps or the methods used in any of the above embodiments of the present invention can be implemented by using corresponding functional modules of the device or submodules included. The above description of the technical features and advantages of the method of the invention applies equally to the apparatus of the invention.
根据本发明的实施例,所述种子序列是基于基因组参考序列按照下列步骤获得的:在所述基因组参考序列中,按照N进行打断;以及将经过打断的参考序列按照预定长度进行截断,以便获得所述种子序列。According to an embodiment of the invention, the seed sequence is obtained based on a genomic reference sequence according to the following steps: in the genomic reference sequence, interrupted by N; and the interrupted reference sequence is truncated by a predetermined length, In order to obtain the seed sequence.
根据本发明的实施例,所述预定长度不小于所述测序中测序文库插入片段的长度。According to an embodiment of the invention, the predetermined length is not less than the length of the sequencing library insert in the sequencing.
根据本发明的实施例,所述读段集包括成对读段,所述种子序列是基于所述读段按照下列步骤获得的:(1)将所述读段滑动切割成多个Kmer,构建Kmer对所述读段的索引RKI,用于通过Kmer访问对应的读段;(2)从所述读段集中提取一对不具有高频Kmer的成对读段;(3)利用所述索引RKI,分别确定(2)中的一对成对读段中的两个读段的Kmer对应的所有读段,获得第一读段群和第二读段群;(4)分别确定(3)的第一读段群和第二读段群对应的测序孔,获得第一测序孔集和第二测序孔集;(5)确定(4)中的第一测序孔集和第二测序孔集的交集,若所述交集的大小与碱基的有效测序孔的数量期望值无显著差异,则确定(2)中的成对读段为所述种子序列。所述碱基的有效测序孔的数量期望值由样本的核酸量决定。According to an embodiment of the invention, the set of reads comprises a pair of reads, the seed sequence being obtained based on the read according to the following steps: (1) sliding the read into a plurality of Kmers, constructing Kmer's index RKI for the read segment, for accessing the corresponding read segment by Kmer; (2) extracting a pair of read pairs without the high frequency Kmer from the read set; (3) using the index RKI, respectively determining all the reads corresponding to the Kmer of the two of the pair of read pairs in (2), obtaining the first read group and the second read group; (4) respectively determining (3) a first read segment group and a second read segment group corresponding to the sequencing holes to obtain a first sequencing hole set and a second sequencing hole set; (5) determining the first sequencing hole set and the second sequencing hole set in (4) The intersection of the pair of reads in (2) is determined to be the seed sequence if the size of the intersection is not significantly different from the expected number of valid sequencing wells of the base. The expected number of effective sequencing wells for the base is determined by the amount of nucleic acid in the sample.
根据本发明的实施例,上述实施例的(5)中:若所述交集的大小在碱基的有效测序孔的数量期望值的一半到两倍之间,则确定(2)中的成对读段为所述种子序列。According to an embodiment of the present invention, in (5) of the above embodiment, if the size of the intersection is between half and two times the expected value of the number of effective sequencing holes of the base, determining the paired reading in (2) The segment is the seed sequence.
根据本发明的实施例,所述还包括索引构建模块,与所述延伸模块连接,用于将所述 读段滑动切割成多个Kmer,构建Kmer对所述读段的索引RKI,用于通过Kmer访问对应的读段;接着利用所述延伸模块实现以下:基于所述读段及其对应的索引RKI,对所述多个种子序列进行并行延伸,以获得所述多个序列重叠群。According to an embodiment of the present invention, the method further includes an index construction module coupled to the extension module for using the The read segment is slidably cut into a plurality of Kmers, and an index RKI of the Kmer to the read segment is constructed for accessing the corresponding read segment by the Kmer; then the extension module is used to implement the following: based on the read segment and its corresponding index RKI And extending the plurality of seed sequences in parallel to obtain the plurality of sequence contigs.
根据本发明的实施例,所述RKI是通过下列步骤获得的:对所述读段进行滑动切割成多个Kmer;构建以Kmer为键值的哈希,所述哈希构成所述RKI,并且所述哈希记载所述Kmer的频率、所属读段、所述Kmer在所述读段上的位置和方向。According to an embodiment of the present invention, the RKI is obtained by slidingly cutting the read into a plurality of Kmers; constructing a hash with a Kmer as a key, the hash constitutes the RKI, and The hash records the frequency of the Kmer, the associated read, and the position and orientation of the Kmer on the read.
根据本发明的实施例,通过重复下列步骤对所述种子序列进行延伸:选择适于延伸的种子序列;将所述读段定位至所述种子序列,以获得延伸序列;将定位在所述延伸序列末端的读段进行逐碱基一致性化处理;以及如果一致性化处理失败,则进行杂合识别、定相处理和/或对重复序列进行解析。According to an embodiment of the invention, the seed sequence is extended by repeating the steps of: selecting a seed sequence suitable for extension; positioning the read to the seed sequence to obtain an extension sequence; positioning the extension Reads at the end of the sequence are per-base-consistent; and if the co-processing fails, heterozygous recognition, phasing processing, and/or parsing of the repetitive sequences are performed.
根据本发明的实施例,通过下列步骤选择适于延伸的种子序列:将种子序列滑动切割成Kmer;通过所述RKI获取所述Kmer对应的读段;将所述对应的读段与所述种子序列进行比对;基于所述对应的读段对应的测序孔,确定测序孔对所述种子序列的覆盖状况;以及基于所述覆盖状况,确定适于延伸的种子序列。According to an embodiment of the invention, a seed sequence suitable for extension is selected by slidingly cutting a seed sequence into a Kmer; obtaining a read corresponding to the Kmer by the RKI; and the corresponding read with the seed The sequences are aligned; based on the sequencing wells corresponding to the corresponding reads, determining the coverage of the seed sequences by the sequencing wells; and determining the seed sequences suitable for extension based on the coverage conditions.
根据本发明的实施例,通过下列步骤将所述读段定位至所述种子序列:将种子序列滑动切割成Kmer;通过所述RKI获取所述Kmer对应的读段;将所述Kmer对应的读段定位至所述种子序列,并逐个碱基进行比对。According to an embodiment of the invention, the reading is located to the seed sequence by sliding slashing the seed sequence into a Kmer; acquiring the Kmer-compatible read by the RKI; reading the Kmer corresponding The segments are mapped to the seed sequence and aligned on a base by base basis.
根据本发明的实施例,在一致性化处理过程中,如果延伸的位点的有效测序孔集合被该位点不同的碱基型平均分配,则判断存在杂合。According to an embodiment of the present invention, in the process of the consistency process, if the set of effective sequencing holes of the extended site is equally distributed by the base types different in the site, it is judged that there is a hybrid.
根据本发明的实施例,在确定存在杂合之后,将所述延伸序列分为多条分别进行延伸。According to an embodiment of the present invention, after it is determined that there is a hybrid, the extended sequence is divided into a plurality of strips for extension.
根据本发明的实施例,所述读段集包含多对成对读段,一对成对读段中的两个读段之间的距离为L,在一致性化处理过程中,如果成对读段中的定位在延伸方向下游的读段与对应的读段之间的距离为非L,则确定该定位在延伸方向下游的读段所定位的位置为重复序列的起点。According to an embodiment of the invention, the set of reads comprises a plurality of pairs of pairs, and the distance between two of the pair of pairs is L, in the process of consistency, if paired If the distance between the read segment located downstream in the extension direction and the corresponding read segment in the read segment is non-L, it is determined that the position at which the read segment located downstream of the extension direction is located is the start point of the repeated sequence.
根据本发明的实施例,所述重复序列的终点为成对读段中的与对应的读段之间的距离为L的定位在延伸方向下游的读段所定位的位置,或者为延伸时的冲突位点。According to an embodiment of the invention, the end point of the repeating sequence is a position in the paired read segment located at a distance L downstream of the corresponding read segment from the read segment located downstream of the extension direction, or Conflict site.
根据本发明的实施例,通过下列步骤判断所述重复序列是否为串联重复序列:对定位至所述重复序列的读段进行滑动切割以获得Kmer;判断每条所述读段进行滑动切割所得到的Kmer是否存在重复Kmer,如果不存在所述重复Kmer,则判断不存在串联重复序列,如果存在所述重复Kmer,则判断存在串联重复序列。 According to an embodiment of the present invention, it is determined whether the repeating sequence is a tandem repeating sequence by performing a sliding cut on a read positioned to the repeating sequence to obtain a Kmer; determining that each of the read segments is subjected to sliding cutting Whether there is a repeat Kmer in the Kmer, if the repeat Kmer is absent, it is judged that there is no tandem repeat, and if the repeat Kmer is present, it is judged that there is a tandem repeat.
根据本发明的实施例,若所述重复序列为串联重复序列,通过下列步骤对所述串联重复序列进行解析:确定所述串联重复序列的长度;将包含所述串联重复序列终点的读段用终点对齐的方式进行位置调整。According to an embodiment of the present invention, if the repeat sequence is a tandem repeat sequence, the tandem repeat sequence is resolved by: determining the length of the tandem repeat sequence; and using the read sequence including the end point of the tandem repeat sequence The position is adjusted by the end point alignment.
根据本发明的实施例,若所述重复序列非为串联重复序列,且存在以下情况,则判断所述重复序列为大型双重复序列:所述重复序列的长度大于L,或者成对读段中的定位在所述重复序列下游的读段对应的读段也定位在该重复序列上。According to an embodiment of the present invention, if the repeat sequence is not a tandem repeat sequence, and the following conditions exist, it is determined that the repeat sequence is a large double repeat sequence: the length of the repeat sequence is greater than L, or in a paired read The read corresponding to the read located downstream of the repeat sequence is also located on the repeat.
根据本发明的实施例,通过以下对所述大型双重复序列进行解析:比较所述大型双重复序列中的上游重复序列对应的有效测序孔的数量与下游重复序列对应的有效测序孔的数量的差值与碱基的有效测序孔的数量期望值的差异,依据差异的大小,解决所述大型双重复序列上的冲突。According to an embodiment of the present invention, the large double repeat sequence is analyzed by comparing the number of effective sequencing wells corresponding to the upstream repeat sequence in the large double repeat sequence with the number of effective sequencing wells corresponding to the downstream repeat sequence The difference between the difference and the expected number of valid sequencing wells of the base, based on the magnitude of the difference, resolves the conflict on the large double repeat.
根据本发明的实施例,若所述重复序列非为大型双重复序列,且存在以下情况,则判断所述重复序列为小型双重复序列:所述重复序列的长度小于L。According to an embodiment of the present invention, if the repeat sequence is not a large double repeat sequence, and the following conditions exist, it is determined that the repeat sequence is a small double repeat sequence: the length of the repeat sequence is less than L.
根据本发明的实施例,若所述重复序列为小型双重复序列,通过以下至少之一对所述小型双重复序列进行解析:(p)利用支持各碱基的成对读段之间的距离的均值作为冲突位点的期望位置,通过比较支持两种冲突碱基的成对读段之间的距离与所述均值的接近程度,以确定该冲突位点上的碱基;(k)利用成对读段中定位至所述延伸序列标准差范围中的非重复序列的最上游的读段对应的读段来构建辅助重叠群,利用成对读段中定位至所述辅助重叠群下游的读段对应的读段,通过增加这些读段数据以更新距离均值,再同样通过比较支持两种冲突碱基的成对读段之间的距离与所述均值的接近程度,以确定冲突位点上的碱基。According to an embodiment of the invention, if the repeat sequence is a small double repeat, the small double repeat is resolved by at least one of: (p) utilizing a distance between pairs of reads supporting each base The mean value is the desired position of the conflicting site, and the bases at the conflicting sites are determined by comparing the distance between the paired reads supporting the two conflicting bases to the mean; (k) utilizing Constructing a helper contig by constructing a read corresponding to the most upstream read of the non-repetitive sequence in the range of standard deviations of the extended sequence in the paired read, using the paired read to locate downstream of the auxiliary contig Reading the corresponding segments of the segment, by updating the read data to update the distance mean, and also by comparing the proximity of the distance between the paired reads supporting the two conflicting bases to the mean to determine the conflicting site Base on.
根据本发明的实施例,若无法解析所述小型双重复序列,则终止对种子序列的延伸。According to an embodiment of the invention, if the small double repeat sequence cannot be resolved, the extension of the seed sequence is terminated.
参见图4,根据本发明的实施例,所述骨架序列构建模块包括初级骨架序列构建模块,用于基于所述读段,建立所述序列重叠群之间的合并连接关系;还包括合并连接关系建立模块,用于基于所述序列重叠群和所述序列重叠群包含的读段对应的测序孔,构建初级骨架序列;还包括组装模块,用于在对所述多个序列重叠群之间的合并连接关系与所述初级骨架序列进行相互检验之后,通过合并所述序列重叠群以构建所述骨架序列,获得分隔长片段序列的组装结果。Referring to FIG. 4, according to an embodiment of the present invention, the skeleton sequence construction module includes a primary skeleton sequence construction module, configured to establish a merge connection relationship between the sequence overlapping groups based on the read segment, and further include a merge connection relationship. Establishing a module for constructing a primary skeleton sequence based on the sequence contig and the sequencing hole corresponding to the read segment included in the sequence contig; and further comprising an assembly module for locating the plurality of sequence contigs After the merged connection relationship and the primary skeleton sequence are mutually verified, the assembly result of separating the long fragment sequences is obtained by combining the sequence contigs to construct the skeleton sequence.
在本说明书的描述中,参考术语“一个实施例”、“一些实施例”、“示例”、“具体示例”、或“一些示例”等的描述意指结合该实施例或示例描述的具体特征、结构、材料或者特点包含于本发明的至少一个实施例或示例中。在本说明书中,对上述术语的示意性表述不一定 指的是相同的实施例或示例。而且,描述的具体特征、结构、材料或者特点可以在任何的一个或多个实施例或示例中以合适的方式结合。In the description of the present specification, the description with reference to the terms "one embodiment", "some embodiments", "example", "specific example", or "some examples" and the like means a specific feature described in connection with the embodiment or example. A structure, material or feature is included in at least one embodiment or example of the invention. In this specification, the schematic representation of the above terms is not necessarily Refers to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in a suitable manner in any one or more embodiments or examples.
为了方便理解,下面对根据本发明优选实施例的组装方法进行逐步详细分析。For ease of understanding, a step-by-step detailed analysis of the assembly method in accordance with a preferred embodiment of the present invention is performed below.
程序首先读入seed数据及reads数据用于后续的步骤,本算法为了达到高效计算的目的,将延伸过程中不断需要参与比对的reads数据以二进制形式完全存入内存中直到整个程序终止。而测序中提供的质量值信息(如Fastq格式中的质量值)并不在组装过程中记录和使用,这些信息被设计在组装前的数据预处理步骤进行去除或纠正质量值异常碱基及reads的工作,即读入的序列被认为在测序质量值上都是正常的,组装时无视不同质量值碱基间的差异。The program first reads the seed data and the reads data for subsequent steps. In order to achieve efficient calculation, the algorithm continuously stores the read data that needs to participate in the comparison process in binary form until the entire program terminates. The quality value information provided in the sequencing (such as the quality value in the Fastq format) is not recorded and used during the assembly process. This information is designed to be used in the pre-assembly data pre-processing steps to remove or correct the mass values of the aberrant bases and reads. The work, that is, the sequence read in is considered to be normal in the quality of the sequencing, and the difference in bases of different quality values is ignored in assembly.
读入reads并记录其所属well的信息后,滑动切割这些reads成Kmer并构建Reads Kmer Index(RKI),这是以Kmer为键(key)的哈希(hash),记录了Kmer的频率、所属的reads集合及该Kmer在这些reads上的位置、方向等信息。哈希的特性是查找速度极快,时间复杂度仅为O(1),所以该索引的核心作用是迅速地通过Kmer来访问其对应的reads集合,以确定哪些reads要被定位到seed上来进行细化的比对。After reading the reads and recording the information of the wells it belongs to, slide the reads into Kmer and build the Reads Kmer Index (RKI). This is the hash of the Kmer key, which records the frequency of the Kmer. The sets of reads and the location, direction, etc. of the Kmer on these reads. The characteristic of hash is that the search speed is extremely fast, and the time complexity is only O(1). Therefore, the core function of the index is to quickly access the corresponding sets of reads through Kmer to determine which reads are to be positioned on the seed. Refined alignment.
这种RKI结构由于需要记录大量reads的信息,而进一步地,由于一条reads包含多个Kmer,这条reads将会在多个Kmer的记录项中出现,这使得内存的消耗更为巨大,对于人基因组来说,100X测序量下内存的耗用约为3TB,比起传统的组装技术消耗1TB以下的内存,该算法所需要的内存较多。数据结构仍可以进一步优化,这可以降低组装的成本及资源瓶颈。可以进行优化的方式如将超高频的Kmer不予或限制存储其相关reads的信息以减少内存开销,因为这些Kmer对组装的作用很小,将其用于reads比对定位也会浪费较多的时间,效率极低。This RKI structure requires a large amount of information to be recorded, and further, since a read contains multiple Kmers, this read will appear in multiple Kmer entries, which makes memory consumption even larger. For the genome, the memory consumption of 100X sequencing is about 3TB, which requires more memory than traditional assembly technology consumes less than 1TB of memory. The data structure can still be further optimized, which can reduce assembly costs and resource bottlenecks. The way to optimize can be to disable or limit the memory of the UHF to reduce the memory overhead, because these Kmers have little effect on the assembly, and it will waste more on the positioning of the reads. Time is extremely inefficient.
以上关于RKI的数据结构设计与一些DBG算法中的设计存在明显的不同,它们通过从文件中读取reads来构建出Kmer间关系的图,并不需要把reads存储于内存中,且Kmer对应的reads关系也无需详细记录,因为基因组的表示已从一堆杂乱的reads转化为一群有逻辑关联的Kmer,数据进行了大量的整理与压缩,之后基因组的构建仅需对Kmer图进行操作即可,无需直接操作reads。而本算法并不构建一幅整体的图,在seed延伸(基因组的构建)过程中需要不断地通过Kmer来对reads进行随机访问以获取延伸(组装)素材。The above data structure design of RKI is obviously different from the design of some DBG algorithms. They construct the graph of the relationship between Kmer by reading the reads from the file, and do not need to store the reads in the memory, and the Kmer corresponds. The reads relationship does not need to be recorded in detail, because the representation of the genome has been transformed from a mess of reads into a group of logically associated Kmers, and the data has been extensively sorted and compressed. After the genome is constructed, only the Kmer map needs to be operated. No need to directly manipulate the reads. However, this algorithm does not construct a whole picture. In the process of seed extension (genome construction), it is necessary to continuously access the reads through Kmer to obtain the extended (assembly) material.
最后,关于实验方面的相关参数也需要读入,如文库构建中成对reads的插入片段大小,分离LFR的well数量,以及投入的细胞数,目标基因组的倍体数(通过光学手段或信息学分析获得)。但其中插入片段大小及投入细胞数是可以通过种子序列初始化时训练统计得 到的,而目标基因组的倍体数也是可以通过杂合识别计算得到。以上获取信息的不同方式都会有一定误差,实际组装时应结合分析及运用。Finally, relevant parameters related to the experiment also need to be read in, such as the insert size of the paired reads in the library construction, the number of wells separating the LFR, and the number of cells input, the number of ploidy of the target genome (by optical means or informatics) Analysis obtained). However, the size of the inserted fragment and the number of cells input can be statistically trained when the seed sequence is initialized. The ploidy number of the target genome can also be calculated by heterozygous recognition. The above different ways of obtaining information will have certain errors, and the actual assembly should be combined with analysis and application.
seed在被选取后,在开始延伸之前会先进行初始化及合法性的判定。初始化获得seed上的相关信息,这些信息用于之后的合法性判定及seed延伸素材。而合法性判定用于丢弃那些在已延伸区域上的seed及不适于延伸的seed。After the seed is selected, it will be initialized and legally determined before it begins to extend. Initialization obtains relevant information on the seed, which is used for subsequent legality determination and seed extension material. The legitimacy decision is used to discard those seeds that are on the extended area and those that are not suitable for extension.
首先通过滑动切割seed成Kmer,通过RKI获取这些Kmer所对应的reads后再将这些reads精细比对上seed,并利用这些reads对应的well信息来确定well对这个seed的覆盖状况。First, the sliding is cut into Kmer, the readings corresponding to these Kmers are obtained through RKI, and then these reads are finely aligned to the seed, and the well information corresponding to these reads is used to determine the coverage of the seed by the well.
基本完全覆盖整个seed的well被定义为“有效的well(Effective Well,EW)”,其他覆盖尚不足的well将作为候选被记录下来并在之后的延伸中进行更新。在分隔长片段序列组装方式中,基因组上每个位点都会被多条LFR所覆盖,其覆盖的数量与细胞数及倍体数相关,随着延伸的前进,覆盖延伸区域的LFR集合将发生过渡性变化,新的LFR不断替换掉旧的LFR,就如接力赛一样。例如,如果用10个二倍体细胞建LFR,每个位点上的well数量期望则为10个细胞*二倍体*DNA双链=40个,若LFR长度为100Kbp,那么相邻LFR的平均间隔为100Kbp/40=2.5Kbp,即在2.5Kbp范围内well的数量期望值为40,若参与计算的范围扩大,则每增加2.5Kbp,well的数量就会增加一个。A well that completely covers the entire seed is defined as "Effective Well (EW)", and other well-covered wells will be recorded as candidates and updated in subsequent extensions. In the method of assembling long segment sequences, each locus on the genome is covered by multiple LFRs, and the number of coverage is related to the number of cells and the number of ploidy. As the extension progresses, the LFR set covering the extended region will occur. Transitional changes, the new LFR continues to replace the old LFR, just like the relay race. For example, if 10 diploid cells are used to construct LFR, the number of wells per site is expected to be 10 cells * diploid * DNA double strand = 40, and if LFR length is 100 Kbp, then adjacent LFR The average interval is 100Kbp/40=2.5Kbp, that is, the expected number of wells in the range of 2.5Kbp is 40. If the range of participation calculation is expanded, the number of wells will increase by one for each 2.5Kbp.
在seed延伸过程中,EW的集合是通过其对已延伸(或初始化)区块的well覆盖度来确认的。当一条LFR刚开始覆盖到正在延伸的seed时,其对该seed的覆盖度会随着seed的延伸不断上升直至完全覆盖,这种基本完全覆盖seed的LFR所对应的well就可以成为辅助延伸用的EW;而随着seed继续往前延伸,这条LFR将逐渐退出seed所延伸的区域,其覆盖度会不断下降直至为0,当LFR对seed覆盖明显不足时,其所对应的well将不再作为EW用以辅助延伸。这就产生了延伸时EW集合的过渡性变化。During the seed extension process, the set of EWs is confirmed by its well coverage of the extended (or initialized) blocks. When an LFR first covers the extended seed, its coverage of the seed will rise as the seed extends until it is completely covered. The well corresponding to the LFR that completely covers the seed can be used as an auxiliary extension. EW; and as the seed continues to extend forward, the LFR will gradually withdraw from the area where the seed extends, and its coverage will continue to decrease until it reaches 0. When the LFR is obviously insufficient for the seed coverage, the corresponding well will not Used as an EW to assist in extension. This creates a transitional change in the EW set at the time of extension.
这个随着延伸不断更新的EW集合将指导组装时reads的过滤,可以从整个基因组众多的reads中分离出只属于当前seed附近区域的reads,使得一些在全基因组范围上复杂的区域退化为LFR长度范围上的简单区域而变得易于组装,这就是分隔长片段序列组装方式的核心所在。This extended EW set will guide the filtering of the assembly during assembly. It can separate the reads that belong to the vicinity of the current seed from the vast majority of the entire genome, making some complex genome-wide regions degenerate into LFR lengths. The simple area of the range makes it easy to assemble, which is the core of the way to separate long segment sequences.
这种在延伸时同时考虑多个well的集合的组装方式与传统的分开考虑单个well的二级组装方式(即对每个well单独进行组装后再合并这些组装结果来进行二次组装)的明显差异之处正是对EW的运用。这种方法可使组装时需要的深度信息从单个well扩展到联合整个EW集合的累积深度,降低了单个well所需的测序深度,仅需较低的覆盖度即可。以10 个二倍体细胞(40份拷贝)为例,以往的二级组装方式需要40*50X=2,000X的测序深度,其中50X为常规基因组组装所需的测序深度;而本算法仅需40*2.5X=100X的测序深度,其中2.5X为EW在容忍部分区域没有被测序到的情况下被识别出来所需的测序深度,而延伸时联合整个EW集合的累积深度100X,即2.5X用于识别EW而累积100X用于延伸。这种方式极大地降低了测序成本,并节约了时间。This assembly method that considers a set of multiple wells while extending is different from the conventional two-component assembly method in which a single well is considered (ie, each well is assembled separately and then combined to perform secondary assembly). The difference is the use of EW. This approach allows the depth information required for assembly to be extended from a single well to a cumulative depth that combines the entire EW set, reducing the sequencing depth required for a single well, requiring only low coverage. By 10 For example, a diploid cell (40 copies) requires a sequencing depth of 40*50X=2,000X in the previous secondary assembly mode, of which 50X is the sequencing depth required for conventional genome assembly; this algorithm only requires 40*2.5 X=100X sequencing depth, where 2.5X is the depth of sequencing required for EW to be identified if the partial region is not sequenced, and the cumulative depth of the entire EW set is 100X when extended, ie 2.5X for identification EW and accumulated 100X for extension. This approach greatly reduces the cost of sequencing and saves time.
此外,EW的过渡性变化也可以用于确定contig间的位置关系,构建出极长的骨架序列。In addition, the transitional changes of EW can also be used to determine the positional relationship between contigs and to construct extremely long skeleton sequences.
根据本发明的实施例,seed初始化及延伸时都需要将reads比对到seed上以提供素材来获取信息(如well和reads的成对信息)及延伸序列,而把所有reads都尝试去比对是不切实际的,这会消耗大量的时间,所以利用RKI来初筛reads,命中了seed上的Kmer的reads才能进行精细的比对,方法如下:According to an embodiment of the present invention, both the initialization and the extension need to compare the reads to the seed to provide the material to obtain information (such as the paired information of well and reads) and the extended sequence, and try to compare all the reads. It is impractical, it will consume a lot of time, so use RKI to screen the reads, hit the Kmer's reads on the seed to make a fine comparison, as follows:
1.滑动切割seed序列以获取Kmer,并通过RKI来获得这些Kmer所对应的reads。seed初始化时对全seed序列进行切割,而在延伸过程中仅对contig末端的序列进行切割。基于效率方面的考虑,极高频的Kmer(一个Kmer对应极多的reads)不会用来获取reads。特殊地,对于延伸过程中的reads获取,若发现末端全都是极高频Kmer时,仅取最末端的一个Kmer用于reads定位。1. Sliding and cutting the seed sequence to obtain the Kmer, and obtaining the reads corresponding to these Kmers through the RKI. The seed is cut when the seed is initialized, and only the sequence at the end of the contig is cut during the extension. Based on efficiency considerations, very high frequency Kmers (a Kmer corresponds to a large number of reads) are not used to obtain reads. In particular, for the read acquisition during the extension process, if the end is found to be extremely high frequency Kmer, only the last Kmer is taken for the reading positioning.
2.通过Kmer在seed和reads上的位置记录把reads定位到seed上,并逐碱基进行比对,对测序中的替换型错误有一定的容忍,但暂无开gap的模块,无法容忍indel类型的错误。2. Position the reads on the seed through Kmer's position record on the seed and reads, and compare them on a per-base basis. There is some tolerance for the replacement error in the sequencing, but there is no module to open the gap, and can not tolerate indel. Type of error.
3.reads过滤的目的是为了去除由于重复导致错误定位的reads。本算法首先通过EW信息滤掉因全基因组级别的重复而错误定位的reads,再通过mate-pair信息滤掉因LFR内部的简单重复而错误定位的右臂reads(左臂reads不进行过滤)。需要注意的是,在重复区内的左臂reads不会参与其对应右臂reads的合法性判断,因为其本身也可能定位错误,而复杂类型的重复将交由重复解决模块处理。The purpose of 3.reads filtering is to remove reads that are mislocated due to repetition. The algorithm firstly filters out the mislocated readers due to the genome-wide repetition through the EW information, and then filters out the right arm reads that are mislocated due to the simple repetition inside the LFR through the mate-pair information (the left arm reads are not filtered). It should be noted that the left arm reads in the repeat region will not participate in the legitimacy judgment of their corresponding right arm reads, because they may also be positioned incorrectly, and the complex type of repetition will be handled by the duplicate solution module.
4.为了获得延伸的序列,需要将定位在contig末端的reads进行逐碱基的一致性化。与reads比对类似,一致性化过程中仅容忍替换型测序错误,暂无indel类型的容忍。由于reads的mate-pair过滤只能作用于右臂reads,这使得右臂reads不会产生定位错误,与左臂reads有着明显优势。基于这个特征,当过滤后的右臂reads数量充足时,本模块仅使用这些reads进行一致性化,此时若能合为一条序列,本次延伸将完成,接下来进入相关信息更新的模块,若无法合为一条序列,即一致性化时发生冲突,则需联合该区域的左臂reads 进行杂合或重复的识别与解决。而当右臂reads数量不足时,也需联合左臂reads进行一致性化。4. In order to obtain an extended sequence, the reads positioned at the end of the contig need to be base-by-base consistent. Similar to the read alignment, only the replacement sequencing error is tolerated in the process of consistency, and there is no tolerance of indel type. Since the mate-pair filtering of the reads can only be applied to the right arm reads, this makes the right arm reads no positioning error, and has obvious advantages with the left arm reads. Based on this feature, when the number of filtered right arm reads is sufficient, the module only uses these reads for consistency. If it can be combined into one sequence, the extension will be completed, and then the module for updating related information will be entered. If it is not possible to form a sequence, that is, a conflict occurs during the consistency, it is necessary to combine the left arm reads of the region. Identification and resolution of heterozygous or repeated. When the number of right arm reads is insufficient, it is also necessary to combine the left arm reads for consistency.
5.一致性化的逐碱基位点合并过程中,若发现有超过一种碱基在同一位置非低频地出现(低频主要是测序错误造成),则会造成冲突,判定该位点的一致性化失败,这主要是由杂合和重复造成。与重复相比,二倍体基因组的杂合识别较为简单,其特性是冲突的碱基有且仅有两种,对半平分EW集合,且各自支持的EW集基本没有交集,所以一致性化失败后会先进行杂合的识别。5. Consistent base-by-base site merging process, if more than one base is found to be non-low frequency at the same position (low frequency is mainly caused by sequencing error), it will cause conflict and determine the consistency of the site. Sexuality failure, which is mainly caused by heterozygosity and repetition. Compared with the repetition, the heterozygous recognition of the diploid genome is relatively simple, and its characteristic is that there are only two kinds of conflicting bases, and the EW sets are semi-halved, and the EW sets supported by each have almost no intersection, so the consistency is achieved. Hybrid identification will be performed first after failure.
与杂合相比,测序错误就不符合这种状况,测序错误的主要特征是低频与随机。因为绝大多数的测序技术中,测序错误只占总测序量的极少数,一致性化时只会发现极少量的差异,即使是在测序深度有较大波动的状况下,测序错误造成的差异依旧只占极少的成分,测序错误的绝对数量虽然会随着测序深度波动一同变化,但比例并不改变,也就是测序错误率不变。另一方面,其随机性的表现则在于测序错误在不同well中并没有偏差,各个well中都会有相同概率的测序错误出现,即使测序错误本身在测错模式上可能有一定的偏向性,但各个well中的状况也都是一样的。这意味着测序错误的碱基在well上面并没有集中性,一致性化冲突时不同碱基各自支持EW集合间存在交集,无法被区分开。需要留意的是,当整体测序深度低下时测序错误的特征将变得不明显,这种时候极易出现误判,所以,测序深度不能过低。Compared with hybridization, sequencing errors do not conform to this situation. The main features of sequencing errors are low frequency and random. Because most sequencing technologies only account for a very small number of sequencing errors, only a small amount of difference is found in the consistency, even in the case of large fluctuations in sequencing depth, the difference caused by sequencing errors Still only a very small number of components, the absolute number of sequencing errors will change with the depth of sequencing, but the ratio does not change, that is, the sequencing error rate remains unchanged. On the other hand, its randomness is manifested in the fact that sequencing errors are not biased in different wells, and sequencing errors with the same probability appear in each well, even though the sequencing error itself may have a certain bias in the error detection mode, but The conditions in each well are the same. This means that the bases that are incorrectly sequenced are not concentrated on the well. When the consensus conflicts, the different bases respectively support the intersection between the EW sets and cannot be distinguished. It should be noted that when the overall sequencing depth is low, the characteristics of sequencing errors will become inconspicuous. In this case, misjudgment is highly prone to occur, so the sequencing depth cannot be too low.
当确认为杂合后,contig将被分为两条来进行双路延伸,分离时会结合前一个杂合区域的EW状况进行定相。原理上来自于同一个单倍体上的两个杂合区域应具有相近的EW集合,相反,来自不同单倍体上的两个杂合区域的EW集合间应没有交集。通过这种方式就可以进行临近杂合位点的定相操作,特殊地,若前一个杂合区域距离过远(超过LFR长度),此时即使是同一单倍体上的两杂合区域的EW也不会有交集,定相将被判定为失败,并建立一个新的定相区块,而contig本身依旧是连续的,并不同时打断。When confirmed to be heterozygous, the contig will be split into two for two-way extension, which will be phased in conjunction with the EW condition of the previous hybrid region. In principle, two heterozygous regions from the same haploid should have similar EW sets. Conversely, there should be no intersection between EW sets from two heterozygous regions on different haploids. In this way, the phasing operation of the adjacent heterozygous sites can be performed. In particular, if the distance of the previous hybrid region is too far (more than the LFR length), even the two heterozygous regions on the same haploid are EW will not have an intersection, the phasing will be judged as a failure, and a new phased block will be created, and the contig itself is still continuous and not interrupted at the same time.
种子序列迭代延伸过程参见图5。以上方法与另外一些定相的算法不同,仅考虑了两临近位点的EW关系,并没有联合多个位点一同考虑,且定相的方式也是线性的,逻辑较为简洁,定相的算法仍存在改进的空间。当然,本算法中的定相方法也有着一定的优势,与一些只考虑单位点杂合的方法不同,本算法对所有杂合类型都会进行定相,这使得定相的长度得到提升。See Figure 5 for the iterative extension of the seed sequence. The above method is different from other phasing algorithms. Only the EW relationship of two adjacent sites is considered, and no multiple sites are considered together, and the phasing method is linear, the logic is simple, and the phasing algorithm still There is room for improvement. Of course, the phasing method in this algorithm also has certain advantages. Unlike some methods that only consider the unit point hybrid, the algorithm will phasing all the hybrid types, which makes the length of the phasing improved.
和其他基于组装来识别结构变异的方法一样,本算法也能识别出大型的结构变异型杂合,这是本算法比起其他处理分隔长片段序列来分离单体型方法的优势所在。更进一步, 由于插入-删除型杂合和大型结构变异型杂合在本算法中的定相方式上并没有差异,这使得本算法比其他仅考虑单一替换型杂合位点的定相方法能构建出更完整更长的单体型,还能获得那些重测序方法所不具有的功能,尤其是能对大型结构变异进行定相,这对下游的分析有着重大意义。Like other methods based on assembly to identify structural variations, the algorithm can also identify large structural variant heterozygotes, which is the advantage of this algorithm in separating the haplotypes from other long processes. Further, Since there is no difference in the phasing mode between the insertion-deletion heterozygous and the large structural variant heterozygous in this algorithm, this algorithm can be constructed more than other phasing methods that only consider single replacement heterozygous sites. Complete longer haplotypes also provide features that are not available in resequencing methods, especially phasing large structural variations, which is significant for downstream analysis.
重复序列(Repetitive Sequence)是基因组中不同位置出现的相同或相似的序列。它在基因组中大量出现,如人类基因组各种类型的重复序列的总长占将近半个基因组的大小。重复序列一直以来都是影响组装质量的重要问题,能否正确地解决重复也是各种组装算法最为关心并不断尝试新策略以求突破的问题。重复的解决自然也是本算法中的关键模块,主要处理LFR内的复杂重复,主要包括邻近的小型双重复,大型双重复,以及串联重复。以下对这些重复的解决方法进行描述。Repetitive Sequences are identical or similar sequences that occur at different positions in the genome. It appears in large numbers in the genome, such as the total length of various types of repeats in the human genome accounting for nearly half of the size of the genome. Repeating sequences have always been an important issue affecting the quality of assembly. Whether or not the correct resolution can be solved is also the most concerned about various assembly algorithms and constantly trying new strategies to achieve breakthroughs. Repeated resolution is naturally a key module in this algorithm, mainly dealing with complex repetitions within LFR, mainly including adjacent small double repetitions, large double repetitions, and tandem repetitions. These repeated solutions are described below.
不同的重复类型将交由不同的模块进行解决,而重复类型的识别需要相关信息的辅助,重复区域识别就是识别重复类型的重要信息之一,主要目的是用于识别不能作过滤用的左臂reads,即mate-pair reads都在重复区内,这是因为重复区的特性之一是reads会被错误地定位到此处,若使用了错误定位的reads用于延伸,往往会导致错误的延伸。Different types of repetition will be solved by different modules, and the identification of repeated types requires the assistance of relevant information. Repeated area recognition is one of the important information for identifying the type of repetition. The main purpose is to identify the left arm that cannot be used for filtering. Reads, ie mate-pair reads are in the repeating region, because one of the characteristics of the repeating region is that reads will be incorrectly located here. If incorrectly positioned reads are used for extension, it will often lead to erroneous extension. .
如图6所示,重复区域识别主要包括起点和终点的识别。当延伸时发现延伸区中的右臂reads所对应的左臂不在离上游一个插入片段长度的位置附近时,可判定出重复区域的起点。As shown in FIG. 6, the repeated area recognition mainly includes the identification of the start point and the end point. When extending, it is found that the left arm corresponding to the right arm reads in the extension region is not near the position of the upstream insertion length, and the starting point of the repeated region can be determined.
重复区的终点主要分为两类,对于简单重复(仅通过mate-pair关系来过滤错误定位的reads就可解决),重复区的终点是丢失左臂的右臂reads不再出现时的位置;而对于复杂重复,延伸时的冲突点即为终点。The end points of the repeating zone are mainly divided into two categories. For simple repetition (only the mate-pair relationship can be used to filter the mislocated reads), the end of the repeating zone is the position where the right arm reading of the left arm is no longer present; For complex repetitions, the conflict point at the time of extension is the end point.
此外,重复片段中出现的差异区域将被认为是非重复区,即相似的重复片段会被严格分割为多个子重复片段来处理。而这种情况下差异点后的区域性质识别则有所不同,重复区起点的右臂reads所对应的左臂在上一个重复区中而并不是在非重复区内,发生了成对reads定位错误的状况。此时检查延伸点前一个插入片段长度位置附近的左臂reads,若它们对应的右臂都参与了延伸,则说明当前延伸的区域是重复区,若发现有一部分并未参与延伸,则说明已经延伸入非重复区。In addition, the difference regions appearing in the repeated segments will be considered as non-repetitive regions, that is, similar repeating segments will be strictly divided into a plurality of sub-repeat segments to be processed. In this case, the regional property recognition after the difference point is different. The left arm corresponding to the right arm reading of the starting point of the repeating region is in the previous repeating region and not in the non-repetitive region, and the paired reads are located. Wrong situation. At this time, check the left arm reads near the length of the inserted segment before the extension point. If the corresponding right arm participates in the extension, the current extended area is the repeating area. If a part is found not to participate in the extension, it indicates that Extend into the non-repetitive area.
由于reads的mate-pair过滤只能作用于右臂reads,延伸时右臂reads和左臂reads的地位并不相同,然而反过来看,若存在一区域能通过右臂reads来检验左臂reads,那左臂reads就可以发挥解决重复的作用。这正是辅助重叠群(Helper Contig,HC)的概念。Since the mate-pair filtering of the reads can only be applied to the right arm reads, the position of the right arm reads and the left arm reads is not the same when extended. However, if there is a region, the right arm reads can be used to test the left arm reads. The left arm reads can play a role in resolving the repetition. This is the concept of Helper Contig (HC).
HC是一条解决复杂重复时使用的contig,仅作辅助使用,并不作为正式的contig出现 在组装结果中。其本质是进一步利用reads的mate-pair信息,并期望HC能跨过当前的重复区,出现在下游的非重复区上,利用这个非重复区来帮助解决重复。若HC未能成功跨过重复区,则一般无法发挥效果。目前对HC的应用对象主要分为以下两类:HC is a contig used to solve complex duplications, only for auxiliary use, not as a formal contig In the assembly results. The essence is to further utilize the mate-pair information of the reads, and expect the HC to cross the current repeating region and appear on the downstream non-repetitive region, and use this non-repetitive region to help resolve the duplication. If the HC fails to cross the repeat zone successfully, it will generally not work. At present, the application objects for HC are mainly divided into the following two categories:
(1)HC case1-邻近的小型双重复:(1) HC case1-adjacent small double repeat:
由于mate-pair reads的插入片段长度具有一定标准差(standard deviation,SD),若存在两个邻近的比reads长度-Kmer+1略长的小型重复,则延伸时会因这个SD偏差造成mate-pair信息的混乱,一致性化时将发生冲突,冲突的两种碱基都有EW和mate-pair reads的支持。Since the length of the insert of the mate-pair reads has a certain standard deviation (SD), if there are two small repeats that are slightly longer than the read length -Kmer+1, the extension will cause mate- due to this SD deviation. The pair information is confusing and conflicts when it is consistent. Both bases of the conflict are supported by EW and mate-pair reads.
对于这种问题,本算法首先利用支持各碱基的mate-pair信息计算出的均值作为期望位置,距离近的将被认为是应该延伸的碱基,若这两个位置距离过近,实际状况较模糊,则需要构建HC辅助进行识别。For this problem, the algorithm first uses the mean value calculated by the mate-pair information supporting each base as the desired position, and the distance is considered to be the base that should be extended. If the distance between the two positions is too close, the actual situation If it is fuzzy, you need to construct HC to assist in identification.
首先将contig末端SD长度范围内不在重复区中的最左端的左臂reads所对应的右臂作为构建HC的起始,对这条reads采用类似于seed的方式进行延伸。延伸到足够的长度时就可以通过定位在这个HC上的右臂reads所对应的左臂来分别计算冲突碱基的位置。实质上,这种情况下使用HC仅提升了计算距离的可靠性,若SD较大,则仍可能产生错误。First, the right arm corresponding to the leftmost left arm of the contig end SD length range which is not in the repeating region is used as the starting point for constructing the HC, and the read is extended in a manner similar to the seed. When extended to a sufficient length, the position of the conflicting base can be calculated separately by the left arm corresponding to the right arm reads positioned on this HC. In essence, the use of HC in this case only improves the reliability of the calculation distance, and if the SD is large, errors may still occur.
(2)HC case2-大型双重复:(2) HC case2-large double repeat:
当重复区域长度大于插入片段长度时,左臂reads将无法用于reads的过滤,实质上,左臂只要在重复区内就无法用于过滤重复区内的右臂,无论两臂之间的区域是重复抑或是非重复。When the length of the repeating region is greater than the length of the insert, the left arm reads will not be used for the filtering of the reads. In essence, the left arm cannot be used to filter the right arm in the repeating region as long as it is in the repeating region, regardless of the region between the arms. Whether it is repeated or non-repetitive.
大型的重复由于长度较长,well信息较容易发生差异,所以依照不同状况可以利用well信息来进行解决,部分情况则需要结合HC来解决,主要分为两类:Large-scale repetitions are prone to differences due to their long lengths. Therefore, well information can be used to solve problems according to different conditions. In some cases, it needs to be solved by combining HC. There are two main types:
两重复序列距离较远,这种情况下虽然两冲突碱基都有EW支持,但处在上游的碱基所含有的EW会明显比下游的多。此时可以通过这种比较来解决冲突。于此相反,对于尚未识别为EW的其他well,可发现支持下游的会较上游的多,这也可以用来辅助冲突的解决,如图7所示。若以上信息依旧比较模糊,可以利用离重复区起点一个插入片段长度范围内的错误右臂所对应的左臂来构建HC,并期望HC上的EW集合与上游冲突碱基的EW集合比它与下游冲突碱基的差异更小。The two repeat sequences are far apart. In this case, although the two conflicting bases have EW support, the bases in the upstream contain more EW than the downstream ones. This comparison can be used to resolve conflicts. On the contrary, for other wells that have not yet been identified as EW, it can be found that there are more downstream supports than upstream, which can also be used to assist in the resolution of conflicts, as shown in FIG. If the above information is still ambiguous, the HC can be constructed using the left arm corresponding to the wrong right arm within the length of the insertion segment starting from the beginning of the repeating region, and it is expected that the EW set on the HC and the EW set of the upstream conflicting base are compared with The difference in downstream conflict bases is smaller.
两重复序列距离较近,这种情况下well的信息将不会发生明显变化而无法用于冲突的解决,但此时可能存在一些简单情况使得HC能够解决冲突,参见图8a和8b:The two repeat sequences are close together, in which case the well information will not change significantly and cannot be used for conflict resolution, but there may be some simple cases where the HC can resolve the conflict, see Figures 8a and 8b:
a)当冲突位点与重复区终点距离小于插入片段长度时,可发现支持上游碱基的左臂 reads所对应的右臂能被定位到HC上(HC的构建方式与上一类大型重复相同,也是利用离重复区起点一个插入片段长度范围内的错误右臂所对应的左臂来构建),如图8a所示;a) When the distance between the collision site and the end of the repeat region is less than the length of the insert, the left arm supporting the upstream base can be found The right arm corresponding to the reads can be positioned on the HC (the HC is constructed in the same way as the previous type of large repeat, and is constructed using the left arm corresponding to the wrong right arm within the length of the insert from the beginning of the repeating region). As shown in Figure 8a;
b)当冲突位点与下游重复的起点间距离小于插入片段长度时,可发现支持上游碱基的左臂reads所对应的右臂能在重复区内找到,如图8b所示。b) When the distance between the conflicting site and the starting point of the downstream repeat is less than the length of the insert, it can be found that the right arm corresponding to the left arm reads supporting the upstream base can be found in the repeating region, as shown in Fig. 8b.
定义由重复单元的多个拷贝连续相接组成的重复为串联重复(Tandem Repeat,TR),在串联重复区中最短的重复片段为串联重复单元(Tandem Repeat Unit,TRU)。一个串联重复单元具有多种不同的相位(如ACT,CTA,TAC),其种类数目与其长度相等。A repeat consisting of multiple consecutive copies of a repeating unit is defined as Tandem Repeat (TR), and the shortest repeat in the tandem repeat region is a Tandem Repeat Unit (TRU). A tandem repeat unit has a number of different phases (such as ACT, CTA, TAC), the number of which is equal to its length.
与常规重复造成的冲突相同,串联重复区末端序列及重复区内的差异序列与重复区内的序列造成了冲突,而与常规重复不同的是一个串联重复区相当于多块常规重复区连接而成。这使得这种重复的解决方法与其他常规重复的方法既有相同点,又有不同之处。Like the conflict caused by the conventional repetition, the sequence of the end of the tandem repeat region and the sequence of the difference within the repeat region conflict with the sequence within the repeat region, and unlike the conventional repeat, a tandem repeat region is equivalent to a plurality of conventional repeat region linkages. to make. This makes this repetitive solution both identical and different from other conventionally repeated methods.
如图9所示,串联重复会使reads的定位发生偏移,这往往造成一致性化时因重复终点或重复差异的reads定位错误而发生冲突。产生这种问题的核心实质是用于定位reads的Kmer是重复Kmer。这种情况下错误定位的reads的头端都会在串联重复区起点之后(因为起点前的非重复区会使错误定位的reads在精细比对过程中失败),而reads因错误定位的偏移又小于插入片段长度的SD,即mate-pair无法滤掉这些定位错误的reads,进一步说,如果串联重复的长度大于插入片段长度的SD,那么其重复区中的reads定位将发生以SD为单元的周期性的定位向延伸上游错误聚集的状况,错误定位的reads距离偏差不会大于插入片段长度的SD,且都聚集在每个SD单元的头端,这意味着在没遇到冲突之前串联重复区内每段SD长度单元都会被压缩变短,下一SD单元的reads不断地被向前挪用。As shown in Figure 9, the tandem repetition will shift the positioning of the reads, which often causes conflicts in the consistency of the repeated positioning or repeated differences in the positioning of the reads. The core essence of this problem is that the Kmer used to locate the reads is a repeating Kmer. In this case, the head of the mislocated reader will be after the start of the tandem repeat region (because the non-repetitive region before the start point will cause the mislocated read to fail during the fine alignment process), and the reads are offset by the wrong positioning. SD smaller than the length of the inserted segment, that is, mate-pair cannot filter out these misplaced reads. Further, if the length of the tandem repeat is greater than the SD of the inserted length, the read positioning in the repeating region will occur in SD. The periodic positioning extends to the upstream error aggregation situation, the misplaced reads distance deviation is not greater than the SD length of the insert, and is concentrated at the head end of each SD unit, which means that the series is repeated before the collision is encountered. Each SD length unit in the zone will be compressed and shortened, and the reads of the next SD unit will be continuously shifted forward.
解决TR的首要前提是正确识别,在本算法中,TR是通过发现串联重复单元来确认的。而由于大于插入片段长度的TR也符合HC case2的激活条件,本算法将TR的激活判定置于HC case2的激活判定之前。The primary premise for solving TR is correct identification. In this algorithm, TR is confirmed by finding tandem repeating units. Since the TR larger than the length of the inserted segment also meets the activation condition of the HC case 2, the algorithm places the activation decision of the TR before the activation decision of the HC case 2.
TRU主要通过发现Kmer被周期性地使用来识别。对于小于reads长度-Kmer+1的TRU(即在一条reads上TRU出现两次或以上),可以把冲突位点上的reads滑动切割成Kmer并查找一条reads中是否存在多次出现的Kmer,若大部分reads都存在这种状况,且这些reads重复的Kmer一致或在TRU相位上一致,则可以判断当前冲突是由串联重复造成,解串联重复模块将被激活。而对于大于reads长度-Kmer+1的TRU,则扫描重复区起点至冲突点的范围内的Kmer,若发现呈固定周期出现的Kmer,则可判断出冲突由串联重复造成。TRU is primarily identified by discovering that Kmer is used periodically. For TRUs smaller than the read length -Kmer+1 (that is, TRU appears twice or more on a reads), the reads on the collision site can be slid into Kmers and a Kmer is found in the reads. This situation exists in most reads, and if the Kmers of these reads are consistent or consistent in the TRU phase, it can be judged that the current conflict is caused by the series repetition, and the de-serial repeat module will be activated. For a TRU larger than the length of the read-Kmer+1, the Kmer in the range from the start point of the repetition region to the collision point is scanned. If the Kmer appears in a fixed period, it can be judged that the collision is caused by the tandem repetition.
识别出TRU的内容后,可将冲突位点上的reads分为四类:1)在TR区内的reads,它们完全被TRU覆盖;2)包含TR起点的reads,这些reads仅在尾端发现TRU;3)包含 TR终点的reads,这些reads仅在头端发现TRU;4)同时包含TR起点和终点的reads,这种情况仅在TR小于reads长度时才会存在,且此时不会发现类型1的完全被TRU覆盖的reads。After identifying the contents of the TRU, the reads on the conflicting sites can be divided into four categories: 1) the reads in the TR area, which are completely covered by the TRU; 2) the reads containing the TR starting point, which are only found at the end TRU; 3) contains Reads at the end of the TR, these reads only find the TRU at the head end; 4) the reads that contain both the TR start and end points, this case only exists when TR is less than the length of reads, and no complete type 1 will be found at this time. TRU covered reads.
对于小于reads长度的TR,将包含TR终点的reads用终点对齐的方式来调整位置后即可消除冲突。而对于大于reads长度的TR则只能跨过并填入“N”来解决。其中对于大于reads长度并小于reads插入片段长度的TR,将有在非重复区内的左臂支持且包含TR终点的右臂reads按终点对齐并进行一致性化后就可依照这些reads的插入片段长度来估算距离并填入预估量数目的“N”;对于大于插入片段长度的TR,可发现包含TR终点的右臂reads都没有在非重复区内的左臂reads支持,此时直接对这些reads一致性化即可,仅填入一个“N”作为标记。需要注意的是,TR区内存在的细小差异(如出现个别碱基替换、插入和删除)会与TR终点混淆,对识别出的包含TR终点的reads进行一致性化时就会发生冲突。此时可以采用DBG的方式对这些冲突的reads进行组装以自然构建出分离的TR区内的差异序列及TR终点序列,再利用这些组装所得的序列上所定位的右臂reads对应的在前面已延伸的非重复区上的左臂reads的位置来计算组装所得序列的位置,这样距离延伸点最远的一段序列就是正确的TR终点。For TRs smaller than the length of the read, the conflicts are eliminated by adjusting the position with the end of the TR containing the end of the TR. For a TR larger than the length of the read, it can only be solved by crossing and filling in "N". For a TR larger than the length of the reads and smaller than the length of the reads insert, there will be a left arm support in the non-repetition zone and the right arm reads containing the TR end point are aligned and consistent after the end point, and the inserts according to these reads can be followed. Length to estimate the distance and fill in the number of "N" for the estimated amount; for TRs larger than the length of the inserted segment, it can be found that the right arm reads containing the TR end point are not supported by the left arm in the non-repetitive area, at this time directly These reads are consistent, just fill in an "N" as a tag. It should be noted that small differences (such as individual base substitutions, insertions, and deletions) in the TR region are confused with the TR endpoint, and conflicts occur when the recognized reads containing the TR endpoint are consistent. At this point, these conflicting reads can be assembled in a DBG manner to naturally construct the difference sequence and the TR end sequence in the separated TR region, and then the right arm reads located on the sequence obtained by using these assemblies are corresponding to the front The position of the left arm reads on the extended non-repetitive region is used to calculate the position of the assembled sequence such that the sequence furthest from the extension point is the correct TR endpoint.
根据本发明的实施例,由于seed的延伸是并行的,为避免同一区域被多个seed重复延伸,需要把已参与延伸的reads标记为“已使用”,当其他seed初始化或延伸时发现这种reads时将停止延伸,之后交由contig合并模块进行连接。需要注意的是,mate-pair都在重复区内的reads是无法明确定位的,为防止错误的延伸停止,重复性质的reads将被标为“重复”,并不会被用作冗余延伸的判定。According to an embodiment of the present invention, since the extension of the seeds is parallel, in order to prevent the same region from being repeatedly extended by a plurality of seeds, it is necessary to mark the read that has participated in the extension as "used", and find this when other seeds are initialized or extended. The reads will stop extending and then be connected by the contig merge module. It should be noted that the mate-pair is not clearly located in the repeat area. To prevent the erroneous extension from stopping, the repeating nature of the reads will be marked as "repetition" and will not be used as a redundant extension. determination.
根据本发明的实施例,当一个well中的reads被定位到contig上时,其有可能是因重复或测序错误导致的错误定位,所以本算法要求well对已延伸的seed区有相对足够的覆盖度时才能被认为是EW,同时覆盖度不足的EW将被丢弃。基于效率的考虑,每达到一定的延伸长度才会进行EW集的更新,且仅考察这个长度范围内的覆盖状况。According to an embodiment of the present invention, when a read in a well is positioned on a contig, which may be an erroneous positioning due to a repetition or sequencing error, the algorithm requires that well have relatively sufficient coverage of the extended seed area. EW can be considered as EW, and EW with insufficient coverage will be discarded. Based on efficiency considerations, the EW set is updated every time a certain extension is reached, and only the coverage within this length range is examined.
在所有contig都延伸完毕之后将对contig进行合并,同时也会进行contig间定相的操作。可以合并的contig分为两种情况:The contig will be merged after all contigs have been extended, and contig-phased operations will also be performed. The contig that can be merged is divided into two cases:
因延伸时发现含有“已使用”reads而停止的contig。此时直接通过这些“已使用”reads在两条contig上的位置来计算两条contig的合并点。A contig that was stopped when it was found to contain "used" reads. At this point, the merge points of the two contigs are calculated directly by the position of these "used" reads on the two contigs.
因重复、局部测序深度过低等原因导致延伸停止的contig。此时考察contig末端非重复区上的外向reads(下游延伸时为左臂reads,上游延伸时为右臂reads)的mate-pair reads 是否被定位于其他contig上,若成功发现,则通过插入片段长度来估算两contig间的距离并填入预估量数目的“N”。The contig of the extended stop due to repetition, localized sequencing depth, etc. At this time, the mate-pair reads on the non-repetitive area at the end of the contig (the left arm reads when the downstream extends and the right arm reads when the upstream extends) are examined. Whether it is located on other contigs, if it is found successfully, the distance between the two contigs is estimated by inserting the length of the segment and filled in the number of "N" of the estimated amount.
因为本算法中骨架序列构建的基本单元也是延伸终止而尚未合并的contig,所以实际处理时重叠群合并步骤先建立出contig间的关系(交叠或是存在间隔地相接),再结合构建骨架序列步骤中建立的contig间关系进行互相检验,简化这些关系并解决冲突,再实质地进行contig的合并操作。Because the basic unit of the skeleton sequence construction in this algorithm is also the contig that has been extended and not yet merged, the merging step of the actual processing first establishes the relationship between contigs (overlapped or overlapped), and then combines the skeleton. The relationship between contigs established in the sequence step is mutually checked, the relationships are simplified and the conflicts are resolved, and the contig merge operation is substantially performed.
随着延伸的不断进行,可以发现well/LFR具有过渡性变化的特征,这种特征可以用于确定contig之间的位置关系,本算法最后对所有的contig进行了一次骨架序列构建(scaffolding)的操作。实际上,这种思维已在前面的大型重复的解决方法中有所体现。As the extension continues, it can be found that the well/LFR has the characteristic of transitional change. This feature can be used to determine the positional relationship between contigs. Finally, the algorithm performs a scaffolding on all contigs. operating. In fact, this thinking has been reflected in the previous large and repetitive solutions.
本算法中,定义contig间主要通过well的信息来确定前后关系所形成的序列为骨架序列(scaffold),这与传统定义上通过reads的mate-pair信息来构建的骨架序列有着一定的区别。本算法中的scaffold仅表述contig的前后关系,但contig间的具体距离并无法确认,它们之间仅用一个“N”进行相连。而传统意义上的scaffold不仅通过reads的mate-pair来确定contig间的前后关系,还能通过成对reads分别定位在两contig上的位置及它们的插入片段长度来计算出contig间的距离,这与contig合并中的方法是一致的。无论是那种scaffolding方式都涉及contig的顺序排布,这是一个最优线性排布问题(Optimal Linear Arrangement,OLA),具有NP-Hard的性质。另外需要注意的是scaffold中的contig也是有自身方向的(四种脱氧核苷酸A、C、T、G按照以3’,5’磷酸二酯键相连形成的多聚脱氧核苷酸称为DNA碱基序列,是组装中contig和scaffold的表示形式,脱氧核苷酸的连接具有严格的方向性,是前一脱氧核苷酸的3’-OH与下一位脱氧核苷酸的5’-位磷酸间形成3’,5’磷酸二酯键,构成一个没有分支的线性DNA大分子,DNA的表述中定义5’端向3’端为“+”向,3’端向5’端为“-”向),位置正确而方向错误的contig也会造成大型组装错误,表现为子序列翻转型严重错误。In this algorithm, the sequence formed by the contig is mainly determined by the information of the well to determine the context, which is a scaffold, which is different from the skeleton sequence constructed by the mate-pair information of the traditional definition. The scaffold in this algorithm only expresses the context of contig, but the specific distance between contig cannot be confirmed, and they are connected by only one "N". In the traditional sense, the scaffold not only determines the context between contigs through the mate-pair of the reads, but also calculates the distance between contigs by the positions of the pairs of reads positioned on the two contigs and their insert lengths. The method in the merge with contig is consistent. Regardless of the kind of scaffolding method involving contig, this is an Optimal Linear Arrangement (OLA) with NP-Hard properties. It should also be noted that the contig in scaffold also has its own orientation (the four deoxynucleotides A, C, T, and G are called polydeoxynucleotides formed by 3', 5' phosphodiester bonds. The DNA base sequence is a representation of contig and scaffold in assembly. The deoxynucleotide linkage has strict directionality and is 5'-OH of the first deoxynucleotide and 5' of the next deoxynucleotide. The 3', 5' phosphodiester bond forms a linear DNA macromolecule with no branching. The expression of DNA defines the 5' end to the 3' end as the "+" direction and the 3' end to the 5' end. For the "-" direction), the contig with the correct position and the wrong direction will also cause a large assembly error, which is manifested as a sub-sequence flip type serious error.
一般的,一个well会在多条临近的contig中出现,通过这个well就可以获得一组临近的contig集合。首先在contig所属的well中找到以该contig作为起点的well,定义这个contig为这个well的起点contig。然后找出包含这个well的其他contig,并分别计算出它们所含的well集合与起点contig的well集合的交集,显然,与起点contig的well集合交集越大的contig离这个起点contig越近,由此可以确定一组contig的前后位置关系继而构建出一条scaffold。In general, a well will appear in multiple adjacent contigs, through which a set of adjacent contigs can be obtained. First, find the well with the contig as the starting point in the well to which contig belongs, and define this contig as the starting point contig of this well. Then find out the other contigs containing this well, and calculate the intersection of the well set they contain and the well set of the starting point contig. Obviously, the larger the intersection of the well set with the starting point contig, the closer the contig is to the starting point contig, by This determines the context of a set of contigs and then constructs a scaffold.
对于contig在scaffold中的方向性判断,需要将其头端和尾端的well集合分开考虑, 各自计算与附近contig头端或尾端的well集合的近似度。而当contig较短时,Well的过渡性变化并不明显,此时可能会出现位置不明的情况,所以当完成所有contig的初步scaffolding之后,需要结合contig合并的结果进行检验和纠正,并合并两种方法所获得的关系,最终输出scaffold序列。For the directional judgment of contig in scaffold, it is necessary to consider the well set of the head end and the tail end separately. Each computes an approximation to the well set of the head or tail of the nearby contig. When the contig is short, the transitional change of Well is not obvious. At this time, the position may be unknown. Therefore, after completing the initial scaffolding of all contigs, it is necessary to combine the results of the contig merge to test and correct, and merge the two. The relationship obtained by the method ultimately outputs the scaffold sequence.
尽管已经示出和描述了本发明的实施例,本领域的普通技术人员可以理解:在不脱离本发明的原理和宗旨的情况下可以对这些实施例进行多种变化、修改、替换和变型,本发明的范围由权利要求及其等同物限定。 While the embodiments of the present invention have been shown and described, the embodiments of the invention may The scope of the invention is defined by the claims and their equivalents.

Claims (46)

  1. 一种对分隔长片段序列进行组装的方法,其特征在于,包括:A method for assembling a sequence of separated long segments, comprising:
    (a)通过测序获得读段集,并记录所述读段集中的读段对应的测序孔,一个测序孔包含至少一条长片段序列;(a) obtaining a set of reads by sequencing, and recording the sequencing holes corresponding to the reads in the read set, one sequencing well comprising at least one long segment sequence;
    (b)利用所述读段及所述读段对应的测序孔,对多个种子序列进行并行延伸,以获得多个序列重叠群,所述多个种子序列通过已知序列确定;(b) using the read segment and the sequencing hole corresponding to the read segment, extending the plurality of seed sequences in parallel to obtain a plurality of sequence contigs, wherein the plurality of seed sequences are determined by a known sequence;
    (c)基于所述读段、所述序列重叠群以及所述序列重叠群包含的读段对应的测序孔,构建骨架序列,以获得分隔长片段序列的组装结果。(c) constructing a skeleton sequence based on the read sequence, the sequence contig, and the sequencing well corresponding to the read included in the sequence contig to obtain an assembly result of separating the long fragment sequence.
  2. 根据权利要求1所述的方法,其特征在于,所述种子序列是基于基因组参考序列按照下列步骤获得的:The method of claim 1 wherein said seed sequence is obtained based on a genomic reference sequence according to the following steps:
    在所述基因组参考序列中,按照N进行打断;以及In the genomic reference sequence, interrupted by N;
    将经过打断的参考序列按照预定长度进行截断,以便获得所述种子序列。The interrupted reference sequence is truncated by a predetermined length to obtain the seed sequence.
  3. 根据权利要求2所述的方法,其特征在于,所述预定长度不小于所述测序中测序文库插入片段的长度。The method of claim 2 wherein said predetermined length is no less than the length of the sequencing library insert in said sequencing.
  4. 根据权利要求1所述的方法,其特征在于,所述读段集包括成对读段,所述种子序列是基于所述读段按照下列步骤获得的:The method of claim 1 wherein said set of reads comprises paired reads, said seed sequence being obtained based on said reads in accordance with the following steps:
    (1)将所述读段滑动切割成多个Kmer,构建Kmer对所述读段的索引RKI,用于通过Kmer访问对应的读段;(1) sliding the read segment into a plurality of Kmers, constructing an index RKI of the Kmer to the read segment, for accessing the corresponding read segment through the Kmer;
    (2)从所述读段集中提取一对不具有高频Kmer的成对读段;(2) extracting a pair of read segments having no high frequency Kmer from the set of read segments;
    (3)利用所述索引RKI,分别确定(2)中的一对成对读段中的两个读段的Kmer对应的所有读段,获得第一读段群和第二读段群;(3) using the index RKI, respectively determining all the reads corresponding to the Kmer of the two read pairs of the pair of paired reads in (2), obtaining the first read segment group and the second read segment group;
    (4)分别确定(3)的第一读段群和第二读段群对应的测序孔,获得第一测序孔集和第二测序孔集;(4) respectively determining the sequencing holes corresponding to the first read segment group and the second read segment group of (3), and obtaining the first sequencing hole set and the second sequencing hole set;
    (5)确定(4)中的第一测序孔集和第二测序孔集的交集,若所述交集的大小与碱基的有效测序孔的数量期望值无显著差异,则确定(2)中的成对读段为所述种子序列。(5) determining the intersection of the first sequencing hole set and the second sequencing hole set in (4), if the size of the intersection is not significantly different from the expected number of effective sequencing holes of the base, determining (2) The paired reads are the seed sequence.
  5. 根据权利要求4所述的方法,其特征在于,(5)中:The method according to claim 4, wherein in (5):
    若所述交集的大小在碱基的有效测序孔的数量期望值的一半到两倍之间,则确定(2)中的成对读段为所述种子序列。If the size of the intersection is between half and two times the expected value of the number of valid sequencing wells of the base, it is determined that the paired reads in (2) are the seed sequence.
  6. 根据权利要求1所述的方法,其特征在于,(b)包括: The method of claim 1 wherein (b) comprises:
    (i)将所述读段滑动切割成多个Kmer,构建Kmer对所述读段的索引RKI,用于通过Kmer访问对应的读段;(i) sliding the read segment into a plurality of Kmers, constructing an index RKI of the Kmer to the read segment, for accessing the corresponding read segment through the Kmer;
    (ii)基于所述读段及其对应的索引RKI,对所述多个种子序列进行并行延伸,以获得所述多个序列重叠群。(ii) performing parallel extension of the plurality of seed sequences based on the read segment and its corresponding index RKI to obtain the plurality of sequence contigs.
  7. 根据权利要求6所述的方法,其特征在于,所述RKI是通过下列步骤获得的:The method of claim 6 wherein said RKI is obtained by the following steps:
    对所述读段reads进行滑动切割成多个Kmer;Sliding and cutting the read reads into a plurality of Kmers;
    构建以Kmer为键值的哈希,所述哈希构成所述RKI,并且所述哈希记载所述Kmer的频率、所属读段、所述Kmer在所述读段上的位置和方向。A hash with a Kmer as a key value is constructed, the hash constitutes the RKI, and the hash records the frequency of the Kmer, the associated read segment, the location and direction of the Kmer on the read segment.
  8. 根据权利要求6所述的方法,其特征在于,通过重复下列步骤对所述种子序列进行延伸:The method of claim 6 wherein said seed sequence is extended by repeating the following steps:
    选择适于延伸的种子序列;Selecting a seed sequence suitable for extension;
    将所述读段定位至所述种子序列,以获得延伸序列;Positioning the read to the seed sequence to obtain an extension sequence;
    将定位在所述延伸序列末端的读段进行逐碱基一致性化处理;以及Readings located at the end of the extended sequence are subjected to base-by-base conformation;
    如果一致性化处理失败,则进行杂合识别、定相处理和/或对重复序列进行解析。If the coherence process fails, hybrid identification, phasing processing, and/or parsing of the repetitive sequence are performed.
  9. 根据权利要求8所述的方法,其特征在于,通过下列步骤选择适于延伸的种子序列:The method according to claim 8, wherein the seed sequence suitable for extension is selected by the following steps:
    将种子序列滑动切割成Kmer;Sliding and cutting the seed sequence into a Kmer;
    通过所述RKI获取所述Kmer对应的读段;Obtaining, by the RKI, a read corresponding to the Kmer;
    将所述对应的读段与所述种子序列进行比对;Comparing the corresponding read segment with the seed sequence;
    基于所述对应的读段对应的测序孔,确定测序孔对所述种子序列的覆盖状况;以及Determining a coverage of the seed sequence by the sequencing hole based on the sequencing holes corresponding to the corresponding read segment;
    基于所述覆盖状况,确定适于延伸的种子序列。Based on the coverage condition, a seed sequence suitable for extension is determined.
  10. 根据权利要求8所述的方法,其特征在于,通过下列步骤将所述读段定位至所述种子序列:The method of claim 8 wherein said read is located to said seed sequence by:
    将种子序列滑动切割成Kmer;Sliding and cutting the seed sequence into a Kmer;
    通过所述RKI获取所述Kmer对应的读段;Obtaining, by the RKI, a read corresponding to the Kmer;
    将所述Kmer对应的读段定位至所述种子序列,并逐个碱基进行比对。The Kmer-compatible reads are positioned to the seed sequence and aligned on a base-by-base basis.
  11. 根据权利要求8所述的方法,其特征在于,在一致性化处理过程中,如果延伸的位点的有效测序孔集合被该位点不同的碱基型平均分配,则判断存在杂合。The method according to claim 8, wherein in the process of the consistency, if the set of effective sequencing wells of the extended sites is equally distributed by the different base types of the sites, it is judged that there is a heterozygosity.
  12. 根据权利要求11所述的方法,其特征在于,在确定存在杂合之后,将所述延伸序列分为多条分别进行延伸。The method according to claim 11, wherein after determining that there is a hybrid, the extended sequence is divided into a plurality of strips for extension.
  13. 根据权利要求8所述的方法,其特征在于,所述读段集包含多对成对读段,一对 成对读段中的两个读段之间的距离为L,The method of claim 8 wherein said set of reads comprises pairs of pairs of reads, one pair The distance between the two reads in the paired read is L,
    在一致性化处理过程中,如果成对读段中的定位在延伸方向下游的读段与对应的读段之间的距离为非L,则确定该定位在延伸方向下游的读段所定位的位置为重复序列的起点。In the process of the consistency process, if the distance between the read segment located downstream in the extension direction and the corresponding read segment in the paired read segment is non-L, it is determined that the read segment located downstream of the extension direction is located. The position is the starting point of the repeating sequence.
  14. 根据权利要求13所述的方法,其特征在于,所述重复序列的终点为成对读段中的定位在延伸方向下游的读段所定位的位置,该定位在延伸方向下游的读段与对应的读段之间的距离为L,The method according to claim 13, wherein the end point of the repeating sequence is a position in the paired read position positioned in the downstream direction of the extending direction, and the reading is located downstream of the extending direction and corresponding to the reading The distance between the readings is L,
    或者为延伸时的冲突位点。Or it is a conflicting site when extending.
  15. 根据权利要求14所述的方法,其特征在于,通过下列步骤判断所述重复序列是否为串联重复序列:The method according to claim 14, wherein the repeating sequence is determined to be a tandem repeat by the following steps:
    对定位至所述重复序列的读段进行滑动切割以获得Kmer;Performing a sliding cut on a read positioned to the repeating sequence to obtain a Kmer;
    判断每条所述读段进行滑动切割所得到的Kmer是否存在重复Kmer,如果不存在所述重复Kmer,则判断不存在串联重复序列,如果存在所述重复Kmer,则判断存在串联重复序列。It is judged whether or not the Kmer obtained by sliding cutting of each of the read segments has a repeating Kmer. If the repeating Kmer is not present, it is judged that there is no tandem repeating sequence, and if the repeating Kmer is present, it is judged that there is a tandem repeating sequence.
  16. 根据权利要求15所述的方法,其特征在于,若所述重复序列为串联重复序列,通过下列步骤对所述串联重复序列进行解析:The method according to claim 15, wherein if the repeat sequence is a tandem repeat sequence, the tandem repeat sequence is resolved by the following steps:
    (m)确定所述串联重复序列的长度;(m) determining the length of the tandem repeat sequence;
    (n)将包含所述串联重复序列终点的读段用终点对齐的方式进行位置调整。(n) Position adjustment is performed by aligning the readings including the end points of the tandem repeats with the end points.
  17. 根据权利要求15所述的方法,其特征在于,若所述重复序列非为串联重复序列,且存在以下情况,则通过以下判断所述重复序列为大型双重复序列:The method according to claim 15, wherein if the repeat sequence is not a tandem repeat sequence and the following conditions exist, the repeat sequence is judged to be a large double repeat sequence by:
    所述重复序列的长度大于L,或者成对读段中的定位在所述重复序列下游的读段对应的读段也定位在该重复序列上。The length of the repeat sequence is greater than L, or the read corresponding to the read located downstream of the repeat sequence in the paired read is also located on the repeat.
  18. 根据权利要求17所述的方法,其特征在于,若所述重复序列为大型双重复序列,通过以下对所述大型双重复序列进行解析:The method according to claim 17, wherein if the repeat sequence is a large double repeat sequence, the large double repeat sequence is resolved by:
    比较所述大型双重复序列中的上游重复序列对应的有效测序孔的数量与下游重复序列对应的有效测序孔的数量的差值与碱基的有效测序孔的数量期望值的差异,依据差异的大小,解决所述大型双重复序列上的冲突。Comparing the difference between the number of effective sequencing wells corresponding to the upstream repeat sequence in the large double repeat sequence and the number of effective sequencing wells corresponding to the downstream repeat sequence and the expected value of the number of valid sequencing wells of the base, depending on the difference Resolving conflicts on the large double repeat sequences.
  19. 根据权利要求17所述的方法,其特征在于,若所述重复序列非为大型双重复序列,且存在以下情况,则判断所述重复序列为小型双重复序列:The method according to claim 17, wherein if the repeat sequence is not a large double repeat sequence and the following conditions exist, it is determined that the repeat sequence is a small double repeat sequence:
    所述重复序列的长度小于L。The repeat sequence has a length less than L.
  20. 根据权利要求19所述的方法,其特征在于,通过以下至少之一对所述小型双重复 序列进行解析:The method according to claim 19, wherein said small double repeat is performed by at least one of the following The sequence is parsed:
    (p)利用支持各碱基的成对读段之间的距离的均值作为冲突位点的期望位置,通过比较支持两种冲突碱基的成对读段之间的距离与所述均值的接近程度,以确定该冲突位点上的碱基;(p) using the mean of the distance between pairs of reads supporting each base as the desired position of the collision site, by comparing the distance between the pair of reads supporting the two conflicting bases to the mean Degree to determine the base at the conflicting site;
    (k)利用成对读段中定位至所述延伸序列标准差范围中的非重复序列的最上游的读段对应的读段来构建辅助重叠群,利用成对读段中定位至所述辅助重叠群下游的读段对应的读段来确定冲突位点上的碱基。(k) constructing an auxiliary contig by using reads corresponding to the most upstream reads of the non-repetitive sequences in the range of standard deviations of the extended sequence in the paired reads, using the paired reads to locate the auxiliary The reads corresponding to the reads downstream of the contig are used to determine the bases at the collision site.
  21. 根据权利要求20所述的方法,其特征在于,若无法解析所述小型双重复序列,则终止对种子序列的延伸。The method according to claim 20, wherein the extension of the seed sequence is terminated if the small double repeat sequence cannot be resolved.
  22. 根据权利要求1所述的方法,其特征在于,(c)包括:The method of claim 1 wherein (c) comprises:
    (iii)基于所述读段,建立所述序列重叠群之间的合并连接关系;(iii) establishing a merged connection relationship between the sequence contigs based on the read segment;
    (iv)基于所述序列重叠群和所述序列重叠群包含的读段对应的测序孔,构建初级骨架序列;(iv) constructing a primary skeleton sequence based on the sequence contigs corresponding to the sequence contigs and the reads contained in the sequence contig;
    (v)在对所述多个序列重叠群之间的合并连接关系与所述初级骨架序列进行相互检验之后,通过合并所述序列重叠群以构建所述骨架序列,获得分隔长片段序列的组装结果。(v) after merging the merged connection relationship between the plurality of sequence contigs with the primary skeletal sequence, by merging the sequence contigs to construct the skeletal sequence, obtaining assembly of the separated long segment sequences result.
  23. 一种对分隔长片段序列进行组装的装置,其特征在于,包括:An apparatus for assembling a sequence of long segment segments, comprising:
    输入模块,通过测序获得读段集,并记录所述读段集中的读段对应的测序孔,一个测序孔包含至少一条长片段序列;Inputting a module, obtaining a read set by sequencing, and recording a sequencing hole corresponding to the read in the read set, wherein one sequencing hole comprises at least one long segment sequence;
    延伸模块,利用所述读段及所述读段对应的测序孔,对多个种子序列进行并行延伸,以获得多个序列重叠群,所述多个种子序列通过已知序列确定;An extension module, using the read segment and the sequencing hole corresponding to the read segment, to extend a plurality of seed sequences in parallel to obtain a plurality of sequence contigs, wherein the plurality of seed sequences are determined by a known sequence;
    骨架序列构建模块,基于所述读段、所述序列重叠群以及所述序列重叠群包含的读段对应的测序孔,构建骨架序列,以获得分隔长片段序列的组装结果。The skeleton sequence constructing module constructs a skeleton sequence based on the read segment, the sequence contig, and the sequencing hole corresponding to the read segment included in the sequence contig to obtain an assembly result of separating the long segment sequence.
  24. 根据权利要求23所述的装置,其特征在于,所述种子序列是基于基因组参考序列按照下列步骤获得的:The device according to claim 23, wherein said seed sequence is obtained based on a genomic reference sequence according to the following steps:
    在所述基因组参考序列中,按照N进行打断;以及In the genomic reference sequence, interrupted by N;
    将经过打断的参考序列按照预定长度进行截断,以便获得所述种子序列。The interrupted reference sequence is truncated by a predetermined length to obtain the seed sequence.
  25. 根据权利要求24所述的装置,其特征在于,所述预定长度不小于所述测序中测序文库插入片段的长度。The device according to claim 24, wherein said predetermined length is not less than the length of the sequencing library insert in said sequencing.
  26. 根据权利要求23所述的装置,其特征在于,所述读段集包括成对读段,所述种子序列是基于所述读段按照下列步骤获得的: The apparatus of claim 23 wherein said set of reads comprises a pair of reads, said sequence of seeds being obtained based on said reads in accordance with the following steps:
    (1)将所述读段滑动切割成多个Kmer,构建Kmer对所述读段的索引RKI,用于通过Kmer访问对应的读段;(1) sliding the read segment into a plurality of Kmers, constructing an index RKI of the Kmer to the read segment, for accessing the corresponding read segment through the Kmer;
    (2)从所述读段集中提取一对不具有高频Kmer的成对读段;(2) extracting a pair of read segments having no high frequency Kmer from the set of read segments;
    (3)利用所述索引RKI,分别确定(2)中的一对成对读段中的两个读段的Kmer对应的所有读段,获得第一读段群和第二读段群;(3) using the index RKI, respectively determining all the reads corresponding to the Kmer of the two read pairs of the pair of paired reads in (2), obtaining the first read segment group and the second read segment group;
    (4)分别确定(3)的第一读段群和第二读段群对应的测序孔,获得第一测序孔集和第二测序孔集;(4) respectively determining the sequencing holes corresponding to the first read segment group and the second read segment group of (3), and obtaining the first sequencing hole set and the second sequencing hole set;
    (5)确定(4)中的第一测序孔集和第二测序孔集的交集,若所述交集的大小与碱基的有效测序孔的数量期望值无显著差异,则确定(2)中的成对读段为所述种子序列。(5) determining the intersection of the first sequencing hole set and the second sequencing hole set in (4), if the size of the intersection is not significantly different from the expected number of effective sequencing holes of the base, determining (2) The paired reads are the seed sequence.
  27. 根据权利要求26所述的装置,其特征在于,(5)中:The apparatus according to claim 26, wherein in (5):
    若所述交集的大小在碱基的有效测序孔的数量期望值的一半到两倍之间,则确定(2)中的成对读段为所述种子序列。If the size of the intersection is between half and two times the expected value of the number of valid sequencing wells of the base, it is determined that the paired reads in (2) are the seed sequence.
  28. 根据权利要求23所述的装置,其特征在于,在所述延伸模块中包括进行:The apparatus according to claim 23, wherein said extending module comprises:
    (i)将所述读段滑动切割成多个Kmer,构建Kmer对所述读段的索引RKI,用于通过Kmer访问对应的读段;(i) sliding the read segment into a plurality of Kmers, constructing an index RKI of the Kmer to the read segment, for accessing the corresponding read segment through the Kmer;
    (ii)基于所述读段及其对应的索引RKI,对所述多个种子序列进行并行延伸,以获得所述多个序列重叠群。(ii) performing parallel extension of the plurality of seed sequences based on the read segment and its corresponding index RKI to obtain the plurality of sequence contigs.
  29. 根据权利要求28所述的装置,其特征在于,所述RKI是通过下列步骤获得的:The apparatus according to claim 28, wherein said RKI is obtained by the following steps:
    对所述读段进行滑动切割成多个Kmer;Sliding and cutting the read segment into a plurality of Kmers;
    构建以Kmer为键值的哈希,所述哈希构成所述RKI,并且所述哈希记载所述Kmer的频率、所属读段、所述Kmer在所述读段上的位置和方向。A hash with a Kmer as a key value is constructed, the hash constitutes the RKI, and the hash records the frequency of the Kmer, the associated read segment, the location and direction of the Kmer on the read segment.
  30. 根据权利要求28所述的装置,其特征在于,通过重复下列步骤对所述种子序列进行延伸:The apparatus of claim 28 wherein said seed sequence is extended by repeating the following steps:
    选择适于延伸的种子序列;Selecting a seed sequence suitable for extension;
    将所述读段定位至所述种子序列,以获得延伸序列;Positioning the read to the seed sequence to obtain an extension sequence;
    将定位在所述延伸序列末端的读段进行逐碱基一致性化处理;以及Readings located at the end of the extended sequence are subjected to base-by-base conformation;
    如果一致性化处理失败,则进行杂合识别、定相处理和/或对重复序列进行解析。If the coherence process fails, hybrid identification, phasing processing, and/or parsing of the repetitive sequence are performed.
  31. 根据权利要求30所述的装置,其特征在于,通过下列步骤选择适于延伸的种子序列:30. Apparatus according to claim 30 wherein the seed sequence suitable for extension is selected by the following steps:
    将种子序列滑动切割成Kmer; Sliding and cutting the seed sequence into a Kmer;
    通过所述RKI获取所述Kmer对应的读段;Obtaining, by the RKI, a read corresponding to the Kmer;
    将所述对应的读段与所述种子序列进行比对;Comparing the corresponding read segment with the seed sequence;
    基于所述对应的读段对应的测序孔,确定测序孔对所述种子序列的覆盖状况;以及Determining a coverage of the seed sequence by the sequencing hole based on the sequencing holes corresponding to the corresponding read segment;
    基于所述覆盖状况,确定适于延伸的种子序列。Based on the coverage condition, a seed sequence suitable for extension is determined.
  32. 根据权利要求30所述的装置,其特征在于,通过下列步骤将所述读段定位至所述种子序列:The apparatus of claim 30 wherein said read is located to said seed sequence by the following steps:
    将种子序列滑动切割成Kmer;Sliding and cutting the seed sequence into a Kmer;
    通过所述RKI获取所述Kmer对应的读段;Obtaining, by the RKI, a read corresponding to the Kmer;
    将所述Kmer对应的读段定位至所述种子序列,并逐个碱基进行比对。The Kmer-compatible reads are positioned to the seed sequence and aligned on a base-by-base basis.
  33. 根据权利要求30所述的装置,其特征在于,在一致性化处理过程中,如果延伸的位点的有效测序孔集合被该位点不同的碱基型平均分配,则判断存在杂合。The apparatus according to claim 30, wherein in the process of the consistency, if the set of effective sequencing wells of the extended sites is equally distributed by the different base types of the sites, it is judged that there is a heterozygosity.
  34. 根据权利要求33所述的装置,其特征在于,在确定存在杂合之后,将所述延伸序列分为多条分别进行延伸。The apparatus according to claim 33, wherein after the presence of the hybrid is determined, the extension sequence is divided into a plurality of strips for extension.
  35. 根据权利要求30所述的装置,其特征在于,所述读段集包含多对成对读段,一对成对读段中的两个读段之间的距离为L,The apparatus according to claim 30, wherein said set of reads comprises a plurality of pairs of pairs, and a distance between two of the pair of pairs is L,
    在一致性化处理过程中,如果成对读段中的定位在延伸方向下游的读段与对应的读段之间的距离为非L,则确定该定位在延伸方向下游的读段所定位的位置为重复序列的起点。In the process of the consistency process, if the distance between the read segment located downstream in the extension direction and the corresponding read segment in the paired read segment is non-L, it is determined that the read segment located downstream of the extension direction is located. The position is the starting point of the repeating sequence.
  36. 根据权利要求35所述的装置,其特征在于,所述重复序列的终点为成对读段中的与对应的读段之间的距离为L的定位在延伸方向下游的读段所定位的位置,或者为延伸时的冲突位点。The apparatus according to claim 35, wherein the end point of said repeating sequence is a position in which the reading in the extending direction is located at a distance L between the paired reading and the corresponding reading. , or a conflicting site when extending.
  37. 根据权利要求36所述的装置,其特征在于,通过下列步骤判断所述重复序列是否为串联重复序列:The apparatus according to claim 36, wherein the repeating sequence is judged to be a tandem repeat by the following steps:
    对定位至所述重复序列的读段进行滑动切割以获得Kmer;Performing a sliding cut on a read positioned to the repeating sequence to obtain a Kmer;
    判断每条所述读段进行滑动切割所得到的Kmer是否存在重复Kmer,如果不存在所述重复Kmer,则判断不存在串联重复序列,如果存在所述重复Kmer,则判断存在串联重复序列。It is judged whether or not the Kmer obtained by sliding cutting of each of the read segments has a repeating Kmer. If the repeating Kmer is not present, it is judged that there is no tandem repeating sequence, and if the repeating Kmer is present, it is judged that there is a tandem repeating sequence.
  38. 根据权利要求37所述的装置,其特征在于,若所述重复序列为串联重复序列,通过下列步骤对所述串联重复序列进行解析:The apparatus according to claim 37, wherein if said repeat sequence is a tandem repeat sequence, said tandem repeat sequence is resolved by the following steps:
    (m)确定所述串联重复序列的长度;(m) determining the length of the tandem repeat sequence;
    (n)将包含所述串联重复序列终点的读段用终点对齐的方式进行位置调整。 (n) Position adjustment is performed by aligning the readings including the end points of the tandem repeats with the end points.
  39. 根据权利要求37所述的装置,其特征在于,若所述重复序列非为串联重复序列,且存在以下情况,则判断所述重复序列为大型双重复序列:The apparatus according to claim 37, wherein said repeating sequence is determined to be a large double repeating sequence if said repeating sequence is not a tandem repeating sequence, and wherein:
    所述重复序列的长度大于L,或者成对读段中的定位在所述重复序列下游的读段对应的读段也定位在该重复序列上。The length of the repeat sequence is greater than L, or the read corresponding to the read located downstream of the repeat sequence in the paired read is also located on the repeat.
  40. 根据权利要求39所述的装置,其特征在于,若所述重复序列为大型双重复序列,通过以下对所述大型双重复序列进行解析:The apparatus according to claim 39, wherein if said repeating sequence is a large double repeating sequence, said large double repeating sequence is resolved by:
    比较所述大型双重复序列中的上游重复序列对应的有效测序孔的数量与下游重复序列对应的有效测序孔的数量的差值与碱基的有效测序孔的数量期望值的差异,依据差异的大小,解决所述大型双重复序列上的冲突。Comparing the difference between the number of effective sequencing wells corresponding to the upstream repeat sequence in the large double repeat sequence and the number of effective sequencing wells corresponding to the downstream repeat sequence and the expected value of the number of valid sequencing wells of the base, depending on the difference Resolving conflicts on the large double repeat sequences.
  41. 根据权利要求39所述的装置,其特征在于,若所述重复序列非为大型双重复序列,且存在以下情况,则判断所述重复序列为小型双重复序列:The apparatus according to claim 39, wherein said repeating sequence is said to be a small double repeating sequence if said repeating sequence is not a large double repeating sequence, and wherein:
    所述重复序列的长度小于L。The repeat sequence has a length less than L.
  42. 根据权利要求41所述的装置,其特征在于,若所述重复序列为小型双重复序列,通过以下至少之一对所述小型双重复序列进行解析:The apparatus according to claim 41, wherein if said repeating sequence is a small double repeating sequence, said small double repeating sequence is resolved by at least one of:
    (p)利用支持各碱基的成对读段之间的距离的均值作为冲突位点的期望位置,通过比较支持两种冲突碱基的成对读段之间的距离与所述均值的接近程度,以确定该冲突位点上的碱基;(p) using the mean of the distance between pairs of reads supporting each base as the desired position of the collision site, by comparing the distance between the pair of reads supporting the two conflicting bases to the mean Degree to determine the base at the conflicting site;
    (k)利用成对读段中定位至所述延伸序列标准差范围中的非重复序列的最上游的读段对应的读段来构建辅助重叠群,利用成对读段中定位至所述辅助重叠群下游的读段对应的读段来确定冲突位点上的碱基。(k) constructing an auxiliary contig by using reads corresponding to the most upstream reads of the non-repetitive sequences in the range of standard deviations of the extended sequence in the paired reads, using the paired reads to locate the auxiliary The reads corresponding to the reads downstream of the contig are used to determine the bases at the collision site.
  43. 根据权利要求42所述的装置,其特征在于,若无法解析所述小型双重复序列,则终止对种子序列的延伸。40. Apparatus according to claim 42 wherein the extension of the seed sequence is terminated if the small double repeat sequence cannot be resolved.
  44. 根据权利要求1所述的方法,其特征在于,在所述骨架序列构建模块中进行:The method of claim 1 wherein said performing in said skeleton sequence building module:
    (iii)基于所述读段,建立所述序列重叠群之间的合并连接关系;(iii) establishing a merged connection relationship between the sequence contigs based on the read segment;
    (iv)基于所述序列重叠群和所述序列重叠群包含的读段对应的测序孔,构建初级骨架序列;(iv) constructing a primary skeleton sequence based on the sequence contigs corresponding to the sequence contigs and the reads contained in the sequence contig;
    (v)在对所述多个序列重叠群之间的合并连接关系与所述初级骨架序列进行相互检验之后,通过合并所述序列重叠群以构建所述骨架序列,获得分隔长片段序列的组装结果。(v) after merging the merged connection relationship between the plurality of sequence contigs with the primary skeletal sequence, by merging the sequence contigs to construct the skeletal sequence, obtaining assembly of the separated long segment sequences result.
  45. 一种计算机可读介质,其特征在于,用于存储计算机可执行程序,执行所述程序包括完成权利要求1-22任一方法。 A computer readable medium, for storing a computer executable program, the executing the program comprising performing the method of any of claims 1-22.
  46. 一种对分隔长片段序列进行组装的系统,其特征在于,包括:A system for assembling a sequence of long segment segments, comprising:
    数据输入单元,用于输入数据;a data input unit for inputting data;
    数据输出单元,用于输出数据;a data output unit for outputting data;
    存储单元,用于存储数据,其中包括计算机可执行程序;a storage unit for storing data, including a computer executable program;
    处理器,与所述数据输入单元、所述数据输出单元和所述存储单元连接,用于执行所述计算机可执行程序,执行所述程序包括完成权利要求1-22任一方法。 And a processor coupled to the data input unit, the data output unit, and the storage unit for executing the computer executable program, the executing the program comprising completing the method of any of claims 1-22.
PCT/CN2016/074665 2016-02-26 2016-02-26 Method and apparatus for assembling separated long fragment sequences WO2017143585A1 (en)

Priority Applications (3)

Application Number Priority Date Filing Date Title
PCT/CN2016/074665 WO2017143585A1 (en) 2016-02-26 2016-02-26 Method and apparatus for assembling separated long fragment sequences
CN201680063769.5A CN108350495B (en) 2016-02-26 2016-02-26 Method and apparatus for assembling partitioned long fragment sequences
HK18113476.6A HK1254399A1 (en) 2016-02-26 2018-10-21 Method and apparatus for assembling separated long fragment sequences

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2016/074665 WO2017143585A1 (en) 2016-02-26 2016-02-26 Method and apparatus for assembling separated long fragment sequences

Publications (1)

Publication Number Publication Date
WO2017143585A1 true WO2017143585A1 (en) 2017-08-31

Family

ID=59685953

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2016/074665 WO2017143585A1 (en) 2016-02-26 2016-02-26 Method and apparatus for assembling separated long fragment sequences

Country Status (3)

Country Link
CN (1) CN108350495B (en)
HK (1) HK1254399A1 (en)
WO (1) WO2017143585A1 (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110534158A (en) * 2019-08-16 2019-12-03 浪潮电子信息产业股份有限公司 A kind of gene order comparison method, device, server and medium
CN110544510A (en) * 2019-05-31 2019-12-06 中南大学 contig integration method based on adjacent algebraic model and quality grade evaluation
CN111564182A (en) * 2020-05-12 2020-08-21 西藏自治区农牧科学院水产科学研究所 Method for assembling high-reconvergence Glyptosternum genus fish at chromosome level
CN111986729A (en) * 2019-05-21 2020-11-24 深圳华大基因科技服务有限公司 Method and system for optimizing framework sequence and application
CN112634989A (en) * 2020-12-29 2021-04-09 山东建筑大学 Double-sided genome fragment filling method and device based on fragment contig
CN113963749A (en) * 2021-09-10 2022-01-21 华南农业大学 High-throughput sequencing data automatic assembly method, system, equipment and storage medium
CN115641911A (en) * 2022-10-19 2023-01-24 哈尔滨工业大学 Method for detecting overlapping between sequences

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111128303B (en) * 2018-10-31 2023-09-15 深圳华大生命科学研究院 Method and system for determining corresponding sequences in a target species based on known sequences
CN109658981B (en) * 2018-12-10 2022-10-04 海南大学 Data classification method for single cell sequencing
CN111755075B (en) * 2019-03-28 2023-09-29 深圳华大生命科学研究院 Method for filtering sequence pollution among high-throughput sequencing samples of immune repertoire
CN112825267B (en) * 2019-11-21 2024-05-14 深圳华大基因科技服务有限公司 Method for determining a collection of small nucleic acid sequences and use thereof
CN112802554B (en) * 2021-01-28 2023-09-22 中国科学院成都生物研究所 Animal mitochondrial genome assembly method based on second-generation data
CN114333989B (en) * 2021-12-31 2023-06-13 天津诺禾致源生物信息科技有限公司 Method and device for positioning characters
CN115148289B (en) * 2022-09-06 2023-01-24 安诺优达基因科技(北京)有限公司 Method and device for assembling autotetraploid gene component types and device for constructing chromosome

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102206704A (en) * 2011-03-02 2011-10-05 深圳华大基因科技有限公司 Method and device for assembling genome sequence
CN102789553A (en) * 2012-07-23 2012-11-21 中国水产科学研究院 Method and device for assembling genomes by utilizing long transcriptome sequencing result
CN104017883A (en) * 2014-06-18 2014-09-03 深圳华大基因科技服务有限公司 Method and system for assembling genomic sequence
CN104164479A (en) * 2014-04-04 2014-11-26 深圳华大基因科技服务有限公司 Heterozygous genome processing method
US20150094961A1 (en) * 2013-10-01 2015-04-02 Complete Genomics, Inc. Phasing and linking processes to identify variations in a genome
CN105189308A (en) * 2013-03-15 2015-12-23 考利达基因组股份有限公司 Multiple tagging of long DNA fragments

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102206704A (en) * 2011-03-02 2011-10-05 深圳华大基因科技有限公司 Method and device for assembling genome sequence
CN102789553A (en) * 2012-07-23 2012-11-21 中国水产科学研究院 Method and device for assembling genomes by utilizing long transcriptome sequencing result
CN105189308A (en) * 2013-03-15 2015-12-23 考利达基因组股份有限公司 Multiple tagging of long DNA fragments
US20150094961A1 (en) * 2013-10-01 2015-04-02 Complete Genomics, Inc. Phasing and linking processes to identify variations in a genome
CN104164479A (en) * 2014-04-04 2014-11-26 深圳华大基因科技服务有限公司 Heterozygous genome processing method
CN104017883A (en) * 2014-06-18 2014-09-03 深圳华大基因科技服务有限公司 Method and system for assembling genomic sequence

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111986729A (en) * 2019-05-21 2020-11-24 深圳华大基因科技服务有限公司 Method and system for optimizing framework sequence and application
CN110544510A (en) * 2019-05-31 2019-12-06 中南大学 contig integration method based on adjacent algebraic model and quality grade evaluation
CN110544510B (en) * 2019-05-31 2023-03-24 中南大学 Contig integration method based on adjacent algebraic model and quality grade evaluation
CN110534158A (en) * 2019-08-16 2019-12-03 浪潮电子信息产业股份有限公司 A kind of gene order comparison method, device, server and medium
CN110534158B (en) * 2019-08-16 2023-08-04 浪潮电子信息产业股份有限公司 Gene sequence comparison method, device, server and medium
CN111564182A (en) * 2020-05-12 2020-08-21 西藏自治区农牧科学院水产科学研究所 Method for assembling high-reconvergence Glyptosternum genus fish at chromosome level
CN111564182B (en) * 2020-05-12 2024-02-09 西藏自治区农牧科学院水产科学研究所 High-weight recovery of fish of the genus of Glehnian chromosome-level assembly of (2)
CN112634989A (en) * 2020-12-29 2021-04-09 山东建筑大学 Double-sided genome fragment filling method and device based on fragment contig
CN113963749A (en) * 2021-09-10 2022-01-21 华南农业大学 High-throughput sequencing data automatic assembly method, system, equipment and storage medium
CN115641911A (en) * 2022-10-19 2023-01-24 哈尔滨工业大学 Method for detecting overlapping between sequences

Also Published As

Publication number Publication date
CN108350495B (en) 2021-10-01
CN108350495A (en) 2018-07-31
HK1254399A1 (en) 2019-07-19

Similar Documents

Publication Publication Date Title
WO2017143585A1 (en) Method and apparatus for assembling separated long fragment sequences
US20210217490A1 (en) Method, computer-accessible medium and system for base-calling and alignment
EP3304383B1 (en) De novo diploid genome assembly and haplotype sequence reconstruction
Kim et al. ECgene: genome-based EST clustering and gene modeling for alternative splicing
Laehnemann et al. Denoising DNA deep sequencing data—high-throughput sequencing errors and their correction
US20180165411A1 (en) Systems and methods for assembling fragmented information
CN104302781B (en) A kind of method and device detecting chromosomal structural abnormality
US20120197533A1 (en) Identifying rearrangements in a sequenced genome
JP2008547080A (en) Method for processing ditag sequences and / or genome mapping
CN110692101B (en) Method for aligning targeted nucleic acid sequencing data
EP3378001B1 (en) Methods for detecting copy-number variations in next-generation sequencing
CN106715711A (en) Method for determining the sequence of a probe and method for detecting genomic structural variation
Wildschutte et al. Discovery and characterization of Alu repeat sequences via precise local read assembly
US20210375397A1 (en) Methods and systems for determining fusion events
CN109994154A (en) A kind of screening plant of single-gene recessive genetic disorder candidate disease causing genes
Smart et al. A novel phylogenetic approach for de novo discovery of putative nuclear mitochondrial (pNumt) haplotypes
Sezerman et al. Bioinformatics workflows for genomic variant discovery, interpretation and prioritization
WO2019132010A1 (en) Method, apparatus and program for estimating base type in base sequence
JP5414130B2 (en) Program for judging base sequence read errors
CN111028885B (en) Method and device for detecting yak RNA editing site
US20200216888A1 (en) Method for increasing accuracy of analysis by removing primer sequence in amplicon-based next-generation sequencing
Bolognini Unraveling tandem repeat variation in personal genomes with long reads
US20170226588A1 (en) Systems and methods for dna amplification with post-sequencing data filtering and cell isolation
Lee et al. Analysis of unmapped regions associated with long deletions in Korean whole genome sequences based on short read data
Lecompte Structural variant genotyping with long read data

Legal Events

Date Code Title Description
NENP Non-entry into the national phase

Ref country code: DE

121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 16891039

Country of ref document: EP

Kind code of ref document: A1

122 Ep: pct application non-entry in european phase

Ref document number: 16891039

Country of ref document: EP

Kind code of ref document: A1