WO2017143585A1 - Procédé et appareil pour assembler des séquences fragmentaires longues séparées - Google Patents

Procédé et appareil pour assembler des séquences fragmentaires longues séparées Download PDF

Info

Publication number
WO2017143585A1
WO2017143585A1 PCT/CN2016/074665 CN2016074665W WO2017143585A1 WO 2017143585 A1 WO2017143585 A1 WO 2017143585A1 CN 2016074665 W CN2016074665 W CN 2016074665W WO 2017143585 A1 WO2017143585 A1 WO 2017143585A1
Authority
WO
WIPO (PCT)
Prior art keywords
sequence
read
reads
sequencing
kmer
Prior art date
Application number
PCT/CN2016/074665
Other languages
English (en)
Chinese (zh)
Inventor
谢寅龙
黄伟华
李净净
郭瑞东
唐静波
邓超
Original Assignee
深圳华大基因研究院
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳华大基因研究院 filed Critical 深圳华大基因研究院
Priority to PCT/CN2016/074665 priority Critical patent/WO2017143585A1/fr
Priority to CN201680063769.5A priority patent/CN108350495B/zh
Publication of WO2017143585A1 publication Critical patent/WO2017143585A1/fr
Priority to HK18113476.6A priority patent/HK1254399A1/zh

Links

Images

Classifications

    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B25/00ICT specially adapted for hybridisation; ICT specially adapted for gene or protein expression

Definitions

  • the present invention relates to the field of biotechnology, and in particular, to a method and apparatus for assembling sequences that separate long segments.
  • Comparative Genomics is a discipline that compares known gene and genomic structures to understand gene function, expression mechanism, and species evolution.
  • the basic principle is that the same characteristics between two organisms are usually encoded by evolutionarily conserved DNA, and the relevant DNA fragments will be the same or similar.
  • What is needed for comparative genomics analysis is the presence of a genomic map (genomic reference sequence) and sequencing of the compared objects. Mutation detection is an important part of comparative genomics.
  • the International HapMap Project and Genome-wide Association Study (GWAS) are based on single nucleotide polymorphisms. This type of variation in single-nucleotide polymorphism (SNP) is used for related research.
  • Assembly refers to the process of integrating shorter fragment sequences into longer sequences. Limited by sequencing technology, the genome cannot read its complete sequence content by sequencing the chromosomes from beginning to end, and often breaks the whole genome into tens to thousands of base fragments, which are processed by massively parallel sequencing. The contents of these fragments are read, and these fragmented fragments are analyzed and integrated using assembly, and finally re-reduced into relatively complete genomic sequences. Identifying mutations through assembly is a new application of assembly techniques, and in fact the main purpose of assembly is to build the genome. When there is no genomic reference sequence, the construction of the genome is a process from scratch, which is especially important. This type of assembly is also called "denovo assembly”.
  • the diploid genome of a human individual is derived from a haploid contributed by each of its two parents, and the various differences between the two sets of haploids cause the individual to have one or more sites on the homologous chromosome. There are different alleles present, and this phenomenon is a heterozygous phenomenon.
  • the human reference sequence is constructed from data from multiple individuals, which results in a virtually heterozygous haploid genome. With the deepening of genomic research, the haploid human reference sequence is increasingly unable to meet the demand, the construction of haplotype (Haplotyping) is increasingly important, genomic analysis based on haplotype information is also emerging.
  • the haplotype information helps to interpret the relationship between genotype and phenotype.
  • the two individuals with the same heterozygous collection will also have different phenotypes and disease susceptibility depending on the haplotype.
  • Studies such as the specific expression of alleles
  • disease research such as Mediterranean fever, breast cancer
  • phase phasing is an important operation for constructing haplotypes, and there are many methods for constructing haplotypes. Mainly divided into the following five categories:
  • the core of the physical separation method is to separate the sequences that break into long fragments of DNA, either by means of fosmid plasmids or by direct physical separation of the multiwell plates. Further fragmentation and amplification operations (construction libraries) required for sequencing after separation, in order to distinguish between different separation units, respectively, the sequences in these units were attached with different barcodes. In this way, the whole gene component is divided into many sub-portions by separation. When the number of separated sub-portions is large, each sub-portion only contains the content of one haploid of a small area on the genome. This allows heterozygous regions at the genome-wide level to appear in homozygous form in these small regions, which is of great importance for the construction of haplotypes.
  • Each partition unit has its own unique barcode sequence to retrieve the own reads belonging to each partition unit by identifying the barcode sequence after sequencing.
  • Fosmid plasmid separation technology refers to the separation unit as fosmid plasmid pool (fosmid pool), each fosmid plasmid pool contains one or more long fragments of about 37Kbp long; and the perforated plate separation technique is called the separation unit is well (well ), each well contains multiple long segments, the length of which varies from technology to technology.
  • the separation method pioneered a new type of information, and the total collection of reads is separated into a multi-group collection by barcode, which is no longer compared to the whole Genome Shotgun (WGS) sequencing method.
  • the reads in each group come from a common small area on the genome, which is the area covered by long fragments when separated. Although these small regions are still derived from any position on the whole genome, the reads in each cluster are constrained and aggregated. The added information of this cluster becomes the key to constructing haplotypes.
  • LFR Long Fragment Reads
  • the present invention is directed to solving at least some of the above technical problems or at least providing a useful commercial choice.
  • the invention provides a method of assembling a sequence of separated long segments, comprising:
  • the program may be stored in a computer readable storage medium, and the storage medium may include: a read only memory, a random Memory, disk or disc, etc.
  • the present invention provides a system for assembling a sequence of separated long segments, the system comprising: a data input unit for inputting data; a data output unit for outputting data; and a storage unit for Storing data, including a computer executable program; a processor coupled to the data input unit, the data output unit, and the storage unit for executing the computer executable program, the executing the program comprising performing the above method .
  • the invention provides an apparatus for assembling a sequence of long segment segments, comprising:
  • An extension module using the read segment and the sequencing hole corresponding to the read segment, to extend a plurality of seed sequences in parallel to obtain a plurality of sequence contigs, wherein the plurality of seed sequences are determined by a known sequence;
  • the skeleton sequence constructing module constructs a skeleton sequence based on the read segment, the sequence contig, and the sequencing hole corresponding to the read segment included in the sequence contig to obtain an assembly result of separating the long segment sequence.
  • Iterative extension The extension of the seed sequence is an iterative process, and the extension will be used as an extension base for the next iteration, so that the seed sequence can be extended.
  • the linear extension method makes the structure encountered during assembly simpler, the logic relationship is relatively clear and easy to classify, and the graph algorithm gathers information of all repeated regions in one place. It involves more information and a more complex structure. That is, the solution of the graph is a one-time solution. As long as one repeat region is solved, all the genomic regions associated with it (they have the same or similar repeat sequences) will be all The solution is solved at the same time, and the linear method only solves the repetition of the current region at one time, and does not solve the other regions at the same time, and then re-interprets when the repetition is extended to another region of the genome. Resolving a record of related information after a repetition, making the information encountered in the next solution repetition easier, which reduces the extra computational loss.
  • the linear extension mode makes the extension of each seed sequence independent of each other, and this mode is easier to calculate parallelization.
  • the extension of the seed sequence will use the well information to perform multiple multiplex extensions of the haplotype, whenever a heterozygous region is encountered.
  • the phasing operation is carried out in real time in conjunction with the previously assembled hybrid zone condition, ie the phase is also linear and runs through the various modules of the assembly device.
  • the extension of the homozygous area still adopts a single extension, and when it encounters the hybrid area, it is divided into multiple ways to extend, and after the extension of the hybrid area is completed, it is merged back to the single way to continue to extend.
  • the algorithm is basically global, and when the extension encounters a complicated situation, the algorithm will strictly choose to terminate the extension instead of taking a higher value to extend, avoiding the use of greedy judgment mode.
  • the algorithm makes the algorithm more global than the conventional hierarchical assembly algorithm, and can greatly reduce the sequencing depth requirement, that is, each well There is no requirement for a particularly high sequencing depth, which saves a lot of resources, both cost and time.
  • FIG. 1 is a flow chart of a method of assembling a sequence of separated long segments in an embodiment of the present invention.
  • FIG. 2 is a flow chart of a method of assembling a sequence of separated long segments in an embodiment of the present invention.
  • Figure 3 is a block diagram showing the construction of an apparatus for assembling a sequence of long segments in an embodiment of the present invention.
  • FIG. 4 is a schematic view showing the structure of an apparatus for assembling a sequence of long segment segments in an embodiment of the present invention.
  • FIG. 5 is a schematic diagram of a seed sequence iterative extension process in an embodiment of the present invention.
  • Fig. 6 is a schematic diagram of identifying a repeating region in an embodiment of the present invention.
  • Figure 7 is a schematic illustration of the processing of conflicts of large double repeat sequences having two repeating regions that are relatively large in an embodiment of the present invention.
  • Figure 8 is a schematic illustration of the processing of collisions of large double repeat sequences having two repeating regions that are relatively small in an embodiment of the present invention.
  • Figure 9 is a schematic illustration of the conflicting processing of tandem repeats in an embodiment of the invention.
  • the invention proposes a method of assembling a sequence of separated long segments, see Fig. 1, the method comprising:
  • the seed sequence is obtained based on the genomic reference sequence according to the following steps:
  • the interrupted reference sequence is truncated by a predetermined length to obtain the seed sequence.
  • the size of the predetermined length is not particularly limited. Generally, the predetermined length is not less than the size of the insert, so that the reads from the same insert can be positioned on the same seed sequence.
  • the sequencing is double-end sequencing, the predetermined length being at least twice the length of the sequencing library insert in the sequencing, facilitating the positioning of the paired reads onto the same seed sequence.
  • the set of reads comprises a pair of reads, the seed sequence being obtained based on the read according to the following steps: (1) sliding the read into a plurality of Kmers, constructing Kmer's index RKI for the read segment, for accessing the corresponding read segment by Kmer; (2) extracting a pair of read pairs without the high frequency Kmer from the read set; (3) using the index RKI, respectively determining all the reads corresponding to the Kmer of the two of the pair of read pairs in (2), obtaining the first read group and the second read group; (4) respectively determining (3) a sequencing hole corresponding to the first read segment group and the second read segment group, obtaining a first sequencing hole set and a second sequencing hole set; (5) determining (4) The intersection of the first set of sequencing holes and the second set of sequencing wells, if the size of the intersection is not significantly different from the expected value of the number of valid sequencing holes of the base, determining that the paired reads in (2) are The seed sequence, the expected number of effective sequencing wells for
  • the so-called high frequency is relative to the average frequency, and the number of occurrences of a certain Kmer is higher than the average number of occurrences of Kmer, which is considered to be a high frequency Kmer.
  • the inventors define, as needed, a Kmer whose number of occurrences is much higher than the average of the occurrence of Kmer as a high frequency Kmer, which is said to be "far above” as at least 10 times the average number of occurrences of Kmer. 20 times, 30 times, 40 times, 50 times or 60 times.
  • the Kmer having a frequency of 3000 to 5000 is set as the high frequency Kmer.
  • the former is constructed based only on known sequences, such as reading sequences obtained according to reference sequences, sequencing, assembly fragments, etc., and the latter is constructed by combining known sequences, RKIs, and sequencing holes. It can be applied to de novo assembly of species without a reference sequence at all.
  • (b) comprises: (i) slidingly cutting the read into a plurality of Kmers, constructing an index RKI of the Kmer to the read for accessing the corresponding read by the Kmer; And extending the plurality of seed sequences in parallel based on the read segment and its corresponding index RKI to obtain the plurality of sequence contigs.
  • the RKI is obtained by slidingly cutting the read reads into a plurality of Kmers; constructing a hash with a Kmer as a key value, the hash forming the RKI, And the hash records the frequency of the Kmer, the associated read segment, the position and direction of the Kmer on the read segment.
  • the seed sequence is extended by repeating the steps of: selecting a seed sequence suitable for extension; positioning the read to the seed sequence to obtain an extension sequence; positioning the extension Reads at the end of the sequence are per-base-consistent; and if the co-processing fails, heterozygous recognition, phasing processing, and/or parsing of the repetitive sequences are performed.
  • a seed sequence suitable for extension is selected by slidingly cutting a seed sequence into a Kmer; obtaining a read corresponding to the Kmer by the RKI; and the corresponding read with the seed
  • the sequences are aligned; based on the sequencing wells corresponding to the corresponding reads, determining the coverage of the seed sequences by the sequencing wells; and determining the seed sequences suitable for extension based on the coverage conditions.
  • the reading is located to the seed sequence by sliding slashing the seed sequence into a Kmer; acquiring the Kmer-compatible read by the RKI; reading the Kmer corresponding The segments are mapped to the seed sequence and aligned on a base by base basis.
  • the set of effective sequencing holes of the extended site is equally distributed by the base types different in the site, it is judged that there is a hybrid.
  • the extended sequence is divided into a plurality of strips for extension.
  • the set of reads comprises a plurality of pairs of pairs, and the distance between two of the pair of pairs is L, in the process of consistency, if paired If the distance between the read segment located downstream in the extension direction and the corresponding read segment in the read segment is non-L, it is determined that the position at which the read segment located downstream of the extension direction is located is the start point of the repeated sequence.
  • any numerical value expressed in an accurate manner may represent a range, for example, an interval including plus or minus 10% of the numerical value; or the numerical value
  • the population is normally distributed, and the value expressed in an accurate manner contains the interval of the positive and negative standard deviation of the value.
  • the distance L between two reads in a pair of paired reads corresponds to the length of the inserted segment.
  • the size of the sequenced library constructed is certain, that is, the size of the inserted fragment is a fixed value. Theoretically, the distance between the outer ends of the paired reads obtained after double-end sequencing is the fixed value.
  • the distance between the paired reads is normally distributed.
  • the inventors set the L to the size of the insert of the experimental phase, and those skilled in the art can understand that the L is set to be the positive and negative standard deviation of the insert size of any test phase. The value between them is either the positive or negative standard deviation interval of the insertion fragment size in the experimental phase, and the repeating sequence in the extension process can also be determined and resolved by the method of the deduplication sequence of the present invention.
  • the end point of the repeating sequence is a position in the paired read segment located at a distance L downstream of the corresponding read segment from the read segment located downstream of the extension direction, or Conflict site.
  • the repeating sequence is a tandem repeating sequence by performing a sliding cut on a read positioned to the repeating sequence to obtain a Kmer; determining that each of the read segments is subjected to sliding cutting Whether there is a repeat Kmer in the Kmer, if the repeat Kmer is absent, it is judged that there is no tandem repeat, and if the repeat Kmer is present, it is judged that there is a tandem repeat.
  • the tandem repeat sequence is resolved by: determining the length of the tandem repeat sequence; and using the read sequence including the end point of the tandem repeat sequence The position of the end point is adjusted to determine the base at the conflicting position.
  • the repeat sequence is judged to be a large double repeat sequence: the length of the repeat sequence is greater than L, or the read corresponding to the read position located downstream of the repeat sequence in the paired read is also located on the repeat sequence.
  • the large double repeat sequence is analyzed by comparing the number of effective sequencing wells corresponding to the upstream repeat sequence in the large double repeat sequence with The difference between the difference in the number of effective sequencing wells corresponding to the downstream repeat sequence and the expected value of the number of valid sequencing wells of the base, and the conflict on the large double repeat sequence is solved according to the size of the difference. For example, if the two repeats in a large double repeat are relatively far apart, the upstream conflicting base contains more efficient sequencing holes (EW) than the downstream, and the defined number difference is greater than half of the expected EW.
  • EW efficient sequencing holes
  • the base can be determined by this comparison to resolve the conflict; if the difference between the upstream and downstream EW numbers is less than or equal to half the expected EW number, the wrong downstream arm within the length of the insertion segment from the start of the repeating region can be utilized.
  • the corresponding upstream arm constructs a helper contig (HC), comparing the EW set on the HC with the EW set of the upstream conflict base compared to its downstream conflict base.
  • HC helper contig
  • the repeat sequence is not a large double repeat sequence, and the following conditions exist, it is determined that the repeat sequence is a small double repeat sequence: the length of the repeat sequence is less than L.
  • the small double repeat sequence is resolved by at least one of: (p) utilizing the mean of the distance between the paired reads supporting each base as the desired position of the collision site, Comparing the proximity of the distance between the paired reads supporting the two conflicting bases to the mean to determine the base at the conflicting site; (k) using the paired reads to locate the extended sequence A readout corresponding to the most upstream read of the non-repetitive sequence in the standard deviation range to construct a helper contig, using the read corresponding to the read located downstream of the helper contig in the paired read to determine the conflicting site Base on.
  • the small double repeat sequence cannot be resolved, the conflict cannot be resolved, and the extension of the seed sequence is terminated.
  • (c) includes: (iii) establishing a merged connection relationship between the sequence contigs based on the read; (iv) based on the sequence contig and the ⁇ / RTI> ⁇ / RTI> ⁇ / RTI> ⁇ / RTI> ⁇ / RTI> ⁇ / RTI> ⁇ / RTI> ⁇ / RTI> ⁇ / RTI> ⁇ / RTI> ⁇ / RTI> ⁇ / RTI> ⁇ / RTI> ⁇ / RTI> ⁇ / RTI> Sequence contigs are constructed to construct the backbone sequences to obtain assembly results that separate long fragment sequences.
  • Iterative extension The extension of the seed sequence is an iterative process, and the extension will be used as an extension base for the next iteration, so that the seed sequence can be extended.
  • Linear extension Unlike other graph-based assembly algorithms, the linear extension method makes the structure encountered during assembly more Simple, logical relationships are relatively clear and easy to classify, while graph algorithms aggregate information from all repeat regions in one place, involving more information and more complex structures. That is, the solution of the graph is a one-time solution. As long as one repeat region is solved, all the genomic regions associated with the same (they have the same or similar repeat sequences) will all be solved at the same time, and the linear method only solves the current region at a time. The repetition of the other regions is not solved at the same time, and is extended once the extension is repeated in another region of the genome. Resolving a record of related information after a repetition, making the information encountered in the next solution repetition easier, which reduces the extra computational loss. In addition, the linear extension mode makes the extension of each seed sequence independent of each other, and this mode is easier to calculate parallelization.
  • the extension of the seed sequence will use the well information to perform multiple multiplex extensions of the haplotype, which will be combined with the previous one when encountering the heterozygous region.
  • the assembled heterozygous zone condition is immediately phased, ie the phase is also linear and runs through the various modules of the assembly system (eg contig merge, skeleton sequence construction).
  • the extension of the homozygous area still adopts a single extension, and when it encounters the hybrid area, it is divided into multiple ways to extend, and after the extension of the hybrid area is completed, it is merged back to the single way to continue to extend.
  • the algorithm is basically global, and when the extension encounters a complicated situation, the algorithm will strictly choose to terminate the extension instead of taking a higher value to extend, avoiding the use of greedy judgment mode.
  • the algorithm makes the algorithm more global than the conventional hierarchical assembly algorithm, and can greatly reduce the sequencing depth requirement, that is, each well There is no requirement for a particularly high sequencing depth, which saves a lot of resources, both cost and time.
  • the program may be stored in a computer readable storage medium, and the storage medium may include: a read only memory, a random Memory, disk or disc, etc.
  • the above process is written into an executable program.
  • the execution program includes: first, reading the seeds and reads, and constructing the reads into the Kmer index of the Reads (Read Kmer Index, RKI), so that the target can be accessed through the Kmer quickly. . Then extend the seeds as much as possible. Since the extension between the seed and the seed is independent of each other, this step can be accelerated by parallel operations.
  • the extended seed is the sequence contig, at this time through the well
  • the information begins to pre-build the skeleton sequence (scaffold), only establishes the context between contig, and does not immediately construct the scaffold. Similarly, by using the reads and paired reads information to establish a merged join relationship between contigs, these contigs are not immediately merged. Then, the relationship between the merged contig and the relationship in the pre-built scaffold are tested against each other, the relationship is simplified and the conflict is solved, and then the contig is merged and the scaffold is constructed, and finally the assembly result is output.
  • the present invention provides an apparatus for assembling a sequence of separated long segments for performing the method of any of the above-described embodiments of the present invention.
  • the apparatus includes: an input module, obtains a read set by sequencing, and records a sequencing hole corresponding to the read in the read set, one sequencing hole includes at least one long segment sequence; and an extension module uses the read And the sequencing holes corresponding to the segments and the read segments, and extending the plurality of seed sequences in parallel to obtain a plurality of sequence contigs, wherein the plurality of seed sequences are determined by known sequences; the skeleton sequence building module is based on the reads The sequence contigs and the sequencing wells corresponding to the reads included in the sequence contig, construct a skeleton sequence to obtain an assembly result of separating the long fragment sequences.
  • the seed sequence is obtained based on a genomic reference sequence according to the following steps: in the genomic reference sequence, interrupted by N; and the interrupted reference sequence is truncated by a predetermined length, In order to obtain the seed sequence.
  • the predetermined length is not less than the length of the sequencing library insert in the sequencing.
  • the set of reads comprises a pair of reads, the seed sequence being obtained based on the read according to the following steps: (1) sliding the read into a plurality of Kmers, constructing Kmer's index RKI for the read segment, for accessing the corresponding read segment by Kmer; (2) extracting a pair of read pairs without the high frequency Kmer from the read set; (3) using the index RKI, respectively determining all the reads corresponding to the Kmer of the two of the pair of read pairs in (2), obtaining the first read group and the second read group; (4) respectively determining (3) a first read segment group and a second read segment group corresponding to the sequencing holes to obtain a first sequencing hole set and a second sequencing hole set; (5) determining the first sequencing hole set and the second sequencing hole set in (4)
  • the intersection of the pair of reads in (2) is determined to be the seed sequence if the size of the intersection is not significantly different from the expected number of valid sequencing wells of the base.
  • the expected number of effective sequencing wells for the base is determined by the intersection of the
  • the size of the intersection is between half and two times the expected value of the number of effective sequencing holes of the base, determining the paired reading in (2) The segment is the seed sequence.
  • the method further includes an index construction module coupled to the extension module for using the The read segment is slidably cut into a plurality of Kmers, and an index RKI of the Kmer to the read segment is constructed for accessing the corresponding read segment by the Kmer; then the extension module is used to implement the following: based on the read segment and its corresponding index RKI And extending the plurality of seed sequences in parallel to obtain the plurality of sequence contigs.
  • the RKI is obtained by slidingly cutting the read into a plurality of Kmers; constructing a hash with a Kmer as a key, the hash constitutes the RKI, and The hash records the frequency of the Kmer, the associated read, and the position and orientation of the Kmer on the read.
  • the seed sequence is extended by repeating the steps of: selecting a seed sequence suitable for extension; positioning the read to the seed sequence to obtain an extension sequence; positioning the extension Reads at the end of the sequence are per-base-consistent; and if the co-processing fails, heterozygous recognition, phasing processing, and/or parsing of the repetitive sequences are performed.
  • a seed sequence suitable for extension is selected by slidingly cutting a seed sequence into a Kmer; obtaining a read corresponding to the Kmer by the RKI; and the corresponding read with the seed
  • the sequences are aligned; based on the sequencing wells corresponding to the corresponding reads, determining the coverage of the seed sequences by the sequencing wells; and determining the seed sequences suitable for extension based on the coverage conditions.
  • the reading is located to the seed sequence by sliding slashing the seed sequence into a Kmer; acquiring the Kmer-compatible read by the RKI; reading the Kmer corresponding The segments are mapped to the seed sequence and aligned on a base by base basis.
  • the set of effective sequencing holes of the extended site is equally distributed by the base types different in the site, it is judged that there is a hybrid.
  • the extended sequence is divided into a plurality of strips for extension.
  • the set of reads comprises a plurality of pairs of pairs, and the distance between two of the pair of pairs is L, in the process of consistency, if paired If the distance between the read segment located downstream in the extension direction and the corresponding read segment in the read segment is non-L, it is determined that the position at which the read segment located downstream of the extension direction is located is the start point of the repeated sequence.
  • the end point of the repeating sequence is a position in the paired read segment located at a distance L downstream of the corresponding read segment from the read segment located downstream of the extension direction, or Conflict site.
  • the repeating sequence is a tandem repeating sequence by performing a sliding cut on a read positioned to the repeating sequence to obtain a Kmer; determining that each of the read segments is subjected to sliding cutting Whether there is a repeat Kmer in the Kmer, if the repeat Kmer is absent, it is judged that there is no tandem repeat, and if the repeat Kmer is present, it is judged that there is a tandem repeat.
  • the tandem repeat sequence is resolved by: determining the length of the tandem repeat sequence; and using the read sequence including the end point of the tandem repeat sequence The position is adjusted by the end point alignment.
  • the repeat sequence is not a tandem repeat sequence, and the following conditions exist, it is determined that the repeat sequence is a large double repeat sequence: the length of the repeat sequence is greater than L, or in a paired read The read corresponding to the read located downstream of the repeat sequence is also located on the repeat.
  • the large double repeat sequence is analyzed by comparing the number of effective sequencing wells corresponding to the upstream repeat sequence in the large double repeat sequence with the number of effective sequencing wells corresponding to the downstream repeat sequence The difference between the difference and the expected number of valid sequencing wells of the base, based on the magnitude of the difference, resolves the conflict on the large double repeat.
  • the repeat sequence is not a large double repeat sequence, and the following conditions exist, it is determined that the repeat sequence is a small double repeat sequence: the length of the repeat sequence is less than L.
  • the small double repeat is resolved by at least one of: (p) utilizing a distance between pairs of reads supporting each base The mean value is the desired position of the conflicting site, and the bases at the conflicting sites are determined by comparing the distance between the paired reads supporting the two conflicting bases to the mean; (k) utilizing Constructing a helper contig by constructing a read corresponding to the most upstream read of the non-repetitive sequence in the range of standard deviations of the extended sequence in the paired read, using the paired read to locate downstream of the auxiliary contig Reading the corresponding segments of the segment, by updating the read data to update the distance mean, and also by comparing the proximity of the distance between the paired reads supporting the two conflicting bases to the mean to determine the conflicting site Base on.
  • the extension of the seed sequence is terminated.
  • the skeleton sequence construction module includes a primary skeleton sequence construction module, configured to establish a merge connection relationship between the sequence overlapping groups based on the read segment, and further include a merge connection relationship. Establishing a module for constructing a primary skeleton sequence based on the sequence contig and the sequencing hole corresponding to the read segment included in the sequence contig; and further comprising an assembly module for locating the plurality of sequence contigs After the merged connection relationship and the primary skeleton sequence are mutually verified, the assembly result of separating the long fragment sequences is obtained by combining the sequence contigs to construct the skeleton sequence.
  • the program first reads the seed data and the reads data for subsequent steps.
  • the algorithm continuously stores the read data that needs to participate in the comparison process in binary form until the entire program terminates.
  • the quality value information provided in the sequencing (such as the quality value in the Fastq format) is not recorded and used during the assembly process. This information is designed to be used in the pre-assembly data pre-processing steps to remove or correct the mass values of the aberrant bases and reads.
  • the work, that is, the sequence read in is considered to be normal in the quality of the sequencing, and the difference in bases of different quality values is ignored in assembly.
  • RKI Reads Kmer Index
  • This RKI structure requires a large amount of information to be recorded, and further, since a read contains multiple Kmers, this read will appear in multiple Kmer entries, which makes memory consumption even larger.
  • the memory consumption of 100X sequencing is about 3TB, which requires more memory than traditional assembly technology consumes less than 1TB of memory.
  • the data structure can still be further optimized, which can reduce assembly costs and resource bottlenecks.
  • the way to optimize can be to disable or limit the memory of the UHF to reduce the memory overhead, because these Kmers have little effect on the assembly, and it will waste more on the positioning of the reads. Time is extremely inefficient.
  • RKI data structure design of RKI is obviously different from the design of some DBG algorithms. They construct the graph of the relationship between Kmer by reading the reads from the file, and do not need to store the reads in the memory, and the Kmer corresponds.
  • the reads relationship does not need to be recorded in detail, because the representation of the genome has been transformed from a mess of reads into a group of logically associated Kmers, and the data has been extensively sorted and compressed. After the genome is constructed, only the Kmer map needs to be operated. No need to directly manipulate the reads. However, this algorithm does not construct a whole picture. In the process of seed extension (genome construction), it is necessary to continuously access the reads through Kmer to obtain the extended (assembly) material.
  • relevant parameters related to the experiment also need to be read in, such as the insert size of the paired reads in the library construction, the number of wells separating the LFR, and the number of cells input, the number of ploidy of the target genome (by optical means or informatics) Analysis obtained).
  • the size of the inserted fragment and the number of cells input can be statistically trained when the seed sequence is initialized.
  • the ploidy number of the target genome can also be calculated by heterozygous recognition.
  • Initialization obtains relevant information on the seed, which is used for subsequent legality determination and seed extension material.
  • the legitimacy decision is used to discard those seeds that are on the extended area and those that are not suitable for extension.
  • the sliding is cut into Kmer, the readings corresponding to these Kmers are obtained through RKI, and then these reads are finely aligned to the seed, and the well information corresponding to these reads is used to determine the coverage of the seed by the well.
  • EW Effective Well
  • LFR length 100 Kbp
  • LFR length 100 Kbp
  • LFR length 100 Kbp
  • LFR length 100 Kbp
  • the set of EWs is confirmed by its well coverage of the extended (or initialized) blocks.
  • an LFR first covers the extended seed, its coverage of the seed will rise as the seed extends until it is completely covered.
  • the well corresponding to the LFR that completely covers the seed can be used as an auxiliary extension. EW; and as the seed continues to extend forward, the LFR will gradually withdraw from the area where the seed extends, and its coverage will continue to decrease until it reaches 0.
  • the corresponding well will not Used as an EW to assist in extension. This creates a transitional change in the EW set at the time of extension.
  • This extended EW set will guide the filtering of the assembly during assembly. It can separate the reads that belong to the vicinity of the current seed from the vast majority of the entire genome, making some complex genome-wide regions degenerate into LFR lengths.
  • the simple area of the range makes it easy to assemble, which is the core of the way to separate long segment sequences.
  • This assembly method that considers a set of multiple wells while extending is different from the conventional two-component assembly method in which a single well is considered (ie, each well is assembled separately and then combined to perform secondary assembly).
  • the difference is the use of EW.
  • This approach allows the depth information required for assembly to be extended from a single well to a cumulative depth that combines the entire EW set, reducing the sequencing depth required for a single well, requiring only low coverage.
  • This approach greatly reduces the cost of sequencing and saves time.
  • transitional changes of EW can also be used to determine the positional relationship between contigs and to construct extremely long skeleton sequences.
  • both the initialization and the extension need to compare the reads to the seed to provide the material to obtain information (such as the paired information of well and reads) and the extended sequence, and try to compare all the reads. It is impractical, it will consume a lot of time, so use RKI to screen the reads, hit the Kmer's reads on the seed to make a fine comparison, as follows:
  • 3.reads filtering is to remove reads that are mislocated due to repetition.
  • the algorithm firstly filters out the mislocated readers due to the genome-wide repetition through the EW information, and then filters out the right arm reads that are mislocated due to the simple repetition inside the LFR through the mate-pair information (the left arm reads are not filtered). It should be noted that the left arm reads in the repeat region will not participate in the legitimacy judgment of their corresponding right arm reads, because they may also be positioned incorrectly, and the complex type of repetition will be handled by the duplicate solution module.
  • the reads positioned at the end of the contig need to be base-by-base consistent. Similar to the read alignment, only the replacement sequencing error is tolerated in the process of consistency, and there is no tolerance of indel type. Since the mate-pair filtering of the reads can only be applied to the right arm reads, this makes the right arm reads no positioning error, and has obvious advantages with the left arm reads. Based on this feature, when the number of filtered right arm reads is sufficient, the module only uses these reads for consistency. If it can be combined into one sequence, the extension will be completed, and then the module for updating related information will be entered.
  • Consistent base-by-base site merging process if more than one base is found to be non-low frequency at the same position (low frequency is mainly caused by sequencing error), it will cause conflict and determine the consistency of the site.
  • Sexuality failure which is mainly caused by heterozygosity and repetition. Compared with the repetition, the heterozygous recognition of the diploid genome is relatively simple, and its characteristic is that there are only two kinds of conflicting bases, and the EW sets are semi-halved, and the EW sets supported by each have almost no intersection, so the consistency is achieved. Hybrid identification will be performed first after failure.
  • sequencing errors do not conform to this situation.
  • the main features of sequencing errors are low frequency and random. Because most sequencing technologies only account for a very small number of sequencing errors, only a small amount of difference is found in the consistency, even in the case of large fluctuations in sequencing depth, the difference caused by sequencing errors Still only a very small number of components, the absolute number of sequencing errors will change with the depth of sequencing, but the ratio does not change, that is, the sequencing error rate remains unchanged.
  • its randomness is manifested in the fact that sequencing errors are not biased in different wells, and sequencing errors with the same probability appear in each well, even though the sequencing error itself may have a certain bias in the error detection mode, but The conditions in each well are the same.
  • the contig When confirmed to be heterozygous, the contig will be split into two for two-way extension, which will be phased in conjunction with the EW condition of the previous hybrid region.
  • two heterozygous regions from the same haploid should have similar EW sets.
  • the algorithm can also identify large structural variant heterozygotes, which is the advantage of this algorithm in separating the haplotypes from other long processes. Further, Since there is no difference in the phasing mode between the insertion-deletion heterozygous and the large structural variant heterozygous in this algorithm, this algorithm can be constructed more than other phasing methods that only consider single replacement heterozygous sites. Complete longer haplotypes also provide features that are not available in resequencing methods, especially phasing large structural variations, which is significant for downstream analysis.
  • Repetitive Sequences are identical or similar sequences that occur at different positions in the genome. It appears in large numbers in the genome, such as the total length of various types of repeats in the human genome accounting for nearly half of the size of the genome. Repeating sequences have always been an important issue affecting the quality of assembly. Whether or not the correct resolution can be solved is also the most concerned about various assembly algorithms and constantly trying new strategies to achieve breakthroughs. Repeated resolution is naturally a key module in this algorithm, mainly dealing with complex repetitions within LFR, mainly including adjacent small double repetitions, large double repetitions, and tandem repetitions. These repeated solutions are described below.
  • Repeated area recognition is one of the important information for identifying the type of repetition.
  • the main purpose is to identify the left arm that cannot be used for filtering. Reads, ie mate-pair reads are in the repeating region, because one of the characteristics of the repeating region is that reads will be incorrectly located here. If incorrectly positioned reads are used for extension, it will often lead to erroneous extension. .
  • the repeated area recognition mainly includes the identification of the start point and the end point.
  • the left arm corresponding to the right arm reads in the extension region is not near the position of the upstream insertion length, and the starting point of the repeated region can be determined.
  • the end points of the repeating zone are mainly divided into two categories.
  • the end of the repeating zone is the position where the right arm reading of the left arm is no longer present;
  • the conflict point at the time of extension is the end point.
  • the difference regions appearing in the repeated segments will be considered as non-repetitive regions, that is, similar repeating segments will be strictly divided into a plurality of sub-repeat segments to be processed.
  • the regional property recognition after the difference point is different.
  • the left arm corresponding to the right arm reading of the starting point of the repeating region is in the previous repeating region and not in the non-repetitive region, and the paired reads are located. Wrong situation. At this time, check the left arm reads near the length of the inserted segment before the extension point. If the corresponding right arm participates in the extension, the current extended area is the repeating area. If a part is found not to participate in the extension, it indicates that Extend into the non-repetitive area.
  • the mate-pair filtering of the reads can only be applied to the right arm reads, the position of the right arm reads and the left arm reads is not the same when extended. However, if there is a region, the right arm reads can be used to test the left arm reads. The left arm reads can play a role in resolving the repetition. This is the concept of Helper Contig (HC).
  • HC Helper Contig
  • HC is a contig used to solve complex duplications, only for auxiliary use, not as a formal contig In the assembly results. The essence is to further utilize the mate-pair information of the reads, and expect the HC to cross the current repeating region and appear on the downstream non-repetitive region, and use this non-repetitive region to help resolve the duplication. If the HC fails to cross the repeat zone successfully, it will generally not work.
  • the application objects for HC are mainly divided into the following two categories:
  • the algorithm first uses the mean value calculated by the mate-pair information supporting each base as the desired position, and the distance is considered to be the base that should be extended. If the distance between the two positions is too close, the actual situation If it is fuzzy, you need to construct HC to assist in identification.
  • the right arm corresponding to the leftmost left arm of the contig end SD length range which is not in the repeating region is used as the starting point for constructing the HC, and the read is extended in a manner similar to the seed.
  • the position of the conflicting base can be calculated separately by the left arm corresponding to the right arm reads positioned on this HC.
  • the use of HC in this case only improves the reliability of the calculation distance, and if the SD is large, errors may still occur.
  • the left arm reads will not be used for the filtering of the reads. In essence, the left arm cannot be used to filter the right arm in the repeating region as long as it is in the repeating region, regardless of the region between the arms. Whether it is repeated or non-repetitive.
  • the two repeat sequences are far apart.
  • the bases in the upstream contain more EW than the downstream ones. This comparison can be used to resolve conflicts.
  • the HC can be constructed using the left arm corresponding to the wrong right arm within the length of the insertion segment starting from the beginning of the repeating region, and it is expected that the EW set on the HC and the EW set of the upstream conflicting base are compared with The difference in downstream conflict bases is smaller.
  • the left arm supporting the upstream base can be found
  • the right arm corresponding to the reads can be positioned on the HC (the HC is constructed in the same way as the previous type of large repeat, and is constructed using the left arm corresponding to the wrong right arm within the length of the insert from the beginning of the repeating region). As shown in Figure 8a;
  • a repeat consisting of multiple consecutive copies of a repeating unit is defined as Tandem Repeat (TR), and the shortest repeat in the tandem repeat region is a Tandem Repeat Unit (TRU).
  • TR Tandem Repeat
  • TRU Tandem Repeat Unit
  • a tandem repeat unit has a number of different phases (such as ACT, CTA, TAC), the number of which is equal to its length.
  • tandem repeat region is equivalent to a plurality of conventional repeat region linkages. to make. This makes this repetitive solution both identical and different from other conventionally repeated methods.
  • the tandem repetition will shift the positioning of the reads, which often causes conflicts in the consistency of the repeated positioning or repeated differences in the positioning of the reads.
  • the core essence of this problem is that the Kmer used to locate the reads is a repeating Kmer.
  • the head of the mislocated reader will be after the start of the tandem repeat region (because the non-repetitive region before the start point will cause the mislocated read to fail during the fine alignment process), and the reads are offset by the wrong positioning. SD smaller than the length of the inserted segment, that is, mate-pair cannot filter out these misplaced reads. Further, if the length of the tandem repeat is greater than the SD of the inserted length, the read positioning in the repeating region will occur in SD.
  • the periodic positioning extends to the upstream error aggregation situation, the misplaced reads distance deviation is not greater than the SD length of the insert, and is concentrated at the head end of each SD unit, which means that the series is repeated before the collision is encountered.
  • Each SD length unit in the zone will be compressed and shortened, and the reads of the next SD unit will be continuously shifted forward.
  • TR is confirmed by finding tandem repeating units. Since the TR larger than the length of the inserted segment also meets the activation condition of the HC case 2, the algorithm places the activation decision of the TR before the activation decision of the HC case 2.
  • TRU is primarily identified by discovering that Kmer is used periodically. For TRUs smaller than the read length -Kmer+1 (that is, TRU appears twice or more on a reads), the reads on the collision site can be slid into Kmers and a Kmer is found in the reads. This situation exists in most reads, and if the Kmers of these reads are consistent or consistent in the TRU phase, it can be judged that the current conflict is caused by the series repetition, and the de-serial repeat module will be activated. For a TRU larger than the length of the read-Kmer+1, the Kmer in the range from the start point of the repetition region to the collision point is scanned. If the Kmer appears in a fixed period, it can be judged that the collision is caused by the tandem repetition.
  • the reads on the conflicting sites can be divided into four categories: 1) the reads in the TR area, which are completely covered by the TRU; 2) the reads containing the TR starting point, which are only found at the end TRU; 3) contains Reads at the end of the TR, these reads only find the TRU at the head end; 4) the reads that contain both the TR start and end points, this case only exists when TR is less than the length of reads, and no complete type 1 will be found at this time. TRU covered reads.
  • the conflicts are eliminated by adjusting the position with the end of the TR containing the end of the TR.
  • a TR larger than the length of the read it can only be solved by crossing and filling in "N".
  • N For a TR larger than the length of the reads and smaller than the length of the reads insert, there will be a left arm support in the non-repetition zone and the right arm reads containing the TR end point are aligned and consistent after the end point, and the inserts according to these reads can be followed.
  • these conflicting reads can be assembled in a DBG manner to naturally construct the difference sequence and the TR end sequence in the separated TR region, and then the right arm reads located on the sequence obtained by using these assemblies are corresponding to the front
  • the position of the left arm reads on the extended non-repetitive region is used to calculate the position of the assembled sequence such that the sequence furthest from the extension point is the correct TR endpoint.
  • the extension of the seeds is parallel, in order to prevent the same region from being repeatedly extended by a plurality of seeds, it is necessary to mark the read that has participated in the extension as "used", and find this when other seeds are initialized or extended.
  • the reads will stop extending and then be connected by the contig merge module. It should be noted that the mate-pair is not clearly located in the repeat area. To prevent the erroneous extension from stopping, the repeating nature of the reads will be marked as "repetition" and will not be used as a redundant extension. determination.
  • the algorithm when a read in a well is positioned on a contig, which may be an erroneous positioning due to a repetition or sequencing error, the algorithm requires that well have relatively sufficient coverage of the extended seed area. EW can be considered as EW, and EW with insufficient coverage will be discarded. Based on efficiency considerations, the EW set is updated every time a certain extension is reached, and only the coverage within this length range is examined.
  • the contig will be merged after all contigs have been extended, and contig-phased operations will also be performed.
  • the contig that can be merged is divided into two cases:
  • the contig of the extended stop due to repetition, localized sequencing depth, etc.
  • the mate-pair reads on the non-repetitive area at the end of the contig are examined. Whether it is located on other contigs, if it is found successfully, the distance between the two contigs is estimated by inserting the length of the segment and filled in the number of "N" of the estimated amount.
  • the merging step of the actual processing first establishes the relationship between contigs (overlapped or overlapped), and then combines the skeleton.
  • the relationship between contigs established in the sequence step is mutually checked, the relationships are simplified and the conflicts are resolved, and the contig merge operation is substantially performed.
  • the sequence formed by the contig is mainly determined by the information of the well to determine the context, which is a scaffold, which is different from the skeleton sequence constructed by the mate-pair information of the traditional definition.
  • the scaffold in this algorithm only expresses the context of contig, but the specific distance between contig cannot be confirmed, and they are connected by only one "N".
  • the scaffold not only determines the context between contigs through the mate-pair of the reads, but also calculates the distance between contigs by the positions of the pairs of reads positioned on the two contigs and their insert lengths. The method in the merge with contig is consistent.
  • Optimal Linear Arrangement with NP-Hard properties.
  • the contig in scaffold also has its own orientation (the four deoxynucleotides A, C, T, and G are called polydeoxynucleotides formed by 3', 5' phosphodiester bonds.
  • the DNA base sequence is a representation of contig and scaffold in assembly.
  • the deoxynucleotide linkage has strict directionality and is 5'-OH of the first deoxynucleotide and 5' of the next deoxynucleotide.
  • the 3', 5' phosphodiester bond forms a linear DNA macromolecule with no branching.
  • DNA defines the 5' end to the 3' end as the "+” direction and the 3' end to the 5' end.
  • the contig with the correct position and the wrong direction will also cause a large assembly error, which is manifested as a sub-sequence flip type serious error.
  • a well will appear in multiple adjacent contigs, through which a set of adjacent contigs can be obtained.

Abstract

L'invention concerne un procédé, un appareil et un système pour assembler des séquences fragmentaires longues séparées. L'invention concerne un procédé pour assembler des séquences fragmentaires longues séparées, consistant à : (a) acquérir un ensemble de fragments lus au moyen d'un séquençage et enregistrer un trou de séquençage correspondant aux fragments lus dans l'ensemble de fragments lus, le trou de séquençage comprenant au moins une longue séquence fragmentaire ; (b) utiliser les fragments lus et les trous de séquençage correspondant aux fragments lus pour mettre en oeuvre une extension parallèle de multiples séquences d'amorçage afin d'acquérir de multiples contigs de séquence, les multiples séquences d'amorçage étant déterminées au moyen de séquences connues ; et (c) sur base des fragments lus, des contigs de chevauchement de séquence et des trous de séquençage correspondant aux segments lus contenus dans les contigs de chevauchement de séquence, construire une séquence de squelette afin d'acquérir les résultats de l'assemblage des séquences fragmentaires longues séparées.
PCT/CN2016/074665 2016-02-26 2016-02-26 Procédé et appareil pour assembler des séquences fragmentaires longues séparées WO2017143585A1 (fr)

Priority Applications (3)

Application Number Priority Date Filing Date Title
PCT/CN2016/074665 WO2017143585A1 (fr) 2016-02-26 2016-02-26 Procédé et appareil pour assembler des séquences fragmentaires longues séparées
CN201680063769.5A CN108350495B (zh) 2016-02-26 2016-02-26 对分隔长片段序列进行组装的方法和装置
HK18113476.6A HK1254399A1 (zh) 2016-02-26 2018-10-21 對分隔長片段序列進行組裝的方法和裝置

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2016/074665 WO2017143585A1 (fr) 2016-02-26 2016-02-26 Procédé et appareil pour assembler des séquences fragmentaires longues séparées

Publications (1)

Publication Number Publication Date
WO2017143585A1 true WO2017143585A1 (fr) 2017-08-31

Family

ID=59685953

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2016/074665 WO2017143585A1 (fr) 2016-02-26 2016-02-26 Procédé et appareil pour assembler des séquences fragmentaires longues séparées

Country Status (3)

Country Link
CN (1) CN108350495B (fr)
HK (1) HK1254399A1 (fr)
WO (1) WO2017143585A1 (fr)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110534158A (zh) * 2019-08-16 2019-12-03 浪潮电子信息产业股份有限公司 一种基因序列比对方法、装置、服务器及介质
CN110544510A (zh) * 2019-05-31 2019-12-06 中南大学 基于邻接代数模型及质量等级评估的contig集成方法
CN111564182A (zh) * 2020-05-12 2020-08-21 西藏自治区农牧科学院水产科学研究所 一种高重复原鮡属鱼类的染色体级别组装的方法
CN111986729A (zh) * 2019-05-21 2020-11-24 深圳华大基因科技服务有限公司 对骨架序列进行优化的方法和系统及应用
CN112634989A (zh) * 2020-12-29 2021-04-09 山东建筑大学 基于片段重叠群的双面基因组片段填充方法及装置
CN115641911A (zh) * 2022-10-19 2023-01-24 哈尔滨工业大学 一种针对序列间重叠检测的方法

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111128303B (zh) * 2018-10-31 2023-09-15 深圳华大生命科学研究院 基于已知序列确定目标物种中对应序列的方法和系统
CN109658981B (zh) * 2018-12-10 2022-10-04 海南大学 一种单细胞测序的数据分类方法
CN111755075B (zh) * 2019-03-28 2023-09-29 深圳华大生命科学研究院 对免疫组库高通量测序样本间序列污染进行过滤的方法
CN112825267A (zh) * 2019-11-21 2021-05-21 深圳华大基因科技服务有限公司 确定小核酸序列集合的方法及其应用
CN112802554B (zh) * 2021-01-28 2023-09-22 中国科学院成都生物研究所 一种基于二代数据的动物线粒体基因组组装方法
CN114333989B (zh) * 2021-12-31 2023-06-13 天津诺禾致源生物信息科技有限公司 性状定位的方法及装置

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102206704A (zh) * 2011-03-02 2011-10-05 深圳华大基因科技有限公司 组装基因组序列的方法和装置
CN102789553A (zh) * 2012-07-23 2012-11-21 中国水产科学研究院 利用长转录组测序结果装配基因组的方法及装置
CN104017883A (zh) * 2014-06-18 2014-09-03 深圳华大基因科技服务有限公司 组装基因组序列的方法和系统
CN104164479A (zh) * 2014-04-04 2014-11-26 深圳华大基因科技服务有限公司 杂合基因组处理方法
US20150094961A1 (en) * 2013-10-01 2015-04-02 Complete Genomics, Inc. Phasing and linking processes to identify variations in a genome
CN105189308A (zh) * 2013-03-15 2015-12-23 考利达基因组股份有限公司 长dna片段的多重标记

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102206704A (zh) * 2011-03-02 2011-10-05 深圳华大基因科技有限公司 组装基因组序列的方法和装置
CN102789553A (zh) * 2012-07-23 2012-11-21 中国水产科学研究院 利用长转录组测序结果装配基因组的方法及装置
CN105189308A (zh) * 2013-03-15 2015-12-23 考利达基因组股份有限公司 长dna片段的多重标记
US20150094961A1 (en) * 2013-10-01 2015-04-02 Complete Genomics, Inc. Phasing and linking processes to identify variations in a genome
CN104164479A (zh) * 2014-04-04 2014-11-26 深圳华大基因科技服务有限公司 杂合基因组处理方法
CN104017883A (zh) * 2014-06-18 2014-09-03 深圳华大基因科技服务有限公司 组装基因组序列的方法和系统

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111986729A (zh) * 2019-05-21 2020-11-24 深圳华大基因科技服务有限公司 对骨架序列进行优化的方法和系统及应用
CN110544510A (zh) * 2019-05-31 2019-12-06 中南大学 基于邻接代数模型及质量等级评估的contig集成方法
CN110544510B (zh) * 2019-05-31 2023-03-24 中南大学 基于邻接代数模型及质量等级评估的contig集成方法
CN110534158A (zh) * 2019-08-16 2019-12-03 浪潮电子信息产业股份有限公司 一种基因序列比对方法、装置、服务器及介质
CN110534158B (zh) * 2019-08-16 2023-08-04 浪潮电子信息产业股份有限公司 一种基因序列比对方法、装置、服务器及介质
CN111564182A (zh) * 2020-05-12 2020-08-21 西藏自治区农牧科学院水产科学研究所 一种高重复原鮡属鱼类的染色体级别组装的方法
CN111564182B (zh) * 2020-05-12 2024-02-09 西藏自治区农牧科学院水产科学研究所 一种高重复原鮡属鱼类的染色体级别组装的方法
CN112634989A (zh) * 2020-12-29 2021-04-09 山东建筑大学 基于片段重叠群的双面基因组片段填充方法及装置
CN115641911A (zh) * 2022-10-19 2023-01-24 哈尔滨工业大学 一种针对序列间重叠检测的方法

Also Published As

Publication number Publication date
CN108350495A (zh) 2018-07-31
CN108350495B (zh) 2021-10-01
HK1254399A1 (zh) 2019-07-19

Similar Documents

Publication Publication Date Title
WO2017143585A1 (fr) Procédé et appareil pour assembler des séquences fragmentaires longues séparées
US20210217490A1 (en) Method, computer-accessible medium and system for base-calling and alignment
EP3304383B1 (fr) Ensemble du génome diploïde de novo et reconstruction de séquence d'haplotype
US11133084B2 (en) Systems and methods for nucleic acid sequence assembly
Kim et al. ECgene: genome-based EST clustering and gene modeling for alternative splicing
Laehnemann et al. Denoising DNA deep sequencing data—high-throughput sequencing errors and their correction
CN104302781B (zh) 一种检测染色体结构异常的方法及装置
US20120197533A1 (en) Identifying rearrangements in a sequenced genome
JP2008547080A (ja) ダイタグ配列の処理および/またはゲノムマッピングの方法
CN106715711A (zh) 确定探针序列的方法和基因组结构变异的检测方法
CN110692101B (zh) 用于比对靶向的核酸测序数据的方法
Wildschutte et al. Discovery and characterization of Alu repeat sequences via precise local read assembly
IL258999A (en) Methods for detecting copy-number variations in next-generation sequencing
US20210375397A1 (en) Methods and systems for determining fusion events
CN109994154A (zh) 一种单基因隐性遗传疾病候选致病基因的筛选装置
Sezerman et al. Bioinformatics workflows for genomic variant discovery, interpretation and prioritization
WO2019132010A1 (fr) Procédé, appareil et programme d'estimation de type de base dans une séquence de bases
JP5414130B2 (ja) 塩基配列のリードエラーを判定するためのプログラム
CN111028885B (zh) 一种检测牦牛rna编辑位点的方法及装置
US20200216888A1 (en) Method for increasing accuracy of analysis by removing primer sequence in amplicon-based next-generation sequencing
Yang et al. Terminitor: cleavage site prediction using deep learning models
Bolognini Unraveling tandem repeat variation in personal genomes with long reads
US20170226588A1 (en) Systems and methods for dna amplification with post-sequencing data filtering and cell isolation
Lee et al. Analysis of unmapped regions associated with long deletions in Korean whole genome sequences based on short read data
Iakovishina Detection of structural variants in cancer genomes using a Bayesian approach. You will find below the abstract of my PhD thesis

Legal Events

Date Code Title Description
NENP Non-entry into the national phase

Ref country code: DE

121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 16891039

Country of ref document: EP

Kind code of ref document: A1

122 Ep: pct application non-entry in european phase

Ref document number: 16891039

Country of ref document: EP

Kind code of ref document: A1