WO2018037289A2 - Systems and methods for computational demultiplexing of genomic barcoded sequences - Google Patents

Systems and methods for computational demultiplexing of genomic barcoded sequences Download PDF

Info

Publication number
WO2018037289A2
WO2018037289A2 PCT/IB2017/001547 IB2017001547W WO2018037289A2 WO 2018037289 A2 WO2018037289 A2 WO 2018037289A2 IB 2017001547 W IB2017001547 W IB 2017001547W WO 2018037289 A2 WO2018037289 A2 WO 2018037289A2
Authority
WO
WIPO (PCT)
Prior art keywords
contigs
contig
unique
kmers
molecule
Prior art date
Application number
PCT/IB2017/001547
Other languages
French (fr)
Other versions
WO2018037289A3 (en
Inventor
Gil BEN-ZVI
Omer BARAD
Original Assignee
Energin.R Technologies 2009 Ltd.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Energin.R Technologies 2009 Ltd. filed Critical Energin.R Technologies 2009 Ltd.
Publication of WO2018037289A2 publication Critical patent/WO2018037289A2/en
Publication of WO2018037289A3 publication Critical patent/WO2018037289A3/en

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/20Allele or variant detection, e.g. single nucleotide polymorphism [SNP] detection
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/10Sequence alignment; Homology search
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/20Sequence assembly

Definitions

  • the field of invention relates to methods of sequencing.
  • the present invention provides methods for computational demultiplexing of genomic barcoded sequences.
  • Figures 1-4 are embodiments of the system used in the present invention.
  • Figures 5A and 5B show embodiments of the connected components as used in the methods of the present invention.
  • Figure 6 shows an exemplary embodiment of the relationship between contigs and barcodes as used in methods of the present invention.
  • the present invention is a method which includes: obtaining a whole genome of an organism,
  • each molecule of the plurality of molecules is 500 base pairs to 2000 kilobases;
  • each of the plurality of tagged raw reads comprises at least one unique barcode
  • the plurality of contigs comprises genomic sequence fragments from 32 base pairs to 100,000 base pairs
  • each contig of the plurality of contigs comprises a set of overlapping DNA fragments
  • the frequency is the number of times each contig of the plurality of contigs is present in the whole genome of the organism
  • each supplementary contig appears two or more times in the whole genome of the organism and are long contigs
  • the corresponding unique contig or the corresponding supplementary contig is fragmented into unique contig Kmers of 32 base pairs to 250 base pairs, wherein at least one origin of the corresponding unique contig or the corresponding supplementary contig is stored,
  • each tagged raw read is fragmented to generate reads Kmers having a length of the confirmed unique contig Kmers
  • each node of the plurality of nodes corresponds to a selected contig
  • the low weight of Ey is less than 3, organizing a first plurality of contigs into a first plurality of groups of assembled molecules,
  • assembled molecules is a connected component in the graph, organizing a second plurality of contigs into a second plurality of groups of assembled molecules,
  • each group of the second plurality of groups of assembled molecules overlaps with at least one portion of a contig
  • the mapping step comprises: identifying a set of overlapping barcodes.
  • the set of the remaining contigs comprises unique contigs or supplementary contigs.
  • connection weight is measured using a contigs pair, Ey .
  • the contigs pair is a number of common barcodes within the set of the remaining contigs, wherein the number of common barcodes is at least one barcode at least in duplicate.
  • the distance matrix between two contigs is measured by: maXk, 1:0 ver all the contigs in the group(Ekl) - Ejj.
  • the distance matrix is reordered to indicate that adjacent contigs are separated by a corresponding distance.
  • This disclosure provides methods and systems for processing polynucleotides.
  • Applications include processing polynucleotides for polynucleotide sequencing.
  • Polynucleotides sequencing includes the sequencing of whole genomes, detection of specific sequences such as single nucleotide polymorphisms (SNPs) and other mutations, detection of nucleic acid (e.g., deoxyribonucleic acid) insertions, and detection of nucleic acid deletions.
  • SNPs single nucleotide polymorphisms
  • nucleic acid e.g., deoxyribonucleic acid
  • Utilization of the methods and systems described herein may incorporate, unless otherwise indicated, conventional techniques of organic chemistry, polymer technology, microfluidics, molecular biology and recombinant techniques, cell biology, biochemistry, and immunology.
  • Such conventional techniques include microwell construction, microfluidic device construction, polymer chemistry, restriction digestion, ligation, cloning, polynucleotide sequencing, and polynucleotide sequence assembly.
  • suitable techniques are described throughout this disclosure. However, equivalent procedures may also be utilized. Descriptions of certain techniques may be found in standard laboratory manuals, such as Genome Analysis: A Laboratory Manual Series (Vols.
  • the present invention is an analysis method that enables the de-multiplexing of the tagged reads information into their distinct origin molecules solely based on the tagged reads information.
  • align refers to a method of arranging sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between the sequences. Aligned sequences of nucleotide, for example, are generally represented as rows within a matrix.
  • an "allele” refers to one of a number of alternative forms of the same gene or same genetic locus.
  • fragment assembly refers to aligning and merging at least two fragments of a much longer DNA sequence to reconstruct the original sequence.
  • fragments can range in size from 20 to 30,000 bases.
  • base pair(s) or “bp” refers to a unit consisting of two nucleobases bound to each other by hydrogen bonds. Base pairs form the building blocks of the DNA double helix, and contribute to the folded structure of both DNA and RNA. The base pairs are paired: adenine-thymine or guanine-cytosine, and allow the DNA helix to maintain a regular helical structure.
  • barcoding refers to a method that enables the partition of genomic DNA into large amount of sets (100s-100,000s) such that each set contains several distinct long (e.g., in the range of 10kb-500kbs) genomic DNA molecules. Later on, in each set, the long genomic DNA molecules are broken into smaller fragments and tagged with unique label, e.g., a barcode.
  • a barcode may be a polynucleotide sequence attached to all fragments of a target polynucleotide contained within a particular partition. Finally, the tagged DNA from all set is pooled to generate a single library for NGS sequencing (as Illumina).
  • Non- limiting examples for such methods are GemCode or Chromium by XI 0 genomics ("Haplotyping germline and cancer genomes with high-throughput Linked-Read sequencing" Zheng, et al, Nature Biotechnology 2016), moleculo by Illumina (Whole-genome haplotyping using long reads and statistical methods, Kulesuv et. al., Nature biotechnology 2014). Further non-limiting examples for such methods are disclosed in US 9,401,201, and is hereby incorporated by reference in its entirety.
  • the presence of the same barcode on multiple sequences may provide information about the origin of the sequence; e.g., a barcode may indicate that the sequence came from a particular partition and/or a proximal region of a genome.
  • Confidence level refers to a measure of the reliability of at least one estimate. Confidence levels include a range of values (intervals) that can be construed as estimates of an unknown population parameter. The level of confidence of the confidence interval indicates the probability that the confidence range can capture the actual population parameter given a distribution of samples. A confidence level can be represented as a percentage.
  • a "connected component” is a group of contigs. Each of the contigs is a linear chain. The contigs are linked together by leveraging information of the reads (e.g., an original read, e.g., but not limited to, a molecule or a contig) and obtaining a set of scaffolds which constitute the final result of a de novo genome assembly.
  • an original read e.g., but not limited to, a molecule or a contig
  • consensus sequence refers to a calculated order of most frequent residues, either nucleotide or amino acid, found at each position in a sequence alignment.
  • the consensus sequence represents the results of multiple sequence alignments in which related sequences are compared to each other and similar sequence motifs are calculated.
  • a "contig” refers to a set of overlapping DNA segments that together represent a consensus region of genomic DNA.
  • the inventive system(s) of the present invention are configured to analyze/sort contigs having at least one haplotype, where the at least one haplotype comprises at least one marker that may be used for genetic analysis (e.g., by identifying an allele).
  • a contig can be a haplotype contig or a non-haplotype contig.
  • a "continuous sequence” is a sequence resulting from the reassembly of small DNA fragments generated.
  • a "corresponding distance” is a calculated distance between at least two contigs or fragments which illustrates the degree of sequence similarity.
  • a "DeBruijn graph” refers to a computational system that assembles a contiguous genome from a large population (e.g., but not limited to, 1 million mer, 10 million mer, 100 million mer, 1 billion mer, 10 billion mer, etc.) of short sequencing reads.
  • demultiplexing refers to, after sequencing, reads being assigned in silico to their long DNA molecule of origin.
  • multiplexing refers to using short DNA indices to uniquely identify (or more correctly semi-uniquely by assigning the same index to several DNA molecules) each DNA sample.
  • demultiplexing enables barcoding different pieces of DNA in the same sample in order to generate long range information. Residual demultiplexing is needed, since methods assign the same barcode to several long DNA molecules from different genomic locus.
  • a "DNA read” or “read” refers to overlapping fragments of DNA obtained by using, e.g., but not limited to, shotgun sequencing.
  • the read is the sequence of letters at the top of each row. The reads are used to reconstruct an original sequence.
  • error-filtering refers to a system configured to selectively reduce errors and facilitate variant detection in data from sequencing technologies.
  • error-free phase sequence refers to a sequence after error-filtering.
  • phase means that adjacent linked alleles are phased into a single sequence.
  • fragment refers to a physical segment of DNA.
  • the fragment can be an overlapping physical segment of DNA.
  • genetic diversity refers to the level of biodiversity, i.e., the total number of genetic characteristics in the genetic makeup of a species.
  • haplotype refers to a set of DNA variations, or polymorphisms, that tend to be inherited together.
  • a haplotype can refer to a combination of alleles or to a set of single nucleotide polymorphisms (SNPs) found on the same chromosome.
  • k-mer refers to all the possible subsequences (of length "k") from a read obtained through DNA sequencing.
  • an "input read length” refers to an initial starting sequence used to build at least one contig and includes at least a portion of one kmer.
  • a "marker” or a “genetic marker” refers to a gene or short sequence of DNA used to identify a chromosome or to locate other genes on a genetic map.
  • a "minimal contig order” is a position in which a DNA fragment or portion begins.
  • a “maximum contig order” is a position in which a DNA fragment or portion ends.
  • mer refers to an oligonucleotide, where when applied to DNA, mer refers to the number of bases in the molecules (e.g., 10-mer, 100-mer, etc.).
  • overlapping refers to polynucleotide fragments, generally referring to a collection of polynucleotide fragments with overlapping sequence.
  • a genome may be fragmented randomly (e.g., but not limited to, by shearing in a pipette) or non-randomly (e.g., but not limited to, by digesting with a rare cutter). Fragmenting randomly produces overlapping sequences because each copy of the genome is cut at different positions. After sequencing of the fragments (which provides "sequence contigs"), this overlap may be used to determine the linear order of the fragments, thereby enabling assembly of the entire genomic sequence.
  • fragmentation may be performed, e.g., but not limited to, by enzymatic digestion, exposure to ultraviolet (UV) light, ultrasonication, and/or mechanical agitation.
  • UV ultraviolet
  • paired-end or “PE” refers to DNA fragments sequenced from both ends (e.g., 5' end and 3' end) and generate pairs of reads.
  • a PE library is a population of PE DNA fragments of varying sizes and sequences.
  • a "path” or “consensus path” refers to information provided in a DeBruijn graph that identifies the consensus of a genomic sequence with all sub-repeats in the genomic sequence substituted by the respective consensus sequences.
  • polymorphism refers to a difference between two different sequences.
  • polynucleotide or “nucleic acid,” as used herein, are used herein to refer to biological molecules comprising a plurality of nucleotides.
  • exemplary polynucleotides include deoxyribonucleic acids, ribonucleic acids, and synthetic analogues thereof, including peptide nucleic acids.
  • polynucleotides can be prepared using the methods disclosed in US 9,410,201.
  • a "raw read” refers to the sequencing result produced from an automatic sequencing machine.
  • the raw reads are short DNA sequences and are mixed together, not in genomic order. Inevitably, raw sequence also contains a few gaps, mistakes, and ambiguities.
  • a "reference genome” or “reference assembly” is a digital nucleic acid sequence database, assembled by scientists as a representative example of a species' set of genes. Reference genomes are typically assembled from the sequencing of DNA from a number of donors, and generally do not accurately represent the set of genes of any single organism. Instead, the reference genome provides a haploid mosaic of different DNA sequences from each donor. In plants (e.g., maize, soybean, rice, etc.) the reference genome is typically assembled from a single variety.
  • a "scaffold” refers to a technique which links together a non-continuous series of genomic sequences into a scaffold, consisting of sequences separated by gaps of known length. The sequences that are linked are typically contiguous sequences corresponding to read overlaps.
  • single nucleotide polymorphism or "SNP” is a DNA sequence variation occurring within a population (e.g. 1%) in which a single nucleotide, e.g. Adenine ("A"), Thymine (“T”), Cytosine (“C”) or Guanine (“G”), in the genome (or other shared sequence) differs between members of a biological species or paired chromosomes.
  • A Adenine
  • T Thymine
  • C Cytosine
  • G Guanine
  • the DNA sequence variation can occur only once.
  • the DNA sequence variation can occur two or more times.
  • a "structural variation” refers to a region of DNA which is approximately 1 kilobase (kb) or larger in size and can include inversions, balanced translocations, and/or genomic imbalances (e.g., DNA insertions and/or DNA deletions), and is typically referred to as copy number variants (CNVs).
  • CNVs copy number variants
  • unique or “uniqueness” refers to a contig and is related to the variability in the contig' s adjacent sequences.
  • a unique contig means that a prediction is made that this contig is embedded in a single long sequence and therefore is predicted to appear only in one locus in the genome. If a sequence is only generated once, then the sequence has high uniqueness. In some embodiments, a sequence having the high uniqueness appears/is present only once in a genome. Alternatively, if a sequence has multiple copies generated, then the sequence has low uniqueness. In contrast, as used herein, a "supplementary"
  • tag refers to combining at least one barcode with a DNA fragment.
  • a “whole genome” refers to the entirety of a genome.
  • the whole genome can be, e.g., mammalian, plant, bacterial, protozoan, etc.
  • the present invention is a method which includes:
  • each molecule of the plurality of molecules is 500 base pairs to 2000 kilobases;
  • each of the plurality of tagged raw reads comprises at least one unique barcode
  • the plurality of contigs comprises genomic sequence fragments from 32 base pairs to 100,000 base pairs
  • each contig of the plurality of contigs comprises a set of overlapping DNA fragments
  • the frequency is the number of times each contig of the plurality of contigs is present in the whole genome of the organism
  • each supplementary contig appears two or more times in the whole genome of the organism and are long contigs
  • each supplementary contig is a long contig
  • the corresponding unique contig or the corresponding supplementary contig is fragmented into unique contig Kmers of 32 base pairs to 250 base pairs, wherein at least one origin of the corresponding unique contig or the corresponding supplementary contig is stored,
  • each tagged raw read is fragmented to generate reads Kmers having a length of the confirmed unique contig Kmers
  • each node of the plurality of nodes corresponds to a selected contig, identifying a connection weight between each two nodes as the number of shared barcodes between the nodes,
  • assembled molecules is a connected component in the graph, organizing a second plurality of contigs into a second plurality of groups of assembled molecules,
  • each group of the second plurality of groups of assembled molecules overlaps with at least one portion of a contig
  • the mapping step comprises: identifying a set of overlapping barcodes.
  • the set of the remaining contigs comprises unique contigs or supplementary contigs.
  • the connection weight is measured using a contigs pair, Ey.
  • the contigs pair is a number of common barcodes within the set of the remaining contigs, wherein the number of common barcodes is at least one barcode at least in duplicate.
  • the distance matrix between two contigs is measured by: maXk, 1:0 ver all the contigs in the group(Ekl) - Ejj.
  • the distance matrix is reordered to indicate that adjacent contigs are separated by a corresponding distance.
  • the method of the present invention includes: generating a plurality of tagged raw reads containing unique bar codes for a set of long genomic DNA molecules obtained from a whole genome of an organism, where a group of tagged raw reads of the plurality of raw reads originated from a long genomic DNA molecule is tagged with a barcode selected from a plurality of unique barcodes,
  • the debruijn graph analyzing the debruijn graph and generating a plurality of contigs, where the plurality of contigs are genomic sequence fragments ranging from 32 base pairs to 100,000 base pairs,
  • each individual contig within the plurality of contigs is a set of overlapping DNA segments that together represent a consensus region of genomic DNA, analyzing the plurality of contigs to determine a number of times each individual contig of the plurality of contigs appears in the whole genome of the organism, identifying unique contigs using the Debruijn graph, where the unique contigs of the plurality of contigs appear once in the whole genome of the organism, identifying supplementary contigs using the Debruijn graph, where the supplementary contigs of the plurality of contigs appear few times in the whole genome of the organism and are long contigs (i.e. greater than 500bp),
  • nodes correspond to the subset of the plurality of the remaining contigs (only long unique contigs or supplementary) which mapped to the barcode, identifying connections weight in the graph, where a connection weight between a contigs pair, is defined as the number of shared barcodes within the additional barcodes set,
  • connection weight in the graph with low weight (i.e. E y ⁇ 3) dividing the barcode contigs into groups of predicted common long molecule of origin, where each group is a connected component in the graph,
  • the distance matrix is reordered to indicate that adjacent contigs have a short distance
  • molecular biology protocols can divide an organism's entire genomic DNA into large sets of genomic DNA molecules (e.g., but not limited to, 10s- 100,000s genomic DNA molecules (e.g., but not limited to 10 genomic DNA molecules, 100 genomic DNA molecules, 1,000 genomic DNA molecules, 10,000 genomic DNA molecules, 100,000 genomic DNA molecules, 500,000 genomic DNA molecules, etc.); also referred to herein as "molecules"), where each set contains dozens of distinct long genomic DNA molecules (e.g., but not limited to, in a range of 10kb-2000kbs (2MB)).
  • the long genomic DNA molecules of each set are broken into smaller DNA fragments and each DNA fragment is tagged with a unique barcode.
  • the tagged DNA fragments from all of the sets are pooled to generate a single library for next generation sequencing.
  • the present invention is a method including demultiplexing tagged DNA reads, so as to result in identifying the distinct origin of the tagged DNA read.
  • a plurality of overlapping long DNA molecules from the same genomic region of an organism exist in several sets of genomic DNA, where there is a low probability that long molecule from two different genomic regions co-exist in more than one set.
  • the method includes de-multiplexing the tagged DNA reads by mapping the tagged DNA reads to a sample of origin.
  • the method includes assembling contigs using Debruijn graph construction.
  • the tagged reads are transformed into computationally tagged contigs and then the tagged contigs are de-multiplexed, so as to result in: (i) computational efficiency, where computational efficiency refers to mapping the tagged DNA reads to the tagged contigs, in which the cumulative length of the tagged contigs is similar to the size of the genome, and (ii) mapping efficiency, where mapping the tagged contigs allows for matching reads from overlapping or adjacent genomic regions.
  • long contigs are contigs greater than 500bp. In some embodiments, long contigs are contigs greater than lkb. In some embodiments, long contigs are contigs greater than 1.5kb. In some embodiments, long contigs are contigs greater than 2kb. In some embodiments, long contigs are contigs greater than 2.5kb. In some embodiments, long contigs are contigs greater than 3kb. In some embodiments, long contigs are contigs greater than 3.5kb. In some embodiments, long contigs are contigs greater than 4kb. In some embodiments, long contigs are contigs greater than 4.5kb.
  • long contigs are contigs greater than 5kb. In some embodiments, long contigs are contigs greater than 5.5kb. In some embodiments, long contigs are contigs greater than 6kb. In some embodiments, long contigs are contigs greater than 6.5kb. In some embodiments, long contigs are contigs greater than 7kb. In some embodiments, long contigs are contigs greater than 7.5kb. In some embodiments, long contigs are contigs greater than 8kb. In some embodiments, long contigs are contigs greater than 8.5kb. In some embodiments, long contigs are contigs greater than 9kb. In some embodiments, long contigs are contigs greater than 9.5kb. In some embodiments, long contigs are contigs greater than lOkb.
  • long contigs are contigs from 500bp to 10 kb.
  • long contigs are contigs from 500bp to 9 kb. In some embodiments, long contigs are contigs from 500bp to 8 kb. In some embodiments, long contigs are contigs from 500bp to 7 kb. In some embodiments, long contigs are contigs from 500bp to 6 kb. In some embodiments, long contigs are contigs from 500bp to 5 kb. In some embodiments, long contigs are contigs from 500bp to 4 kb. In some embodiments, long contigs are contigs from 500bp to 3 kb. In some embodiments, long contigs are contigs from 500bp to 2 kb. In some embodiments, long contigs are contigs from 500bp to 1 kb.
  • long contigs are contigs from lkb to 10 kb. In some embodiments, long contigs are contigs from 2kb to 10 kb. In some embodiments, long contigs are contigs from 3kb to 10 kb. In some embodiments, long contigs are contigs from 4kb to 10 kb. In some embodiments, long contigs are contigs from 5kb to 10 kb. In some embodiments, long contigs are contigs from 6kb to 10 kb. In some embodiments, long contigs are contigs from 7kb to 10 kb. In some embodiments, long contigs are contigs from 8kb to 10 kb. In some embodiments, long contigs are contigs from 9kb to 10 kb.
  • long contigs are contigs from lkb to 9 kb. In some embodiments, long contigs are contigs from 2kb to 8 kb. In some embodiments, long contigs are contigs from 3kb to 7 kb. In some embodiments, long contigs are contigs from 4kb to 6 kb.
  • the genomic sequence fragments are 32bp to
  • the genomic sequence fragments are 50bp to 100,000 bp (lOOkb). In some embodiments, the genomic sequence fragments are lOObp to 100,000 bp (lOOkb). In some embodiments, the genomic sequence fragments are 200bp to 100,000 bp (lOOkb). In some embodiments, the genomic sequence fragments are 300bp to 100,000 bp (lOOkb). In some embodiments, the genomic sequence fragments are 400bp to 100,000 bp (lOOkb). In some embodiments, the genomic sequence fragments are 500bp to 100,000 bp (lOOkb). In some embodiments, the genomic sequence fragments are 600bp to 100,000 bp (lOOkb).
  • the genomic sequence fragments are 700bp to 100,000 bp (lOOkb). In some embodiments, the genomic sequence fragments are 800bp to 100,000 bp (lOOkb). In some embodiments, the genomic sequence fragments are 900bp to 100,000 bp (lOOkb).
  • the genomic sequence fragments are l,000bp to
  • the genomic sequence fragments are 10,000bp to 100,000 bp (lOOkb). In some embodiments, the genomic sequence fragments are 25,000bp to 100,000 bp (lOOkb). In some embodiments, the genomic sequence fragments are 50,000bp to 100,000 bp (lOOkb). In some embodiments, the genomic sequence fragments are 75,000bp to 100,000 bp (lOOkb).
  • the genomic sequence fragments are 32bp to
  • the genomic sequence fragments are 32bp to 50,000 bp (lOOkb). In some embodiments, the genomic sequence fragments are 32bp to 25,000 bp (lOOkb). In some embodiments, the genomic sequence fragments are 32bp to 10,000 bp (lOOkb). In some embodiments, the genomic sequence fragments are 32bp to 1,000 bp (lOOkb).
  • the genomic sequence fragments are 32bp to
  • the genomic sequence fragments are 32bp to 800 bp. In some embodiments, the genomic sequence fragments are 32bp to 700 bp. In some embodiments, the genomic sequence fragments are 32bp to 600 bp. In some embodiments, the genomic sequence fragments are 32bp to 500 bp. In some embodiments, the genomic sequence fragments are 32bp to 400 bp. In some embodiments, the genomic sequence fragments are 32bp to 300 bp. In some embodiments, the genomic sequence fragments are 32bp to 200 bp. In some embodiments, the genomic sequence fragments are 32bp to 100 bp.
  • the genomic sequence fragments are 500bp to
  • the genomic sequence fragments are l,000bp to 50,000 bp. In some embodiments, the genomic sequence fragments are 10,000bp to 25,000 bp.
  • the unique contig Kmers are 32 bp to 250bp. In some embodiments, the unique contig Kmers are 32 bp to 225bp. In some embodiments, the unique contig Kmers are 32 bp to 200bp. In some embodiments, the unique contig Kmers are 32 bp to 175bp. In some embodiments, the unique contig Kmers are 32 bp to 150bp. In some embodiments, the unique contig Kmers are 32 bp to 125bp. In some embodiments, the unique contig Kmers are 32 bp to lOObp. In some embodiments, the unique contig Kmers are 32 bp to 75bp. In some embodiments, the unique contig Kmers are 32 bp to 50bp.
  • the unique contig Kmers are 50 bp to 250bp. In some embodiments, the unique contig Kmers are 75 bp to 250bp. In some embodiments, the unique contig Kmers are 100 bp to 250bp. In some embodiments, the unique contig Kmers are 125 bp to 250bp. In some embodiments, the unique contig Kmers are 150 bp to 250bp. In some embodiments, the unique contig Kmers are 175 bp to 250bp. In some embodiments, the unique contig Kmers are 200 bp to 250bp. In some embodiments, the unique contig Kmers are 225 bp to 250bp.
  • the unique contig Kmers are 50 bp to 225bp. In some embodiments, the unique contig Kmers are 75 bp to 200bp. In some embodiments, the unique contig Kmers are 100 bp to 175bp. In some embodiments, the unique contig Kmers are 125 bp to 150bp.
  • the low weight of Ey is less than 3. In some embodiments, the low weight of Ey is less than 2. In some embodiments, the low weight of Ei j is less than 1. In some embodiments, the low weight of Ey is less than 0.5.
  • the low weight of Ey is 0.5 to 3. In some embodiments, the low weight of Ey is 1 to 3. In some embodiments, the low weight of Ey is 1.5 to 3. In some embodiments, the low weight of Ey is 2 to 3. In some embodiments, the low weight of Ey is 2.5 to 3. In some embodiments, the low weight of Ey is 0.5 to 2.5. In some embodiments, the low weight of Ey is 0.5 to 2. In some embodiments, the low weight of Ey is 0.5 to 1.5. In some embodiments, the low weight of Ej j is 0.5 to 1. In some embodiments, the low weight of Ey is 1 to 2.5. In some embodiments, the low weight of Ej j is 1.5 to 2.
  • the whole genome of the organism is fragmented to produce a plurality of molecules.
  • each molecule of the plurality of molecules is a DNA molecule.
  • a molecule is 500bp to 2MB (megabase).
  • a molecule is 500bp to 1MB.
  • a molecule is 500bp to 0.5MB.
  • a molecule is 500bp to 250,000 kb.
  • a molecule is 500bp to 100,000 kb.
  • a molecule is 500bp to 50,000 kb.
  • a molecule is 500bp to 25,000 kb.
  • a molecule is 500bp to 10,000 kb. In some embodiments, a molecule is 500bp to 2,500 kb. In some embodiments, a molecule is 500bp to 1,000 kb. In some embodiments, a molecule is 500bp to 500 kb. In some embodiments, a molecule is 500bp to 250 kb. In some embodiments, a molecule is 500bp to 100 kb. In some embodiments, a molecule is 500bp to 50 kb. In some embodiments, a molecule is 500bp to 25 kb. In some embodiments, a molecule is 500bp to 10 kb. In some embodiments, a molecule is 500bp to 5 kb. In some embodiments, a molecule is 500bp to 2.5 kb. In some embodiments, a molecule is 500bp to 1 kb.
  • a molecule is lkb to 2MB. In some embodiments, a molecule is 2.5kb to 2MB. In some embodiments, a molecule is 5kb to 2MB. In some embodiments, a molecule is lOkb to 2MB. In some embodiments, a molecule is 25kb to 2MB. In some embodiments, a molecule is 50kb to 2MB. In some embodiments, a molecule is lOOkb to 2MB. In some embodiments, a molecule is 250kb to 2MB. In some embodiments, a molecule is 500kb to 2MB. In some embodiments, a molecule is l,000kb to 2MB.
  • a molecule is 2,500kb to 2MB. In some embodiments, a molecule is 5,000kb to 2MB. In some embodiments, a molecule is 10,000kb to 2MB. In some embodiments, a molecule is 25,000kb to 2MB. In some embodiments, a molecule is 50,000kb to 2MB. In some embodiments, a molecule is 100,000kb to 2MB. In some embodiments, a molecule is 250,000kb to 2MB. In some embodiments, a molecule is 500,000kb to 2MB. In some embodiments, a molecule is 1MB to 2MB. In some embodiments, a molecule is 1.5kb to 2MB. In some embodiments, a molecule is 1MB to 1.5MB.
  • a molecule is lkb to 1.5MB. In some embodiments, a molecule is 2.5kb to 1MB. In some embodiments, a molecule is 5kb to 0.5MB. In some embodiments, a molecule is lOkb to 0.25MB. In some embodiments, a molecule is 25kb to 100,000kb. In some embodiments, a molecule is 50kb to 50,000kb. In some embodiments, a molecule is lOOkb to 25,000kb. In some embodiments, a molecule is 250kb to 10,000kb. In some embodiments, a molecule is 500kb to 5,000kb. In some embodiments, a molecule is l,000kb to 2,500kb.
  • the whole genome of an organism is derived from a plant, e.g., but not limited to, maize, rice, barley, etc. In some embodiments, the whole genome of an organism is derived from a mammal, e.g., but not limited to, human, feline, canine, murine, etc., In some embodiments, the whole genome of an organism is derived from a single-cell organism, e.g., but not limited to, a bacterium, an archaebacterium, a protozoan, etc.
  • the method of the present invention includes dynamically constructing, by a specifically programmer computer system, a debruign graph from the plurality of tagged raw reads. In some embodiments, the method further includes analyzing, by the specifically programmer computer system, a plurality of contigs. [00089] In some embodiments, the method of the present invention includes dynamically constructing, by a specifically programmer computer system, a weighted graph comprising a plurality of nodes.
  • the method further includes identifying, by the specifically programmer computer system, a connection weight between each two nodes as the number of shared barcodes between the nodes, filtering the connection weight between each two nodes as the number of shared barcodes between the nodes, organizing a first plurality of contigs into a first plurality of groups of assembled molecules, organizing a second plurality of contigs into a second plurality of groups of assembled molecules, predicting a continuous sequence of the at least one portion of the contig by constructing a distance matrix between at least two contigs, assigning each of the overlapping groups of the second plurality of groups of assembled molecules an Overlapping Start (OS) and an Overlapping End (OE), assembling a DNA sequence order, wherein the DNA sequence order is organized from a maximum OS to a minimum OE, or any combination thereof.
  • OS Overlapping Start
  • OE Overlapping End
  • the analysis includes the following steps:
  • the Kmers are the keys for the hash table and the value is their contig id. Kmers that are originated from more than one contigs are discarded.
  • a matrix is generated as in the following simplified example (where 1 mean 4 1203 0 1 1 1 0 1 1 1 0 0 0 0 0 0
  • the group of additional barcodes (BX) is defined, in which the UCs in Bl co-appear, by searching for barcode ids in which the number of UCs in Bl that also appear there is above some threshold (i.e. >0, B2-B10 based on the above table example).
  • each LC is a node and the number of barcodes (from BX. defined at step 4a) in which the pair co-detected defines a weighted edge between each LCs pairs (Ey). Edges with weight lower than a threshold (i.e 2, 3 or 5, depending on sequence coverage and number of barcodes) are deleted from the graph.
  • a threshold i.e 2, 3 or 5, depending on sequence coverage and number of barcodes
  • each connected component defines LCs in Bl with common molecule of origin: Bl_LCs_l (with contigs 4, 11, 15, 18 and 20) and Bl_LCs_2 (with contigs 6, 7, 12, 16, and 17).
  • Figures 5 A and 5B show the connected components as used in the methods of the present invention.
  • the predict order of the long contig is: 11, 15, 4, 20 and 18 and the reorder distance matrix for Bl LCs l is:
  • the LCs ordering information is used to estimate the overlapping of the molecule is each set of BX (i.e. BX l) with this same molecule in Bl (i.e. BI LCs l): for each set of BX we define a Overlapping Start (OS) as the minimal LCs order (defined in 4d) that appear in the BX set and a Overlapping End (OE) as the maximal LCs order (defined in 4d) that appear in the BX set.
  • OS Overlapping Start
  • OE Overlapping End
  • the contig IDs represent the contigs or their reverse complement contigs, as strand information (i.e., sense strand or anti-sense strand) cannot be idenfied from the barcoded data.
  • the method includes mapping the barcoded read into pre-built debuijn graph from additional NGS data from the same sample. This option supports de novo assembly analysis by providing genomic coverage. Additionally, this method can use PCR-free libraries as a data source.
  • this algorithm can be used on barcoded sequencing data generated using a single ChromiumTM library (by 10X Genomics, CA, USA) sequenced by two lanes of HiSeq xlO machine (Illumina) to generate 230Gb sequenced data (785*10 ⁇ 6 2X150bp reads) of a Bovine genome (Nellore beef cattle).
  • Figure 6 shows an example of the contigs to barcodes matrix of a long molecule composed of 118 contigs (Y-axis) and overlapped with 79 barcodes (X- axis).
  • White cells indicate that reads from the corresponding barcode were mapped to the corresponding contig, while black cells indication un matched barcode and contig.
  • FIG. 1 illustrates one embodiment of an environment in which the present invention may operate.
  • the inventive system and method may include a large number of members and/or concurrent transactions.
  • the inventive system(s) and method(s) are based on a scalable computer and network architecture that incorporates varies strategies for assessing the data, caching, searching, and database connection pooling.
  • An example of the scalable architecture is an architecture that is capable of operating multiple servers.
  • members of the computer system 102-104 include virtually any computing device capable of receiving and sending a message over a network, such as network 105, to and from another computing device, such as servers 106 and 107, each other, and the like.
  • the set of such devices includes devices that typically connect using a wired communications medium such as personal computers, multiprocessor systems, microprocessor-based or programmable consumer electronics, network PCs, and the like.
  • the set of such devices also includes devices that typically connect using a wireless communications medium such as cell phones, smart phones, pagers, walkie talkies, radio frequency (RF) devices, infrared (IR) devices, CBs, integrated devices combining one or more of the preceding devices, or virtually any mobile device, and the like.
  • client devices 102-104 are any device that is capable of connecting using a wired or wireless communication medium such as a PDA, POCKET PC, wearable computer, and any other device that is equipped to communicate over a wired and/or wireless communication medium.
  • the inventive system(s) of the present invention can deliver information (e.g., DNA sequences, an analysis of at least one DNA sequence) to at least one user.
  • the at least one user is remotely located.
  • the at least one user is a farmer.
  • the at least one user may be a company specializing in growing and/or distributing seeds and/or plants (e.g., but not limited to, maize, rice, wheat, etc.)
  • the inventive system(s) of the present invention can deliver information to at least one user by use of a GUI, which can allow for the at least one user to select a crop.
  • each member device within member devices 102-104 may include a browser application that is configured to receive and to send web pages, and the like.
  • the browser application may be configured to receive and display graphics, text, multimedia, and the like, employing virtually any web based language, including, but not limited to Standard Generalized Markup Language (SMGL), such as HyperText Markup Language (HTML), a wireless application protocol (WAP), a Handheld Device Markup Language (HDML), such as Wireless Markup Language (WML), WMLScript, XML, JavaScript, and the like.
  • SMGL Standard Generalized Markup Language
  • HTML HyperText Markup Language
  • WAP wireless application protocol
  • HDML Handheld Device Markup Language
  • WMLScript Wireless Markup Language
  • XML XML
  • JavaScript JavaScript
  • programming may include either Java, .Net, QT, C, C++ or other suitable programming language.
  • member devices 102-104 may be further configured to receive a message from another computing device employing another mechanism, including, but not limited to email, Short Message Service (SMS), Multimedia Message Service (MMS), instant messaging (IM), internet relay chat (IRC), mIRC, Jabber, and the like or a Proprietary protocol.
  • SMS Short Message Service
  • MMS Multimedia Message Service
  • IM instant messaging
  • IRC internet relay chat
  • Jabber Jabber, and the like or a Proprietary protocol.
  • network 105 may be configured to couple one computing device to another computing device to enable them to communicate.
  • network 105 may be enabled to employ any form of computer readable media for communicating information from one electronic device to another.
  • network 105 may include a wireless interface, and/or a wired interface, such as the Internet, in addition to local area networks (LANs), wide area networks (WANs), direct connections, such as through a universal serial bus (USB) port, other forms of computer-readable media, or any combination thereof.
  • LANs local area networks
  • WANs wide area networks
  • USB universal serial bus
  • a router may act as a link between LANs, enabling messages to be sent from one to another.
  • communication links within LANs typically include twisted wire pair or coaxial cable, while communication links between networks may utilize analog telephone lines, full or fractional dedicated digital lines including Tl, T2, T3, and T4, Integrated Services Digital Networks (ISDNs), Digital Subscriber Lines (DSLs), wireless links including satellite links, or other communications links known to those skilled in the art.
  • ISDNs Integrated Services Digital Networks
  • DSLs Digital Subscriber Lines
  • remote computers and other related electronic devices could be remotely connected to either LANs or WANs via a modem and temporary telephone link.
  • network 105 includes any communication method by which information may travel between client devices 102-104, and servers 106 and 107.
  • FIG. 2 shows another exemplary embodiment of the computer and network architecture that supports the inventive method and system.
  • the member devices 202a, 202b thru 202n shown each at least includes a computer-readable medium, such as a random access memory (RAM) 208 coupled to a processor 210 or FLASH memory.
  • the processor 210 may execute computer-executable program instructions stored in memory 208.
  • Such processors comprise a microprocessor, an ASIC, and state machines.
  • Such processors comprise, or may be in communication with, media, for example computer-readable media, which stores instructions that, when executed by the processor, cause the processor to perform the steps described herein.
  • Embodiments of computer-readable media may include, but are not limited to, an electronic, optical, magnetic, or other storage or transmission device capable of providing a processor, such as the processor 210 of client 202a, with computer- readable instructions.
  • suitable media may include, but are not limited to, a floppy disk, CD-ROM, DVD, magnetic disk, memory chip, ROM, RAM, an ASIC, a configured processor, all optical media, all magnetic tape or other magnetic media, or any other medium from which a computer processor can read instructions.
  • various other forms of computer-readable media may transmit or carry instructions to a computer, including a router, private or public network, or other transmission device or channel, both wired and wireless.
  • the instructions may comprise code from any computer-programming language, including, for example, C, C++, C#, Visual Basic, Java, Python, Perl, and JavaScript
  • Member devices 202a-n may also comprise a number of external or internal devices such as a mouse, a CD-ROM, DVD, a keyboard, a display, or other input or output devices.
  • client devices 202a-n may be personal computers, digital assistants, personal digital assistants, cellular phones, mobile phones, smart phones, pagers, digital tablets, laptop computers, Internet appliances, and other processor-based devices.
  • a client device 202a may be any type of processor-based platform that is connected to a network 206 and that interacts with one or more application programs.
  • Client devices 202a-n may operate on any operating system capable of supporting a browser or browser-enabled application, such as MicrosoftTM, WindowsTM, or Linux.
  • the client devices 202a-n shown may include, for example, personal computers executing a browser application program such as Microsoft Corporation's Internet ExplorerTM, Apple Computer, Inc.'s SafariTM, Mozilla Firefox, and Opera. Through the client devices 202a-n, users, 212a-n communicate over the network 206 with each other and with other systems and devices coupled to the network 206. As shown in FIG. 2, server devices 204 and 213 may be also coupled to the network 206.
  • the term "mobile electronic device” may refer to any portable electronic device that may or may not be enabled with location tracking functionality.
  • a mobile electronic device can include, but is not limited to, a mobile phone, Personal Digital Assistant (PDA), BlackberryTM, Pager, Smartphone, or any other reasonable mobile electronic device.
  • PDA Personal Digital Assistant
  • BlackberryTM BlackberryTM
  • Pager Pager
  • Smartphone any other reasonable mobile electronic device.
  • the terms “cloud,” “Internet cloud,” “cloud computing,” “cloud architecture,” and similar terms correspond to at least one of the following: (1) a large number of computers connected through a realtime communication network (e.g., Internet); (2) providing the ability to run a program or application on many connected computers (e.g., physical machines, virtual machines (VMs)) at the same time; (3) network-based services, which appear to be provided by real server hardware, and are in fact served up by virtual hardware (e.g., virtual servers), simulated by software running on one or more real machines (e.g., allowing to be moved around and scaled up (or down) on the fly without affecting the end user).
  • a realtime communication network e.g., Internet
  • VMs virtual machines
  • the instant invention offers/manages the cloud computing/architecture as, but not limiting to: infrastructure a service (IaaS), platform as a service (PaaS), and software as a service (SaaS).
  • Figures 3 and 4 illustrate schematics of exemplary implementations of the cloud computing/ architecture.
  • present invention is a system, including: at least one server and specialized software stored on a non-transient computer readable medium accessible by the at least one server, where, when executing the specialized software, the at least one server becomes at least one specifically programmed server that is configured to: analyse a plurality of genome sequences obtained from a plurality of organisms, where each of the plurality of the organisms has at least one distinctive genetic element, where a number of organisms in the plurality of the organisms correlates with a genetic diversity level of the plurality of the organisms; assemble at least one DNA sequence corresponding to the genome sequences of each of the plurality of the organisms, generate a plurality of contigs based on the at least one DNA sequence assembled for each of the plurality of the organisms, plot digital representations of the plurality of the contigs into at least one population DeBruijn graph, map the plurality of the contigs based on a plurality of overlapping DNA sequence regions, identify a plurality of unique contigs from

Abstract

In some embodiments, the present invention is a method which includes: obtaining a whole genome of an organism, fragmenting the whole genome of the organism to produce a plurality of raw reads, tagging each raw read of the plurality of raw reads to produce a plurality of tagged raw reads, constructing a debruijn graph from the plurality of tagged raw reads, extrapolating data from the constructed debruijn graph to generate a plurality of contigs, mapping the at least one unique barcode of the plurality of unique barcodes to a corresponding unique contig or a corresponding supplementary contig, organizing each contig of the plurality of contigs according to the plurality of unique barcodes, identifying a connection weight using the Debruijn graph, filtering the connection weight in the graph with low weight, and assembling a DNA sequence order, wherein the DNA sequence order is organized from a maximum OS to a minimum OE.

Description

SYSTEMS AND METHODS FOR COMPUTATIONAL DEMULTIPLEXING OF GENOMIC BARCODED SEQUENCES
RELATED APPLICATIONS
[0001] This application claims the priority of U.S. Patent Appln.
No. 62/293,470; filed February 10, 2016; entitled "SYSTEMS AND METHODS FOR COMPUTATIONAL DEMULTIPLEXING OF GENOMIC BARCODED SEQUENCES," which is incorporated herein by reference in its entirety for all purposes.
FIELD OF THE INVENTION
[0002] The field of invention relates to methods of sequencing. In particular, the present invention provides methods for computational demultiplexing of genomic barcoded sequences.
BACKGROUND
[0003] Molecular biology protocols enable tagging short sequencing reads originating from several long genomic DNA molecules with unique barcodes. Those methods enable the extraction of long-range information based on standard short read sequencing of the tagged DNA. This information is useful for the following bioinformatics applications: (1) Phasing of heterozygous polymorphism, (2) Structural variation detection, and (3) Denovo assembly analysis.
[0004] All of the above analyses are based on the ability to de-multiplex the information from each barcode reads that are pooled from several of molecules into the distinct molecules of origin. Typically, this bioinformatics analysis is accomplished using mapping to external reference genome and it is not suitable to demultiplex haplotype sequences that are not mapped to external reference genome. BRIEF DESCRIPTION OF THE FIGURES
[0005] The present invention will be further explained with reference to the attached drawings, wherein like structures are referred to by like numerals throughout the several views. The drawings shown are not necessarily to scale, with emphasis instead generally being placed upon illustrating the principles of the present invention. Further, some features may be exaggerated to show details of particular components.
[0006] Figures 1-4 are embodiments of the system used in the present invention.
[0007] Figures 5A and 5B show embodiments of the connected components as used in the methods of the present invention.
[0008] Figure 6 shows an exemplary embodiment of the relationship between contigs and barcodes as used in methods of the present invention.
[0009] The figures constitute a part of this specification and include illustrative embodiments of the present invention and illustrate various objects and features thereof. Further, the figures are not necessarily to scale, some features may be exaggerated to show details of particular components. In addition, any measurements, specifications and the like shown in the figures are intended to be illustrative, and not restrictive. Therefore, specific structural and functional details disclosed herein are not to be interpreted as limiting, but merely as a representative basis for teaching one skilled in the art to variously employ the present invention.
SUMMARY OF THE INVENTION
[00010] In some embodiments, the present invention is a method which includes: obtaining a whole genome of an organism,
fragmenting the whole genome of the organism to produce a plurality of molecules, wherein each molecule of the plurality of molecules is 500 base pairs to 2000 kilobases;
tagging each molecule of the plurality of molecules to produce a plurality of tagged raw reads,
wherein each of the plurality of tagged raw reads comprises at least one unique barcode;
constructing a debruijn graph from the plurality of tagged raw reads,
extrapolating data from the constructed debruijn graph to generate a plurality of contigs,
wherein the plurality of contigs comprises genomic sequence fragments from 32 base pairs to 100,000 base pairs, and
wherein each contig of the plurality of contigs comprises a set of overlapping DNA fragments,
determining a frequency of each contig of the plurality of contigs,
wherein the frequency is the number of times each contig of the plurality of contigs is present in the whole genome of the organism;
identifying unique contigs of the plurality of contigs,
wherein each unique contig appears once in the whole genome of the organism, wherein the identifying is performed using the Debruijn graph,
identifying a plurality of supplementary contigs,
wherein each supplementary contig appears two or more times in the whole genome of the organism and are long contigs,
wherein the identifying is performed using the Debruijn graph, wherein each supplementary contig is a long contig,
wherein the long contig is greater than 500 base pairs, mapping the at least one unique barcode of the plurality of unique barcodes to a corresponding unique contig or a corresponding supplementary contig,
wherein the corresponding unique contig or the corresponding supplementary contig is fragmented into unique contig Kmers of 32 base pairs to 250 base pairs, wherein at least one origin of the corresponding unique contig or the corresponding supplementary contig is stored,
wherein the unique contig Kmers corresponding with more than one contig are discarded, and
wherein the unique contig Kmers corresponding with one contig are identified as confirmed unique contig Kmers,
wherein each tagged raw read is fragmented to generate reads Kmers having a length of the confirmed unique contig Kmers,
wherein a match is identified when one read Kmer is matched to a unique contig Kmer,
organizing each contig of the plurality of contigs according to the plurality of unique barcodes,
constructing a weighted graph comprising a plurality of nodes,
wherein each node of the plurality of nodes corresponds to a selected contig,
identifying a connection weight between each two nodes as the number of shared barcodes between the nodes,
filtering the connection weight in the graph with low weight,
wherein the low weight of Ey is less than 3, organizing a first plurality of contigs into a first plurality of groups of assembled molecules,
wherein each group of the first plurality of groups of
assembled molecules is a connected component in the graph, organizing a second plurality of contigs into a second plurality of groups of assembled molecules,
wherein each group of the second plurality of groups of assembled molecules overlaps with at least one portion of a contig,
predicting a continuous sequence of the at least one portion of the contig by constructing a distance matrix between at least two contigs,
assigning each of the overlapping groups of the second plurality of groups of assembled molecules an Overlapping Start (OS) and an Overlapping End (OE),
wherein an OS is the minimal contig order, and
wherein an OE is the maximal contig order, and
assembling a DNA sequence order, wherein the DNA sequence order is organized from a maximum OS to a minimum OE.
[00011] In some embodiments, the mapping step comprises: identifying a set of overlapping barcodes.
[00012] In some embodiments, the set of the remaining contigs comprises unique contigs or supplementary contigs.
[00013] In some embodiments, the connection weight is measured using a contigs pair, Ey.
[00014] In some embodiments, the contigs pair is a number of common barcodes within the set of the remaining contigs, wherein the number of common barcodes is at least one barcode at least in duplicate. [00015] In some embodiments, the distance matrix between two contigs is measured by: maXk,1:0ver all the contigs in the group(Ekl) - Ejj.
[00016] In some embodiments, the distance matrix is reordered to indicate that adjacent contigs are separated by a corresponding distance.
DETAILED DESCRIPTION
[00017] Among those benefits and improvements that have been disclosed, other objects and advantages of this invention will become apparent from the following description taken in conjunction with the accompanying figures. Detailed embodiments of the present invention are disclosed herein; however, it is to be understood that the disclosed embodiments are merely illustrative of the invention that may be embodied in various forms. In addition, each of the examples given in connection with the various embodiments of the invention which are intended to be illustrative, and not restrictive.
[00018] Throughout the specification and claims, the following terms take the meanings explicitly associated herein, unless the context clearly dictates otherwise. The phrases "in one embodiment" and "in some embodiments" as used herein do not necessarily refer to the same embodiment(s), though it may. Furthermore, the phrases "in another embodiment" and "in some other embodiments" as used herein do not necessarily refer to a different embodiment, although it may. Thus, as described below, various embodiments of the invention may be readily combined, without departing from the scope or spirit of the invention.
[00019] This disclosure provides methods and systems for processing polynucleotides. Applications include processing polynucleotides for polynucleotide sequencing. Polynucleotides sequencing includes the sequencing of whole genomes, detection of specific sequences such as single nucleotide polymorphisms (SNPs) and other mutations, detection of nucleic acid (e.g., deoxyribonucleic acid) insertions, and detection of nucleic acid deletions.
[00020] Utilization of the methods and systems described herein may incorporate, unless otherwise indicated, conventional techniques of organic chemistry, polymer technology, microfluidics, molecular biology and recombinant techniques, cell biology, biochemistry, and immunology. Such conventional techniques include microwell construction, microfluidic device construction, polymer chemistry, restriction digestion, ligation, cloning, polynucleotide sequencing, and polynucleotide sequence assembly. Specific, non-limiting, illustrations of suitable techniques are described throughout this disclosure. However, equivalent procedures may also be utilized. Descriptions of certain techniques may be found in standard laboratory manuals, such as Genome Analysis: A Laboratory Manual Series (Vols. I-IV), Using Antibodies: A Laboratory Manual, Cells: A Laboratory Manual, PCR Primer: A Laboratory Manual, and Molecular Cloning: A Laboratory Manual (all from Cold Spring Harbor Laboratory Press), and "Oligonucleotide Synthesis: A Practical Approach" 1984, IRL Press London, all of which are herein incorporated in their entirety by reference for all purposes.
[00021] In some embodiments, the present invention is an analysis method that enables the de-multiplexing of the tagged reads information into their distinct origin molecules solely based on the tagged reads information.
[00022] In addition, as used herein, the term "or" is an inclusive "or" operator, and is equivalent to the term "and/or," unless the context clearly dictates otherwise. The term "based on" is not exclusive and allows for being based on additional factors not described, unless the context clearly dictates otherwise. In addition, throughout the specification, the meaning of "a," "an," and "the" include plural references. The meaning of "in" includes "in" and "on."
[00023] As used herein, "align", "alignment" or "sequence alignment" refers to a method of arranging sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between the sequences. Aligned sequences of nucleotide, for example, are generally represented as rows within a matrix.
[00024] As used herein, an "allele" refers to one of a number of alternative forms of the same gene or same genetic locus.
[00025] As used herein, "assembly" or "sequence assembly" refers to aligning and merging at least two fragments of a much longer DNA sequence to reconstruct the original sequence. In some embodiments, fragments can range in size from 20 to 30,000 bases.
[00026] As used herein, "base pair(s)" or "bp" refers to a unit consisting of two nucleobases bound to each other by hydrogen bonds. Base pairs form the building blocks of the DNA double helix, and contribute to the folded structure of both DNA and RNA. The base pairs are paired: adenine-thymine or guanine-cytosine, and allow the DNA helix to maintain a regular helical structure.
[00027] As used herein, "barcoding" or "DNA barcoding" refers to a method that enables the partition of genomic DNA into large amount of sets (100s-100,000s) such that each set contains several distinct long (e.g., in the range of 10kb-500kbs) genomic DNA molecules. Later on, in each set, the long genomic DNA molecules are broken into smaller fragments and tagged with unique label, e.g., a barcode. A barcode may be a polynucleotide sequence attached to all fragments of a target polynucleotide contained within a particular partition. Finally, the tagged DNA from all set is pooled to generate a single library for NGS sequencing (as Illumina). Non- limiting examples for such methods are GemCode or Chromium by XI 0 genomics ("Haplotyping germline and cancer genomes with high-throughput Linked-Read sequencing" Zheng, et al, Nature Biotechnology 2016), moleculo by Illumina (Whole-genome haplotyping using long reads and statistical methods, Kulesuv et. al., Nature biotechnology 2014). Further non-limiting examples for such methods are disclosed in US 9,401,201, and is hereby incorporated by reference in its entirety. In some embodiments, the presence of the same barcode on multiple sequences may provide information about the origin of the sequence; e.g., a barcode may indicate that the sequence came from a particular partition and/or a proximal region of a genome.
[00028] As used herein, "confidence level" refers to a measure of the reliability of at least one estimate. Confidence levels include a range of values (intervals) that can be construed as estimates of an unknown population parameter. The level of confidence of the confidence interval indicates the probability that the confidence range can capture the actual population parameter given a distribution of samples. A confidence level can be represented as a percentage.
[00029] As used herein, a "connected component" is a group of contigs. Each of the contigs is a linear chain. The contigs are linked together by leveraging information of the reads (e.g., an original read, e.g., but not limited to, a molecule or a contig) and obtaining a set of scaffolds which constitute the final result of a de novo genome assembly.
[00030] As used herein, "consensus" or "consensus sequence" refers to a calculated order of most frequent residues, either nucleotide or amino acid, found at each position in a sequence alignment. In some embodiments, the consensus sequence represents the results of multiple sequence alignments in which related sequences are compared to each other and similar sequence motifs are calculated.
[00031] As used herein, a "contig" refers to a set of overlapping DNA segments that together represent a consensus region of genomic DNA. In some embodiments, the inventive system(s) of the present invention are configured to analyze/sort contigs having at least one haplotype, where the at least one haplotype comprises at least one marker that may be used for genetic analysis (e.g., by identifying an allele). In some embodiments, a contig can be a haplotype contig or a non-haplotype contig.
[00032] As used herein, a "continuous sequence" is a sequence resulting from the reassembly of small DNA fragments generated.
[00033] As used herein, a "corresponding distance" is a calculated distance between at least two contigs or fragments which illustrates the degree of sequence similarity.
[00034] As used herein, a "DeBruijn graph" refers to a computational system that assembles a contiguous genome from a large population (e.g., but not limited to, 1 million mer, 10 million mer, 100 million mer, 1 billion mer, 10 billion mer, etc.) of short sequencing reads.
[00035] As used herein, "demultiplexing" refers to, after sequencing, reads being assigned in silico to their long DNA molecule of origin. In contrast, as used herein, "multiplexing" refers to using short DNA indices to uniquely identify (or more correctly semi-uniquely by assigning the same index to several DNA molecules) each DNA sample. In some embodiments, demultiplexing enables barcoding different pieces of DNA in the same sample in order to generate long range information. Residual demultiplexing is needed, since methods assign the same barcode to several long DNA molecules from different genomic locus.
[00036] As used herein, a "DNA read" or "read" refers to overlapping fragments of DNA obtained by using, e.g., but not limited to, shotgun sequencing. When using a chromatogram to sequence DNA, the read is the sequence of letters at the top of each row. The reads are used to reconstruct an original sequence.
[00037] As used herein, "error-filtering" refers to a system configured to selectively reduce errors and facilitate variant detection in data from sequencing technologies.
[00038] As used herein, "error-free phase sequence" refers to a sequence after error-filtering. As used herein, "phase" means that adjacent linked alleles are phased into a single sequence.
[00039] As used herein, a "fragment" refers to a physical segment of DNA. In some embodiments, the fragment can be an overlapping physical segment of DNA.
[00040] As used herein, "genetic diversity" refers to the level of biodiversity, i.e., the total number of genetic characteristics in the genetic makeup of a species.
[00041] As used herein, "haplotype" refers to a set of DNA variations, or polymorphisms, that tend to be inherited together. A haplotype can refer to a combination of alleles or to a set of single nucleotide polymorphisms (SNPs) found on the same chromosome.
[00042] As used herein, a "k-mer" or "kmer" refers to all the possible subsequences (of length "k") from a read obtained through DNA sequencing.
[00043] As used herein, an "input read length" refers to an initial starting sequence used to build at least one contig and includes at least a portion of one kmer. [00044] As used herein, a "marker" or a "genetic marker", refers to a gene or short sequence of DNA used to identify a chromosome or to locate other genes on a genetic map.
[00045] As used herein, a "minimal contig order" is a position in which a DNA fragment or portion begins. As used herein, a "maximum contig order" is a position in which a DNA fragment or portion ends.
[00046] As used herein, "mer" refers to an oligonucleotide, where when applied to DNA, mer refers to the number of bases in the molecules (e.g., 10-mer, 100-mer, etc.).
[00047] As used herein, "overlapping," refers to polynucleotide fragments, generally referring to a collection of polynucleotide fragments with overlapping sequence. A genome may be fragmented randomly (e.g., but not limited to, by shearing in a pipette) or non-randomly (e.g., but not limited to, by digesting with a rare cutter). Fragmenting randomly produces overlapping sequences because each copy of the genome is cut at different positions. After sequencing of the fragments (which provides "sequence contigs"), this overlap may be used to determine the linear order of the fragments, thereby enabling assembly of the entire genomic sequence. In some embodiments, fragmentation may be performed, e.g., but not limited to, by enzymatic digestion, exposure to ultraviolet (UV) light, ultrasonication, and/or mechanical agitation.
[00048] As used herein, "paired-end" or "PE" refers to DNA fragments sequenced from both ends (e.g., 5' end and 3' end) and generate pairs of reads. In some embodiments, a PE library is a population of PE DNA fragments of varying sizes and sequences. [00049] As used herein, a "path" or "consensus path" refers to information provided in a DeBruijn graph that identifies the consensus of a genomic sequence with all sub-repeats in the genomic sequence substituted by the respective consensus sequences.
[00050] As used herein, a "polymorphism" refers to a difference between two different sequences.
[00051] The terms "polynucleotide" or "nucleic acid," as used herein, are used herein to refer to biological molecules comprising a plurality of nucleotides. Exemplary polynucleotides include deoxyribonucleic acids, ribonucleic acids, and synthetic analogues thereof, including peptide nucleic acids. In some embodiments, polynucleotides can be prepared using the methods disclosed in US 9,410,201.
[00052] As used herein, a "raw read" refers to the sequencing result produced from an automatic sequencing machine. The raw reads are short DNA sequences and are mixed together, not in genomic order. Inevitably, raw sequence also contains a few gaps, mistakes, and ambiguities.
[00053] As used herein, a "reference genome" or "reference assembly" is a digital nucleic acid sequence database, assembled by scientists as a representative example of a species' set of genes. Reference genomes are typically assembled from the sequencing of DNA from a number of donors, and generally do not accurately represent the set of genes of any single organism. Instead, the reference genome provides a haploid mosaic of different DNA sequences from each donor. In plants (e.g., maize, soybean, rice, etc.) the reference genome is typically assembled from a single variety.
[00054] As used herein, a "scaffold" refers to a technique which links together a non-continuous series of genomic sequences into a scaffold, consisting of sequences separated by gaps of known length. The sequences that are linked are typically contiguous sequences corresponding to read overlaps.
[00055] As used herein, "single nucleotide polymorphism" or "SNP" is a DNA sequence variation occurring within a population (e.g. 1%) in which a single nucleotide, e.g. Adenine ("A"), Thymine ("T"), Cytosine ("C") or Guanine ("G"), in the genome (or other shared sequence) differs between members of a biological species or paired chromosomes. The DNA sequence variation can occur only once. The DNA sequence variation can occur two or more times.
[00056] As used herein, a "structural variation" refers to a region of DNA which is approximately 1 kilobase (kb) or larger in size and can include inversions, balanced translocations, and/or genomic imbalances (e.g., DNA insertions and/or DNA deletions), and is typically referred to as copy number variants (CNVs).
[00057] As used herein, "unique" or "uniqueness" refers to a contig and is related to the variability in the contig' s adjacent sequences. A unique contig means that a prediction is made that this contig is embedded in a single long sequence and therefore is predicted to appear only in one locus in the genome. If a sequence is only generated once, then the sequence has high uniqueness. In some embodiments, a sequence having the high uniqueness appears/is present only once in a genome. Alternatively, if a sequence has multiple copies generated, then the sequence has low uniqueness. In contrast, as used herein, a "supplementary"
[00058] As used herein, "tag" or "tagging" refers to combining at least one barcode with a DNA fragment.
[00059] As used herein, a "whole genome" refers to the entirety of a genome.
In some embodiments, the whole genome can be, e.g., mammalian, plant, bacterial, protozoan, etc. Methods for computational demultiplexing of genomic barcoded sequences
[00060] In some embodiments, the present invention is a method which includes:
obtaining a whole genome of an organism,
fragmenting the whole genome of the organism to produce a plurality of molecules, wherein each molecule of the plurality of molecules is 500 base pairs to 2000 kilobases;
tagging each molecule of the plurality of molecules to produce a plurality of tagged raw reads,
wherein each of the plurality of tagged raw reads comprises at least one unique barcode;
constructing a debruijn graph from the plurality of tagged raw reads,
extrapolating data from the constructed debruijn graph to generate a plurality of contigs,
wherein the plurality of contigs comprises genomic sequence fragments from 32 base pairs to 100,000 base pairs, and
wherein each contig of the plurality of contigs comprises a set of overlapping DNA fragments,
determining a frequency of each contig of the plurality of contigs,
wherein the frequency is the number of times each contig of the plurality of contigs is present in the whole genome of the organism;
identifying unique contigs of the plurality of contigs,
wherein each unique contig appears once in the whole genome of the organism, wherein the identifying is performed using the Debruijn graph,
identifying a plurality of supplementary contigs, wherein each supplementary contig appears two or more times in the whole genome of the organism and are long contigs,
wherein the identifying is performed using the Debruijn graph,
wherein each supplementary contig is a long contig,
wherein the long contig is greater than 500 base pairs, mapping the at least one unique barcode of the plurality of unique barcodes to a corresponding unique contig or a corresponding supplementary contig,
wherein the corresponding unique contig or the corresponding supplementary contig is fragmented into unique contig Kmers of 32 base pairs to 250 base pairs, wherein at least one origin of the corresponding unique contig or the corresponding supplementary contig is stored,
wherein the unique contig Kmers corresponding with more than one contig are discarded, and
wherein the unique contig Kmers corresponding with one contig are identified as confirmed unique contig Kmers,
wherein each tagged raw read is fragmented to generate reads Kmers having a length of the confirmed unique contig Kmers,
wherein a match is identified when one read Kmer is matched to a unique contig Kmer,
organizing each contig of the plurality of contigs according to the plurality of unique barcodes,
constructing a weighted graph comprising a plurality of nodes,
wherein each node of the plurality of nodes corresponds to a selected contig, identifying a connection weight between each two nodes as the number of shared barcodes between the nodes,
filtering the connection weight in the graph with low weight,
wherein the low weight of Ey is less than 3,
organizing a first plurality of contigs into a first plurality of groups of assembled molecules,
wherein each group of the first plurality of groups of
assembled molecules is a connected component in the graph, organizing a second plurality of contigs into a second plurality of groups of assembled molecules,
wherein each group of the second plurality of groups of assembled molecules overlaps with at least one portion of a contig,
predicting a continuous sequence of the at least one portion of the contig by constructing a distance matrix between at least two contigs,
assigning each of the overlapping groups of the second plurality of groups of assembled molecules an Overlapping Start (OS) and an Overlapping End (OE),
wherein an OS is the minimal contig order, and
wherein an OE is the maximal contig order, and
assembling a DNA sequence order, wherein the DNA sequence order is organized from a maximum OS to a minimum OE.
[00061] In some embodiments, the mapping step comprises: identifying a set of overlapping barcodes.
[00062] In some embodiments, the set of the remaining contigs comprises unique contigs or supplementary contigs. [00063] In some embodiments, the connection weight is measured using a contigs pair, Ey.
[00064] In some embodiments, the contigs pair is a number of common barcodes within the set of the remaining contigs, wherein the number of common barcodes is at least one barcode at least in duplicate.
[00065] In some embodiments, the distance matrix between two contigs is measured by: maXk,1:0ver all the contigs in the group(Ekl) - Ejj.
[00066] In some embodiments, the distance matrix is reordered to indicate that adjacent contigs are separated by a corresponding distance.
[00067] In some embodiments, the method of the present invention includes: generating a plurality of tagged raw reads containing unique bar codes for a set of long genomic DNA molecules obtained from a whole genome of an organism, where a group of tagged raw reads of the plurality of raw reads originated from a long genomic DNA molecule is tagged with a barcode selected from a plurality of unique barcodes,
constructing a debruijn graph from the plurality of tagged raw reads,
analyzing the debruijn graph and generating a plurality of contigs, where the plurality of contigs are genomic sequence fragments ranging from 32 base pairs to 100,000 base pairs,
where each individual contig within the plurality of contigs is a set of overlapping DNA segments that together represent a consensus region of genomic DNA, analyzing the plurality of contigs to determine a number of times each individual contig of the plurality of contigs appears in the whole genome of the organism, identifying unique contigs using the Debruijn graph, where the unique contigs of the plurality of contigs appear once in the whole genome of the organism, identifying supplementary contigs using the Debruijn graph, where the supplementary contigs of the plurality of contigs appear few times in the whole genome of the organism and are long contigs (i.e. greater than 500bp),
mapping the barcodes to the unique and supplementary contigs,
where the contigs are broken into unique contig Kmers of a length of between 32bp and 250bp and its unique contig of origin is stored,
where the unique contig Kmers which align with more than one contig are discarded, and
where the unique contig Kmers which align with only one contig are unique contig Kmers,
where the raw reads from a barcode of the plurality of barcodes are broken into reads Kmers at the length of the unique contig Kmers,
where a match is identified between a barcode and a contig when a read Kmer is matched to a unique contig Kmer,
splitting the contigs within each barcode based on their long molecule of origin by the following procedure:
identifying a set of additional barcodes that overlap with the subset of the plurality of the remaining unique contigs which include the barcode.
constructing a graph comprising nodes,
where the nodes correspond to the subset of the plurality of the remaining contigs (only long unique contigs or supplementary) which mapped to the barcode, identifying connections weight in the graph, where a connection weight between a contigs pair, is defined as the number of shared barcodes within the additional barcodes set,
filtering the connection weight in the graph with low weight (i.e. Ey<3) dividing the barcode contigs into groups of predicted common long molecule of origin, where each group is a connected component in the graph,
dividing the additional barcodes set into groups, where each of the groups overlaps with a group of the contigs,
predicting a contiguous sequence of the contig with in each group of predicted common long molecule of origin by building a distance matrix between the contigs, where the distance between two contigs is measured by:
maXkJiover all the contigs in the group(Ekl) " Ejj, and
where the distance matrix is reordered to indicate that adjacent contigs have a short distance,
assigning each of the additional barcodes overlap the group of contigs with predicted common long molecule of origin an Overlapping Start (OS) which is the minimal contig order that appears in the barcode and an Overlapping End (OE) as the maximal contig order that appears in the barcode, and
identifying all barcode contigs (long and short) that predicted common long molecule of origin as contigs that also appear in the corresponding barcode group from the additional barcode set and predict a DNA sequence order in the DNA molecule in the range of max(OSbar∞de)->min(OE arcode).
[00068] In some embodiments, molecular biology protocols can divide an organism's entire genomic DNA into large sets of genomic DNA molecules (e.g., but not limited to, 10s- 100,000s genomic DNA molecules (e.g., but not limited to 10 genomic DNA molecules, 100 genomic DNA molecules, 1,000 genomic DNA molecules, 10,000 genomic DNA molecules, 100,000 genomic DNA molecules, 500,000 genomic DNA molecules, etc.); also referred to herein as "molecules"), where each set contains dozens of distinct long genomic DNA molecules (e.g., but not limited to, in a range of 10kb-2000kbs (2MB)). Next, the long genomic DNA molecules of each set are broken into smaller DNA fragments and each DNA fragment is tagged with a unique barcode. Finally, the tagged DNA fragments from all of the sets are pooled to generate a single library for next generation sequencing.
[00069] In some embodiments, the present invention is a method including demultiplexing tagged DNA reads, so as to result in identifying the distinct origin of the tagged DNA read.
[00070] In some embodiments of the method of the present invention, a plurality of overlapping long DNA molecules from the same genomic region of an organism exist in several sets of genomic DNA, where there is a low probability that long molecule from two different genomic regions co-exist in more than one set. In some embodiments, the method includes de-multiplexing the tagged DNA reads by mapping the tagged DNA reads to a sample of origin. In some embodiments, the method includes assembling contigs using Debruijn graph construction. In some embodiments, the tagged reads are transformed into computationally tagged contigs and then the tagged contigs are de-multiplexed, so as to result in: (i) computational efficiency, where computational efficiency refers to mapping the tagged DNA reads to the tagged contigs, in which the cumulative length of the tagged contigs is similar to the size of the genome, and (ii) mapping efficiency, where mapping the tagged contigs allows for matching reads from overlapping or adjacent genomic regions.
[00071] In some embodiments, long contigs are contigs greater than 500bp. In some embodiments, long contigs are contigs greater than lkb. In some embodiments, long contigs are contigs greater than 1.5kb. In some embodiments, long contigs are contigs greater than 2kb. In some embodiments, long contigs are contigs greater than 2.5kb. In some embodiments, long contigs are contigs greater than 3kb. In some embodiments, long contigs are contigs greater than 3.5kb. In some embodiments, long contigs are contigs greater than 4kb. In some embodiments, long contigs are contigs greater than 4.5kb. In some embodiments, long contigs are contigs greater than 5kb. In some embodiments, long contigs are contigs greater than 5.5kb. In some embodiments, long contigs are contigs greater than 6kb. In some embodiments, long contigs are contigs greater than 6.5kb. In some embodiments, long contigs are contigs greater than 7kb. In some embodiments, long contigs are contigs greater than 7.5kb. In some embodiments, long contigs are contigs greater than 8kb. In some embodiments, long contigs are contigs greater than 8.5kb. In some embodiments, long contigs are contigs greater than 9kb. In some embodiments, long contigs are contigs greater than 9.5kb. In some embodiments, long contigs are contigs greater than lOkb.
[00072] In some embodiments, long contigs are contigs from 500bp to 10 kb.
In some embodiments, long contigs are contigs from 500bp to 9 kb. In some embodiments, long contigs are contigs from 500bp to 8 kb. In some embodiments, long contigs are contigs from 500bp to 7 kb. In some embodiments, long contigs are contigs from 500bp to 6 kb. In some embodiments, long contigs are contigs from 500bp to 5 kb. In some embodiments, long contigs are contigs from 500bp to 4 kb. In some embodiments, long contigs are contigs from 500bp to 3 kb. In some embodiments, long contigs are contigs from 500bp to 2 kb. In some embodiments, long contigs are contigs from 500bp to 1 kb.
[00073] In some embodiments, long contigs are contigs from lkb to 10 kb. In some embodiments, long contigs are contigs from 2kb to 10 kb. In some embodiments, long contigs are contigs from 3kb to 10 kb. In some embodiments, long contigs are contigs from 4kb to 10 kb. In some embodiments, long contigs are contigs from 5kb to 10 kb. In some embodiments, long contigs are contigs from 6kb to 10 kb. In some embodiments, long contigs are contigs from 7kb to 10 kb. In some embodiments, long contigs are contigs from 8kb to 10 kb. In some embodiments, long contigs are contigs from 9kb to 10 kb.
[00074] In some embodiments, long contigs are contigs from lkb to 9 kb. In some embodiments, long contigs are contigs from 2kb to 8 kb. In some embodiments, long contigs are contigs from 3kb to 7 kb. In some embodiments, long contigs are contigs from 4kb to 6 kb.
[00075] In some embodiments, the genomic sequence fragments are 32bp to
100,000 bp (lOOkb). In some embodiments, the genomic sequence fragments are 50bp to 100,000 bp (lOOkb). In some embodiments, the genomic sequence fragments are lOObp to 100,000 bp (lOOkb). In some embodiments, the genomic sequence fragments are 200bp to 100,000 bp (lOOkb). In some embodiments, the genomic sequence fragments are 300bp to 100,000 bp (lOOkb). In some embodiments, the genomic sequence fragments are 400bp to 100,000 bp (lOOkb). In some embodiments, the genomic sequence fragments are 500bp to 100,000 bp (lOOkb). In some embodiments, the genomic sequence fragments are 600bp to 100,000 bp (lOOkb). In some embodiments, the genomic sequence fragments are 700bp to 100,000 bp (lOOkb). In some embodiments, the genomic sequence fragments are 800bp to 100,000 bp (lOOkb). In some embodiments, the genomic sequence fragments are 900bp to 100,000 bp (lOOkb).
[00076] In some embodiments, the genomic sequence fragments are l,000bp to
100,000 bp (lOOkb). In some embodiments, the genomic sequence fragments are 10,000bp to 100,000 bp (lOOkb). In some embodiments, the genomic sequence fragments are 25,000bp to 100,000 bp (lOOkb). In some embodiments, the genomic sequence fragments are 50,000bp to 100,000 bp (lOOkb). In some embodiments, the genomic sequence fragments are 75,000bp to 100,000 bp (lOOkb).
[00077] In some embodiments, the genomic sequence fragments are 32bp to
75,000 bp (lOOkb). In some embodiments, the genomic sequence fragments are 32bp to 50,000 bp (lOOkb). In some embodiments, the genomic sequence fragments are 32bp to 25,000 bp (lOOkb). In some embodiments, the genomic sequence fragments are 32bp to 10,000 bp (lOOkb). In some embodiments, the genomic sequence fragments are 32bp to 1,000 bp (lOOkb).
[00078] In some embodiments, the genomic sequence fragments are 32bp to
900 bp. In some embodiments, the genomic sequence fragments are 32bp to 800 bp. In some embodiments, the genomic sequence fragments are 32bp to 700 bp. In some embodiments, the genomic sequence fragments are 32bp to 600 bp. In some embodiments, the genomic sequence fragments are 32bp to 500 bp. In some embodiments, the genomic sequence fragments are 32bp to 400 bp. In some embodiments, the genomic sequence fragments are 32bp to 300 bp. In some embodiments, the genomic sequence fragments are 32bp to 200 bp. In some embodiments, the genomic sequence fragments are 32bp to 100 bp.
[00079] In some embodiments, the genomic sequence fragments are 500bp to
75,000 bp. In some embodiments, the genomic sequence fragments are l,000bp to 50,000 bp. In some embodiments, the genomic sequence fragments are 10,000bp to 25,000 bp.
[00080] In some embodiments, the unique contig Kmers are 32 bp to 250bp. In some embodiments, the unique contig Kmers are 32 bp to 225bp. In some embodiments, the unique contig Kmers are 32 bp to 200bp. In some embodiments, the unique contig Kmers are 32 bp to 175bp. In some embodiments, the unique contig Kmers are 32 bp to 150bp. In some embodiments, the unique contig Kmers are 32 bp to 125bp. In some embodiments, the unique contig Kmers are 32 bp to lOObp. In some embodiments, the unique contig Kmers are 32 bp to 75bp. In some embodiments, the unique contig Kmers are 32 bp to 50bp.
[00081] In some embodiments, the unique contig Kmers are 50 bp to 250bp. In some embodiments, the unique contig Kmers are 75 bp to 250bp. In some embodiments, the unique contig Kmers are 100 bp to 250bp. In some embodiments, the unique contig Kmers are 125 bp to 250bp. In some embodiments, the unique contig Kmers are 150 bp to 250bp. In some embodiments, the unique contig Kmers are 175 bp to 250bp. In some embodiments, the unique contig Kmers are 200 bp to 250bp. In some embodiments, the unique contig Kmers are 225 bp to 250bp. In some embodiments, the unique contig Kmers are 50 bp to 225bp. In some embodiments, the unique contig Kmers are 75 bp to 200bp. In some embodiments, the unique contig Kmers are 100 bp to 175bp. In some embodiments, the unique contig Kmers are 125 bp to 150bp.
[00082] In some embodiments, the low weight of Ey is less than 3. In some embodiments, the low weight of Ey is less than 2. In some embodiments, the low weight of Eij is less than 1. In some embodiments, the low weight of Ey is less than 0.5.
[00083] In some embodiments, the low weight of Ey is 0.5 to 3. In some embodiments, the low weight of Ey is 1 to 3. In some embodiments, the low weight of Ey is 1.5 to 3. In some embodiments, the low weight of Ey is 2 to 3. In some embodiments, the low weight of Ey is 2.5 to 3. In some embodiments, the low weight of Ey is 0.5 to 2.5. In some embodiments, the low weight of Ey is 0.5 to 2. In some embodiments, the low weight of Ey is 0.5 to 1.5. In some embodiments, the low weight of Ejj is 0.5 to 1. In some embodiments, the low weight of Ey is 1 to 2.5. In some embodiments, the low weight of Ejj is 1.5 to 2.
[00084] In some embodiments, the whole genome of the organism is fragmented to produce a plurality of molecules. In some embodiments, each molecule of the plurality of molecules is a DNA molecule. In some embodiments, a molecule is 500bp to 2MB (megabase). In some embodiments, a molecule is 500bp to 1MB. In some embodiments, a molecule is 500bp to 0.5MB. In some embodiments, a molecule is 500bp to 250,000 kb. In some embodiments, a molecule is 500bp to 100,000 kb. In some embodiments, a molecule is 500bp to 50,000 kb. In some embodiments, a molecule is 500bp to 25,000 kb. In some embodiments, a molecule is 500bp to 10,000 kb. In some embodiments, a molecule is 500bp to 2,500 kb. In some embodiments, a molecule is 500bp to 1,000 kb. In some embodiments, a molecule is 500bp to 500 kb. In some embodiments, a molecule is 500bp to 250 kb. In some embodiments, a molecule is 500bp to 100 kb. In some embodiments, a molecule is 500bp to 50 kb. In some embodiments, a molecule is 500bp to 25 kb. In some embodiments, a molecule is 500bp to 10 kb. In some embodiments, a molecule is 500bp to 5 kb. In some embodiments, a molecule is 500bp to 2.5 kb. In some embodiments, a molecule is 500bp to 1 kb.
[00085] In some embodiments, a molecule is lkb to 2MB. In some embodiments, a molecule is 2.5kb to 2MB. In some embodiments, a molecule is 5kb to 2MB. In some embodiments, a molecule is lOkb to 2MB. In some embodiments, a molecule is 25kb to 2MB. In some embodiments, a molecule is 50kb to 2MB. In some embodiments, a molecule is lOOkb to 2MB. In some embodiments, a molecule is 250kb to 2MB. In some embodiments, a molecule is 500kb to 2MB. In some embodiments, a molecule is l,000kb to 2MB. In some embodiments, a molecule is 2,500kb to 2MB. In some embodiments, a molecule is 5,000kb to 2MB. In some embodiments, a molecule is 10,000kb to 2MB. In some embodiments, a molecule is 25,000kb to 2MB. In some embodiments, a molecule is 50,000kb to 2MB. In some embodiments, a molecule is 100,000kb to 2MB. In some embodiments, a molecule is 250,000kb to 2MB. In some embodiments, a molecule is 500,000kb to 2MB. In some embodiments, a molecule is 1MB to 2MB. In some embodiments, a molecule is 1.5kb to 2MB. In some embodiments, a molecule is 1MB to 1.5MB.
[00086] In some embodiments, a molecule is lkb to 1.5MB. In some embodiments, a molecule is 2.5kb to 1MB. In some embodiments, a molecule is 5kb to 0.5MB. In some embodiments, a molecule is lOkb to 0.25MB. In some embodiments, a molecule is 25kb to 100,000kb. In some embodiments, a molecule is 50kb to 50,000kb. In some embodiments, a molecule is lOOkb to 25,000kb. In some embodiments, a molecule is 250kb to 10,000kb. In some embodiments, a molecule is 500kb to 5,000kb. In some embodiments, a molecule is l,000kb to 2,500kb.
[00087] In some embodiments, the whole genome of an organism is derived from a plant, e.g., but not limited to, maize, rice, barley, etc. In some embodiments, the whole genome of an organism is derived from a mammal, e.g., but not limited to, human, feline, canine, murine, etc., In some embodiments, the whole genome of an organism is derived from a single-cell organism, e.g., but not limited to, a bacterium, an archaebacterium, a protozoan, etc.
[00088] In some embodiments, the method of the present invention includes dynamically constructing, by a specifically programmer computer system, a debruign graph from the plurality of tagged raw reads. In some embodiments, the method further includes analyzing, by the specifically programmer computer system, a plurality of contigs. [00089] In some embodiments, the method of the present invention includes dynamically constructing, by a specifically programmer computer system, a weighted graph comprising a plurality of nodes. In some embodiments, the method further includes identifying, by the specifically programmer computer system, a connection weight between each two nodes as the number of shared barcodes between the nodes, filtering the connection weight between each two nodes as the number of shared barcodes between the nodes, organizing a first plurality of contigs into a first plurality of groups of assembled molecules, organizing a second plurality of contigs into a second plurality of groups of assembled molecules, predicting a continuous sequence of the at least one portion of the contig by constructing a distance matrix between at least two contigs, assigning each of the overlapping groups of the second plurality of groups of assembled molecules an Overlapping Start (OS) and an Overlapping End (OE), assembling a DNA sequence order, wherein the DNA sequence order is organized from a maximum OS to a minimum OE, or any combination thereof.
[00090] Example 1:
[00091] The analysis includes the following steps:
[00092] (1) Building debruijn graph based on all the raw reads (from all sets) and generate contigs. The contigs represent authentic genomic sequence fragment of varying length (64bp (or any other value of the debruijn kmer length +1) to 10,000bp). The longer the contigs the better the following steps, so building the debruijn graph with large kmer (>100mer) and filtering sequencing error (that generate false contigs) from the raw read are required supplementary analysis stages for this step.
[00093] (2) Analyzing the debruijn graph structure (including calculated average read coverage for the contigs, and contigs connectivity) and predict the number of appearance of each contig in the genome (noise, one, two, three, four or more). Specifically, the analysis is focus on the contigs that appear once in the genome and are defined as "unique contigs" ("UCs"). The information is about two (or more) UCs origin from a single long DNA molecule. In some cases, the UCs are relatively short, so the analysis is supported by the mapping information of the tagged reads to additional long contigs (>500bp) that may appear several times in the genome (between 2 to 10). In addition:
[00094] (a) In homozygous genome (as plant inbred), the diploid genomic sequence is defined to appear once in the genome and defined as "UCs".
[00095] (b) In heterozygous genome (as human), since the two haploid sequence are not identical, the haploid genomic sequence is defined to appear once in the genome and defined as "UCs". While the diploid genomic sequence is defined to appear twice.
[00096] (3) Mapping the tagged reads to the obtained unique or long contigs using a hash table method:
[00097] (a) contigs are broken into Kmers with length short than the reads size
(i.e lOObp).
[00098] (b) The Kmers are the keys for the hash table and the value is their contig id. Kmers that are originated from more than one contigs are discarded.
[00099] (c) Each read is broken into Kmers and in case more than a threshold
(1, 2, 3 to 10) of the read Kmers are mapped to the same contig, defines a match between the read barcode and the contigs.
A matrix is generated as in the following simplified example (where 1 mean
Figure imgf000030_0001
4 1203 0 1 1 1 0 1 1 1 0 0 0 0
5 175 1 0 1 0 0 0 0 0 0 1 1 0
6 1223 0 1 1 0 0 0 0 0 1 1 1 1
7 2380 0 1 1 0 0 0 0 0 0 0 1 1
8 201 1 0 1 0 0 0 0 0 0 0 1 1
9 175 1 0 1 0 0 0 0 0 1 1 0 1
10 175 1 0 1 1 0 0 1 0 0 0 0 0
11 7805 0 1 1 0 0 0 1 1 0 0 0 0
12 2124 0 1 1 0 0 0 0 0 1 1 0 0
13 212 1 0 1 0 1 0 1 0 0 0 0 0
14 175 1 0 1 1 0 1 0 1 0 0 0 0
15 2351 0 1 1 0 0 1 1 1 0 0 0 0
16 6437 0 1 1 0 0 0 0 0 0 1 1 1
17 1729 0 1 1 0 0 0 0 0 0 0 1 1
18 2323 0 1 1 1 1 1 0 0 0 0 0 0
19 191 1 0 1 0 0 0 0 0 0 1 0 1
20 4231 0 1 1 1 1 1 1 0 0 0 0 0
[000100] De-multiplex step: For each barcode id (i.e Bl)
[000101] (a) The group of additional barcodes (BX) is defined, in which the UCs in Bl co-appear, by searching for barcode ids in which the number of UCs in Bl that also appear there is above some threshold (i.e. >0, B2-B10 based on the above table example).
[000102] (b) Focusing on long contigs in Bl (LC, such that their appearance probability in the matched barcodes is high), a Bl-LC graph is built (example FIG. 5): each LC is a node and the number of barcodes (from BX. defined at step 4a) in which the pair co-detected defines a weighted edge between each LCs pairs (Ey). Edges with weight lower than a threshold (i.e 2, 3 or 5, depending on sequence coverage and number of barcodes) are deleted from the graph.
[000103] (c) Searching for connected component in the Bl-LC graph (example FIG. 5), each connected component defines LCs in Bl with common molecule of origin: Bl_LCs_l (with contigs 4, 11, 15, 18 and 20) and Bl_LCs_2 (with contigs 6, 7, 12, 16, and 17). Figures 5 A and 5B show the connected components as used in the methods of the present invention.
[000104] (d) Splitting BX to distinct groups (i.e BX_1, BX_2 etc), each one of them overlaps with a distinct molecule in Bl (i.e. BI LCs l), identified by the barcodes in which the LCs in BI LCs l appears. In our example BI LCs l match to B2-B6 and Bl_LCs_2 match to B7-B10
[000105] (e) Predicting LCs order within each molecule (i.e. BI LCs l). LCs from the same molecule of origin are sorted based on the following procedure: we build a distance matrix between the LCs (l->n). The distance between LCs pair (i and j) is defined as maxk,i:i->n(Eki) - Ey. Next, the distance matrix is reordered such that adjacent LCs have low distance.
[000106] The distance matrix is for Bl_LCs_l is:
Figure imgf000032_0001
The predict order of the long contig is: 11, 15, 4, 20 and 18 and the reorder distance matrix for Bl LCs l is:
Figure imgf000032_0002
[000107] (a) The LCs ordering information is used to estimate the overlapping of the molecule is each set of BX (i.e. BX l) with this same molecule in Bl (i.e. BI LCs l): for each set of BX we define a Overlapping Start (OS) as the minimal LCs order (defined in 4d) that appear in the BX set and a Overlapping End (OE) as the maximal LCs order (defined in 4d) that appear in the BX set.
[000108] (b) Finally, all contigs (unique and long) are defined in Bl that belong to BI LCs l as the contigs in Bl that appear in BX l and predicted their order in the molecule in the range of max(OSBx_i)->min(OEBx_i). The out ut of the demulti lexin al orithm in the exam le above is:
Figure imgf000033_0001
[000109] The contig IDs represent the contigs or their reverse complement contigs, as strand information (i.e., sense strand or anti-sense strand) cannot be idenfied from the barcoded data.
[000110] In another embodiment, the method includes mapping the barcoded read into pre-built debuijn graph from additional NGS data from the same sample. This option supports de novo assembly analysis by providing genomic coverage. Additionally, this method can use PCR-free libraries as a data source.
[000111] As a non-limiting example, this algorithm can be used on barcoded sequencing data generated using a single Chromium™ library (by 10X Genomics, CA, USA) sequenced by two lanes of HiSeq xlO machine (Illumina) to generate 230Gb sequenced data (785*10Λ6 2X150bp reads) of a Bovine genome (Nellore beef cattle).
[000112] The statistics of the obtained DeBruijn graph are the following:
Total number of contigs: 22,445,962
Total Assembly size: 3,534,210,010
Contig N50: 398
Max contig length: 62,105 [000113] Analyzing DeBruijn graph allows for the identifying 9,538,995 of the contigs as unique and covering l,693Mbp (47.6% of the total assembly size), 9,557,415 of the contigs defined as supplementary and they cover l,210Mbp (34% of the total assembly size).
[000114] Upon mapping the barcoded reads (tagged raw reads) to the graph, 1,119,393,205 tagged raw reads that mapped to the graph (e.g., at least identical 127bp overlap between the read and the contig) are identified. A matrix is generated, assigning the contigs to a barcode matrix. Reliable barcodes are identified as barcodes having more than 60 tagged raw reads (e.g., but not limited to, 60 tagged raw reads, 61 tagged raw reads, 62 tagged raw reads, etc.). 1,323,829 barcodes are obtained to assign the final contigs to the barcodes matrix.
[000115] Upon running the algorithm, the long molecules within barcodes were demuliplexed. Overall 8.24 molecules were identified per barcode, with an average length of 41kbp. On average, each molecule is composed of 111.63 contigs and part of the molecule appears in 119.6 distinct barcodes.
[000116] Figure 6 shows an example of the contigs to barcodes matrix of a long molecule composed of 118 contigs (Y-axis) and overlapped with 79 barcodes (X- axis). White cells indicate that reads from the corresponding barcode were mapped to the corresponding contig, while black cells indication un matched barcode and contig.
[000117] Illustrative Operating Environments
[000118] FIG. 1 illustrates one embodiment of an environment in which the present invention may operate. However, not all of these components may be required to practice the invention, and variations in the arrangement and type of the components may be made without departing from the spirit or scope of the present invention. In some embodiments, the inventive system and method may include a large number of members and/or concurrent transactions. In other embodiments, the inventive system(s) and method(s) are based on a scalable computer and network architecture that incorporates varies strategies for assessing the data, caching, searching, and database connection pooling. An example of the scalable architecture is an architecture that is capable of operating multiple servers.
[000119] In embodiments, members of the computer system 102-104 include virtually any computing device capable of receiving and sending a message over a network, such as network 105, to and from another computing device, such as servers 106 and 107, each other, and the like. In embodiments, the set of such devices includes devices that typically connect using a wired communications medium such as personal computers, multiprocessor systems, microprocessor-based or programmable consumer electronics, network PCs, and the like. In embodiments, the set of such devices also includes devices that typically connect using a wireless communications medium such as cell phones, smart phones, pagers, walkie talkies, radio frequency (RF) devices, infrared (IR) devices, CBs, integrated devices combining one or more of the preceding devices, or virtually any mobile device, and the like. Similarly, in embodiments, client devices 102-104 are any device that is capable of connecting using a wired or wireless communication medium such as a PDA, POCKET PC, wearable computer, and any other device that is equipped to communicate over a wired and/or wireless communication medium.
[000120] In embodiments, the inventive system(s) of the present invention can deliver information (e.g., DNA sequences, an analysis of at least one DNA sequence) to at least one user. In some embodiments, the at least one user is remotely located. In some embodiments, the at least one user is a farmer. In some embodiments, the at least one user may be a company specializing in growing and/or distributing seeds and/or plants (e.g., but not limited to, maize, rice, wheat, etc.) In some embodiments, the inventive system(s) of the present invention can deliver information to at least one user by use of a GUI, which can allow for the at least one user to select a crop.
[000121] In embodiments, each member device within member devices 102-104 may include a browser application that is configured to receive and to send web pages, and the like. In embodiments, the browser application may be configured to receive and display graphics, text, multimedia, and the like, employing virtually any web based language, including, but not limited to Standard Generalized Markup Language (SMGL), such as HyperText Markup Language (HTML), a wireless application protocol (WAP), a Handheld Device Markup Language (HDML), such as Wireless Markup Language (WML), WMLScript, XML, JavaScript, and the like. In embodiments, programming may include either Java, .Net, QT, C, C++ or other suitable programming language.
[000122] In embodiments, member devices 102-104 may be further configured to receive a message from another computing device employing another mechanism, including, but not limited to email, Short Message Service (SMS), Multimedia Message Service (MMS), instant messaging (IM), internet relay chat (IRC), mIRC, Jabber, and the like or a Proprietary protocol.
[000123] In embodiments, network 105 may be configured to couple one computing device to another computing device to enable them to communicate. In some embodiments, network 105 may be enabled to employ any form of computer readable media for communicating information from one electronic device to another. Also, in embodiments, network 105 may include a wireless interface, and/or a wired interface, such as the Internet, in addition to local area networks (LANs), wide area networks (WANs), direct connections, such as through a universal serial bus (USB) port, other forms of computer-readable media, or any combination thereof. In embodiments, on an interconnected set of LANs, including those based on differing architectures and protocols, a router may act as a link between LANs, enabling messages to be sent from one to another.
[000124] Also, in some embodiments, communication links within LANs typically include twisted wire pair or coaxial cable, while communication links between networks may utilize analog telephone lines, full or fractional dedicated digital lines including Tl, T2, T3, and T4, Integrated Services Digital Networks (ISDNs), Digital Subscriber Lines (DSLs), wireless links including satellite links, or other communications links known to those skilled in the art. Furthermore, in some embodiments, remote computers and other related electronic devices could be remotely connected to either LANs or WANs via a modem and temporary telephone link. In essence, in some embodiments, network 105 includes any communication method by which information may travel between client devices 102-104, and servers 106 and 107.
[000125] FIG. 2 shows another exemplary embodiment of the computer and network architecture that supports the inventive method and system. The member devices 202a, 202b thru 202n shown each at least includes a computer-readable medium, such as a random access memory (RAM) 208 coupled to a processor 210 or FLASH memory. The processor 210 may execute computer-executable program instructions stored in memory 208. Such processors comprise a microprocessor, an ASIC, and state machines. Such processors comprise, or may be in communication with, media, for example computer-readable media, which stores instructions that, when executed by the processor, cause the processor to perform the steps described herein. Embodiments of computer-readable media may include, but are not limited to, an electronic, optical, magnetic, or other storage or transmission device capable of providing a processor, such as the processor 210 of client 202a, with computer- readable instructions. Other examples of suitable media may include, but are not limited to, a floppy disk, CD-ROM, DVD, magnetic disk, memory chip, ROM, RAM, an ASIC, a configured processor, all optical media, all magnetic tape or other magnetic media, or any other medium from which a computer processor can read instructions. Also, various other forms of computer-readable media may transmit or carry instructions to a computer, including a router, private or public network, or other transmission device or channel, both wired and wireless. The instructions may comprise code from any computer-programming language, including, for example, C, C++, C#, Visual Basic, Java, Python, Perl, and JavaScript
[000126] Member devices 202a-n may also comprise a number of external or internal devices such as a mouse, a CD-ROM, DVD, a keyboard, a display, or other input or output devices. Examples of client devices 202a-n may be personal computers, digital assistants, personal digital assistants, cellular phones, mobile phones, smart phones, pagers, digital tablets, laptop computers, Internet appliances, and other processor-based devices. In general, a client device 202a may be any type of processor-based platform that is connected to a network 206 and that interacts with one or more application programs. Client devices 202a-n may operate on any operating system capable of supporting a browser or browser-enabled application, such as Microsoft™, Windows™, or Linux. The client devices 202a-n shown may include, for example, personal computers executing a browser application program such as Microsoft Corporation's Internet Explorer™, Apple Computer, Inc.'s Safari™, Mozilla Firefox, and Opera. Through the client devices 202a-n, users, 212a-n communicate over the network 206 with each other and with other systems and devices coupled to the network 206. As shown in FIG. 2, server devices 204 and 213 may be also coupled to the network 206.
[000127] In some embodiments, the term "mobile electronic device" may refer to any portable electronic device that may or may not be enabled with location tracking functionality. For example, a mobile electronic device can include, but is not limited to, a mobile phone, Personal Digital Assistant (PDA), Blackberry™, Pager, Smartphone, or any other reasonable mobile electronic device. For ease, at times the above variations are not listed or are only partially listed, this is in no way meant to be a limitation.
[000128] For purposes of the instant description, the terms "cloud," "Internet cloud," "cloud computing," "cloud architecture," and similar terms correspond to at least one of the following: (1) a large number of computers connected through a realtime communication network (e.g., Internet); (2) providing the ability to run a program or application on many connected computers (e.g., physical machines, virtual machines (VMs)) at the same time; (3) network-based services, which appear to be provided by real server hardware, and are in fact served up by virtual hardware (e.g., virtual servers), simulated by software running on one or more real machines (e.g., allowing to be moved around and scaled up (or down) on the fly without affecting the end user). In some embodiments, the instant invention offers/manages the cloud computing/architecture as, but not limiting to: infrastructure a service (IaaS), platform as a service (PaaS), and software as a service (SaaS). Figures 3 and 4 illustrate schematics of exemplary implementations of the cloud computing/ architecture.
[000129] Of note, the embodiments described herein may, of course, be implemented using any appropriate computer system hardware and/or computer system software. In this regard, those of ordinary skill in the art are well versed in the type of computer hardware that may be used (e.g., a mainframe, a mini-computer, a personal computer ("PC"), a network (e.g., an intranet and/or the internet)), the type of computer programming techniques that may be used (e.g., object oriented programming), and the type of computer programming languages that may be used (e.g., C++, Basic, AJAX, Javascript). The aforementioned examples are, of course, illustrative and not restrictive.
[000130] in some embodiments, present invention is a system, including: at least one server and specialized software stored on a non-transient computer readable medium accessible by the at least one server, where, when executing the specialized software, the at least one server becomes at least one specifically programmed server that is configured to: analyse a plurality of genome sequences obtained from a plurality of organisms, where each of the plurality of the organisms has at least one distinctive genetic element, where a number of organisms in the plurality of the organisms correlates with a genetic diversity level of the plurality of the organisms; assemble at least one DNA sequence corresponding to the genome sequences of each of the plurality of the organisms, generate a plurality of contigs based on the at least one DNA sequence assembled for each of the plurality of the organisms, plot digital representations of the plurality of the contigs into at least one population DeBruijn graph, map the plurality of the contigs based on a plurality of overlapping DNA sequence regions, identify a plurality of unique contigs from the plurality of contigs, exclude the plurality of unique contigs to produce a plurality of non-unique contigs; and assemble the plurality of non-unique contigs so as to result in at least one ancestor genome sequence. [000131] All publications, patents and patent applications mentioned in this specification are herein incorporated in their entirety by reference into the specification, to the same extent as if each individual publication, patent or patent application was specifically and individually indicated to be incorporated herein by reference. In addition, citation or identification of any reference in this application shall not be construed as an admission that such reference is available as prior art to the present invention. To the extent that section headings are used, they should not be construed as necessarily limiting.
[000132] While a number of embodiments of the present invention have been described, it is understood that these embodiments are illustrative only, and not restrictive, and that many modifications may become apparent to those of ordinary skill in the art.

Claims

CLAIMS: What is claimed is:
1. A method, comprising:
obtaining a whole genome of an organism,
fragmenting the whole genome of the organism to produce a plurality of molecules, wherein each molecule of the plurality of molecules is 500 base pairs to 2000 kilobases;
tagging each molecule of the plurality of molecules to produce a plurality of tagged raw reads,
wherein each of the plurality of tagged raw reads comprises at least one unique barcode;
constructing a debruijn graph from the plurality of tagged raw reads,
extrapolating data from the constructed debruijn graph to generate a plurality of contigs,
wherein the plurality of contigs comprises genomic sequence fragments from 32 base pairs to 100,000 base pairs, and
wherein each contig of the plurality of contigs comprises a set of overlapping DNA fragments,
determining a frequency of each contig of the plurality of contigs,
wherein the frequency is the number of times each contig of the plurality of contigs is present in the whole genome of the organism;
identifying unique contigs of the plurality of contigs,
wherein each unique contig appears once in the whole genome of the organism, wherein the identifying is performed using the Debruijn graph, identifying a plurality of supplementary contigs,
wherein each supplementary contig appears two or more times in the whole genome of the organism and are long contigs,
wherein the identifying is performed using the Debruijn graph,
wherein each supplementary contig is a long contig,
wherein the long contig is greater than 500 base pairs, mapping the at least one unique barcode of the plurality of unique barcodes to a corresponding unique contig or a corresponding supplementary contig,
wherein the corresponding unique contig or the corresponding supplementary contig is fragmented into unique contig Kmers of 32 base pairs to 250 base pairs, wherein at least one origin of the corresponding unique contig or the corresponding supplementary contig is stored,
wherein the unique contig Kmers corresponding with more than one contig are discarded, and
wherein the unique contig Kmers corresponding with one contig are identified as confirmed unique contig Kmers,
wherein each tagged raw read is fragmented to generate reads Kmers having a length of the confirmed unique contig Kmers,
wherein a match is identified when one read Kmer is matched to a unique contig Kmer,
organizing each contig of the plurality of contigs according to the plurality of unique barcodes,
constructing a weighted graph comprising a plurality of nodes,
wherein each node of the plurality of nodes corresponds to a selected contig, identifying a connection weight between each two nodes as the number of shared barcodes between the nodes,
filtering the connection weight in the graph with low weight,
wherein the low weight of Ey is less than 3,
organizing a first plurality of contigs into a first plurality of groups of assembled molecules,
wherein each group of the first plurality of groups of
assembled molecules is a connected component in the graph, organizing a second plurality of contigs into a second plurality of groups of assembled molecules,
wherein each group of the second plurality of groups of assembled molecules overlaps with at least one portion of a contig,
predicting a continuous sequence of the at least one portion of the contig by constructing a distance matrix between at least two contigs,
assigning each of the overlapping groups of the second plurality of groups of assembled molecules an Overlapping Start (OS) and an Overlapping End (OE),
wherein an OS is the minimal contig order, and
wherein an OE is the maximal contig order, and
assembling a DNA sequence order, wherein the DNA sequence order is organized from a maximum OS to a minimum OE.
2. The method of claim 1, wherein the mapping step comprises: identifying a set of overlapping barcodes.
3. The method of claim 1, wherein the set of the remaining contigs comprises unique contigs or supplementary contigs.
4. The method of claim 1, wherein the connection weight is measured using a contigs pair, Ey
5. The method of claim 5, wherein the contigs pair is a number of common barcodes within the set of the remaining contigs,
wherein the number of common barcodes is at least one barcode at least in duplicate.
6. The method of claim 1,
wherein the distance matrix between two contigs is measured by:
maXkJiover all the contigs in the group(Ekl) " Ejj.
7. The method of claim 1,
wherein the distance matrix is reordered to indicate that adjacent contigs are separated by a corresponding distance.
PCT/IB2017/001547 2016-02-10 2017-02-10 Systems and methods for computational demultiplexing of genomic barcoded sequences WO2018037289A2 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US201662293470P 2016-02-10 2016-02-10
US62/293,470 2016-02-10

Publications (2)

Publication Number Publication Date
WO2018037289A2 true WO2018037289A2 (en) 2018-03-01
WO2018037289A3 WO2018037289A3 (en) 2018-06-07

Family

ID=61246488

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/IB2017/001547 WO2018037289A2 (en) 2016-02-10 2017-02-10 Systems and methods for computational demultiplexing of genomic barcoded sequences

Country Status (1)

Country Link
WO (1) WO2018037289A2 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115331733A (en) * 2022-10-14 2022-11-11 青岛百创智能制造技术有限公司 Method and device for analyzing sequencing data of space transcriptome chip
WO2023288018A3 (en) * 2021-07-14 2023-04-20 Ultima Genomics, Inc. Barcode selection

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
AU2014212152B2 (en) * 2013-02-01 2020-02-06 The Regents Of The University Of California Methods for genome assembly and haplotype phasing
EP3058096A1 (en) * 2013-10-18 2016-08-24 Good Start Genetics, Inc. Methods for assessing a genomic region of a subject
US20150286775A1 (en) * 2013-12-18 2015-10-08 Pacific Biosciences Of California, Inc. String graph assembly for polyploid genomes
KR20170023979A (en) * 2014-06-26 2017-03-06 10엑스 제노믹스, 인크. Processes and systems for nucleic acid sequence assembly
US20160106005A1 (en) * 2014-10-13 2016-04-14 Ntherma Corporation Carbon nanotubes as a thermal interface material
JP2018513508A (en) * 2015-03-16 2018-05-24 パーソナル ジノーム ダイアグノスティクス, インコーポレイテッド Systems and methods for analyzing nucleic acids

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023288018A3 (en) * 2021-07-14 2023-04-20 Ultima Genomics, Inc. Barcode selection
CN115331733A (en) * 2022-10-14 2022-11-11 青岛百创智能制造技术有限公司 Method and device for analyzing sequencing data of space transcriptome chip

Also Published As

Publication number Publication date
WO2018037289A3 (en) 2018-06-07

Similar Documents

Publication Publication Date Title
Sahlin et al. De novo clustering of long-read transcriptome data using a greedy, quality value-based algorithm
Wildschutte et al. Discovery of unfixed endogenous retrovirus insertions in diverse human populations
Flagel et al. Duplicate gene evolution, homoeologous recombination, and transcriptome characterization in allopolyploid cotton
Sun et al. LTR retrotransposons contribute to genomic gigantism in plethodontid salamanders
Lemmon et al. High-throughput genomic data in systematics and phylogenetics
Peterson et al. Double digest RADseq: an inexpensive method for de novo SNP discovery and genotyping in model and non-model species
Cariou et al. Is RAD‐seq suitable for phylogenetic inference? An in silico assessment and optimization
Janouškovec et al. Evolution of red algal plastid genomes: ancient architectures, introns, horizontal gene transfer, and taxonomic utility of plastid markers
US20170199959A1 (en) Genetic analysis systems and methods
Hohenlohe et al. Population genomic analysis of model and nonmodel organisms using sequenced RAD tags
Hu et al. An efficient error correction and accurate assembly tool for noisy long reads
US10373705B2 (en) Providing nucleotide sequence data
Ellegren The avian genome uncovered
KR101832834B1 (en) Method and system for multiple dot plot analysis
Sahlin et al. De novo clustering of long-read transcriptome data using a greedy, quality-value based algorithm
Nürnberger et al. Para‐allopatry in hybridizing fire‐bellied toads (Bombina bombina and B. variegata): Inference from transcriptome‐wide coalescence analyses
Koren et al. Complete assembly of parental haplotypes with trio binning
Lee et al. MaizeNet: a co‐functional network for network‐assisted systems genetics in Zea mays
AlMomin et al. Draft genome sequence of the silver pomfret fish, Pampus argenteus
WO2018037289A2 (en) Systems and methods for computational demultiplexing of genomic barcoded sequences
Basantani et al. An update on bioinformatics resources for plant genomics research
Shirasawa et al. An improved reference genome for Trifolium subterraneum L. provides insight into molecular diversity and intra-specific phylogeny
Goswami et al. RNA-Seq for revealing the function of the transcriptome
Simon Three new genome assemblies of blue mussel lineages: North and South European Mytilus edulis and Mediterranean Mytilus galloprovincialis
Flack et al. Chromosome-level, nanopore-only genome and allele-specific DNA methylation of Pallas's cat, Otocolobus manul

Legal Events

Date Code Title Description
NENP Non-entry into the national phase

Ref country code: DE

121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 17843011

Country of ref document: EP

Kind code of ref document: A2

122 Ep: pct application non-entry in european phase

Ref document number: 17843011

Country of ref document: EP

Kind code of ref document: A2