WO2018037289A2

WO2018037289A2 - Systems and methods for computational demultiplexing of genomic barcoded sequences

Info

Publication number: WO2018037289A2
Application number: PCT/IB2017/001547
Authority: WO
Inventors: Gil BEN-ZVI; Omer BARAD
Original assignee: Energin.R Technologies 2009 Ltd.
Priority date: 2016-02-10
Filing date: 2017-02-10
Publication date: 2018-03-01
Also published as: WO2018037289A3

Abstract

In some embodiments, the present invention is a method which includes: obtaining a whole genome of an organism, fragmenting the whole genome of the organism to produce a plurality of raw reads, tagging each raw read of the plurality of raw reads to produce a plurality of tagged raw reads, constructing a debruijn graph from the plurality of tagged raw reads, extrapolating data from the constructed debruijn graph to generate a plurality of contigs, mapping the at least one unique barcode of the plurality of unique barcodes to a corresponding unique contig or a corresponding supplementary contig, organizing each contig of the plurality of contigs according to the plurality of unique barcodes, identifying a connection weight using the Debruijn graph, filtering the connection weight in the graph with low weight, and assembling a DNA sequence order, wherein the DNA sequence order is organized from a maximum OS to a minimum OE.

Description

SYSTEMS AND METHODS FOR COMPUTATIONAL DEMULTIPLEXING OF GENOMIC BARCODED SEQUENCES

RELATED APPLICATIONS

[0001] This application claims the priority of U.S. Patent Appln.

No. 62/293,470; filed February 10, 2016; entitled "SYSTEMS AND METHODS FOR COMPUTATIONAL DEMULTIPLEXING OF GENOMIC BARCODED SEQUENCES," which is incorporated herein by reference in its entirety for all purposes.

FIELD OF THE INVENTION

[0002] The field of invention relates to methods of sequencing. In particular, the present invention provides methods for computational demultiplexing of genomic barcoded sequences.

BACKGROUND

[0003] Molecular biology protocols enable tagging short sequencing reads originating from several long genomic DNA molecules with unique barcodes. Those methods enable the extraction of long-range information based on standard short read sequencing of the tagged DNA. This information is useful for the following bioinformatics applications: (1) Phasing of heterozygous polymorphism, (2) Structural variation detection, and (3) Denovo assembly analysis.

[0004] All of the above analyses are based on the ability to de-multiplex the information from each barcode reads that are pooled from several of molecules into the distinct molecules of origin. Typically, this bioinformatics analysis is accomplished using mapping to external reference genome and it is not suitable to demultiplex haplotype sequences that are not mapped to external reference genome. BRIEF DESCRIPTION OF THE FIGURES

[0005] The present invention will be further explained with reference to the attached drawings, wherein like structures are referred to by like numerals throughout the several views. The drawings shown are not necessarily to scale, with emphasis instead generally being placed upon illustrating the principles of the present invention. Further, some features may be exaggerated to show details of particular components.

[0006] Figures 1-4 are embodiments of the system used in the present invention.

[0007] Figures 5A and 5B show embodiments of the connected components as used in the methods of the present invention.

[0008] Figure 6 shows an exemplary embodiment of the relationship between contigs and barcodes as used in methods of the present invention.

[0009] The figures constitute a part of this specification and include illustrative embodiments of the present invention and illustrate various objects and features thereof. Further, the figures are not necessarily to scale, some features may be exaggerated to show details of particular components. In addition, any measurements, specifications and the like shown in the figures are intended to be illustrative, and not restrictive. Therefore, specific structural and functional details disclosed herein are not to be interpreted as limiting, but merely as a representative basis for teaching one skilled in the art to variously employ the present invention.

SUMMARY OF THE INVENTION

[00010] In some embodiments, the present invention is a method which includes: obtaining a whole genome of an organism,

fragmenting the whole genome of the organism to produce a plurality of molecules, wherein each molecule of the plurality of molecules is 500 base pairs to 2000 kilobases;

tagging each molecule of the plurality of molecules to produce a plurality of tagged raw reads,

wherein each of the plurality of tagged raw reads comprises at least one unique barcode;

constructing a debruijn graph from the plurality of tagged raw reads,

extrapolating data from the constructed debruijn graph to generate a plurality of contigs,

wherein the plurality of contigs comprises genomic sequence fragments from 32 base pairs to 100,000 base pairs, and

wherein each contig of the plurality of contigs comprises a set of overlapping DNA fragments,

determining a frequency of each contig of the plurality of contigs,

wherein the frequency is the number of times each contig of the plurality of contigs is present in the whole genome of the organism;

identifying unique contigs of the plurality of contigs,

wherein each unique contig appears once in the whole genome of the organism, wherein the identifying is performed using the Debruijn graph,

identifying a plurality of supplementary contigs,

wherein each supplementary contig appears two or more times in the whole genome of the organism and are long contigs,

wherein the identifying is performed using the Debruijn graph, wherein each supplementary contig is a long contig,

wherein the long contig is greater than 500 base pairs, mapping the at least one unique barcode of the plurality of unique barcodes to a corresponding unique contig or a corresponding supplementary contig,

wherein the corresponding unique contig or the corresponding supplementary contig is fragmented into unique contig Kmers of 32 base pairs to 250 base pairs, wherein at least one origin of the corresponding unique contig or the corresponding supplementary contig is stored,

wherein the unique contig Kmers corresponding with more than one contig are discarded, and

wherein the unique contig Kmers corresponding with one contig are identified as confirmed unique contig Kmers,

wherein each tagged raw read is fragmented to generate reads Kmers having a length of the confirmed unique contig Kmers,

wherein a match is identified when one read Kmer is matched to a unique contig Kmer,

organizing each contig of the plurality of contigs according to the plurality of unique barcodes,

constructing a weighted graph comprising a plurality of nodes,

wherein each node of the plurality of nodes corresponds to a selected contig,

identifying a connection weight between each two nodes as the number of shared barcodes between the nodes,

filtering the connection weight in the graph with low weight,

wherein the low weight of Ey is less than 3, organizing a first plurality of contigs into a first plurality of groups of assembled molecules,

wherein each group of the first plurality of groups of

assembled molecules is a connected component in the graph, organizing a second plurality of contigs into a second plurality of groups of assembled molecules,

wherein each group of the second plurality of groups of assembled molecules overlaps with at least one portion of a contig,

predicting a continuous sequence of the at least one portion of the contig by constructing a distance matrix between at least two contigs,

assigning each of the overlapping groups of the second plurality of groups of assembled molecules an Overlapping Start (OS) and an Overlapping End (OE),

wherein an OS is the minimal contig order, and

wherein an OE is the maximal contig order, and

assembling a DNA sequence order, wherein the DNA sequence order is organized from a maximum OS to a minimum OE.

[00011] In some embodiments, the mapping step comprises: identifying a set of overlapping barcodes.

[00012] In some embodiments, the set of the remaining contigs comprises unique contigs or supplementary contigs.

[00013] In some embodiments, the connection weight is measured using a contigs pair, Ey_.

[00014] In some embodiments, the contigs pair is a number of common barcodes within the set of the remaining contigs, wherein the number of common barcodes is at least one barcode at least in duplicate. [00015] In some embodiments, the distance matrix between two contigs is measured by: maXk,_1:0ver all the contigs in the group(Ekl) - Ejj.

[00016] In some embodiments, the distance matrix is reordered to indicate that adjacent contigs are separated by a corresponding distance.

DETAILED DESCRIPTION

[00017] Among those benefits and improvements that have been disclosed, other objects and advantages of this invention will become apparent from the following description taken in conjunction with the accompanying figures. Detailed embodiments of the present invention are disclosed herein; however, it is to be understood that the disclosed embodiments are merely illustrative of the invention that may be embodied in various forms. In addition, each of the examples given in connection with the various embodiments of the invention which are intended to be illustrative, and not restrictive.

[00018] Throughout the specification and claims, the following terms take the meanings explicitly associated herein, unless the context clearly dictates otherwise. The phrases "in one embodiment" and "in some embodiments" as used herein do not necessarily refer to the same embodiment(s), though it may. Furthermore, the phrases "in another embodiment" and "in some other embodiments" as used herein do not necessarily refer to a different embodiment, although it may. Thus, as described below, various embodiments of the invention may be readily combined, without departing from the scope or spirit of the invention.

[00019] This disclosure provides methods and systems for processing polynucleotides. Applications include processing polynucleotides for polynucleotide sequencing. Polynucleotides sequencing includes the sequencing of whole genomes, detection of specific sequences such as single nucleotide polymorphisms (SNPs) and other mutations, detection of nucleic acid (e.g., deoxyribonucleic acid) insertions, and detection of nucleic acid deletions.

[00020] Utilization of the methods and systems described herein may incorporate, unless otherwise indicated, conventional techniques of organic chemistry, polymer technology, microfluidics, molecular biology and recombinant techniques, cell biology, biochemistry, and immunology. Such conventional techniques include microwell construction, microfluidic device construction, polymer chemistry, restriction digestion, ligation, cloning, polynucleotide sequencing, and polynucleotide sequence assembly. Specific, non-limiting, illustrations of suitable techniques are described throughout this disclosure. However, equivalent procedures may also be utilized. Descriptions of certain techniques may be found in standard laboratory manuals, such as Genome Analysis: A Laboratory Manual Series (Vols. I-IV), Using Antibodies: A Laboratory Manual, Cells: A Laboratory Manual, PCR Primer: A Laboratory Manual, and Molecular Cloning: A Laboratory Manual (all from Cold Spring Harbor Laboratory Press), and "Oligonucleotide Synthesis: A Practical Approach" 1984, IRL Press London, all of which are herein incorporated in their entirety by reference for all purposes.

[00021] In some embodiments, the present invention is an analysis method that enables the de-multiplexing of the tagged reads information into their distinct origin molecules solely based on the tagged reads information.

[00022] In addition, as used herein, the term "or" is an inclusive "or" operator, and is equivalent to the term "and/or," unless the context clearly dictates otherwise. The term "based on" is not exclusive and allows for being based on additional factors not described, unless the context clearly dictates otherwise. In addition, throughout the specification, the meaning of "a," "an," and "the" include plural references. The meaning of "in" includes "in" and "on."

[00023] As used herein, "align", "alignment" or "sequence alignment" refers to a method of arranging sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between the sequences. Aligned sequences of nucleotide, for example, are generally represented as rows within a matrix.

[00024] As used herein, an "allele" refers to one of a number of alternative forms of the same gene or same genetic locus.

[00025] As used herein, "assembly" or "sequence assembly" refers to aligning and merging at least two fragments of a much longer DNA sequence to reconstruct the original sequence. In some embodiments, fragments can range in size from 20 to 30,000 bases.

[00026] As used herein, "base pair(s)" or "bp" refers to a unit consisting of two nucleobases bound to each other by hydrogen bonds. Base pairs form the building blocks of the DNA double helix, and contribute to the folded structure of both DNA and RNA. The base pairs are paired: adenine-thymine or guanine-cytosine, and allow the DNA helix to maintain a regular helical structure.

[00027] As used herein, "barcoding" or "DNA barcoding" refers to a method that enables the partition of genomic DNA into large amount of sets (100s-100,000s) such that each set contains several distinct long (e.g., in the range of 10kb-500kbs) genomic DNA molecules. Later on, in each set, the long genomic DNA molecules are broken into smaller fragments and tagged with unique label, e.g., a barcode. A barcode may be a polynucleotide sequence attached to all fragments of a target polynucleotide contained within a particular partition. Finally, the tagged DNA from all set is pooled to generate a single library for NGS sequencing (as Illumina). Non- limiting examples for such methods are GemCode or Chromium by XI 0 genomics ("Haplotyping germline and cancer genomes with high-throughput Linked-Read sequencing" Zheng, et al, Nature Biotechnology 2016), moleculo by Illumina (Whole-genome haplotyping using long reads and statistical methods, Kulesuv et. al., Nature biotechnology 2014). Further non-limiting examples for such methods are disclosed in US 9,401,201, and is hereby incorporated by reference in its entirety. In some embodiments, the presence of the same barcode on multiple sequences may provide information about the origin of the sequence; e.g., a barcode may indicate that the sequence came from a particular partition and/or a proximal region of a genome.

[00028] As used herein, "confidence level" refers to a measure of the reliability of at least one estimate. Confidence levels include a range of values (intervals) that can be construed as estimates of an unknown population parameter. The level of confidence of the confidence interval indicates the probability that the confidence range can capture the actual population parameter given a distribution of samples. A confidence level can be represented as a percentage.

[00029] As used herein, a "connected component" is a group of contigs. Each of the contigs is a linear chain. The contigs are linked together by leveraging information of the reads (e.g., an original read, e.g., but not limited to, a molecule or a contig) and obtaining a set of scaffolds which constitute the final result of a de novo genome assembly.

[00030] As used herein, "consensus" or "consensus sequence" refers to a calculated order of most frequent residues, either nucleotide or amino acid, found at each position in a sequence alignment. In some embodiments, the consensus sequence represents the results of multiple sequence alignments in which related sequences are compared to each other and similar sequence motifs are calculated.

[00031] As used herein, a "contig" refers to a set of overlapping DNA segments that together represent a consensus region of genomic DNA. In some embodiments, the inventive system(s) of the present invention are configured to analyze/sort contigs having at least one haplotype, where the at least one haplotype comprises at least one marker that may be used for genetic analysis (e.g., by identifying an allele). In some embodiments, a contig can be a haplotype contig or a non-haplotype contig.

[00032] As used herein, a "continuous sequence" is a sequence resulting from the reassembly of small DNA fragments generated.

[00033] As used herein, a "corresponding distance" is a calculated distance between at least two contigs or fragments which illustrates the degree of sequence similarity.

[00034] As used herein, a "DeBruijn graph" refers to a computational system that assembles a contiguous genome from a large population (e.g., but not limited to, 1 million mer, 10 million mer, 100 million mer, 1 billion mer, 10 billion mer, etc.) of short sequencing reads.

[00035] As used herein, "demultiplexing" refers to, after sequencing, reads being assigned in silico to their long DNA molecule of origin. In contrast, as used herein, "multiplexing" refers to using short DNA indices to uniquely identify (or more correctly semi-uniquely by assigning the same index to several DNA molecules) each DNA sample. In some embodiments, demultiplexing enables barcoding different pieces of DNA in the same sample in order to generate long range information. Residual demultiplexing is needed, since methods assign the same barcode to several long DNA molecules from different genomic locus.

[00036] As used herein, a "DNA read" or "read" refers to overlapping fragments of DNA obtained by using, e.g., but not limited to, shotgun sequencing. When using a chromatogram to sequence DNA, the read is the sequence of letters at the top of each row. The reads are used to reconstruct an original sequence.

[00037] As used herein, "error-filtering" refers to a system configured to selectively reduce errors and facilitate variant detection in data from sequencing technologies.

[00038] As used herein, "error-free phase sequence" refers to a sequence after error-filtering. As used herein, "phase" means that adjacent linked alleles are phased into a single sequence.

[00039] As used herein, a "fragment" refers to a physical segment of DNA. In some embodiments, the fragment can be an overlapping physical segment of DNA.

[00040] As used herein, "genetic diversity" refers to the level of biodiversity, i.e., the total number of genetic characteristics in the genetic makeup of a species.

[00041] As used herein, "haplotype" refers to a set of DNA variations, or polymorphisms, that tend to be inherited together. A haplotype can refer to a combination of alleles or to a set of single nucleotide polymorphisms (SNPs) found on the same chromosome.

[00042] As used herein, a "k-mer" or "kmer" refers to all the possible subsequences (of length "k") from a read obtained through DNA sequencing.

[00043] As used herein, an "input read length" refers to an initial starting sequence used to build at least one contig and includes at least a portion of one kmer. [00044] As used herein, a "marker" or a "genetic marker", refers to a gene or short sequence of DNA used to identify a chromosome or to locate other genes on a genetic map.

[00045] As used herein, a "minimal contig order" is a position in which a DNA fragment or portion begins. As used herein, a "maximum contig order" is a position in which a DNA fragment or portion ends.

[00046] As used herein, "mer" refers to an oligonucleotide, where when applied to DNA, mer refers to the number of bases in the molecules (e.g., 10-mer, 100-mer, etc.).

[00047] As used herein, "overlapping," refers to polynucleotide fragments, generally referring to a collection of polynucleotide fragments with overlapping sequence. A genome may be fragmented randomly (e.g., but not limited to, by shearing in a pipette) or non-randomly (e.g., but not limited to, by digesting with a rare cutter). Fragmenting randomly produces overlapping sequences because each copy of the genome is cut at different positions. After sequencing of the fragments (which provides "sequence contigs"), this overlap may be used to determine the linear order of the fragments, thereby enabling assembly of the entire genomic sequence. In some embodiments, fragmentation may be performed, e.g., but not limited to, by enzymatic digestion, exposure to ultraviolet (UV) light, ultrasonication, and/or mechanical agitation.

[00048] As used herein, "paired-end" or "PE" refers to DNA fragments sequenced from both ends (e.g., 5' end and 3' end) and generate pairs of reads. In some embodiments, a PE library is a population of PE DNA fragments of varying sizes and sequences. [00049] As used herein, a "path" or "consensus path" refers to information provided in a DeBruijn graph that identifies the consensus of a genomic sequence with all sub-repeats in the genomic sequence substituted by the respective consensus sequences.

[00050] As used herein, a "polymorphism" refers to a difference between two different sequences.

[00051] The terms "polynucleotide" or "nucleic acid," as used herein, are used herein to refer to biological molecules comprising a plurality of nucleotides. Exemplary polynucleotides include deoxyribonucleic acids, ribonucleic acids, and synthetic analogues thereof, including peptide nucleic acids. In some embodiments, polynucleotides can be prepared using the methods disclosed in US 9,410,201.

[00052] As used herein, a "raw read" refers to the sequencing result produced from an automatic sequencing machine. The raw reads are short DNA sequences and are mixed together, not in genomic order. Inevitably, raw sequence also contains a few gaps, mistakes, and ambiguities.

[00053] As used herein, a "reference genome" or "reference assembly" is a digital nucleic acid sequence database, assembled by scientists as a representative example of a species' set of genes. Reference genomes are typically assembled from the sequencing of DNA from a number of donors, and generally do not accurately represent the set of genes of any single organism. Instead, the reference genome provides a haploid mosaic of different DNA sequences from each donor. In plants (e.g., maize, soybean, rice, etc.) the reference genome is typically assembled from a single variety.

[00054] As used herein, a "scaffold" refers to a technique which links together a non-continuous series of genomic sequences into a scaffold, consisting of sequences separated by gaps of known length. The sequences that are linked are typically contiguous sequences corresponding to read overlaps.

[00055] As used herein, "single nucleotide polymorphism" or "SNP" is a DNA sequence variation occurring within a population (e.g. 1%) in which a single nucleotide, e.g. Adenine ("A"), Thymine ("T"), Cytosine ("C") or Guanine ("G"), in the genome (or other shared sequence) differs between members of a biological species or paired chromosomes. The DNA sequence variation can occur only once. The DNA sequence variation can occur two or more times.

[00056] As used herein, a "structural variation" refers to a region of DNA which is approximately 1 kilobase (kb) or larger in size and can include inversions, balanced translocations, and/or genomic imbalances (e.g., DNA insertions and/or DNA deletions), and is typically referred to as copy number variants (CNVs).

[00057] As used herein, "unique" or "uniqueness" refers to a contig and is related to the variability in the contig' s adjacent sequences. A unique contig means that a prediction is made that this contig is embedded in a single long sequence and therefore is predicted to appear only in one locus in the genome. If a sequence is only generated once, then the sequence has high uniqueness. In some embodiments, a sequence having the high uniqueness appears/is present only once in a genome. Alternatively, if a sequence has multiple copies generated, then the sequence has low uniqueness. In contrast, as used herein, a "supplementary"

[00058] As used herein, "tag" or "tagging" refers to combining at least one barcode with a DNA fragment.

[00059] As used herein, a "whole genome" refers to the entirety of a genome.

In some embodiments, the whole genome can be, e.g., mammalian, plant, bacterial, protozoan, etc. Methods for computational demultiplexing of genomic barcoded sequences

[00060] In some embodiments, the present invention is a method which includes:

obtaining a whole genome of an organism,

constructing a debruijn graph from the plurality of tagged raw reads,

determining a frequency of each contig of the plurality of contigs,

identifying unique contigs of the plurality of contigs,

identifying a plurality of supplementary contigs, wherein each supplementary contig appears two or more times in the whole genome of the organism and are long contigs,

wherein the identifying is performed using the Debruijn graph,

wherein each supplementary contig is a long contig,

constructing a weighted graph comprising a plurality of nodes,

wherein each node of the plurality of nodes corresponds to a selected contig, identifying a connection weight between each two nodes as the number of shared barcodes between the nodes,

filtering the connection weight in the graph with low weight,

wherein the low weight of Ey is less than 3,

organizing a first plurality of contigs into a first plurality of groups of assembled molecules,

wherein each group of the first plurality of groups of

wherein an OS is the minimal contig order, and

wherein an OE is the maximal contig order, and

[00061] In some embodiments, the mapping step comprises: identifying a set of overlapping barcodes.

[00062] In some embodiments, the set of the remaining contigs comprises unique contigs or supplementary contigs. [00063] In some embodiments, the connection weight is measured using a contigs pair, Ey.

[00064] In some embodiments, the contigs pair is a number of common barcodes within the set of the remaining contigs, wherein the number of common barcodes is at least one barcode at least in duplicate.

[00065] In some embodiments, the distance matrix between two contigs is measured by: maXk,_1:0ver all the contigs in the group(Ekl) - Ejj.

[00066] In some embodiments, the distance matrix is reordered to indicate that adjacent contigs are separated by a corresponding distance.

[00067] In some embodiments, the method of the present invention includes: generating a plurality of tagged raw reads containing unique bar codes for a set of long genomic DNA molecules obtained from a whole genome of an organism, where a group of tagged raw reads of the plurality of raw reads originated from a long genomic DNA molecule is tagged with a barcode selected from a plurality of unique barcodes,

constructing a debruijn graph from the plurality of tagged raw reads,

analyzing the debruijn graph and generating a plurality of contigs, where the plurality of contigs are genomic sequence fragments ranging from 32 base pairs to 100,000 base pairs,

where each individual contig within the plurality of contigs is a set of overlapping DNA segments that together represent a consensus region of genomic DNA, analyzing the plurality of contigs to determine a number of times each individual contig of the plurality of contigs appears in the whole genome of the organism, identifying unique contigs using the Debruijn graph, where the unique contigs of the plurality of contigs appear once in the whole genome of the organism, identifying supplementary contigs using the Debruijn graph, where the supplementary contigs of the plurality of contigs appear few times in the whole genome of the organism and are long contigs (i.e. greater than 500bp),

mapping the barcodes to the unique and supplementary contigs,

where the contigs are broken into unique contig Kmers of a length of between 32bp and 250bp and its unique contig of origin is stored,

where the unique contig Kmers which align with more than one contig are discarded, and

where the unique contig Kmers which align with only one contig are unique contig Kmers,

where the raw reads from a barcode of the plurality of barcodes are broken into reads Kmers at the length of the unique contig Kmers,

where a match is identified between a barcode and a contig when a read Kmer is matched to a unique contig Kmer,

splitting the contigs within each barcode based on their long molecule of origin by the following procedure:

identifying a set of additional barcodes that overlap with the subset of the plurality of the remaining unique contigs which include the barcode.

constructing a graph comprising nodes,

where the nodes correspond to the subset of the plurality of the remaining contigs (only long unique contigs or supplementary) which mapped to the barcode, identifying connections weight in the graph, where a connection weight between a contigs pair, is defined as the number of shared barcodes within the additional barcodes set,

filtering the connection weight in the graph with low weight (i.e. E_y<3) dividing the barcode contigs into groups of predicted common long molecule of origin, where each group is a connected component in the graph,

dividing the additional barcodes set into groups, where each of the groups overlaps with a group of the contigs,

predicting a contiguous sequence of the contig with in each group of predicted common long molecule of origin by building a distance matrix between the contigs, where the distance between two contigs is measured by:

maXkJiover all the contigs in the group(Ekl) " Ejj, and

where the distance matrix is reordered to indicate that adjacent contigs have a short distance,

assigning each of the additional barcodes overlap the group of contigs with predicted common long molecule of origin an Overlapping Start (OS) which is the minimal contig order that appears in the barcode and an Overlapping End (OE) as the maximal contig order that appears in the barcode, and

identifying all barcode contigs (long and short) that predicted common long molecule of origin as contigs that also appear in the corresponding barcode group from the additional barcode set and predict a DNA sequence order in the DNA molecule in the range of max(OS_bar∞de)-^>min(OE _arcode).

[00068] In some embodiments, molecular biology protocols can divide an organism's entire genomic DNA into large sets of genomic DNA molecules (e.g., but not limited to, 10s- 100,000s genomic DNA molecules (e.g., but not limited to 10 genomic DNA molecules, 100 genomic DNA molecules, 1,000 genomic DNA molecules, 10,000 genomic DNA molecules, 100,000 genomic DNA molecules, 500,000 genomic DNA molecules, etc.); also referred to herein as "molecules"), where each set contains dozens of distinct long genomic DNA molecules (e.g., but not limited to, in a range of 10kb-2000kbs (2MB)). Next, the long genomic DNA molecules of each set are broken into smaller DNA fragments and each DNA fragment is tagged with a unique barcode. Finally, the tagged DNA fragments from all of the sets are pooled to generate a single library for next generation sequencing.

[00069] In some embodiments, the present invention is a method including demultiplexing tagged DNA reads, so as to result in identifying the distinct origin of the tagged DNA read.

[00070] In some embodiments of the method of the present invention, a plurality of overlapping long DNA molecules from the same genomic region of an organism exist in several sets of genomic DNA, where there is a low probability that long molecule from two different genomic regions co-exist in more than one set. In some embodiments, the method includes de-multiplexing the tagged DNA reads by mapping the tagged DNA reads to a sample of origin. In some embodiments, the method includes assembling contigs using Debruijn graph construction. In some embodiments, the tagged reads are transformed into computationally tagged contigs and then the tagged contigs are de-multiplexed, so as to result in: (i) computational efficiency, where computational efficiency refers to mapping the tagged DNA reads to the tagged contigs, in which the cumulative length of the tagged contigs is similar to the size of the genome, and (ii) mapping efficiency, where mapping the tagged contigs allows for matching reads from overlapping or adjacent genomic regions.

[00071] In some embodiments, long contigs are contigs greater than 500bp. In some embodiments, long contigs are contigs greater than lkb. In some embodiments, long contigs are contigs greater than 1.5kb. In some embodiments, long contigs are contigs greater than 2kb. In some embodiments, long contigs are contigs greater than 2.5kb. In some embodiments, long contigs are contigs greater than 3kb. In some embodiments, long contigs are contigs greater than 3.5kb. In some embodiments, long contigs are contigs greater than 4kb. In some embodiments, long contigs are contigs greater than 4.5kb. In some embodiments, long contigs are contigs greater than 5kb. In some embodiments, long contigs are contigs greater than 5.5kb. In some embodiments, long contigs are contigs greater than 6kb. In some embodiments, long contigs are contigs greater than 6.5kb. In some embodiments, long contigs are contigs greater than 7kb. In some embodiments, long contigs are contigs greater than 7.5kb. In some embodiments, long contigs are contigs greater than 8kb. In some embodiments, long contigs are contigs greater than 8.5kb. In some embodiments, long contigs are contigs greater than 9kb. In some embodiments, long contigs are contigs greater than 9.5kb. In some embodiments, long contigs are contigs greater than lOkb.

[00072] In some embodiments, long contigs are contigs from 500bp to 10 kb.

In some embodiments, long contigs are contigs from 500bp to 9 kb. In some embodiments, long contigs are contigs from 500bp to 8 kb. In some embodiments, long contigs are contigs from 500bp to 7 kb. In some embodiments, long contigs are contigs from 500bp to 6 kb. In some embodiments, long contigs are contigs from 500bp to 5 kb. In some embodiments, long contigs are contigs from 500bp to 4 kb. In some embodiments, long contigs are contigs from 500bp to 3 kb. In some embodiments, long contigs are contigs from 500bp to 2 kb. In some embodiments, long contigs are contigs from 500bp to 1 kb.

[00073] In some embodiments, long contigs are contigs from lkb to 10 kb. In some embodiments, long contigs are contigs from 2kb to 10 kb. In some embodiments, long contigs are contigs from 3kb to 10 kb. In some embodiments, long contigs are contigs from 4kb to 10 kb. In some embodiments, long contigs are contigs from 5kb to 10 kb. In some embodiments, long contigs are contigs from 6kb to 10 kb. In some embodiments, long contigs are contigs from 7kb to 10 kb. In some embodiments, long contigs are contigs from 8kb to 10 kb. In some embodiments, long contigs are contigs from 9kb to 10 kb.

[00074] In some embodiments, long contigs are contigs from lkb to 9 kb. In some embodiments, long contigs are contigs from 2kb to 8 kb. In some embodiments, long contigs are contigs from 3kb to 7 kb. In some embodiments, long contigs are contigs from 4kb to 6 kb.

[00075] In some embodiments, the genomic sequence fragments are 32bp to

100,000 bp (lOOkb). In some embodiments, the genomic sequence fragments are 50bp to 100,000 bp (lOOkb). In some embodiments, the genomic sequence fragments are lOObp to 100,000 bp (lOOkb). In some embodiments, the genomic sequence fragments are 200bp to 100,000 bp (lOOkb). In some embodiments, the genomic sequence fragments are 300bp to 100,000 bp (lOOkb). In some embodiments, the genomic sequence fragments are 400bp to 100,000 bp (lOOkb). In some embodiments, the genomic sequence fragments are 500bp to 100,000 bp (lOOkb). In some embodiments, the genomic sequence fragments are 600bp to 100,000 bp (lOOkb). In some embodiments, the genomic sequence fragments are 700bp to 100,000 bp (lOOkb). In some embodiments, the genomic sequence fragments are 800bp to 100,000 bp (lOOkb). In some embodiments, the genomic sequence fragments are 900bp to 100,000 bp (lOOkb).

[00076] In some embodiments, the genomic sequence fragments are l,000bp to

100,000 bp (lOOkb). In some embodiments, the genomic sequence fragments are 10,000bp to 100,000 bp (lOOkb). In some embodiments, the genomic sequence fragments are 25,000bp to 100,000 bp (lOOkb). In some embodiments, the genomic sequence fragments are 50,000bp to 100,000 bp (lOOkb). In some embodiments, the genomic sequence fragments are 75,000bp to 100,000 bp (lOOkb).

[00077] In some embodiments, the genomic sequence fragments are 32bp to

75,000 bp (lOOkb). In some embodiments, the genomic sequence fragments are 32bp to 50,000 bp (lOOkb). In some embodiments, the genomic sequence fragments are 32bp to 25,000 bp (lOOkb). In some embodiments, the genomic sequence fragments are 32bp to 10,000 bp (lOOkb). In some embodiments, the genomic sequence fragments are 32bp to 1,000 bp (lOOkb).

[00078] In some embodiments, the genomic sequence fragments are 32bp to

900 bp. In some embodiments, the genomic sequence fragments are 32bp to 800 bp. In some embodiments, the genomic sequence fragments are 32bp to 700 bp. In some embodiments, the genomic sequence fragments are 32bp to 600 bp. In some embodiments, the genomic sequence fragments are 32bp to 500 bp. In some embodiments, the genomic sequence fragments are 32bp to 400 bp. In some embodiments, the genomic sequence fragments are 32bp to 300 bp. In some embodiments, the genomic sequence fragments are 32bp to 200 bp. In some embodiments, the genomic sequence fragments are 32bp to 100 bp.

[00079] In some embodiments, the genomic sequence fragments are 500bp to

75,000 bp. In some embodiments, the genomic sequence fragments are l,000bp to 50,000 bp. In some embodiments, the genomic sequence fragments are 10,000bp to 25,000 bp.

[00080] In some embodiments, the unique contig Kmers are 32 bp to 250bp. In some embodiments, the unique contig Kmers are 32 bp to 225bp. In some embodiments, the unique contig Kmers are 32 bp to 200bp. In some embodiments, the unique contig Kmers are 32 bp to 175bp. In some embodiments, the unique contig Kmers are 32 bp to 150bp. In some embodiments, the unique contig Kmers are 32 bp to 125bp. In some embodiments, the unique contig Kmers are 32 bp to lOObp. In some embodiments, the unique contig Kmers are 32 bp to 75bp. In some embodiments, the unique contig Kmers are 32 bp to 50bp.

[00081] In some embodiments, the unique contig Kmers are 50 bp to 250bp. In some embodiments, the unique contig Kmers are 75 bp to 250bp. In some embodiments, the unique contig Kmers are 100 bp to 250bp. In some embodiments, the unique contig Kmers are 125 bp to 250bp. In some embodiments, the unique contig Kmers are 150 bp to 250bp. In some embodiments, the unique contig Kmers are 175 bp to 250bp. In some embodiments, the unique contig Kmers are 200 bp to 250bp. In some embodiments, the unique contig Kmers are 225 bp to 250bp. In some embodiments, the unique contig Kmers are 50 bp to 225bp. In some embodiments, the unique contig Kmers are 75 bp to 200bp. In some embodiments, the unique contig Kmers are 100 bp to 175bp. In some embodiments, the unique contig Kmers are 125 bp to 150bp.

[00082] In some embodiments, the low weight of Ey is less than 3. In some embodiments, the low weight of Ey is less than 2. In some embodiments, the low weight of Ei_j is less than 1. In some embodiments, the low weight of Ey is less than 0.5.

[00083] In some embodiments, the low weight of Ey is 0.5 to 3. In some embodiments, the low weight of Ey is 1 to 3. In some embodiments, the low weight of Ey is 1.5 to 3. In some embodiments, the low weight of Ey is 2 to 3. In some embodiments, the low weight of Ey is 2.5 to 3. In some embodiments, the low weight of Ey is 0.5 to 2.5. In some embodiments, the low weight of Ey is 0.5 to 2. In some embodiments, the low weight of Ey is 0.5 to 1.5. In some embodiments, the low weight of Ej_j is 0.5 to 1. In some embodiments, the low weight of Ey is 1 to 2.5. In some embodiments, the low weight of Ej_j is 1.5 to 2.

[00084] In some embodiments, the whole genome of the organism is fragmented to produce a plurality of molecules. In some embodiments, each molecule of the plurality of molecules is a DNA molecule. In some embodiments, a molecule is 500bp to 2MB (megabase). In some embodiments, a molecule is 500bp to 1MB. In some embodiments, a molecule is 500bp to 0.5MB. In some embodiments, a molecule is 500bp to 250,000 kb. In some embodiments, a molecule is 500bp to 100,000 kb. In some embodiments, a molecule is 500bp to 50,000 kb. In some embodiments, a molecule is 500bp to 25,000 kb. In some embodiments, a molecule is 500bp to 10,000 kb. In some embodiments, a molecule is 500bp to 2,500 kb. In some embodiments, a molecule is 500bp to 1,000 kb. In some embodiments, a molecule is 500bp to 500 kb. In some embodiments, a molecule is 500bp to 250 kb. In some embodiments, a molecule is 500bp to 100 kb. In some embodiments, a molecule is 500bp to 50 kb. In some embodiments, a molecule is 500bp to 25 kb. In some embodiments, a molecule is 500bp to 10 kb. In some embodiments, a molecule is 500bp to 5 kb. In some embodiments, a molecule is 500bp to 2.5 kb. In some embodiments, a molecule is 500bp to 1 kb.

[00085] In some embodiments, a molecule is lkb to 2MB. In some embodiments, a molecule is 2.5kb to 2MB. In some embodiments, a molecule is 5kb to 2MB. In some embodiments, a molecule is lOkb to 2MB. In some embodiments, a molecule is 25kb to 2MB. In some embodiments, a molecule is 50kb to 2MB. In some embodiments, a molecule is lOOkb to 2MB. In some embodiments, a molecule is 250kb to 2MB. In some embodiments, a molecule is 500kb to 2MB. In some embodiments, a molecule is l,000kb to 2MB. In some embodiments, a molecule is 2,500kb to 2MB. In some embodiments, a molecule is 5,000kb to 2MB. In some embodiments, a molecule is 10,000kb to 2MB. In some embodiments, a molecule is 25,000kb to 2MB. In some embodiments, a molecule is 50,000kb to 2MB. In some embodiments, a molecule is 100,000kb to 2MB. In some embodiments, a molecule is 250,000kb to 2MB. In some embodiments, a molecule is 500,000kb to 2MB. In some embodiments, a molecule is 1MB to 2MB. In some embodiments, a molecule is 1.5kb to 2MB. In some embodiments, a molecule is 1MB to 1.5MB.

[00086] In some embodiments, a molecule is lkb to 1.5MB. In some embodiments, a molecule is 2.5kb to 1MB. In some embodiments, a molecule is 5kb to 0.5MB. In some embodiments, a molecule is lOkb to 0.25MB. In some embodiments, a molecule is 25kb to 100,000kb. In some embodiments, a molecule is 50kb to 50,000kb. In some embodiments, a molecule is lOOkb to 25,000kb. In some embodiments, a molecule is 250kb to 10,000kb. In some embodiments, a molecule is 500kb to 5,000kb. In some embodiments, a molecule is l,000kb to 2,500kb.

[00087] In some embodiments, the whole genome of an organism is derived from a plant, e.g., but not limited to, maize, rice, barley, etc. In some embodiments, the whole genome of an organism is derived from a mammal, e.g., but not limited to, human, feline, canine, murine, etc., In some embodiments, the whole genome of an organism is derived from a single-cell organism, e.g., but not limited to, a bacterium, an archaebacterium, a protozoan, etc.

[00088] In some embodiments, the method of the present invention includes dynamically constructing, by a specifically programmer computer system, a debruign graph from the plurality of tagged raw reads. In some embodiments, the method further includes analyzing, by the specifically programmer computer system, a plurality of contigs. [00089] In some embodiments, the method of the present invention includes dynamically constructing, by a specifically programmer computer system, a weighted graph comprising a plurality of nodes. In some embodiments, the method further includes identifying, by the specifically programmer computer system, a connection weight between each two nodes as the number of shared barcodes between the nodes, filtering the connection weight between each two nodes as the number of shared barcodes between the nodes, organizing a first plurality of contigs into a first plurality of groups of assembled molecules, organizing a second plurality of contigs into a second plurality of groups of assembled molecules, predicting a continuous sequence of the at least one portion of the contig by constructing a distance matrix between at least two contigs, assigning each of the overlapping groups of the second plurality of groups of assembled molecules an Overlapping Start (OS) and an Overlapping End (OE), assembling a DNA sequence order, wherein the DNA sequence order is organized from a maximum OS to a minimum OE, or any combination thereof.

[00090] Example 1:

[00091] The analysis includes the following steps:

[00092] (1) Building debruijn graph based on all the raw reads (from all sets) and generate contigs. The contigs represent authentic genomic sequence fragment of varying length (64bp (or any other value of the debruijn kmer length +1) to 10,000bp). The longer the contigs the better the following steps, so building the debruijn graph with large kmer (>100mer) and filtering sequencing error (that generate false contigs) from the raw read are required supplementary analysis stages for this step.

[00093] (2) Analyzing the debruijn graph structure (including calculated average read coverage for the contigs, and contigs connectivity) and predict the number of appearance of each contig in the genome (noise, one, two, three, four or more). Specifically, the analysis is focus on the contigs that appear once in the genome and are defined as "unique contigs" ("UCs"). The information is about two (or more) UCs origin from a single long DNA molecule. In some cases, the UCs are relatively short, so the analysis is supported by the mapping information of the tagged reads to additional long contigs (>500bp) that may appear several times in the genome (between 2 to 10). In addition:

[00094] (a) In homozygous genome (as plant inbred), the diploid genomic sequence is defined to appear once in the genome and defined as "UCs".

[00095] (b) In heterozygous genome (as human), since the two haploid sequence are not identical, the haploid genomic sequence is defined to appear once in the genome and defined as "UCs". While the diploid genomic sequence is defined to appear twice.

[00096] (3) Mapping the tagged reads to the obtained unique or long contigs using a hash table method:

[00097] (a) contigs are broken into Kmers with length short than the reads size

(i.e lOObp).

[00098] (b) The Kmers are the keys for the hash table and the value is their contig id. Kmers that are originated from more than one contigs are discarded.

[00099] (c) Each read is broken into Kmers and in case more than a threshold

(1, 2, 3 to 10) of the read Kmers are mapped to the same contig, defines a match between the read barcode and the contigs.

A matrix is generated as in the following simplified example (where 1 mean

4 1203 0 1 1 1 0 1 1 1 0 0 0 0

5 175 1 0 1 0 0 0 0 0 0 1 1 0

6 1223 0 1 1 0 0 0 0 0 1 1 1 1

7 2380 0 1 1 0 0 0 0 0 0 0 1 1

8 201 1 0 1 0 0 0 0 0 0 0 1 1

9 175 1 0 1 0 0 0 0 0 1 1 0 1

10 175 1 0 1 1 0 0 1 0 0 0 0 0

11 7805 0 1 1 0 0 0 1 1 0 0 0 0

12 2124 0 1 1 0 0 0 0 0 1 1 0 0

13 212 1 0 1 0 1 0 1 0 0 0 0 0

14 175 1 0 1 1 0 1 0 1 0 0 0 0

15 2351 0 1 1 0 0 1 1 1 0 0 0 0

16 6437 0 1 1 0 0 0 0 0 0 1 1 1

17 1729 0 1 1 0 0 0 0 0 0 0 1 1

18 2323 0 1 1 1 1 1 0 0 0 0 0 0

19 191 1 0 1 0 0 0 0 0 0 1 0 1

20 4231 0 1 1 1 1 1 1 0 0 0 0 0

[000100] De-multiplex step: For each barcode id (i.e Bl)

[000101] (a) The group of additional barcodes (BX) is defined, in which the UCs in Bl co-appear, by searching for barcode ids in which the number of UCs in Bl that also appear there is above some threshold (i.e. >0, B2-B10 based on the above table example).

[000102] (b) Focusing on long contigs in Bl (LC, such that their appearance probability in the matched barcodes is high), a Bl-LC graph is built (example FIG. 5): each LC is a node and the number of barcodes (from BX. defined at step 4a) in which the pair co-detected defines a weighted edge between each LCs pairs (Ey). Edges with weight lower than a threshold (i.e 2, 3 or 5, depending on sequence coverage and number of barcodes) are deleted from the graph.

[000103] (c) Searching for connected component in the Bl-LC graph (example FIG. 5), each connected component defines LCs in Bl with common molecule of origin: Bl_LCs_l (with contigs 4, 11, 15, 18 and 20) and Bl_LCs_2 (with contigs 6, 7, 12, 16, and 17). Figures 5 A and 5B show the connected components as used in the methods of the present invention.

[000104] (d) Splitting BX to distinct groups (i.e BX_1, BX_2 etc), each one of them overlaps with a distinct molecule in Bl (i.e. BI LCs l), identified by the barcodes in which the LCs in BI LCs l appears. In our example BI LCs l match to B2-B6 and Bl_LCs_2 match to B7-B10

[000105] (e) Predicting LCs order within each molecule (i.e. BI LCs l). LCs from the same molecule of origin are sorted based on the following procedure: we build a distance matrix between the LCs (l->n). The distance between LCs pair (i and j) is defined as max_k,i_:i->_n(E_ki) - Ey. Next, the distance matrix is reordered such that adjacent LCs have low distance.

[000106] The distance matrix is for Bl_LCs_l is:

The predict order of the long contig is: 11, 15, 4, 20 and 18 and the reorder distance matrix for Bl LCs l is:

[000107] (a) The LCs ordering information is used to estimate the overlapping of the molecule is each set of BX (i.e. BX l) with this same molecule in Bl (i.e. BI LCs l): for each set of BX we define a Overlapping Start (OS) as the minimal LCs order (defined in 4d) that appear in the BX set and a Overlapping End (OE) as the maximal LCs order (defined in 4d) that appear in the BX set.

[000108] (b) Finally, all contigs (unique and long) are defined in Bl that belong to BI LCs l as the contigs in Bl that appear in BX l and predicted their order in the molecule in the range of max(OSBx_i)->min(OE_Bx_i). The out ut of the demulti lexin al orithm in the exam le above is:

[000109] The contig IDs represent the contigs or their reverse complement contigs, as strand information (i.e., sense strand or anti-sense strand) cannot be idenfied from the barcoded data.

[000110] In another embodiment, the method includes mapping the barcoded read into pre-built debuijn graph from additional NGS data from the same sample. This option supports de novo assembly analysis by providing genomic coverage. Additionally, this method can use PCR-free libraries as a data source.

[000111] As a non-limiting example, this algorithm can be used on barcoded sequencing data generated using a single Chromium™ library (by 10X Genomics, CA, USA) sequenced by two lanes of HiSeq xlO machine (Illumina) to generate 230Gb sequenced data (785*10^Λ6 2X150bp reads) of a Bovine genome (Nellore beef cattle).

[000112] The statistics of the obtained DeBruijn graph are the following:

Total number of contigs: 22,445,962

Total Assembly size: 3,534,210,010

Contig N50: 398

Max contig length: 62,105 [000113] Analyzing DeBruijn graph allows for the identifying 9,538,995 of the contigs as unique and covering l,693Mbp (47.6% of the total assembly size), 9,557,415 of the contigs defined as supplementary and they cover l,210Mbp (34% of the total assembly size).

[000114] Upon mapping the barcoded reads (tagged raw reads) to the graph, 1,119,393,205 tagged raw reads that mapped to the graph (e.g., at least identical 127bp overlap between the read and the contig) are identified. A matrix is generated, assigning the contigs to a barcode matrix. Reliable barcodes are identified as barcodes having more than 60 tagged raw reads (e.g., but not limited to, 60 tagged raw reads, 61 tagged raw reads, 62 tagged raw reads, etc.). 1,323,829 barcodes are obtained to assign the final contigs to the barcodes matrix.

[000115] Upon running the algorithm, the long molecules within barcodes were demuliplexed. Overall 8.24 molecules were identified per barcode, with an average length of 41kbp. On average, each molecule is composed of 111.63 contigs and part of the molecule appears in 119.6 distinct barcodes.

[000116] Figure 6 shows an example of the contigs to barcodes matrix of a long molecule composed of 118 contigs (Y-axis) and overlapped with 79 barcodes (X- axis). White cells indicate that reads from the corresponding barcode were mapped to the corresponding contig, while black cells indication un matched barcode and contig.

[000117] Illustrative Operating Environments

[000118] FIG. 1 illustrates one embodiment of an environment in which the present invention may operate. However, not all of these components may be required to practice the invention, and variations in the arrangement and type of the components may be made without departing from the spirit or scope of the present invention. In some embodiments, the inventive system and method may include a large number of members and/or concurrent transactions. In other embodiments, the inventive system(s) and method(s) are based on a scalable computer and network architecture that incorporates varies strategies for assessing the data, caching, searching, and database connection pooling. An example of the scalable architecture is an architecture that is capable of operating multiple servers.

[000119] In embodiments, members of the computer system 102-104 include virtually any computing device capable of receiving and sending a message over a network, such as network 105, to and from another computing device, such as servers 106 and 107, each other, and the like. In embodiments, the set of such devices includes devices that typically connect using a wired communications medium such as personal computers, multiprocessor systems, microprocessor-based or programmable consumer electronics, network PCs, and the like. In embodiments, the set of such devices also includes devices that typically connect using a wireless communications medium such as cell phones, smart phones, pagers, walkie talkies, radio frequency (RF) devices, infrared (IR) devices, CBs, integrated devices combining one or more of the preceding devices, or virtually any mobile device, and the like. Similarly, in embodiments, client devices 102-104 are any device that is capable of connecting using a wired or wireless communication medium such as a PDA, POCKET PC, wearable computer, and any other device that is equipped to communicate over a wired and/or wireless communication medium.

[000120] In embodiments, the inventive system(s) of the present invention can deliver information (e.g., DNA sequences, an analysis of at least one DNA sequence) to at least one user. In some embodiments, the at least one user is remotely located. In some embodiments, the at least one user is a farmer. In some embodiments, the at least one user may be a company specializing in growing and/or distributing seeds and/or plants (e.g., but not limited to, maize, rice, wheat, etc.) In some embodiments, the inventive system(s) of the present invention can deliver information to at least one user by use of a GUI, which can allow for the at least one user to select a crop.

[000121] In embodiments, each member device within member devices 102-104 may include a browser application that is configured to receive and to send web pages, and the like. In embodiments, the browser application may be configured to receive and display graphics, text, multimedia, and the like, employing virtually any web based language, including, but not limited to Standard Generalized Markup Language (SMGL), such as HyperText Markup Language (HTML), a wireless application protocol (WAP), a Handheld Device Markup Language (HDML), such as Wireless Markup Language (WML), WMLScript, XML, JavaScript, and the like. In embodiments, programming may include either Java, .Net, QT, C, C++ or other suitable programming language.

[000122] In embodiments, member devices 102-104 may be further configured to receive a message from another computing device employing another mechanism, including, but not limited to email, Short Message Service (SMS), Multimedia Message Service (MMS), instant messaging (IM), internet relay chat (IRC), mIRC, Jabber, and the like or a Proprietary protocol.

[000123] In embodiments, network 105 may be configured to couple one computing device to another computing device to enable them to communicate. In some embodiments, network 105 may be enabled to employ any form of computer readable media for communicating information from one electronic device to another. Also, in embodiments, network 105 may include a wireless interface, and/or a wired interface, such as the Internet, in addition to local area networks (LANs), wide area networks (WANs), direct connections, such as through a universal serial bus (USB) port, other forms of computer-readable media, or any combination thereof. In embodiments, on an interconnected set of LANs, including those based on differing architectures and protocols, a router may act as a link between LANs, enabling messages to be sent from one to another.

[000124] Also, in some embodiments, communication links within LANs typically include twisted wire pair or coaxial cable, while communication links between networks may utilize analog telephone lines, full or fractional dedicated digital lines including Tl, T2, T3, and T4, Integrated Services Digital Networks (ISDNs), Digital Subscriber Lines (DSLs), wireless links including satellite links, or other communications links known to those skilled in the art. Furthermore, in some embodiments, remote computers and other related electronic devices could be remotely connected to either LANs or WANs via a modem and temporary telephone link. In essence, in some embodiments, network 105 includes any communication method by which information may travel between client devices 102-104, and servers 106 and 107.

[000125] FIG. 2 shows another exemplary embodiment of the computer and network architecture that supports the inventive method and system. The member devices 202a, 202b thru 202n shown each at least includes a computer-readable medium, such as a random access memory (RAM) 208 coupled to a processor 210 or FLASH memory. The processor 210 may execute computer-executable program instructions stored in memory 208. Such processors comprise a microprocessor, an ASIC, and state machines. Such processors comprise, or may be in communication with, media, for example computer-readable media, which stores instructions that, when executed by the processor, cause the processor to perform the steps described herein. Embodiments of computer-readable media may include, but are not limited to, an electronic, optical, magnetic, or other storage or transmission device capable of providing a processor, such as the processor 210 of client 202a, with computer- readable instructions. Other examples of suitable media may include, but are not limited to, a floppy disk, CD-ROM, DVD, magnetic disk, memory chip, ROM, RAM, an ASIC, a configured processor, all optical media, all magnetic tape or other magnetic media, or any other medium from which a computer processor can read instructions. Also, various other forms of computer-readable media may transmit or carry instructions to a computer, including a router, private or public network, or other transmission device or channel, both wired and wireless. The instructions may comprise code from any computer-programming language, including, for example, C, C++, C#, Visual Basic, Java, Python, Perl, and JavaScript

[000126] Member devices 202a-n may also comprise a number of external or internal devices such as a mouse, a CD-ROM, DVD, a keyboard, a display, or other input or output devices. Examples of client devices 202a-n may be personal computers, digital assistants, personal digital assistants, cellular phones, mobile phones, smart phones, pagers, digital tablets, laptop computers, Internet appliances, and other processor-based devices. In general, a client device 202a may be any type of processor-based platform that is connected to a network 206 and that interacts with one or more application programs. Client devices 202a-n may operate on any operating system capable of supporting a browser or browser-enabled application, such as Microsoft™, Windows™, or Linux. The client devices 202a-n shown may include, for example, personal computers executing a browser application program such as Microsoft Corporation's Internet Explorer™, Apple Computer, Inc.'s Safari™, Mozilla Firefox, and Opera. Through the client devices 202a-n, users, 212a-n communicate over the network 206 with each other and with other systems and devices coupled to the network 206. As shown in FIG. 2, server devices 204 and 213 may be also coupled to the network 206.

[000127] In some embodiments, the term "mobile electronic device" may refer to any portable electronic device that may or may not be enabled with location tracking functionality. For example, a mobile electronic device can include, but is not limited to, a mobile phone, Personal Digital Assistant (PDA), Blackberry™, Pager, Smartphone, or any other reasonable mobile electronic device. For ease, at times the above variations are not listed or are only partially listed, this is in no way meant to be a limitation.

[000128] For purposes of the instant description, the terms "cloud," "Internet cloud," "cloud computing," "cloud architecture," and similar terms correspond to at least one of the following: (1) a large number of computers connected through a realtime communication network (e.g., Internet); (2) providing the ability to run a program or application on many connected computers (e.g., physical machines, virtual machines (VMs)) at the same time; (3) network-based services, which appear to be provided by real server hardware, and are in fact served up by virtual hardware (e.g., virtual servers), simulated by software running on one or more real machines (e.g., allowing to be moved around and scaled up (or down) on the fly without affecting the end user). In some embodiments, the instant invention offers/manages the cloud computing/architecture as, but not limiting to: infrastructure a service (IaaS), platform as a service (PaaS), and software as a service (SaaS). Figures 3 and 4 illustrate schematics of exemplary implementations of the cloud computing/ architecture.

[000129] Of note, the embodiments described herein may, of course, be implemented using any appropriate computer system hardware and/or computer system software. In this regard, those of ordinary skill in the art are well versed in the type of computer hardware that may be used (e.g., a mainframe, a mini-computer, a personal computer ("PC"), a network (e.g., an intranet and/or the internet)), the type of computer programming techniques that may be used (e.g., object oriented programming), and the type of computer programming languages that may be used (e.g., C++, Basic, AJAX, Javascript). The aforementioned examples are, of course, illustrative and not restrictive.

[000130] in some embodiments, present invention is a system, including: at least one server and specialized software stored on a non-transient computer readable medium accessible by the at least one server, where, when executing the specialized software, the at least one server becomes at least one specifically programmed server that is configured to: analyse a plurality of genome sequences obtained from a plurality of organisms, where each of the plurality of the organisms has at least one distinctive genetic element, where a number of organisms in the plurality of the organisms correlates with a genetic diversity level of the plurality of the organisms; assemble at least one DNA sequence corresponding to the genome sequences of each of the plurality of the organisms, generate a plurality of contigs based on the at least one DNA sequence assembled for each of the plurality of the organisms, plot digital representations of the plurality of the contigs into at least one population DeBruijn graph, map the plurality of the contigs based on a plurality of overlapping DNA sequence regions, identify a plurality of unique contigs from the plurality of contigs, exclude the plurality of unique contigs to produce a plurality of non-unique contigs; and assemble the plurality of non-unique contigs so as to result in at least one ancestor genome sequence. [000131] All publications, patents and patent applications mentioned in this specification are herein incorporated in their entirety by reference into the specification, to the same extent as if each individual publication, patent or patent application was specifically and individually indicated to be incorporated herein by reference. In addition, citation or identification of any reference in this application shall not be construed as an admission that such reference is available as prior art to the present invention. To the extent that section headings are used, they should not be construed as necessarily limiting.

[000132] While a number of embodiments of the present invention have been described, it is understood that these embodiments are illustrative only, and not restrictive, and that many modifications may become apparent to those of ordinary skill in the art.

Claims

CLAIMS: What is claimed is:

1. A method, comprising:

obtaining a whole genome of an organism,

constructing a debruijn graph from the plurality of tagged raw reads,

determining a frequency of each contig of the plurality of contigs,

identifying unique contigs of the plurality of contigs,

wherein each unique contig appears once in the whole genome of the organism, wherein the identifying is performed using the Debruijn graph, identifying a plurality of supplementary contigs,

wherein the identifying is performed using the Debruijn graph,

wherein each supplementary contig is a long contig,

constructing a weighted graph comprising a plurality of nodes,

filtering the connection weight in the graph with low weight,

wherein the low weight of Ey is less than 3,

wherein each group of the first plurality of groups of

wherein an OS is the minimal contig order, and

wherein an OE is the maximal contig order, and

2. The method of claim 1, wherein the mapping step comprises: identifying a set of overlapping barcodes.

3. The method of claim 1, wherein the set of the remaining contigs comprises unique contigs or supplementary contigs.

4. The method of claim 1, wherein the connection weight is measured using a contigs pair, Ey

5. The method of claim 5, wherein the contigs pair is a number of common barcodes within the set of the remaining contigs,

wherein the number of common barcodes is at least one barcode at least in duplicate.

6. The method of claim 1,

wherein the distance matrix between two contigs is measured by:

maXkJiover all the contigs in the group(Ekl) " Ejj.

7. The method of claim 1,

wherein the distance matrix is reordered to indicate that adjacent contigs are separated by a corresponding distance.