WO2019204702A1 - Codes-barres d'adn de correction d'erreur - Google Patents

Codes-barres d'adn de correction d'erreur Download PDF

Info

Publication number
WO2019204702A1
WO2019204702A1 PCT/US2019/028279 US2019028279W WO2019204702A1 WO 2019204702 A1 WO2019204702 A1 WO 2019204702A1 US 2019028279 W US2019028279 W US 2019028279W WO 2019204702 A1 WO2019204702 A1 WO 2019204702A1
Authority
WO
WIPO (PCT)
Prior art keywords
barcode
barcodes
genetic
corrupt
error
Prior art date
Application number
PCT/US2019/028279
Other languages
English (en)
Inventor
William H. PRESS
John Hawkins
Original Assignee
Board Of Regents, The University Of Texas System
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Board Of Regents, The University Of Texas System filed Critical Board Of Regents, The University Of Texas System
Publication of WO2019204702A1 publication Critical patent/WO2019204702A1/fr

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B35/00ICT specially adapted for in silico combinatorial libraries of nucleic acids, proteins or peptides
    • G16B35/10Design of libraries
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B50/00ICT programming tools or database systems specially adapted for bioinformatics
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6869Methods for sequencing
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids

Definitions

  • the disclosure generally relates to methods to correct errors in digitized genetic barcode sequences, including substitution, insertion, and deletion errors that occur in pooled populations of diverse genetic materials.
  • DNA barcodes short, unique DNA sequences that are coupled to each member in the population (FIG. 1A).
  • DNA barcode-based identification is central to such diverse applications as single-cell genome and RNA sequencing 1-7 , gene synthesis 8 9 , high-throughput antibody screens 10 11 , and drug discovery 12 13 .
  • Such experiments have been enabled by recent breakthroughs in massively-parallel, pooled DNA synthesis 14 15 .
  • a recent study used DNA barcodes to discover small molecule inhibitors of enzymes by screening about 10 8 small molecules. Each small molecule was attached to a unique set of three DNA barcodes.
  • the highest affinity ligands were enriched via multiple rounds of selection and then identified via high-throughput sequencing of the attached barcodes 16 .
  • the rapid growth of such methodologies in all areas of biomedicine requires the development of large pools (>l0 6 members) of unique DNA barcodes to identify individual members (e.g., cells, proteins, drugs) in heterogeneous ensembles.
  • Every assay using DNA barcodes is subject to errors introduced during DNA synthesis and sequencing. These errors decrease experimental power and accuracy by confounding the identity of individual biomolecules in the population. Base pair insertions and deletions are particularly challenging to decode because these mutations cause a frameshift in all downstream sequencing. Applying a manufacturer-advertised error rate of up to 1 per 200 nucleotides (nt) 17 to a 20 base pair (bp) long barcode with no error correction translates to a best-case scenario of 10% data lost or, worse, incorrectly interpreted. Next-generation sequencing also has error rates between 10 3 and 10 4 . This alone represents errors in approximately 1% of the above example 20 bp barcodes, which can be limiting for detection of rare events. These errors can be overcome through the use of error-correcting DNA barcodes— DNA sequences that can correctly identify the underlying individuals in a pooled experiment even in the presence of DNA sequencing and synthesis errors.
  • Error-correcting barcodes should efficiently detect and correct DNA sequencing and synthesis errors.
  • Many current DNA barcode strategies repurpose error-correcting codes developed for computers 18 19 , such as Hamming or Reed-Solomon codes, to DNA applications 20 ’ 21 .
  • Hamming distance which describes the number of substitutions between two sequences of equal length, is possibly the most used due to its simplicity.
  • nearly all well-studied error-correcting codes developed in computer science—including the widely-used Hamming codes— were not designed to handle deletions and insertions. Such codes are generally used to only detect errors without correcting them. However, this method also fails to account for the possibility that a single error (e.g., deletion) can convert one barcode into another.
  • Levenshtein codes also known as edit codes, can theoretically account for all three types of common error: substitutions, insertions, and deletions, but only when the corrupted length of each barcode after errors is known 22 23 . This is a critical limitation in real-world DNA barcode applications because errors can change the barcode length unpredictably, which can lead to erroneous decoding of Levenshtein-based barcodes in the context of a longer read (FIG. 1B).
  • Levenshtein codes can be used at twice the level of error correction as desired for a given application, for example using a 2 error-correcting code when a 1 -error correcting code is desired, but this is inefficient and significantly decreases the number of valid barcodes for a given oligonucleotide length.
  • existing DNA barcode strategies are unable to efficiently detect and decode real-world errors encountered during DNA synthesis and sequencing.
  • a genetic barcode may be selected for a member of the population, and that genetic barcode will have been created in a manner that allows for a given error tolerance in the barcode at decoding.
  • the compositions and methods disclosed herein address these and other needs.
  • FREE barcodes can correct substitutions, insertions, and deletions even when the edited length of the barcode is unknown, which is a significant advantage over known DNA barcode error-correcting methods.
  • FREE barcodes are designed with experimental considerations in mind, including balanced GC content, minimal homopolymer runs, absence of GGC motifs, and no self complementarity of more than two bases to reduce internal hairpin propensity. Lists of barcodes were generated with different lengths and error-correction levels that may be broadly useful in diverse high- throughput applications.
  • hairpin melting temperatures were calculated which can be used to select subsets of barcodes compatible with experimental conditions.
  • the largest barcode list includes >l0 6 unique error-correcting barcodes usable in a single experiment. Moreover, appending two or more barcodes together in combination increases the total barcode set, producing >10 9 -10 12 unique error-correcting DNA barcodes.
  • the included software for creating new barcode libraries and decoding/error-correcting observed barcodes is fast and efficient, decoding >120,000 barcodes per second with a single processor, and is designed to be user friendly for a broad biologist community.
  • Modern high-throughput biological assays study pooled populations of individual members by labeling each member with a unique DNA sequence called a“barcode.”
  • DNA barcodes are frequently corrupted by DNA synthesis and sequencing errors, leading to significant data loss and incorrect data interpretation.
  • Disclosed herein is a novel error-correction strategy to improve the efficiency and statistical power of DNA barcodes.
  • the error-correcting method accurately handles insertions and deletions in DNA barcodes, the most common type of error encountered during DNA synthesis and sequencing, resulting in order-of-magnitude increases in accuracy, efficiency, and signal-to-noise.
  • the accompanying software package makes deployment of these barcodes effortless for the broader experimental scientist community.
  • a computerized method of generating a set of unique genetic barcodes for attaching to members of a pooled population of samples represented by respective genetic sequences The set of genetic barcodes has boundaries established by controlling the presence and absence of selected combinations of base values when the selected combinations represent disallowed physical characteristics within the genetic samples.
  • the method includes (i) identifying a common length of the barcodes in the set, wherein the length is a given number of base positions present in each genetic barcode, (ii) identifying base values available to fill each base position, wherein the respective genetic barcode is subject to error formation in the base values, such that at least one of the respective genetic sequences has a corrupt barcode, and (iii) identifying a center barcode from which a sphere of multiple corrupt barcodes may be generated.
  • Each of the multiple corrupt barcodes refers back to the center barcode also in the sphere
  • generating the multiple corrupt barcodes utilizes a computer executing software configured to receive an acceptable error level for the sphere of multiple corrupt barcodes, wherein the error level corresponds to the number of edits necessary in any one of the multiple corrupt barcodes to match the center barcode and generates combinations of bases filling the length of respective corrupt barcode sequences, comprising errors in the base values within the acceptable error level, such that the corrupt barcode sequences pack the sphere having the center barcode.
  • the computerized method includes (i) identifying an original length of a generated barcode from a set of unique genetic barcodes, wherein the original length comprises a given number of base positions present in the generated barcode, (ii) identifying a barcode starting value position along a respective genetic sequence of a genetic sample, wherein the starting value position is occupied by a first base of a respective genetic barcode, and (iii) identifying a center barcode within a sphere of multiple corrupt barcodes such that each of the multiple corrupt barcodes refers back to the center barcode also in the sphere.
  • Identifying the center barcode includes a computer executing software configured to receive an acceptable error level for the sphere of multiple corrupt barcodes, wherein the error level corresponds to the number of edits necessary in any one of the corrupt barcodes to match the center barcode.
  • the computer is further configured to utilize the barcode starting value position within the genetic sequence and evaluating bases up to the length of the generated barcode.
  • the method stores the evaluated bases as a corrupt barcode in a computerized memory buffer and identifies a respective decode sphere in which the corrupt barcode exists. Finally, the method includes decoding the corrupt barcode to the center barcode of the respective decode sphere.
  • FIG. l(A-C) are schematics depicting applications and error-correction strategies of DNA barcodes.
  • FIG. 1A depicts illustrative examples of high-throughput sequencing assays that require large lists of error-correcting DNA barcodes. Barcodes can be used to identify individual cells or molecules in pooled libraries (Klein, 2015; Fan, 2008; Melkko, 2004).
  • FIG. IB depicts current strategies to correct synthesis and sequencing errors in DNA barcodes that are confounded by insertions and deletions.
  • Hamming distance can only handle substitutions.
  • Levenshtein distance is limited by the fact that barcodes are prepended to other sequences of interest. Insertions and/or deletions (“indels”) produce phantom Levenshtein distance errors when bases from the remaining DNA molecule shift into or out of the barcode window.
  • FIG. 1C depicts examples of FREE divergence given an actual edit history. Levenshtein and Hamming distances are also shown for comparison. A substitution and insertion are correctly attributed as two edits by FREE divergence (first column).
  • Indels can have zero cost, particularly near the end of the barcode where they can occasionally be undone by fill or truncation (fourth column). Edits past the barcode end can matter since the fill/truncation step happens only upon observation (fifth column).
  • FIG. 2(A-D) are schematics and graphs showing FREE barcode generation and decoding.
  • FIG. 2A shows error-correcting barcode generation as a sphere packing problem.
  • Reserved around each accepted barcode B e.g.,“CTCA” is DecodeSphere m (B), the set of all sequences within FREE divergence m of B. That is, the set of all sequences with any combination of up to m errors from B, followed by fill or truncation as necessary.
  • Any set of disjoint decode spheres is a valid FREE code (FIG. 2A, right panel).
  • FIG. 2B is a graph showing the number of single- and double-error correction barcodes generated for a range of barcode lengths.
  • FIG. 2C shows that the herein disclosed methods and software can decode more than 120,000 barcodes per second for all barcode lengths considered here.
  • FIG. 2D shows a comparison of FREE barcode counts against pruned Hamming codes and Levenshtein codes.
  • Hamming codes were pruned to remove members that did not decode FREE divergence errors, while Levenshtein codes were produced at double the error-correction levels for the same purpose.
  • FREE codes produce more barcodes than either of the other methods for all barcode lengths.
  • FIG. 3(A-C) shows experimental measurement of synthesis and sequencing error rates.
  • FIG. 3A is a schematic of the DNA constructs used for barcode validation experiments.
  • Each member in the synthetic library had a unique pair of left and right barcodes drawn from a list of more than 8,000 17- nucleotides FREE codes with double-error correction.
  • FIG. 3B is a graph showing the measured synthesis error rates, by intended reference base and error type— substitution (sub), deletion (del), and insertion (ins).
  • FIG. 3C is a graph showing measured sequencing substitution error rates, by reference base. Insertions and deletions from illumina sequencing are extremely rare and are omitted for clarity.
  • FIG. 4(A-B) are graphs showing decoding corrupted barcodes from simulated errors.
  • 4A shows modeled and simulated decoding error rates given per-base error rate for length 8 barcodes and length 16 barcodes. Barcode sets are labeled according to length and number of errors corrected.
  • FIG. 4B shows that the 16-2 code is length 16 and corrects up to 2 errors.
  • Solid lines show the error rate approximations using a binomial model. Circles and triangles show direct simulation error rates for single- and double-error correcting codes, respectively. Substitution, insertion, and deletion errors each have simulated error rates P( error per base)/3 for simplicity.
  • FIG. 5 is a graph showing decoded corrupted barcodes from experimental data. Observed decoding error rates are compared with theoretical rates from the synthesis and sequencing error rates.
  • FIG. 6(A-D) shows combined barcode libraries via concatenation of FREE barcodes.
  • FIG. 6A shows concatenated barcodes can be decoded sequentially in a left-to-right order, even when the end position of each edited sub-barcode is not initially known.
  • the decoded first FREE sub-barcode can be used to find the starting position of the next sub-barcode, and similarly for subsequent sub-barcodes.
  • FIG. 6B shows concatenated barcode decoding error rates.
  • Concatenated barcode labels use the following format: a 3x(l6-l) barcode consists of three concatenated sub-barcodes, each of which is 16 bp long and can correct up to 1 error.
  • FIG. 6C and FIG. 6D show concatenating multiple barcodes combining to increase the numbers of effective FREE barcodes.
  • Concatenated barcodes can correct the same number of errors per sub-barcode. When the errors are distributed evenly among the sub-barcodes, concatenated barcodes can correct a higher total number of errors than the individual sub-barcodes.
  • FIG. 6C shows concatenated single-error correcting barcodes.
  • FIG. 6D shows concatenated double-error correcting barcodes. Dashed lines: projected quantities calculated by sampling; dotted lines: log-linear projections.
  • FIG. 7(A-C) are graphs showing decode sphere volumes and code efficiency.
  • FIG. 7A shows that unlike Hamming decode spheres, FREE divergence decode spheres do not have uniform volume due to degeneracy of insertions and deletions.
  • the sequence AACT only has three unique deletions because a deletion of either A generates the same resulting sequence.
  • Sphere volumes of 1- and 2-error codes are shown for all words and for only valid code words after application of FREE code synthesis and sequencing filters (no homopolymer runs, no triplet complementarity, etc.) ⁇ Black lines are explained in FIG. 7B below.
  • FIG. 7B shows optimal sphere packing bounds.
  • the optimal packing for an error-correcting code is not known in general.
  • Typical code generating algorithms including the herein disclosed algorithms, are instead heuristics for finding relatively good codes.
  • the volumes of FREE divergence decode spheres are not uniform, so the volume of every sphere in the space is instead determined, sorted, and the minimum number of barcodes at which the cumulative sum of barcode sphere volumes is smaller than the space is determined.
  • the upper bound calculated for valid code words is shown.
  • the volume at which that happens for each code is shown in FIG. 7A as black lines.
  • the lower bound is the best efficiency achieved by any code generation method to date, which for FREE codes is simply the number of barcodes reported herein.
  • the actual maximum possible number of barcodes is somewhere between the two.
  • FIG. 7C shows raw and actual code rates for each FREE barcode set disclosed herein as well as the asymptotic values they approach.
  • FIG. 8(A-B) are graphs depicting error rate simulations by error type.
  • FIG. 8A and FIG. 8B show the simulations performed for FIG. 4, repeated for each error type - substitutions (top panel; also called mismatches), deletions (middle panel), insertions (bottom panel) - individually.
  • Barcode sets are labeled according to length and number of errors corrected; for example, the 16-2 code is length 16 and corrects up to 2 errors. Mismatches follow the binomial approximation closely, while deletions and especially insertions perform slightly better than the binomial approximation.
  • FIG. 8A is shown for base pair length 8.
  • FIG. 8B is shown for base pair length 16.
  • FIG. 9 is a set of graphs depicting error rate comparison with constant barcode length.
  • the barcode length used for each is denoted at the top of each individual graph.
  • FIG. 10 is a set of graphs depicting error rate comparison with constant barcode number of errors corrected. The binomial approximation of the decode error rate as a function of the error rate per base, grouped by given number of errors corrected.
  • FIG. 11 is a set of graphs depicting error rate comparison with constant number of barcodes.
  • FIG. 12 is a graph depicting a coverage histogram and statistics for the FREE code validation experiment. Each of the 8,684 oligos were observed with average coverage of l59x.
  • FIG. 13 is a graph depicting maximum error run length probabilities. The probability distribution of maximum consecutive-error run lengths from a model assuming independent errors and from herein disclosed data. The two differ significantly because errors in the herein disclosed data are not independent.
  • FIG. 14 is a set of graphs depicting hairpin melting temperatures (Tm). Hairpin melting temperature CDFs are shown for all barcodes libraries included with this manuscript. The barcodes included here nearly all have a Tm ⁇ 60°C, and users can further filter the barcode sets to avoid hairpins in their specific experimental conditions.
  • compositions disclosed herein have certain functions. Disclosed herein are certain structural requirements for performing the disclosed functions, and it is understood that there are a variety of structures which can perform the same function which are related to the disclosed structures, and that these structures will ultimately achieve the same result.
  • an agent includes a plurality of agents, including mixtures thereof.
  • the terms“can,”“may,”“optionally,”“can optionally,” and“may optionally” are used interchangeably and are meant to include cases in which the condition occurs as well as cases in which the condition does not occur.
  • the statement that a formulation“may include an excipient” is meant to include cases in which the formulation includes an excipient as well as cases in which the formulation does not include an excipient.
  • Ranges can be expressed herein as from “about” one particular value, and/or to "about” another particular value. When such a range is expressed, another embodiment includes from the one particular value and/or to the other particular value. Similarly, when values are expressed as approximations, by use of the antecedent "about,” it will be understood that the particular value forms another embodiment. It will be further understood that the endpoints of each of the ranges are significant both in relation to the other endpoint, and independently of the other endpoint. It is also understood that there are a number of values disclosed herein, and that each value is also herein disclosed as "about” that particular value in addition to the value itself. For example, if the value" 10" is disclosed, then “about 10" is also disclosed.
  • a polynucleotide e.g. DNA or RNA
  • a polynucleotide comprises a sequence of nucleotides that enables it to non- covalently bind, to another polynucleotide in a sequence-specific, antiparallel, manner (i.e., a polynucleotide specifically binds to a complementary polynucleotide) under the appropriate in vitro and/or in vivo conditions of temperature and solution ionic strength.
  • polynucleotide hybridization is the binding of a polymerase chain reaction (PCR) primer to a polynucleotide template (e.g., in a sequencing reaction).
  • PCR polymerase chain reaction
  • a polynucleotide e.g., a PCR primer or probe
  • a polynucleotide can comprise at least 70%, at least 80%, at least 90%, at least 95%, at least 99%, or 100% sequence complementarity to a target site within the target polynucleotide sequence.
  • a “primer” is a short polynucleotide, generally with a free 3'-OH group that binds to a target or "template” potentially present in a sample of interest by hybridizing with the target, and thereafter promoting polymerization of a polynucleotide complementary to the target.
  • a “polymerase chain reaction” (“PCR”) is a reaction in which replicate copies are made of a target polynucleotide using a "pair of primers” or a “set of primers” consisting of an "upstream” and a “downstream” primer, and a catalyst of polymerization, such as a DNA polymerase, and typically a thermally- stable polymerase enzyme. Methods for PCR are well known in the art, and taught, for example in "PCR: A
  • polynucleotide and“oligonucleotide” are used interchangeably and generally refer to a linear polymer of nucleotide monomers of DNA or RNA.
  • Monomers making up a polynucleotide are capable of specifically binding to a second polynucleotide by way of a regular pattern of monomer- to-monomer interactions, such as Watson-Crick type of base pairing, base stacking, Hoogsteen or reverse Hoogsteen types of base pairing, for example.
  • Such monomers and their internucleosidic linkages may be naturally occurring or may be analogs thereof (e.g., naturally occurring or non-naturally occurring analogs).
  • Non-limiting examples non-naturally occurring analogs include phosphorothioate internucleosidic linkages, bases containing linking groups permitting the attachment of labels, such as fluorophores, or haptens.
  • “oligonucleotide” can refer to (relatively) smaller polynucleotides comprising, for example, 2-100, 3-75, or 3-50 monomeric units.
  • Polynucleotides may, in some instances, include the natural deoxyribonucleosides (e.g., deoxyadenosine, deoxycytidine, deoxyguanosine, and deoxythymidine for DNA or their ribose counterparts for RNA) linked by phosphodiester linkages.
  • polynucleotides can also include non-natural nucleotide analogs (e.g., including modified bases, sugars, or internucleosidic linkages).
  • a polynucleotide may be represented by a sequence of letters (upper or lower case), such as“ATGCCTG,” and it will be understood that the nucleotides are in 5' 3' order from left to right and that“A” denotes deoxyadenosine,“C” denotes deoxycytidine,“G” denotes deoxyguanosine, and“T” denotes deoxythymidine, and that“I” denotes deoxyinosine, and“U” denotes deoxyuridine (in the case of RNA), unless otherwise indicated or implied from context. Unless otherwise noted the terminology and atom numbering conventions will follow those disclosed in Strachan and Read, HUMAN MOLECULAR GENETICS, 2nd ed. (Wiley-Liss, New York
  • a barcode can be used to identify a barcoded sample. Often, one or more barcoded samples are mixed in a heterogenous pool, which can contain known or unknown components.
  • the use of a barcode facilitates the identification of a barcoded sample. Typically, any given barcode identifies only one barcoded sample, and any given barcoded sample is identified by only one barcode. Thus, the barcode and the barcoded sample can be exclusively assigned. This is conceptually similar to the use of scannable universal product codes (UPC) on consumer products of a grocery store shelf.
  • UPC scannable universal product codes
  • the barcode is a specially designed DNA sequence attached to a barcoded sample and which identifies the barcoded sample for each read output by a DNA sequencer.
  • DNA can be extracted from each barcoded sample, and the unique identifying barcode for that sample is amplified in a polymerase chain reaction (PCR).
  • PCR polymerase chain reaction
  • DNA extracts from multiple barcoded samples, or multiple PCR amplicons from PCRs of separate barcoded samples, can be mixed together (multiplexed) for sequencing numerous samples simultaneously in bulk, thereby reducing time and complexity.
  • multiplexed sequencings a barcoded sample can thus be tracked by identification of its barcode.
  • DNA barcodes can be selected to be of sufficient length to generate the desired number of barcodes with sufficient variability to account for common sequencing errors, generally ranging in size from about 2 to about 20 bases, but may be longer or shorter. Longer barcodes will permit higher sequence diversity and typically allow more samples to be combined.
  • the target specific PCR sequences for the forward and reverse PCR primers can be specific for any DNA sequence, in coding or non-coding regions of a target genome, plasmid, or organelle.
  • DNA barcoding has been described for many sequencing platforms. For instance, the first platform of next-generation sequencing (NGS), the GS 20, was developed for barcoding in 2007 (Parameswaran P, Jalili R, Tao L, et al. Nucleic Acids Research. 2007; 35(l9):el30). Since that time, barcoding strategies have been made commercially available by Illumina (San Diego, Calif.), Pacific Biosciences (Menlo Park, Calif.), and Thermo Fisher Scientific (Waltham, Mass.), such as the Ion TorrentTM (Thermo Fisher Scientific, Waltham, Mass.) sequencing platforms.
  • NGS next-generation sequencing
  • a high throughput sequencing method can be any sequencing method, with high throughput generally meaning greater than 100 (e.g., 1,000 or more) reads per run.
  • Next-generation sequencing refers to modem high throughput sequencing platforms that parallelize the sequencing process, producing thousands or millions of sequences concurrently, in contrast to less-efficient and more expensive standard dye-terminator methods.
  • Non-limiting examples of NGS methods include single molecule real-time sequencing (also referred to as Pacific Biosciences or PacBio), ion semiconductor (also referred to as Ion Torrent sequencing), pyrosequencing (also referred to as Roche 454), sequencing by synthesis (also referred to as Illumina sequencing), sequencing by ligation (also referred to as SOLiD sequencing) and chain termination sequencing (also referred to as Sanger sequencing).
  • single molecule real-time sequencing also referred to as Pacific Biosciences or PacBio
  • ion semiconductor also referred to as Ion Torrent sequencing
  • pyrosequencing also referred to as Roche 454
  • sequencing by synthesis also referred to as Illumina sequencing
  • sequencing by ligation also referred to as SOLiD sequencing
  • chain termination sequencing also referred to as Sanger sequencing.
  • High throughput DNA sequencing can be carried out on pooled, barcoded PCR amplicon DNA sequences as described above, producing a file with individual DNA sequences from all samples in random order.
  • the high throughput DNA sequencing is a next-generation sequencing (NGS) method, such as single-molecule real-time sequencing, ion semiconductor sequencing, pyrosequencing, sequencing by synthesis, sequencing by ligation and chain termination sequencing.
  • NGS next-generation sequencing
  • Obtained DNA sequences from each barcoded sample (or PCR amplicons thereof) can be identified by barcode after sequencing and all reads are sorted into files by barcode.
  • separating the DNA sequences and identifying target DNA sequences and/or barcode sequences can be performed by computer implemented methods, and can further employ database searches.
  • the databases may include online databases such as BLAST
  • barcodes can be designed to exhibit improved read accuracy for sequencing using a sequence-by-synthesis platform (as discussed previously), which can include fluorophore-labeled nucleotide sequencing platforms or non-labeled sequencing platforms, such as, for example, the Ion PGMTM and Ion ProtonTM Sequencers, and the Ion SSTM and Ion S5 XL Next GenerationTM Sequencing System.
  • a sequence-by-synthesis platform can include fluorophore-labeled nucleotide sequencing platforms or non-labeled sequencing platforms, such as, for example, the Ion PGMTM and Ion ProtonTM Sequencers, and the Ion SSTM and Ion S5 XL Next GenerationTM Sequencing System.
  • Design of the barcodes is not limited to any particular instrument platform or sequencing technology, however.
  • the barcode has a number of nucleotide bases (sometimes referred to as a barcode length) of at least 3 bases, at least 4 bases, at least 5 bases, at least 6 bases, at least 7 bases, at least 8 bases, at least 9 bases, at least 10 bases, at least 11 bases, at least 12 bases, at least 13 bases, at least 14 bases, at least 15 bases, at least 16 bases, at least 17 bases, at least 18 bases, at least 19 bases, or at least 20 bases.
  • the barcode has a number of nucleotide bases from 3 to 20 bases, from 3-18 bases, from 3-16 bases, from 3-15 bases, from 3-12 bases, or from 3-8 bases.
  • two or more barcodes can be used in combination in a concatenated sequence.
  • Combined use of barcodes in concatenated barcodes facilitate generation of a high number of barcodes (e.g., over 106 and even 1012 or more).
  • Combined use of barcodes, as it relates to concatenated barcodes refers to the combining of two or more barcodes from a DecodeSphere in a single polynucleotide to form a larger barcode (a“concatenated barcode”).
  • a concatenated barcode comprises sub-barcodes in a single polynucleotide wherein the sub-barcodes are disposed along the polynucleotide sufficiently close to an adjacent sub-barcode such that the concatenated barcode can be identified from a single amplicon formed from a PCR amplification reaction.
  • a sub-barcode can be disposed along the polynucleotide within a nucleotide base length of an adjacent sub-barcode of 100 bases or less, 75 bases or less, 50 bases or less, 25 bases or less, 10 bases or less, 9 bases or less, 8 bases or less, 7 bases or less, 6 bases or less, 5 bases or less, 4 bases or less, 3 bases or less, 2 bases or less, or 1 base or less.
  • a first sub-barcode can be immediately adjacent to a second sub barcode (e.g., the 3’ end of the first sub-barcode is directly and covalently linked by a phosphodiester bond to the 5’ end of a second sub-barcode).
  • a concatenated barcode can be defined by an intended sequence, for example a polynucleotide sequence designed by a user, without regard to the
  • a concatenated barcode can comprise sub-barcodes which are of the same base length (number of bases or base pairs; bp) or of different base lengths.
  • a barcoded sample can be any composition of matter capable of attachment, either directly or indirectly, with a polynucleotide comprising a barcode.
  • the barcoded sample can be, for example, a small molecule, a macromolecule (e.g., a nucleic acid, protein, lipid, polysaccharide, or synthetic or biological polymer), a solid support (e.g., a synthetic particle, a bead such as an affinity bead, a plastic, a metallic substance), a vesicle (e.g., liposome, exosome, micelle), an artificial or biological cell, or others.
  • a macromolecule e.g., a nucleic acid, protein, lipid, polysaccharide, or synthetic or biological polymer
  • a solid support e.g., a synthetic particle, a bead such as an affinity bead, a plastic, a metallic substance
  • a vesicle e.g., liposome,
  • Suitable beads include silica gel beads, controlled pore glass beads, magnetic beads, Dynabeads, Sephadex/Sepharose beads, cellulose beads, polystyrene beads, or any combination thereof.
  • a barcoded sample and a cell are said to be attached when a barcoded sample either enters a cell or is attached to a component (e.g., a lipid) on the surface of a cell.
  • Direct attachment can include, for instance, covalent linkage of a 3’ end of a polynucleotide comprising a barcode to the barcoded sample, for instance a chemically accessible motif of a small molecule capable of covalent linkage to the 3’ end or other portion of a polynucleotide comprising a barcode.
  • Indirect attachment can include a physical attachment but without a direct covalent linkage between the barcoded sample and the polynucleotide comprising a barcode.
  • an intervening component can link the barcoded sample and the polynucleotide comprising a barcode.
  • An example of an indirect attachment includes a linkage between a polynucleotide comprising a barcode and a solid support (e.g., a bead), wherein the solid support is further linked to the barcoded sample (e.g., an antibody or small molecule).
  • a solid support e.g., a bead
  • the methods can include a plurality of barcodes.
  • the methods can include at least 50, at least 100, at least 250, at least 500, at least 750, at least 1,000, at least 2,000, at least 5,000, at least 10,000, at least 50,000, at least 100,000, at least 500,000, at least 1,000,000 barcodes.
  • the methods can include at least 107, at least 108, at least 109, at least 1010, at least 1011, at least 1012, at least 1013, at least 1014, or at least 1015 barcodes.
  • the total number of barcodes can include single barcodes only, concatenated barcodes only, or both single barcodes and concatenated barcodes.
  • a plurality of barcodes can sometimes be referred to as a barcode library.
  • the term“barcode library” can also refer to a plurality of barcodes, each attached to a single barcoded sample of a plurality of barcoded samples.
  • a polynucleotide (e.g., a polynucleotide comprising a barcode) can be sequenced by any sequencing method known in the art to sequence a polynucleotide.
  • the DNA sequencing method is a high throughput sequencing method, for instance a next-generation sequencing (NGS) method.
  • the DNA sequencing method is a low throughput sequencing method, for example Sanger sequencing.
  • Sanger sequencing is a method of DNA sequencing based on the selective incorporation of chain-terminating dideoxynucleotides by DNA polymerase during in vitro DNA replication.
  • sequencing results and other polynucleotide analyses can be represented and analyzed by digitized sequences using known computer implemented methods.
  • a polynucleotide comprising a barcode can be immobilized on a solid support.
  • the solid support is a bead (e.g., an affinity bead), multi-well plate (e.g., 96-well plate), filter surface, or other solid support.
  • a barcoded polynucleotide can be attached to a solid support at one end, and attached (directly or indirectly) to a barcoded sample at the other end.
  • the technology disclosed herein takes advantage of a highly efficient way to tag samples in a pooled set of specimens with a genetic barcode as shown in Figure 1A (5A-5C) and account for likely corruption of the barcode sequences during experimentation, genetic synthesis, and genetic sequencing operations.
  • the corruption is caused by naturally occurring errors in genetic sequences that change certain portions of a genetic barcode sequence from an originally generated genetic barcode (25) to a different sequence that does not match the original barcode exactly and may lead to errors for certain identification purposes, absent additional processing.
  • This disclosure utilizes computers, computer processors, and associated kinds of computerized memory to provide additional processing functionality that enables even corrupted barcodes to be decoded back to an intended original genetic barcode, which in turn identifies a property of the specimen.
  • the identified property of the specimen includes, but is not limited to, a source of the specimen, a characteristic of the specimen, or any information or metadata that can be useful to track in regard to the specimen to which the genetic barcode is attached.
  • Figure 1B illustrates that the genetic barcodes (25) described herein start with a particular set of base values (27) filling the base positions (20) of a genetic barcode having a defined length (145). The base values, however, may change during the course of genetic synthesis and genetic sequencing operations.
  • FIG. 1C shows that originally generated genetic barcodes (25) have base values with a starting position (45A-45E) in an overall genetic sequence for a sample.
  • the base values fill the base positions in the bar code sequence up to the length (147) of the genetic barcode (25).
  • these base values (27) in an expected configuration of the original genetic barcode (25) are subject to actual edits (53) such that the observed version (55) of the barcode (corrupted with mutations in the base values) is different due to truncation and fill effects as shown.
  • One goal of the technology described herein lies in identifying not only the corrupt barcodes (63) as observed after various edits, but also to be able to use those corrupt barcodes in reliably determining to which of the originally generated genetic barcodes (25) the corrupt barcode (63) should refer back to after decoding.
  • the set of genetic barcodes has boundaries (33) established by controlling the presence and absence of selected combinations of base values (20) when the selected combinations represent disallowed physical characteristics within the genetic samples. Controlling select combinations of base values involves implementing lexicographical constraints (e.g., constraining the genetic code in a sequence) and identifying base values (27) available to fill each base position in terms of “GC” code content, disallowing homopolymer triplets, disallowing triplet self complementarity, and disallowing GGC base values (illumina error motif).
  • the method includes (i) identifying a common length (145) of the barcodes (25) in the set (23), wherein the length (145) is a given number of base positions (20) present in each genetic barcode (25), (ii) identifying base values (27) available to fill each base position (20), wherein the respective genetic barcode (25) is subject to error formation in the base values, such that at least one of the respective genetic sequences has a corrupt barcode (63), and (iii) identifying a center barcode (125) from which a sphere (70) of multiple corrupt barcodes (63) may be generated.
  • Each of the multiple corrupt barcodes (63) refers back to the center barcode (125) also in the sphere (70).
  • the method also includes generating the multiple corrupt barcodes (63) by using a computer executing software configured to receive an acceptable error level (herein,“m”) for the sphere (70) of multiple corrupt barcodes, wherein the error level corresponds to the maximum number of edits necessary in any one of the multiple corrupt barcodes (63) to match the center barcode (125) and generates combinations of base values (27) filling the length (145) of respective corrupt barcode (63) sequences.
  • the corrupt barcodes (63) within a given sphere (70) and within given set (33) have errors in the base values (27) within the acceptable error level, such that the corrupt barcode sequences pack the sphere (70) having the center barcode (125).
  • Generating combinations of base values (27) having up to the maximum number of errors allowed by the error level includes adjusting base values (27) from the center barcode (125) with end fills and end truncations of base values at one end of the length of the genetic barcode, up to the entire length (145) of the genetic barcode.
  • the method further includes generating distinct center barcodes (125) for each member of the pooled population of genetic samples (5A-5C) and using respective distinct center barcodes (125) to generate additional spheres of corrupt barcodes that relate back to a corresponding distinct center barcode.
  • the computerized method is configured to result in the sphere and the additional spheres (80, 90, 110) being disjointed with no common members at the identified error level.
  • genetic barcodes (25) can be concatenated so that multiple barcodes form a single new barcode (i.e., the multiple barcodes become sub-barcodes), and, using the sub-bar codes, new spheres (150, 175, 195) of corrupt barcodes are generated with computerized equipment.
  • the computerized method includes (i) identifying an original length (145) of a generated barcode (25) from a set of unique genetic barcodes, wherein the original length comprises a given number of base positions (20) present in the generated barcode, (ii) identifying a barcode starting value position (45) along a respective genetic sequence of a genetic sample, wherein the starting value position is occupied by a first base value of a respective genetic barcode, and (iii) identifying a center barcode (125) within a sphere (70) of multiple corrupt barcodes (63) such that each of the multiple corrupt barcodes (63) refers back to the center barcode also in the sphere (70).
  • Identifying the center barcode (125) includes a computer executing software configured to receive an acceptable error level for the sphere of multiple corrupt barcodes, wherein the error level corresponds to the maximum number of edits (53) necessary in any one of the corrupt barcodes to match the center barcode (125).
  • the computer is further configured to utilize the barcode starting value position (45) within the genetic sequence and evaluating bases up to the length (145) of the generated barcode.
  • the method stores the evaluated bases as a corrupt barcode in a computerized memory buffer and identifies a respective decode sphere (70) in which the corrupt barcode (63) exists.
  • the method includes decoding the corrupt barcode to the center barcode of the respective decode sphere.
  • a computerized method as described above for decoding genetic barcodes further includes identifying the respective decode sphere in which the corrupt barcode exists by evaluating the base position values (27) after the starting point (45) and up to the original length of the generated barcode (25).
  • the base position values after the starting point (45) and up to the original length (145) of the generated barcode include filled positions at an end of the length, wherein the filled positions do not match the corresponding positions of the generated barcode.
  • the base position values after the starting point and up to the original length of the generated barcode do not include base positions from the originally generated barcode that have been truncated at an end of the length.
  • the computerized method identifies a true length (145) of the first corrupt barcode and from that identifies a next start position in the genetic sequence for a sequential barcode (175, 195) to be decoded.
  • the hardware and software used to enable this disclosure will be sufficient to provide for sets of genetic barcodes to be stored, subject to entry in a -up table or other matching means, and configured with a graphical user interface to receive at least one of an acceptable error level, an acceptable probability of a number of errors in a corrupted barcode, and a genetic barcode length.
  • DNA barcodes short DNA sequences prepended to DNA libraries— for identification of individuals in pooled biomolecule populations.
  • DNA synthesis and sequencing errors confound the correct interpretation of observed barcodes and can lead to significant data loss or spurious results.
  • Widely-used error-correcting codes borrowed from computer science e.g., Hamming and Levenshtein codes
  • Disclosed herein are experimentally validated FREE (Filled/truncated Right End Edit) barcodes, which can correct substitution, insertion, and deletion errors, even when these errors alter the barcode length.
  • FREE barcodes are designed with experimental considerations in mind, including balanced GC content, minimal homopolymer runs, and reduced internal hairpin propensity. Lists of barcodes with different lengths and error-correction levels were generated which may be useful in diverse high-throughput applications, including >l0 6 single-error correcting l6-mers that strike a balance between decoding accuracy, barcode length, and library size. Moreover, concatenating two or more FREE codes into a single barcode increases the available barcode space in combination, generating lists with > 10 15 error-correcting barcodes. The included software for creating barcode libraries and decoding sequenced barcodes is efficient and designed to be user-friendly for the general biology community. Results
  • FIG. 1C shows a typical example of how FREE divergence captured the actual number of barcode edits in the context of a longer read.
  • An insertion caused the final T to move out of the barcode window, but FREE divergence correctly accounts for its loss.
  • a barcode length n is set, and any DNA sequence of length n is referred to as a word.
  • the set of all words is called W such that FreeDiv(B, W) ⁇ m the m-error decode sphere of B, written as DecodeSphere m (B), or just DecodeSphere(B) if m is clear from context. Any observed DNA sequence within
  • DecodeSphere(B) will by definition decode to (error-correct to) the center word B (FIG. 2 A). Then, an m-error correcting FREE code is simply any set of barcodes such that the m-error decode spheres of all barcodes are disjoint, or, in other words, no two decode spheres overlap. Any corrupted barcode with up to m errors is thus in the decode sphere of exactly one barcode and can be decoded (error-corrected) uniquely (FIG. 2A).
  • Requiring disjoint decode spheres places a limit on the relationship between allowed m, the number of correctible errors, and n, the barcode length: to fit more than one non overlapping decode sphere in the space requires that 2 m + 1 ⁇ n.
  • Candidate barcodes must have: (1) balanced GC content (40-60%); (2) no homopolymer triples (e.g., AAA); (3) no GGC (a known Illumina-based error motif 25 ); and (4) no self-complementarity of >2 bases to reduce hairpin propensity. All disclosed software is available in the GitHub repository (https://github.com/finkelsteinlab).
  • the number of available error-correcting barcodes for a DNA sequence of length n depends on the experimentally-required degree of error-correction (FIG. 2B).
  • Libraries of single-error correcting codes up to a l6-nucleotide length were generated, containing >1,600,000 barcodes.
  • more robust, double-error correcting codes up to a l7-nucleotide length with >23,000 unique members were generated (Table 1).
  • Barcodes correcting m errors required length at least 2m + 1 bp to avoid having all decode spheres overlap all other decode spheres.
  • the 1 -error and 2-error correcting barcode libraries had minimum lengths of 3 bp and 5 bp respectively.
  • the barcode decoding software ran in time proportional to the length of the barcodes but constant with respect to the number of barcodes in the library. Hence, 1 -error and 2-error correcting codes decoded at the same speed for a given barcode length even though the 1 -error libraries contained many more barcodes (FIG. 2C). Even the slowest decodes considered, the l7-mer double-error correction barcodes, decoded at >120,000 barcodes ⁇ sec 1 on a desktop computer using a single processor.
  • Table 1 Number of FREE barcodes for each barcode set disclosed herein, by barcode length and number of errors corrected. Comparison with current error-correcting DNA barcode strategies. Current state-of-the art error correcting DNA barcoding applications often use Hamming or Levenshtein error-correction strategies 2023 . Hamming codes only correct substitutions, and are thus insufficient for any DNA barcode applications with indels 26 . However, Hamming codes are linear codes, meaning the code words form a well-structured lattice in barcode space.
  • Levenstein codes can be used directly (e.g., without pruning) because they account for indels, but must be used at 2-fold higher error correction for DNA barcode applications (FIG. IB). Such over corrected Levenshtein barcode sets were generated in a manner similar to the FREE code generation strategy. This strategy produced even fewer barcodes than the pruned Hamming code sets. (FIG. 2D). Sequence-Levenshtein codes attempted to solve the problems inherent in using Levenstein codes for DNA applications, but an error in the derivation of these codes often caused them to decode to the wrong barcode 27 .
  • FREE codes offer a substantially larger number of usable barcodes for a given barcode length, when taking into consideration real-world errors such as deletions, insertions, and substitutions that are encountered during DNA sequencing and synthesis. Error Correction in Real and Simulated Data.
  • FREE barcodes generated herein were validated by both numerical simulation and by experiment. Pooled oligonucleotide synthesis was used to produce a library of >8,000 oligos with double-error correcting barcodes at both ends (FIG. 3A). The barcodes were arranged such that each left barcode should only be observed on the same oligo with one specific right barcode sequence, and similarly for right barcodes. Hence, the rate of incorrectly decoding barcodes could be measured by observing unexpected left-right barcode pairs. 1.4 million copies of this library were sequenced on an Illumina MiSeq for an average coverage of l59x using the standard Illumina workflow.
  • Full-length, paired-end Illumina sequencing was used to measure the background synthesis and sequencing error rates (FIG. 3B and 3C). Using full-length paired-end reads permitted discrimination between synthesis and sequencing errors. Substitution, insertion, and deletion error rates from library amplification using Q5 polymerase have previously been reported to occur at rates less than 10 5 , and thus are a negligible fraction of the measured synthesis errors 28 . Measured errors were dominated by single-base synthesis deletions, which occurred at rates of approximately 1 in 200 bp and 1 in 100 bp in the left and right barcode regions, respectively (FIG. 3B and FIG. 13). The two-fold difference in synthesis error rates between the two sides was consistent with statements from the manufacturer regarding their synthesis error rates 17 .
  • Sequencing error rates were between 10 4 and 10 3 , as advertised by Illumina (FIG. 3C). In sum, experimental error rates were dominated by deletion errors. As Hamming codes are not designed to error-correct deletions in barcodes, they perform very poorly in DNA-based experiments.
  • each increase in error correction level resulted in at least an order of magnitude improvement in the decoding error rate (FIG. 4).
  • FIG. 4 For example, experimental data showed an overall per-base p err of approximately 10 2 (FIG. 3B and 3C).
  • the approximate uncorrected decode error rate (solid line) was 8% for length 8 barcodes and 15% for length 16 barcodes. Without error correction, a best-case scenario would be that these errors could be successfully filtered out, representing a significant loss of data. In other scenarios, these data might be erroneously counted.
  • FREE barcodes were validated by measuring the decoding error rates for the experimental dataset (FIG. 5). For double-error correction, mismatches in barcode pairs were used to identify erroneously decoded barcodes. After corrections, error rates of 0.29% and 0.46% were observed for left and right barcodes, respectively.
  • the 0- and l-error correction rates shown in FIG. 5 were counted by also counting the number of errors observed in each correctly decoded barcode. That is, 0-error correction decode error rates were calculated as the number of erroneously decoded barcodes plus the number of correctly decoded barcodes with 1 or 2 errors; l-error correction errors were counted similarly.
  • the theoretical model was calculated using the synthesis and sequencing error rates found in FIG. 3 to calculate the decode error probability of each barcode depending on its base composition, and then combined for an overall error rate.
  • each barcode may not be defined exactly because the primer region could also have errors.
  • sub-barcode 1 The left-most sub-barcode (referred to as sub-barcode 1) was decoded first, and then the decoded sub-barcode was used to find the starting position of the immediately adjacent sub-barcode 3’ to sub-barcode 1 (referred to as sub-barcode 2).
  • the error-correction level of each FREE sub-barcode remained the same, such that, for example, three concatenated double-error correction sub-barcodes could each correct up to two errors for a maximum total of six corrected errors if, and only if, the errors were evenly distributed, two per sub-barcode.
  • Overall concatenated barcode decoding error rates were given by the probability of any decoding error in any sub-barcode or -barcodes. Concatenated barcode error rates were thus slightly higher than for the individual sub-barcodes (FIG. 6B). The decoding process was performed automatically using the disclosed software.
  • Concatenating FREE barcodes results in combinations of large barcode sets that are sufficient for even the most demanding high-throughput sequencing applications (FIG. 6).
  • Concatenated barcodes were pruned to remain compatible with experimental constrains by removing DNA sequences that had triplet repeats of a single base or had excess self-complementary (defined as any self-complementarity of three or more bases). Even with these filters, lists of up to 10 10 barcodes were generated with concatenation of three single-error correcting codes (FIG. 6). Beyond that, the projected total barcode count was estimated via subsampling, where possible.
  • FLEE filled/truncated right end edit
  • FREE barcodes are a powerful tool to correct DNA barcode errors, reducing measurement errors in modem, high-throughput experiments. Use of FREE barcodes improves these assays in three key ways: (1) helping avoid spurious results; (2) decreasing the amount of discarded data; and (3) increasing experimental signal-to-noise ratios. Decreasing spurious results and discarded data are important for any experiment involving DNA barcodes. Further, increased signal-to-noise ratios facilitates new and useful possibilities for assays. The power to decrease error rates from 15% to 0.05%, as in FIG. 4B, opens the door for entirely new assay designs. FREE barcodes are broadly useful for the ever-growing set of pooled high-throughput sequencing experiments in cell and molecular biology, protein engineering, and drug discovery. Methods
  • n the word length, n, is given. Any DNA sequence of length n is a word, and any word observed in the data is an observed word.
  • Strings of DNA are represented as base-4 numbers where A, C, G, and T correspond to 0, 1, 2, and 3 respectively.
  • A, C, G, and T correspond to 0, 1, 2, and 3 respectively.
  • 39 is the word number and 5 is the word length.
  • the word length is required to uniquely convert numbers to DNA to account for leading A’s.
  • the word number from the example above, 39, with word length 3 is simply GCT.
  • word length n the largest valid word number is 4" - 1.
  • a decode sphere is defined around a barcode B to be the set of all words with FreeDiv less than or equal to m, and an encode sphere is defined to be the set of all words of FreeDiv less than or equal to 2m.
  • FREE barcode sets are generated with a modified lexicographic code generation method.
  • Lexicographic code generation consists of marching through all words lexicographically, alphabetically in this case, and adding new words to the list of barcodes whenever they are sufficiently far from all previous barcodes 30 .
  • lexicographic codes are linear 30 , and more generally, lexicographic code generation is shown to have relatively good sphere packing efficiency 24 .
  • the first FREE modification to the procedure is to enforce the following sequencing and synthesis properties:
  • the coloring of barcodes, decode spheres, and encode spheres is accomplished by having an array of 4" integers valued 0, 1, or 2: 0 for uncolored, 1 for black, and 2 for red.
  • the location of each integer in memory itself represents the word, via the numerical representation of DNA given above. This is both memory and speed efficient. Memory efficiency is important, as it is a limiting resource for this method.
  • the memory required for barcode generation is 4 k bytes, which for the disclosed experiments was up to l6Gb of random access memory (RAM).
  • Barcode Decoding The decoding process builds the code book and looks up decoded words directly. This is performed in a memory efficient fashion as follows. For each barcode in a list, the barcode index is defined as the index of that barcode within the list of barcodes. A user again reserves a space of A k integers to represent the code space. For each barcode B, a user stores the barcode index of B at every word of DecodeSphere(B). A user stores barcode indices rather than barcode numbers because barcode indices require fewer bits per word.
  • the memory required for barcode decoding is (1, 2, or 4) x 4 n bytes, requiring 1, 2, or 4 bytes to store each barcode index. For the disclosed experiments, the maximum memory used for barcode decoding was 32Gb of RAM.
  • Barcode Pruning Specific barcode lists from the literature or elsewhere may sometimes be required for a given experiment, but require pruning to find a subset with error-correction. A user can accomplish barcode pruning via the same strategy as barcode generation, but only considering the input set of barcodes as potential new barcodes. This pruning method was also used to prune the linear Hamming codes.
  • Levenshtein Barcodes Levenshtein Barcodes. Levenshtein barcodes were generated lexicographically using the standard technique of code generation with a metric. Briefly, for desired barcode length n and number of correctable errors e, a user can walk through the space of n-mers lexicographically adding any new word if it: (a) satisfies the same sequencing and synthesis properties as above, and (b) is Levenshtein distance at least 2 ⁇ ? +l from any previously accepted barcode.
  • Oligonucleotide pools were designed as in FIG. 3A, with primers and barcodes on each end and a spacer in the middle (116 bp total length). To test the FREE method, 8,634 barcodes of length 17 and double-error correction were used in 8,634 unique pairs. Oligos were synthesized (CustomArray), and the oligo pool was amplified for twenty cycles with Q5 polymerase (NEB) and sequenced on an Illumina MiSeq machine with 2x150 bp paired-end reads. Maximum likelihood sequences were inferred using both reads.
  • the left and right primer sequences were used to determine both the read orientation and the starting position of each barcode.
  • Each barcode was then decoded using the FREE decoding software. Matching barcodes identified correctly decoded barcodes, while mismatching barcodes indicated an error.
  • the FREE method was powerful enough to reveal a surprising and unrelated source of error: the creation of oligo chimeras, sequences with the left part of one oligo and right part of another, which were then also accounted for (Example 2).
  • FreeDiv(X, Y ) can be efficiently calculated with a modified Needleman-Wunsch algorithm 1 , where the last row and column of the matrix have zero penalty for insertion and deletion corresponding to right-end fill or truncation respectively.
  • FREE divergence is symmetric because any minimum filled/truncated right end edit path (FREE path) is invertible by inverting all the edits and then inverting the fill/truncation step. Substitutions are invertible with substitutions, while insertions and deletions are invertible with each other in the natural way, so edits by themselves are invertible.
  • FREE path minimum filled/truncated right end edit path
  • Invertibility with the fill/truncation step is less obvious, and requires no edit be truncated off the end. For example, a substitution in the last position followed by any insertion results in the substitution getting truncated off the end. Minimum FREE paths never have any edits truncated off the end, because any truncated edit can be omitted to create a shorter edit path.
  • a and Y be barcodes and let P be any minimum FREE path from X to Y . If P has no fill or truncation, then the fill/truncation step is trivially invertible by doing nothing.
  • P has a fill step which fills /bases at the end. Then starting at Y and inverting the edits results in exactly those/bases being outside the barcode window, so they are truncated to arrive at X.
  • P has a truncation step which truncates t bases. Since P is a minimum edit path, none of the truncated bases were edited bases, so they are not needed for the inverted edit path starting at Y .
  • FreeDiv(X, Y) FreeDiv(Y, A) and any inverted minimum FREE path is itself a minimum FREE path.
  • FREE divergence is not a metric.
  • the counter-example shown in the right column of FIG. 1C was used.
  • FreeDiv(TAGA, ACGC ) the modified Needleman-Wunsch algorithm described above produces
  • Sphere iterator Central to the disclosed generation and decoding algorithms is the ability to deterministically iterate over decode and encode spheres. Recursive iteration is far too slow for practical use due to redundancy. For example, attempting to find DecodeSphereJJT) by finding DecodeSpherefW) of all words W in DecodeSphere ⁇ (B) results in iterating over each 2-error word at least twice, by switching the order of added edits. As the number of edits, m, grows, the redundancy grows as m ⁇ due to edit permutations.
  • Code efficiency is measured, where possible, in terms of a code rate, defined as the number of usable“message” bits that can be encoded in a single barcode divided by the actual number of bits in the sent barcode.
  • k message bits have r bits added for error correction, giving a code rate of k/(k + r).
  • each sent base is two bits of information, so the denominator is 2 n.
  • the numerator is the effective number of message bits: the length of the largest binary number smaller than the number of barcodes, given by b ⁇ ogi(Number of barcodes)c.
  • the number of message bits does not need to be an integer, so one can refer to the previous as the actual message bits, while one may be more interested in the “raw” message bits: ⁇ ogi(Number of barcodes) without a floor function. These correspond to raw and actual code rates, shown in FIG. 7.
  • the code rate of FREE codes increases with barcode length, and appears to asymptotically approach a maximal code rate determined by the properties of the decode sphere packing.
  • FIG. 2B shows that after some boundary effects at short barcode lengths, the number of raw message bits (log of the number of barcodes) increases linearly with the length of the barcodes. The slope of this line, up to a factor of 2 for the x-axis due to using base-4 instead of base-2, is an empirical estimate for the asymptotic code rate— message bits over sent bits— for this packing method. Estimated asymptotic values are shown for single- and double-error correcting codes as dashed lines in FIG. 1C.
  • H parity check matrix
  • Primer processing Primers were used both chemically for library amplification and informatically to distinguish left from right sides. However, the possibility of insertions and/or deletions in these primer sites introduced some uncertainty in the starting position of the DNA barcodes.
  • a custom adaptation of the Smith-Waterman algorithm for overhanging sequences was written. The user specifies an expected primer sequence, a full-length observed read, and a maximum allowable number of errors, which can be chosen to be 2 for both the left (19 bp) and right (18 bp) primers. Using the modified Smith-Waterman algorithm with unity penalties for all error types, the highest scoring prefix of the observed sequence which matched the expected sequence was identified. If two or more possible lengths had the same score, the one closest to the expected length was selected. If the number of edits is less than or equal to 2, this best inferred length then determines the position to be used as the start of the barcode sequence.
  • Decode errors are detected by whether or not the left and right barcodes, as shown in FIG. 3A, match an intended left/right barcode pair. There are two possible ways to decode incorrectly: either by decoding to a wrong barcode or by decoding to“None” if the observed barcode is not in any decode sphere at all. If a barcode decodes to“None”, then that decode is an error. If a barcode decodes to an incorrect barcode, then the observed output is that the left and right barcodes mismatch but it is unclear which is actually the decode error. To determine which barcode was in error, the edit distance of the entire oligo was measured against the two possible intended sequences, accepting the one with lowest edit distance. To measure the 0- and 1 -Error correction data in FIG. 5, the edit distance of each observed barcode to the intended barcode was measured using the primer processing algorithm described above.
  • the observed number of wrong barcodes with zero errors was ⁇ 10 4 , the approximate size of a decode sphere, so this was accepted as an approximation for the number of chimera oligos having barcodes with zero errors.
  • This number and the distribution of correct barcode errors was then used to approximate the number of the wrong barcodes l-error and 2-errors away from the wrong barcode were chimera oligos. These were then omitted from analysis.
  • the decoding error rate of an / «-error correction code is the probability of seeing more than m errors in a given barcode.
  • each barcode was modeled as a queue of intended bases. At each read position, an intended base is popped off the queue and attempted to be added. One of four things will happen: 1) the correct base will be added, 2) an incorrect base will be added, 3) the base will be deleted, or 4) another base will be inserted and the intended base will go back to the top of the queue. The first three options do not return the base to the queue, resulting in the same structure of expected output 7 observed output. However, insertions cause the intended base to return to the top of the queue, and the output was never expected in the first place.
  • CDS the 3-by-4 matrix with columns corresponding to the DNA bases, and rows corresponding to all non-insertion outputs: correct bases, deletions, and substitutions.
  • Insertion and deletion rates, p,(b) and p b), are taken directly from synthesis error rate measurements.
  • Substitution rates, p s (b), are calculated as the probability of not observing the event ⁇ no synthesis substitution and no sequencing substitution ⁇ nor the event ⁇ synthesis substitution to another base c and correcting synthesis substitution back to b ⁇ , and are thus given by
  • FIG. 5 since experimental errors clump together, increasing the probability of having more than two, say, in a single barcode.
  • the user would use a sphere iterator (see Supplemental Methods) which, instead of iterating over all barcodes with up to some number of errors, iterates over the most likely erroneous barcodes on the chosen synthesis and sequencing platform until the total probability of the barcodes contained in the sphere is at least 1-10 6 .
  • the rest of the generation and decoding process would remain the same.
  • These barcodes would only achieve the expected accuracy on exactly the same DNA synthesis and sequencing pipeline, but they would be more efficient (produce more barcodes per barcode length) for a given desired decode error rate. As such, popular synthesis and sequencing pipelines may warrant their own dedicated barcodes in the future.
  • Exemplary embodiments may include program products comprising computer or machine- readable media for carrying or having machine-executable instructions or data structures stored thereon.
  • the sensing electrode may be computer driven.
  • Exemplary embodiments illustrated in the methods of the figures may be controlled by program products comprising computer or machine-readable media for carrying or having machine-executable instructions or data structures stored thereon.
  • Such computer or machine-readable media can be any available media which can be accessed by a general purpose or special purpose computer or other machine with a processor.
  • Such computer or machine-readable media can comprise RAM, ROM, EPROM, EEPROM, CD-ROM or other optical disk storage, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to carry or store desired program code in the form of machine-executable instructions or data structures and which can be accessed by a general purpose or special purpose computer or other machine with a processor. Combinations of the above are also included within the scope of computer or machine-readable media.
  • Computer or machine-executable instructions comprise, for example, instructions and data which cause a general purpose computer, special purpose computer, or special purpose processing machines to perform a certain function or group of functions.
  • Software implementations of the present disclosure could be accomplished with standard programming techniques with rule based logic and other logic to accomplish the various connection steps, processing steps, comparison steps and decision steps.
  • elements shown as integrally formed may be constructed of multiple parts or elements shown as multiple parts may be integrally formed, the operation of the assemblies may be reversed or otherwise varied, the length or width of the structures and/or members or connectors or other elements of the system may be varied, the nature or number of adjustment or attachment positions provided between the elements may be varied.
  • the elements and/or assemblies of the system may be constructed from any of a wide variety of materials that provide sufficient strength or durability. Accordingly, all such modifications are intended to be included within the scope of the present disclosure.
  • the order or sequence of any process or method steps may be varied or re-sequenced according to alternative embodiments.
  • Other substitutions, modifications, changes and omissions may be made in the design, operating conditions and arrangement of the preferred and other exemplary embodiments without departing from the spirit of the present subject matter.
  • the methods and systems may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects.
  • the methods and systems may take the form of a computer program product on a computer-readable storage medium having computer-readable program instructions (e.g., computer software) embodied in the storage medium.
  • the present methods and systems may take the form of web-implemented computer software. Any suitable computer-readable storage medium may be utilized including hard disks, CD-ROMs, optical storage devices, or magnetic storage devices.
  • These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including computer-readable instructions for implementing the function specified in the flowchart block or blocks.
  • the computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer-implemented process such that the instructions that execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart block or blocks.
  • blocks of the block diagram and flowchart illustration support combinations of means for performing the specified functions, combinations of steps for performing the specified functions and program instruction means for performing the specified functions. It will also be understood that each block of the block diagram and flowchart illustration, and combinations of blocks in the block diagram and flowchart illustration, can be implemented by special purpose hardware- based computer systems that perform the specified functions or steps, or combinations of special purpose hardware and computer instructions.
  • Results can be delivered to a gateway (remote computer via the Internet or satellite) for in graphical user interface format.
  • the described system can be used with an algorithm, such as those disclosed herein.
  • the computer may include a processing unit that communicates with other elements. Also included in the computer readable medium may be an output device and an input device for receiving and displaying data. This display device/input device may be, for example, a keyboard or pointing device that is used in combination with a monitor.
  • the computer system may further include at least one storage device, such as a hard disk drive, a floppy disk drive, a CD Rom drive, SD disk, optical disk drive, or the like for storing information on various computer-readable media, such as a hard disk, a removable magnetic disk, or a CD-ROM disk.
  • each of these storage devices may be connected to the system bus by an appropriate interface.
  • the storage devices and their associated computer-readable media may provide nonvolatile storage. It is important to note that the computer described above could be replaced by any other type of computer in the art. Such media include, for example, magnetic cassettes, flash memory cards and digital video disks.
  • a network interface controller can be implemented via a gateway that comprises a general-purpose computing device in the form of a computing device or computer.
  • bus structures can be used as well, including a memory bus or memory controller, a peripheral bus, an accelerated graphics port, and a processor or local bus using any of a variety of bus architectures.
  • bus architectures can comprise an Industry Standard Architecture (ISA) bus, a Micro Channel Architecture (MCA) bus, an Enhanced ISA (EISA) bus, a Video Electronics Standards Association (VESA) local bus, an ISA bus or a bus that can be used as a bus or peripheral bus, an accelerated graphics port, and a processor or local bus using any of a variety of bus architectures.
  • such architectures can comprise an Industry Standard Architecture (ISA) bus, a Micro Channel Architecture (MCA) bus, an Enhanced ISA (EISA) bus, a Video Electronics Standards Association (VESA) local bus, an Industry Standard Architecture (ISA) bus, a Micro Channel Architecture (MCA) bus, an Enhanced ISA (EISA) bus, a Video Electronics Standards Association (VESA) local bus, an ISA bus or a bus that can comprise an Industry Standard Architecture
  • AGP Accelerated Graphics Port
  • PCI Peripheral Component Interconnects
  • PCMCIA Personal Computer Memory Card Industry Association
  • USB Universal Serial Bus
  • the bus, and all buses specified in this description can also be implemented over a wired or wireless network connection and each of the subsystems, including the processor , a mass storage device, an operating system, network interface controller, Input/Output Interface, and a display device, can be contained within one or more remote computing devices at physically separate locations, connected through buses of this form, in effect implementing a fully distributed system.
  • the computer typically comprises a variety of computer readable media.
  • Exemplary readable media can be any available media that is accessible by the computer and comprises, for example and not meant to be limiting, both volatile and non-volatile media, removable and non-removable media.
  • the system memory comprises computer readable media in the form of volatile memory, such as random access memory (RAM), and/or non-volatile memory, such as read only memory (ROM).
  • the computer can also comprise other removable/non-removable, volatile/non-volatile computer storage media.
  • a mass storage device can be a hard disk, a removable magnetic disk, a removable optical disk, magnetic cassettes or other magnetic storage devices, flash memory cards, CD-ROM, digital versatile disks (DVD) or other optical storage, random access memories (RAM), read only memories (ROM), electrically erasable programmable read-only memory (EEPROM), and the like.
  • any number of program modules can be stored on the mass storage device, including by way of example, an operating system and computational software.
  • Each of the operating system and computational software (or some combination thereof) can comprise elements of the programming and the computational software.
  • Data can also be stored on the mass storage device.
  • Data can also be stored in any of one or more databases known in the art. Examples of such databases comprise, DB2TM, MICROSOFTTM ACCESS, MICROSOFTTM SQL Server, ORACLETM, mySQL, PostgreSQL, and the like.
  • the databases can be centralized or distributed across multiple systems.
  • the user can enter commands and information into the computer 102 via an input device.
  • input devices comprise, but are not limited to, a keyboard, pointing device (e.g., a“mouse”), a microphone, a joystick, a scanner, tactile input devices such as gloves, and other body coverings, and the like
  • a human machine interface that is coupled to the network interface controller, but can be connected by other interface and bus structures, such as a parallel port, game port, an IEEE 1394 Port (also known as a Firewire port), a serial port, or a universal serial bus (USB).
  • a display device can also be connected to the system bus via an interface, such as a display adapter.
  • the computer can have more than one display adapter and the computer can have more than one display device.
  • a display device can be a monitor, an LCD (Liquid Crystal Display), or a projector.
  • other output peripheral devices can comprise components such as speakers and a printer which can be connected to the computer via Input/Output Interface. Any step and/or result of the methods can be output in any form to an output device.
  • Such output can be any form of visual representation, including, but not limited to, textual, graphical, animation, audio, tactile, and the like.
  • the computer can operate in a networked environment.
  • a remote computing device can be a personal computer, portable computer, a server, a router, a network computer, a peer device, sensor node, or other common network node, and so on.
  • Logical connections between the computer and a remote computing device can be made via a local area network (LAN), a general wide area network (WAN), or any other form of a network.
  • LAN local area network
  • WAN wide area network
  • a network adapter can be implemented in both wired and wireless environments.
  • Such networking environments are conventional and commonplace in offices, enterprise-wide computer networks, intranets, and other networks such as the Internet.
  • Computer readable media can be any available media that can be accessed by a computer.
  • Computer readable media can comprise“computer storage media” and“communications media.”“Computer storage media” comprise volatile and non-volatile, removable and non-removable media implemented in any methods or technology for storage of information such as computer readable instructions, data structures, program modules, or other data.
  • Exemplary computer storage media comprises, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disks (DVD) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can be accessed by a computer.
  • the methods and systems described herein can employ Artificial Intelligence techniques such as machine learning and iterative learning.
  • Artificial Intelligence techniques such as machine learning and iterative learning.
  • Such techniques include, but are not limited to, expert systems, case-based reasoning, Bayesian networks, behavior based AI, neural networks, fuzzy systems, evolutionary computation (e.g. genetic algorithms), swarm intelligence (e.g. ant algorithms), and hybrid intelligent systems (e.g. Expert inference rules generated through a neural network or production rules from statistical learning).
  • CustomArray, Inc. maker of custom microarrays, oligo pools and instrumentation. Available at: http://www.customarrayinc.com/aboutus_main.htm. (Accessed: 8th January 2018)

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Library & Information Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Theoretical Computer Science (AREA)
  • Biophysics (AREA)
  • Medical Informatics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Bioethics (AREA)
  • Databases & Information Systems (AREA)
  • Chemical & Material Sciences (AREA)
  • Biochemistry (AREA)
  • Molecular Biology (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

L'invention concerne un procédé informatisé de génération d'un ensemble de codes-barres génétiques à des fins de fixation à des échantillons, qui prédit des erreurs dans des valeurs de base de séquences génétiques et permet de décoder des codes-barres corrompus qui comprennent les erreurs. L'ensemble de codes-barres génétiques est délimité par la nécessité ou la prévention de combinaisons sélectionnées de valeurs de base qui correspondent à des caractéristiques physiques d'un échantillon génétique. Le procédé comprend l'identification d'une longueur commune des codes à barres dans l'ensemble, l'identification de valeurs de base disponibles pour remplir chaque position de base et l'identification d'un code-barres central à partir duquel une sphère de multiples codes-barres corrompus peut être générée. La sphère de codes-barres se référant à un code-barres central donné est stockée sous la forme d'une table de consultation pour décoder des codes-barres corrompus pour former un seul code-barres central unique.
PCT/US2019/028279 2018-04-20 2019-04-19 Codes-barres d'adn de correction d'erreur WO2019204702A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US201862660531P 2018-04-20 2018-04-20
US62/660,531 2018-04-20

Publications (1)

Publication Number Publication Date
WO2019204702A1 true WO2019204702A1 (fr) 2019-10-24

Family

ID=68239272

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2019/028279 WO2019204702A1 (fr) 2018-04-20 2019-04-19 Codes-barres d'adn de correction d'erreur

Country Status (1)

Country Link
WO (1) WO2019204702A1 (fr)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2021053208A1 (fr) * 2019-09-20 2021-03-25 Sophia Genetics S.A. Procédés de génération de bibliothèque d'adn pour faciliter la détection et le rapport de variants à basse fréquence
WO2022060889A3 (fr) * 2020-09-16 2022-04-28 10X Genomics, Inc. Procédés et systèmes de correction d'erreur de code à barres
CN114774516A (zh) * 2022-03-28 2022-07-22 深圳裕康医学检验实验室 一种校正测序错误的umi序列设计方法及其应用

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160289740A1 (en) * 2015-03-30 2016-10-06 Cellular Research, Inc. Methods and compositions for combinatorial barcoding

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160289740A1 (en) * 2015-03-30 2016-10-06 Cellular Research, Inc. Methods and compositions for combinatorial barcoding

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
BUSCHMANN ET AL.: "Levenshtein error-correcting barcodes for multiplexed DNA sequencing", BMC BIOINFORMATICS, vol. 14, no. 272, 11 September 2013 (2013-09-11), pages 1 - 10, XP055194297 *
COSTEA ET AL.: "TagGD: fast and accurate software for DNA Tag generation and demultiplexing", PLOS ONE, vol. 8, no. 3, 4 March 2013 (2013-03-04), pages 1 - 5, XP055289247, DOI: 10.1371/journal.pone.0057521 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2021053208A1 (fr) * 2019-09-20 2021-03-25 Sophia Genetics S.A. Procédés de génération de bibliothèque d'adn pour faciliter la détection et le rapport de variants à basse fréquence
EP4031664A1 (fr) * 2019-09-20 2022-07-27 Sophia Genetics S.A. Procédés de génération de bibliothèque d'adn pour faciliter la détection et le rapport de variants à basse fréquence
WO2022060889A3 (fr) * 2020-09-16 2022-04-28 10X Genomics, Inc. Procédés et systèmes de correction d'erreur de code à barres
CN114774516A (zh) * 2022-03-28 2022-07-22 深圳裕康医学检验实验室 一种校正测序错误的umi序列设计方法及其应用
CN114774516B (zh) * 2022-03-28 2024-04-12 深圳裕康医学检验实验室 一种校正测序错误的umi序列设计方法及其应用

Similar Documents

Publication Publication Date Title
Liu et al. Hi-TOM: a platform for high-throughput tracking of mutations induced by CRISPR/Cas systems
Buschmann et al. Levenshtein error-correcting barcodes for multiplexed DNA sequencing
Marsan et al. Algorithms for extracting structured motifs using a suffix tree with an application to promoter and regulatory site consensus identification
US20200032334A1 (en) Methods, systems, computer readable media, and kits for sample identification
Bystrykh Generalized DNA barcode design based on Hamming codes
Organick et al. Scaling up DNA data storage and random access retrieval
US20180211001A1 (en) Trace reconstruction from noisy polynucleotide sequencer reads
CN107969138B (zh) 条形码序列和有关系统与方法
WO2019204702A1 (fr) Codes-barres d'adn de correction d'erreur
US20210074380A1 (en) Reverse concatenation of error-correcting codes in dna data storage
EP2984598A1 (fr) Systèmes et procédés destinés à définir une variation du nombre de copie
WO2011137368A2 (fr) Systèmes et méthodes d'analyse de séquences d'acides nucléiques
US20120185177A1 (en) Harnessing high throughput sequencing for multiplexed specimen analysis
Bhardwaj et al. Trace reconstruction problems in computational biology
Tambe et al. Barcode identification for single cell genomics
Liu et al. NullSeq: a tool for generating random coding sequences with desired amino acid and GC contents
Westesson et al. Accurate detection of recombinant breakpoints in whole-genome alignments
Anavy et al. Improved DNA based storage capacity and fidelity using composite DNA letters
CN114424288A (zh) 用于确定与两个突变的序列读段衍生自包含突变的相同序列的概率相关的量度的方法
Erlich et al. Capacity-approaching DNA storage
US20210202032A1 (en) Method of tagging nucleic acid sequences, composition and use thereof
Milenkovic et al. DNA-based data storage systems: A review of implementations and code constructions
Hawkins et al. Error-correcting DNA barcodes for high-throughput sequencing
Sharma et al. Efficiently Enabling Block Semantics and Data Updates in DNA Storage
Simpson Efficient sequence assembly and variant calling using compressed data structures

Legal Events

Date Code Title Description
NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19788060

Country of ref document: EP

Kind code of ref document: A1