WO2020180659A1 - Nucleic acid labeling methods and composition - Google Patents

Nucleic acid labeling methods and composition Download PDF

Info

Publication number
WO2020180659A1
WO2020180659A1 PCT/US2020/020321 US2020020321W WO2020180659A1 WO 2020180659 A1 WO2020180659 A1 WO 2020180659A1 US 2020020321 W US2020020321 W US 2020020321W WO 2020180659 A1 WO2020180659 A1 WO 2020180659A1
Authority
WO
WIPO (PCT)
Prior art keywords
label
sequence
oligonucleotide
degenerate
cell
Prior art date
Application number
PCT/US2020/020321
Other languages
French (fr)
Inventor
Colin J.H. Brenan
Michael Kopczynski
Steven SCHERR
Ely PORTER
Original Assignee
1Cellbio Inc.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 1Cellbio Inc. filed Critical 1Cellbio Inc.
Publication of WO2020180659A1 publication Critical patent/WO2020180659A1/en

Links

Classifications

    • CCHEMISTRY; METALLURGY
    • C40COMBINATORIAL TECHNOLOGY
    • C40BCOMBINATORIAL CHEMISTRY; LIBRARIES, e.g. CHEMICAL LIBRARIES
    • C40B40/00Libraries per se, e.g. arrays, mixtures
    • C40B40/04Libraries containing only organic compounds
    • C40B40/06Libraries containing nucleotides or polynucleotides, or derivatives thereof
    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12NMICROORGANISMS OR ENZYMES; COMPOSITIONS THEREOF; PROPAGATING, PRESERVING, OR MAINTAINING MICROORGANISMS; MUTATION OR GENETIC ENGINEERING; CULTURE MEDIA
    • C12N15/00Mutation or genetic engineering; DNA or RNA concerning genetic engineering, vectors, e.g. plasmids, or their isolation, preparation or purification; Use of hosts therefor
    • C12N15/09Recombinant DNA-technology
    • C12N15/10Processes for the isolation, preparation or purification of DNA or RNA
    • C12N15/1034Isolating an individual clone by screening libraries
    • C12N15/1065Preparation or screening of tagged libraries, e.g. tagged microorganisms by STM-mutagenesis, tagged polynucleotides, gene tags

Definitions

  • the present application contains a sequence listing that is submitted in ASCII format via EFS-Web concurrent with the filing of this application, containing the file name 37578_0063P1_SL which is 4,096 bytes in size, created on February 25, 2020, and is herein incorporated by reference in its entirety.
  • the present invention is directed to methods for labeling nucleic acids in a cell with an oligonucleotide label as a unique identifier of molecules from a single cell and a second oligonucleotide label derived from the first label that further identifies individual molecules from the same cell.
  • the first label is a meta-label synthesized from a pre-existing library of lUPAC bases representing degenerate nucleotides wherein each bi-degenerate or tri-degenerate base encodes for two or three pre-specified nucleotides.
  • a string of bi-degenerate labels for example, generates a unique code that is used to label molecules from a single cell and through application of combinatorial synthesis methods, a library of unique bi-degenerate labels is created to tag molecules from the same cell with the same bi degenerate label and molecules from a different cell with a different label.
  • Each bi-degenerate label in turn, encodes an oligonucleotide library of unique molecular identifiers derived from the bi degenerate label. These identifiers are separate from the cell label in that each molecular label is different from one molecule to the next and they share in common the same bi-degenerate sequence that indicates they are from the same cell.
  • oligonucleotide label as a molecular and a cell identifier is a unique and different approach to associating molecules from individual cells in a population of cells.
  • agents of interest e.g. DNA, RNA, proteins, chemicals etc
  • Oligonucleotide labeling of agents associated with an individual cell in a population or group of cells is useful for a number of applications.
  • One non-limiting example is to uniquely label individual RNA molecules in a cell to enumerate the number of RNA molecules in a specific cell and to distinguish and enumerate RNA molecules from different cells. Sequencing the labeled molecules allows the number of RNA molecules to be counted and their assignment and association to an individual cell from a population of cells to be made.
  • This information can be of great significance in scientific research, in the research and development of therapeutic drugs, in the research and development of medical devices and in the research and development of diagnostic and prognostic tests of disease in humans, animals and plants. Examples include tracing cell lineages during embryogenesis for scientific research; tracking cell proliferation or differentiation during tumorigenesis for medical research and therapeutic or diagnostic development; identifying genotypically rare cells in a population of cells for scientific research; and measuring differences in immune cell populations as a function of disease state or application of a therapeutic drug for therapeutic or diagnostic test development.
  • the source of cells for single cell genetic analysis is varied and can be individual plants, viruses, fungi, prokaryotic cells including bacteria, eukaryotic cells including animals and humans.
  • Cells can be prepared to be labeled in a variety of different ways and include, but not limited to, the common steps of (a) enclosing or isolating an individual cell in a container; (b) introducing a lysis agent into the container to lyse the cell release the agents to be labeled, (c) introduction of the oligonucleotide label, (d) the process of attaching the label to one or more agents and (e) finally the ways by which the sequence of the label can be determined to extract information specific to the cell and different from other cells in the same population and, separately, to the labeled agents in each cell.
  • an oligonucleotide tag or label comprising multiple parts: an oligo dT sequence, a sequencing primer binding site, a common sequence that is the same for all oligonucleotide labels and a unique label tag sequence wherein the unique label tag sequence is selected from a set of at least m different label tag sequences.
  • Encoded in the label in separate parts are oligonucleotide sequences comprising the label that is common to all molecules in a cell and one that is unique and variable from one molecule to the next.
  • each of these nucleic acid labeling schemes is the need to synthesize a diverse library of labels which could be cumbersome and expensive since to label molecules from a population of cells, the sequence diversity of the library of oligonucleotide labels must be an order of magnitude or more greater than the number of cells in the population.
  • the label diversity depends on the number of individual nucleotides that comprise the label so to have a diverse library of labels, the label is composed of a number of nucleotides.
  • the larger the number of nucleotides in the label the greater the diversity but the larger the number of sequencing cycles in an lllumina sequencer needed to measure and identify the label. This in turn leaves fewer cycles for reading the nucleic acid sequence to which the label is attached and illustrates the need for a balance between library diversity and length of the labels in the library.
  • the more information included in the label the longer the label needs to be to accommodate this expanded information set. For example, at least two independent sequences are needed in the same label to identify the cell of origin and to identify and count the number agents per cell.
  • the present invention seeks to overcome the intrinsic limitations of current approaches for labeling of agents with oligonucleotide labels and to simplify the labeling by eliminating the unique molecular identifier label as an independent sequence and component of the barcode oligonucleotide sequence.
  • the current art provides methods for generating one oligonucleotide label for uniquely labeling the molecular agents in a cell with the same label and a second oligonucleotide label for tagging each molecular agent in a cell with a different specific and unique molecular identifier and combining the two labels into a single label for tagging molecular agents to identify each agent uniquely and to identify from which cell the agent originated in a population of cells.
  • this oligonucleotide label is typically a concatenation of a cell label and a unique molecular identifier.
  • the methods for detection and identification of the unique labels tagging molecular agents from one or more cells is varied and may include methods based on sequencing the label, hybridization with a complementary fluorescently-labeled sequence (FISH), tagging the label with a unique set or combination of fluorophores and/or optically active molecules, measurement of the mass of the oligonucleotide label with a mass spectrometer or measurement of the length of the oligonucleotide with gel electrophoresis.
  • FISH complementary fluorescently-labeled sequence
  • Synthesis of labels typically makes use of a plurality of detectable bi-degenerate bases to generate unique oligonucleotide labels.
  • a library of bi-degenerate bases is synthesized by specifying to a oligonucleotide synthesizer the International Union of Pure and Applied Chemistry (lUPAC) nucleotide labels representing a single nucleotide from a group of one of five possible nucleotides (e.g. Adenine (A), Guanine (G), Thymine (T), Uracil (U) or Cytosine (C)).
  • the synthesizer translates the input labels to synthesize an oligonucleotide sequence and this process is repeated independently with different unique sequences of labels to generate a library of unique oligonucleotide labels.
  • the cost of this approach wherein each label is independently synthesized is costly and therefore not the preferred embodiment where libraries exceeding 384 labels are needed.
  • Another approach is to combine combinatorically two or more libraries of oligonucleotide labels to synthesize a single library with greater tag diversity than the diversity of the starting libraries.
  • This library of bi-degenerate bases can be used to uniquely label molecular species in a single cell so that all species in a cell thus labeled are distinguished from other cells uniquely.
  • a further embodiment is to attach or synthesize an additional random N-mer sequence to each oligonucleotide label so that each labeled molecule and its replicates in an individual cell is readily identified.
  • the methods allow a plurality of agents, ranging from 2 to millions of agents, to be uniquely labeled without the need to manually generate the same number of unique labels.
  • the agents may be of diverse nature.
  • the agents are nucleic acids such as genomic DNA fragments, RNA transcripts, long non-coding RNA, microRNA, circRNA, chromatin-DNA fragments or they can be non-nucleic acids such as proteins or small molecules.
  • the invention contemplates that agents may be labeled in order to identify them, identify their source, identify their relationship with other agents, enumerate the number of agents in a population of agents, and/or identify one or more conditions to which the agents have been subject.
  • the methods of the invention also provide for amplifying nucleic acids to increase the number of read pairs that can be properly identified via their unique index combination.
  • the method of the invention allows for each end-labeled nucleic acid to be identically labeled at either it's 5' and/or 3' ends.
  • the methods of the invention provide for enumeration of the number of copies of the labeled agent in a population of agents.
  • degenerate when used to refer to a nucleotide sequence, refers to one or more positions which may contain any of a plurality of different bases. Degenerate residues within an oligonucleotide or nucleotide sequence are denoted by standard lUPAC nucleic acid notation (see Figure 1) and are sub-divided into bi-degenerate and tri-degenerate bases.
  • the unique labels provided herein are at least partly nucleic acid in nature.
  • the invention contemplates the labels are prepared by sequentially attaching either a bi-degenerate or tri degenerate base to each other.
  • the order in which the bases attach to each other can be from a library of known base sequences or in a random manner. In turn and depending on the base degeneracy, at each position in the base sequence there is either one of two or one of three possible non-degenerate bases.
  • the invention is based, in part, on the appreciation by the inventors that a sequence of bi degenerate or tri-degenerate bases results in one unique label and a second label is created from the sequences of non-degenerate bases derived from the single degenerate base sequence.
  • the degenerate base sequence is defined as a meta-code packet and the non-degenerate sequences derived from a single meta-code are defined as badges.
  • This combination of a single meta-code packet and multiple, dependent badges is a cypher coding scheme useful for multiple applications where an object or item is labeled with a known degenerate sequence and its dependencies labeled with a known yet related non-degenerate sequence.
  • One non-limiting example is to use a meta-code to uniquely label the molecular agents from a single cell and the associated badges uniquely label individual molecular agents from the same cell.
  • This compact, two level cipher uniquely combines in a single coding scheme and a single oligonucleotide sequence two critical pieces of information: labeling of an individual cell and assigning unique and different labels derived from the cell label to each molecular agent in the cell.
  • the invention allows a large number of labels to be generated (and thus a large number of agents to be uniquely labeled) using a relatively small number of oligonucleotides.
  • a meta-code can be constructed from lUPAC bi-degenerate symbols wherein each symbol encodes for a specific pair of nucleotides in equal measure.
  • N bi-degenerate symbols can be strung together to form a meta-code to label molecules from a specific cell and from the same meta-code, there are 2 ⁇ unique and different nucleotide sequences as badges for labeling individual molecules from the same cell.
  • the invention contemplates the badge encoding independent information about the agent including identifying the population of agents relative to their common source and the number of unique agents in that population. Additionally, the invention contemplates the badge encoding another extrinsic species interacting with the agents from a single cell. This could be, for example, an antibody or a small molecule wherein the badge encodes an antibody or small molecule interacting specifically with an agent of a population of agents derived or associated with a single cell.
  • the information encoded by the badge in this example would include cell-specific information and be directly related to the meta-code that unifies the relationship of these different agents to a common source, in this case a single cell.
  • Another non-limiting example from the healthcare industry is to use the meta-code to uniquely label an individual patient and the associated badges label all the health information associated with this patient, including different electronic medical records, lab tests, medical imaging results and results from visits to different physicians.
  • Another non-limiting example from the insurance industry is to use the meta-code to uniquely label an individual insurance policy holder and the associated badges label all the insurance information associated with the policy holder including claims, claim information and insurance policies held by the individual.
  • Another non-limiting example from inventory control is to use the meta-code to uniquely label an item in inventory and the associated badges label all the information associated with the inventory item including physical attributes of the item, its manufacturing history, its time in inventory and shipping history to a customer.
  • the invention provides, in part, a method comprising obtaining a sample comprising a plurality of cells; labeling at least a portion of two or more molecular agents such as DNA, RNA, proteins, small molecules, microRNA, long non-coding RNA, metabolites or other chemicals in the cell, complements thereof, or reaction products therefrom, from a first cell of the plurality and a second cell of the plurality with a first same cell label specific to the first cell and a second same cell label specific to the second cell; and a unique label specific to each of one or more molecular agents, complements thereof, or reaction products therefrom, from the first cell; and wherein a unique label specific to each of one or more molecular agents, complements thereof, or reaction products therefrom, from the second cell are unique with respect to each other.
  • the cell specific label or meta-code enables the assignment of labeled molecules to a given cell and the molecular badge uniquely identifies different labeled molecules from that same cell.
  • An oligonucleotide label is typically synthesized by successive addition and polymerization of individual non-degenerate nucleotides (A, G, T or C) to create a single physical oligonucleotide sequence but only encodes for information as described by the position and type of nucleotide in the physical sequence.
  • a key inventive step of this invention is the realization that defining a sequence from lUPAC degenerate bases enables a two-level coding scheme comprised of two parts: a degenerate base sequence that defines an object-specific label and multiple non-degenerate nucleotide sequences derived from the object-specific label that defines additional labels to tag additional objects related to, derived from or dependent on the first object labeled with the original degenerate base sequence.
  • the synthesis process can be performed manually with standard oligonucleotide chemistries, by programming a commercial oligonucleotide synthesizer to synthesize uniquely different labels or by synthesizing a set of labels and creating a library of unique labels using combinatorial methods.
  • the cell meta-code can be constructed from standard lUPAC symbols encoding for either two different nucleotides (bi-degenerate cipher) or three different nucleotides (tri-degenerate cipher) at each position during synthesis of the oligonucleotide meta-code and that this encryption scheme permits the cell meta-code to be derived from the dependent molecular badges from the same degenerate base sequence as opposed to discrete sequences joined together to make a single, longer label sequence.
  • a key benefit therefore is a compact and efficient approach to specifically labeling individual agents from individual cells with a shorter, informatically more efficient label than described in current methods. This translates into either lower cost sequencing for the same number of nucleotides sequenced or a larger number of nucleotides sequenced (deeper sequencing) of the attached nucleic acid.
  • a common approach to molecular labeling is to construct a molecular label from the sequential addition of an oligonucleotide barcode sequence plus an unique molecular identifier (UMI).
  • the UM I typically consists of between 6-10 nucleotides randomly selected and attached to form a unique oligonucleotide sequence.
  • the schema described here effectively eliminates the UM I, thus making the overall barcode shorter and requiring few sequencing cycles to read-out. Fewer read cycles on the sequencer translates into lower sequencing costs.
  • a second key benefit is the ability to easily correct for sequencing or synthesis errors since the meta code sequence is a well-defined and known sequence.
  • barcode labels constructed with an UMI consist of a known barcode sequence concatenated to a random sequence that is the unique molecular identifier.
  • the molecular identifier is an unknown random sequence, it can be challenging to identify in the sequence data where the identifier sequence ends and the actual sequence data begins given there could be errors in the sequence data itself that could confound this identification.
  • a third key benefit is the availability of a much larger number of molecular labels without adding additional bases in the label sequence. This further economizes the sequencing of the labeled nucleic acid without decreasing diversity of possible molecular labels.
  • oligonucleotide synthesizer is programmed to synthesize a label based on a string of either bi degenerate or tri-degenerate symbols, At each nucleotide position and depending on whether it is a bi-degenerate or tri-degenerate symbol randomly inserts one of two (bi-degenerate) or similarly one of three (tri-degenerate) possible nucleotides in equal parts with a single nucleotide in that specific position to synthesize a unique oligonucleotide. This process repeats itself N times to build a N-mer molecular label (badge) sequence defined by the meta-code sequence.
  • the lUPAC nomenclature specifies either a bi-degenerate or tri-degenerate coding scheme.
  • bi degenerate coding we can distinguish two classes of encoding: exclusive to a combination of bi degenerate symbols are certain symbols and the corresponding associated nucleotides as defined by lUPAC.
  • the bi-degenerate encoding can be non-exclusive and not follow the lUPAC encoding scheme.
  • the paired symbols (R, Y), (M, K), and (S, W) encode for either of two nucleotides, each of which is incorporated in equal proportion at that position by the synthesizer and inserted into the growing sequence as the oligonucleotide molecular label sequence (badge) is synthesized.
  • the bi-degenerate sequence may include specific non-degenerate nucleotides instead of a degenerate base at any position in the nucleotide sequence.
  • Each lUPAC bi-degenerate base represents defined pairs of nucleotides A, G, T and C, the bi-degenerate base sequences are combined to define a meta-code packet and a cell meta barcode label is created by combining one or more meta-code packets wherein each packet defines multiple unique badge sequences for labelling molecular agents from a single cell. Decoding the packet sequence to its constituent nucleotides results in a library of unique nucleotide sequences called molecular badge sequences that can be sequenced to reveal the specific label identifying a molecular agent.
  • Measurement of the molecular labels in this manner can be used to decipher the badge sequence to the meta-code packets and cell meta-barcodes and assign labeled agents to a specific cell and enumerate the number of copies of the agent from an individual cell as the badge sequences attached to the agents are processed for sequencing analysis.
  • the sequencer records a string of nucleotides (A, G, T or C) as a label associated with a specific molecular agent from a specific cell.
  • A, G, T or C a string of nucleotides
  • the library of molecular badges is already known and is loaded as packets into a look-up table.
  • the table of known badge sequences is scanned to identify a match between the measured sequence with the packets in the look-up table.
  • the cell meta-barcode label is then reconstructed from multiple packets based on the unique, one-to-one correspondence between the molecular badge sequence and the meta-code packet sequences. In this way many molecular badge sequences can be associated with one cell meta-barcode.
  • a similar concept is applied to the tri-degenerate encoding where the symbol H, B, V or D is input to the synthesizer, paired with the single nucleotide not included in the tri-degenerate cipher, to create a new sequence and where one of the three nucleotides is incorporated into the oligonucleotide molecular label sequence in equal proportion, to pair with the remaining single base.
  • a meta-code packet is constructed from strings of bi-degenerate symbols and by way of illustration, Figure 3a shows an example where a packet string of 3 bi-degenerate symbols gives rise to 8 unique nucleotide sequences.
  • N bi-degenerate symbols comprising a meta-code there are 2 ⁇ unique nucleotide sequences.
  • programming the synthesizer to generate the bi-degenerate string WSS will give rise to a unique combination of 8 different nucleotide sequences all with the common property that they originated from the same packet sequence of bi degenerate symbols.
  • FIG. 3b shows another example where packets are formed from the combination of three bi-degenerate or tri-degenerate symbols selected from their respective canonical sets.
  • 8 512 unique nucleotide sequences where 8 is the number of nucleotide sequences per meta-code packets and 3 represents the number of meta-code packets joined together to create the cell meta barcode.
  • the badge sequence is the label attached through different means to the molecular agents in a single cell.
  • An integral step of the process to associate individual molecular agents to their respective cells of origins requires sequencing the badge sequence attached as labels to the molecular agents.
  • the molecular agent specifically if it is a nucleic acid, the molecular agent will also be sequenced to identify the specific nucleic acid associated with the label.
  • This coding schema allows for many badge sequences to be associated with and originate from a single cell meta-code packet. Labeling individual agents within a cell with this labeling scheme makes for a compact single barcode label by combining two labels - one at the cell level and the second at the molecular level - in the same label to associate individual molecules and their replicates to a specific cell. If the agent is replicated as preparation to measuring the nucleotide sequence that defines the label, then the molecular badge is also replicated.
  • the number of agents uniquely associated with a specific cell can be enumerated and the number compared against the number of an internal agent used as a standard or against the number of an external agent added to the sample of cells as an external standard. This is an important property of the cipher scheme since it enables the number of individual agents from a cell to be counted and compared against an internal or external standard.
  • the number of agents from different cells can be compared and the benefit of doing so is to determine the natural differences between cells based on the relative abundance of a particular agent or the differences that may arise due to an external factor such as application of a natural or synthetic chemical, a biological molecule or another external physical, biological or chemical perturbation that would change the cell state in a manner measurable in the number of labeled agents.
  • the library of sequence badges can be synthesized directly or assembled combinatorially from a library of known bi-degenerate symbols such that a table of all possible nucleotide badge sequences corresponding to all possible packet sequences from the synthesized library is generated. Sequencing the sequence badges from a group of cells results in nucleotide sequences that can be compared computationally against the table of known packets. This allows the badge sequences in the sequence data to assign and associate individual molecular agents to their origin from different specific cells.
  • Each molecular agent in a cell will also be labeled with a unique nucleotide badge sequence and if a particular badge is identified more than once in the sequence data then the number of times it occurs in the sequence data can be used to assess and compensate for any bias in the sequencing process. It is particularly useful when enumerating the number of specific molecular agents in a cell.
  • One non limiting example would be to enumerate the number of copies of a specific transcript from a given cell based on counting the number of times a particular nucleotide badge sequence appears in the sequence data.
  • Figure 1 The lUPAC coding scheme is used to create meta-codes that direct synthesis of unique molecular labels (badges) for labeling individual molecular agents in a cell.
  • the meta-codes can be combined to create cell meta-barcodes to identify the cell where the agent originated.
  • Figure 2 This figure shows the nomenclature used for the molecular agent labeling process.
  • a meta code is created from a string of bi-degenerate symbols and a barcode label is synthesized by combining one or more meta-codes wherein each meta-code can be an index combined to synthesize the barcode label.
  • the sequence badge is derived from the meta-code and represents another level of labeling wherein each molecular agent from a single cell will have a unique sequence badge label. This allows for copies of a molecular agent to be enumerated and the number of copies normalized to account and correct for any bias in the sequence process.
  • Figure 3a shows the process by which a library of badge sequences is created from a string of three bi-degenerate symbols from the same canonical set bi-degenerate symbols (packet).
  • the badge sequence labels all the molecular agents from a particular cell and enables the association of each of those agents to a specific t cell. Note that because each bi-degenerate symbol can be replaced by one of two possible nucleotides the number of possible nucleotide sequences is 2 ⁇ where N is the number of bi-degenerate symbols that comprise the packet.
  • Figure 3b shows a different example where the combination of bi-degenerate and tri-degenerate symbols yield a unique library of badge sequence labels by the combination of three bi-degenerate and tri-degenerate symbols for each meta-code packet.
  • Figures 4a-c illustrate as an example a cell meta-barcode label derived from three meta-code packets and the resulting number of unique molecular badges which is derived from this meta-barcode. In practice this meta-barcode would be used to identify molecular agents originating from the same cell and the sequence badges would be used to track the number of copies of individual molecules during the preparation and sequencing of molecular agents.
  • Figures 4a-c show a combination of the three indices WSS-MKM-RRY codes for 512 unique sequence cassettes.
  • Figure 5 describes a combinatorial split pool synthesis as the preferred method by which the oligonucleotide barcode sequences are constructed.
  • the invention provides methods and compositions for uniquely labeling agents of interest including for example nucleic acids such as DNA, DNA fragments, chromatin DNA, RNA, miRNA, long non-coding RNA, proteins, small molecules, peptides and metabolites.
  • nucleic acids such as DNA, DNA fragments, chromatin DNA, RNA, miRNA, long non-coding RNA, proteins, small molecules, peptides and metabolites.
  • the ability to uniquely label agents has a number of applications, as contemplated by the invention, including but not limited to genomic sequencing, genomic assembly, screening of putative drugs and biologies, analysis of environmental samples to discover new organisms, labeling of individual elements of synthetic biology constructs, labeling of samples from a specific source or donor, quality control analysis of reagents to verify purity, the analysis of nucleic acids from single cells, analysis of various conditions on populations of cells, single cells or cell components.
  • One of the major limitations of prior art labeling techniques is the limited number of available unique labels capable of labeling both the population of agents relative to their common source and the number of unique agents in that population. Typically, the number of agents to be labeled in any given application far exceeds the number of unique labels that are available.
  • the methods of the invention can be used to synthesize essentially an infinite number of unique labels capable of dual level encoding of individual agents. Moreover, because of their nature, the labels can be easily detected and distinguished from each other, making them suitable for many applications and uses.
  • the methods of the invention easily and efficiently generate libraries of unique labels.
  • libraries may be of any size and are preferably large libraries including tens to hundreds of millions to billions of unique labels.
  • the libraries of unique labels may be synthesized separately from agents and then associated with agents post-synthesis. Alternately, unique labels may be synthesized in real-time (e.g. while in the presence of the agent) and in some instances the label synthesis is a function of the history of the agent. This means, in some instances, that synthesis of the label may occur while an agent is being exposed to one or more conditions. This would occur, for example, in a continuous flow system or in a microfluidic droplet.
  • the invention therefore contemplates the resultant label may store (or code) within it information about the agent (i.e.
  • agent-specific information including the origin or source of the agent, the relatedness of the agent to other agents (e.g. the number of agents in a population of agents), the genomic distance between two agents (e.g. in the case of genomic fragments), conditions to which the agent may have been exposed, and the like.
  • Some methods of the invention comprise determining information about an agent based on the unique label associated with the agent.
  • determining information about the agent may comprise obtaining the nucleotide sequence of the unique label (i.e. sequencing the unique label).
  • determining information about the agent may comprise determining the presence, number and/or order of non-nucleic acid detectable moieties.
  • determining information about the agent may comprise obtaining the nucleotide sequence of the unique label and determining the presence, number and/or order of non-nucleic acid detectable moieties.
  • agent refers to any moiety or entity that can be associated with, including being attached to, a unique label.
  • An agent may be a single entity, or it may be plurality of entities.
  • An agent may be a nucleic acid, a peptide, a protein, a cell, a cell lysate, a solid support, a polymer, a chemical, a metabolite, and the like, or an agent may be a plurality of any of the foregoing, or it may be a mixture of the foregoing.
  • an agent may be nucleic acids (e.g. mRNA transcripts and/or genomic DNA fragments), solid supports such as beads or polymers, and/or proteins from a single cell or from a single cell population (e.g. a tumor or non- tumor tissue sample).
  • an agent is a nucleic acid.
  • the nucleic acid agent may be single- stranded (ss) or double-stranded (ds), or it may be partially single-stranded and partially double- stranded.
  • Nucleic acid agents include but are not limited to DNA such as genomic DNA fragments, PCR and other amplification products, RNA, cDNA, and the like. Nucleic acid agents may be fragments of larger nucleic acids such as but not limited to genomic DNA fragments.
  • An agent of interest may be associated with a unique label.
  • "associated” refers to a relationship between the agent and the unique label such that the unique label may be used to identify the agent, identify the source or origin of the agent, identify one or more conditions to which the agent has been exposed, etc.
  • a label that is associated with an agent may be, for example, physically attached to the agent, either directly or indirectly, or it may be in the same defined, typically a physically separate, volume as the agent.
  • a defined volume may be an emulsion droplet, a well (of for example a multiwall plate), a tube, a container, and the like. It is understood that the defined volume will typically contain only one agent and the label with which it is associated, although a volume containing multiple agents with multiple copies of the label is also contemplated depending on the application.
  • An agent may be associated with a single copy of a unique label or it may be associated with multiple copies of the same unique label including for example 2, 3, 4, 5, 6, 7, 8, 9, 10, 100, 1,000 (10 9 ), 10,000
  • the label is considered unique because it is different from labels associated with other, different agents.
  • Attachment of labels to agents may be direct or indirect.
  • the attachment chemistry will depend on the nature of the agent and/or any derivatization or functionalization applied to the agent.
  • labels can be directly attached through covalent attachment.
  • the label may include a moiety, which may be a non-nucleotide chemical modification, to facilitate attachment.
  • the label may include methylated nucleotides, uracil basis, phosphorothioate groups, ribonucleotides, diol linkages, disulphide linkages, etc. to enable covalent attachment to an agent.
  • a label can be attached to an agent via a linker or in another indirect manner.
  • linkers include, but are not limited to, carbon-containing chains, polyethylene glycol (PEG), nucleic acids, monosaccharide units and peptides.
  • PEG polyethylene glycol
  • the linkers may be cleavable under certain conditions. Cleavable linkers are discussed in greater detail herein.
  • nucleic acid labels to nucleic acid agents
  • methods for attaching nucleic acids to each other include but are not limited to ligation, such as blunt end ligation or cohesive overhang ligation, and polymerase-mediated attachment methods (see, e.g. US Patent Nos. 7863058 and 7754429; Green and Sambrook, Molecular Cloning: A Laboratory Manual, Fourth edition, 2012; Current Protocols in Molecular Biology, and Current Protocols in Nucleic Acid Chemistry.
  • oligonucleotide adapters are used to attach a unique label to an agent or to a solid support.
  • an oligonucleotide adapter comprises one or more known sequences, e.g. an amplification sequence, a capture sequence, a primer sequence, and the like.
  • the adapter comprises a thymidine (T) tail overhang.
  • TdT terminal deoxynucleotide transferase
  • the oligonucleotide adapter comprises a region that is forked.
  • the adapter comprises a capture or detection moiety.
  • moieties include, but are not limited to, fluorophores, microparticles such as quantum dots, gold nanoparticles, microbeads, biotin, DNP (dinitrophenyl), fucose, digoxigenin, avidin, streptavidin, amino acid-based tags such as, but not limited to HA-, Myc-, FLAG-, M BP-, SUMO-, Protein A-, polyhistidine- and GST-tags, antigens and other moieties known to those skilled in the art.
  • the moiety is biotin.
  • a label and/or an agent may be attached to a solid support.
  • a label or multiple copies of the same label
  • suitable solid supports include, but are not limited to, inert polymers (preferably non-nucleic acid polymers), porous hydrogel polymers, beads, magnetic beads, hydrogel beads, glass, ceramics, metals, with limited mobility carbon nanofibers or nanotubes, or peptides.
  • the solid support is an inert polymer or bead (porous or non-porous).
  • the solid support may be functionalized to permit covalent attachment of the agent and/or label. Such functionalization may comprise placing on the solid support reactive groups that permit covalent attachment to an agent and/or a label.
  • Labels and/or agents may be attached to each other or to solid supports using cleavable linkers.
  • Cleavable linkers are known in the art and include, but are not limited to, TEV, trypsin, thrombin, cathepsin B, cathespin D, cathepsin K, caspase lumatrix metalloproteinase sequences, phosphodiester, phospholipid, ester, beta galactose, b-glucoronide, dialkyl dialkoxysilane, cyanoethyl group, sulfone, ethylene glycolyl disuccinate, 2-N-acyl nitrobenzensulfonamide, a- thiophenylester, unsaturated vinyl sulfide, sulfonamide after activation, malondialdehyde (M DA)- indole derivative, levulinoyl ester, hydrazone, acylhydrazone, alkyl thioester
  • Cleavage conditions and reagents include, but are not limited to enzymes, nucleophilic/basic reagents, reducing agents, photo-irradiation, electrophilic/acidic reagents, organometallic and metal reagents, and oxidizing reagents.
  • the unique labels of the invention are, at least in part, nucleic acid in nature, and are generated by either sequentially attaching two or more detectable bi-degenerate base positions to each other or it could be synthesized as a sequence of bi-degenerate bases.
  • the preferable embodiment is constructing the barcode label by sequential attachment of two or more detectable bi-degenerate bases.
  • a detectable bi-degenerate position is one where either one of two possible nucleotides defined by the specific base at that single base position in the sequence is incorporated uniquely into the oligonucleotide sequence and can be detected by sequencing of its nucleotide sequence and/or by detecting non-nucleic acid detectable moieties it may be attached to.
  • the barcode label can be constructed from other degenerate sequence libraries. A similar embodiment would involve a sequence of tri-degenerate bases.
  • the oligonucleotide bi-degenerate bases are typically randomly selected from a diverse plurality of oligonucleotide bi-degenerate bases.
  • an oligonucleotide tag may be present once in a plurality or it may be present multiple times in plurality.
  • the plurality of bi degenerate positions may be comprised of a number of subsets each comprising a plurality of identical bi-degenerate bases. In some important embodiments, these subsets are physically separate from each other. Physical separation may be achieved by providing the subsets in separate wells of a multiwall plate or separate droplets from an emulsion.
  • oligonucleotide bi-degenerate positions that result in multiple primary sequences (badges) that correspond to a unique label. Accordingly, the number of distinct (i.e., different) oligonucleotide bi-degenerate positions required to uniquely label a plurality of agents can be far less than the number of agents being labeled. This is particularly advantageous when the number of agents is large (e.g. when the agents are members of a library).
  • a similar embodiment would involve a sequence of tri-degenerate bases.
  • the oligonucleotide bi-degenerate bases may be detectable by virtue of the nucleotide sequence, or by virtue of a non-nucleic acid detectable moiety attached to the oligonucleotide such as but not limited to a fluorophore, or by virtue of a combination of the nucleotide sequence and the non-nucleic acid detectable moiety.
  • a similar embodiment would involve a sequence of tri-degenerate bases.
  • oligonucleotide refers to a nucleic acid such as deoxyribonucleic acid (DNA), ribonucleic acid (RNA), or DNA/RNA hybrids and includes analogs of either DNA or RNA made from nucleotide analogs known in the art. Oligonucleotides may be single-stranded (such as sense or anti-sense oligonucleotides), double-stranded or partially single-stranded and partially double stranded.
  • the invention provides methods for generating unique labels.
  • the methods typically use a plurality of detectable bi-degenerate bases to generate a plurality of molecular labels (badges) relating to a single packet sequence.
  • a library of bi-degenerate bases is synthesized according to the International Union of Pure and Applied Chemistry (lUPAC) nomenclature of degenerate base symbols where each symbol represents a position in an oligonucleotide sequence that can have multiple possible alternatives derived from the canonical nucleotide bases of Adenine (A), Guanine (G), Thymine (T) or Cytosine (C) ( Figure 1).
  • the degenerate symbols can be combined together to form a unique label packet sequence and meta-barcode for tagging a population of agents and reducing the degenerate sequences into the set of unique set of canonical sequences tagging the individual agents with a common association, such as those derived from a single cell in a population of cells, and a unique label or molecular badge for each agent in the population of agents with a common association.
  • the bi-degenerate base symbols Weak (W), Strong (S), aMino (M), Keto (K), puRine (R) and pYrimindine (Y) representing one of two canonical nucleotides is used to synthesize a label for tagging a population of agents.
  • the bi-degenerate symbols are grouped according to the reliability of synthesis of the underlying nucleotide sequence and to allow for unambiguous synthesis of a set of molecular badges uniquely associated with the cell meta-code packet.
  • bi-degenerate symbols are grouped into the following canonical sets: (R,Y), (M,K) and (W,S) and individual packets are synthesized only from the pair of canonical sequences. These packets can be joined together to form a cell meta-code barcode label.
  • a label defined by a packet sequence of N bi-degenerate symbols allows for the synthesis of a library of unique 2 ⁇ molecular badge sequences that form a library of unique labels for tagging agents. This leads to a library of 2 ⁇ unique oligonucleotide sequences defined by the N length packet sequence.
  • a key advantage of this label scheme is that it allows the same label to tag at the level of the degenerate meta-barcode sequence agents from a common source that are related to each other (such agents from a single cell) and coding at the level of canonical oligonucleotide badge sequences individual agents in the population of agents that enables digitally counting of agents and copies of agents from a common origin or source.
  • a common source that are related to each other (such agents from a single cell)
  • the meta-barcode library encodes a population of agents with a common association and the individual molecular badge sequence encodes individual agents in that same population.
  • unique barcode labels can be synthesized from a second set of lUPAC tri degenerate base symbols, specifically H, B, V and D, to create a meta-barcode label of N tri-degenerate base symbols in length.
  • the symbol H is matched with the base G, B with A, V with T and D with C to produce codes in the same manner as the bi-degenerate base codes.
  • This similarly allows for the synthesis of a library of 2 ⁇ unique nucleotide sequences and combining M meta-codes into a barcode label results in 2 ⁇ unique molecular badges.
  • An example would be if the meta-codes is 3 tri degenerate bases in length and the meta-codes combined into barcode labels of 5 meta-codes each,
  • unique molecular badge labels can be synthesized from a combination of lUPAC degenerate (and/or nondegenerate) base symbols at each position in a N symbol sequence depending on the degenerate base symbol in the sequence of N different symbols defining a library of unique meta-barcode labels.
  • the oligonucleotide label is synthesized to consist of at least two sets of at least two or three consecutive nucleotides encoding standard lUPAC symbols wherein each set of at least two or three consecutive nucleotides has a hamming distance of at least 1, 2 or 3 to every other set of at least two or three nucleotides encoded within the oligonucleotide label.
  • the oligonucleotide label described beforehand herein optionally comprised within a composition, comprises at least two or more sets of two or three consecutive nucleotides encoding predefined lUPAC symbols wherein the code of lUPAC symbols encode a set of bi-degenerate and/or tri-degenerate bases each having a hamming distance of at least
  • the oligonucleotide label described beforehand herein optionally comprised within a composition, comprises at least two or more sets of two or three consecutive nucleotides encoding predefined lUPAC symbols wherein the code of lUPAC symbols encode a set of bi-degenerate and/or tri-degenerate bases each having a hamming distance of at least
  • a unique nucleotide badge sequence may be a nucleotide sequence that is different (and thus distinguishable) from the sequence of each detectable oligonucleotide tag in a plurality of detectable oligonucleotide bi-degenerate bases.
  • a unique nucleotide sequence may also be a nucleotide sequence that is different (and thus distinguishable) from the sequence of each detectable oligonucleotide tag in a first plurality of detectable oligonucleotide bi-degenerate bases but identical to the sequence of a least one detectable oligonucleotide tag in a second plurality of detectable oligonucleotide bi-degenerate bases.
  • a unique sequence may differ from other sequences by multiple bases (or base pairs).
  • the multiple bases may be contiguous or non-contiguous.
  • Methods for obtaining nucleotide sequences e.g. sequencing methods are described herein and/or are known in the art.
  • detectable bi-degenerate bases comprise one or more of a ligation sequence, a priming sequence, a capture sequence, and a unique sequence (optionally referred to herein as an badge sequence).
  • a ligation sequence is a sequence complementary to a second nucleotide sequence which allows for ligation of the detectable oligonucleotide tag to another entity comprising the second nucleotide sequence, e.g.
  • a priming sequence is a sequence complementary to a primer, e.g. an oligonucleotide primer used for an amplification reaction such as but not limited to PCR.
  • a capture sequence is a sequence capable of being bound by a capture entity.
  • a capture entity may be an oligonucleotide comprising a nucleotide sequence complementary to a capture sequence, e.g. a second detectable oligonucleotide tag or an oligonucleotide attached to a bead.
  • the bi-degenerate bases could comprise nucleotides selected from purine bases, pyrimidine bases, natural nucleotide bases, chemically modified nucleotide bases, biochemically modified nucleotide bases and non-natural nucleotide bases.
  • a capture entity may also be any other entity capable of binding to the capture sequence, e.g. an analog, antibody or peptide.
  • An index sequence is a sequence comprising a unique nucleotide sequence and/or a detectable moiety as described above ( Figure 4). Incorporation into a single barcode tag is shown in Figure 6.
  • “Complementary” is a term which is used to indicate a sufficient degree of complementarity between two nucleotide sequences such that stable and specific binding occurs between one and preferably more bases (or nucleotides, as the terms are used interchangeably herein) of the two sequences. For example, if a nucleotide in a first nucleotide sequence is capable of hydrogen bonding with a nucleotide in a second nucleotide sequence, then the bases are considered to be complementary to each other. Complete (i.e. 100%) complementarity between a first nucleotide sequence and a second nucleotide is preferable, but not required for ligation, priming or capture sequences.
  • Each unique label comprises two or more detectable oligonucleotide bi-degenerate bases.
  • the two or more bi-degenerate bases may be three or more bi-degenerate bases, four or more bi-degenerate bases, or five or more bi-degenerate bases.
  • a unique label comprises 2,3,4,5,6,7,8,9,10,15,20,30,40,50,100 or more detectable bi-degenerate bases.
  • a similar embodiment would involve a sequence of tri-degenerate bases.
  • the bi-degenerate bases are typically bound to each other, typically in a directional manner.
  • Ligation reactions include blunt end ligation and cohesive overhang ligation. In some instances, ligation may comprise both blunt end and cohesive overhang ligation.
  • a "cohesive overhang” (also referred to as a "cohesive end” or an "overhang”) is a single stranded end sequence (attached to a double stranded sequence) capable of binding to another single stranded sequence thereby forming a double stranded sequence.
  • a cohesive overhang may be generated by a polymerase, a restriction endonuclease, a combination of a polymerase and a restriction endonuclease, or Uracil-Specific Excision Reagent (USER) enzyme (NEB) or a combination of a Uracil DNA glycosylase enzyme and a DNA glycosylase- lyase Exonuclease VIII enzyme.
  • a "cohesive overhang" may be a thymidine tail.
  • Polymerization reactions include enzyme-mediated polymerization such as a polymerase-mediated fill-reaction. Similar considerations apply to tri-degenerate-based barcode labels. A similar embodiment would involve a sequence of tri-degenerate bases.
  • a cleavable linker attaches the oligonucleotide label to a solid substrate such as a hydrogel bead.
  • the label may contain a promoter or primer sequence followed by one (PEI) of two sequences (PEI, Wl) used to construct a sequence library suitable for sequencing on an lllumina sequencer.
  • PEI promoter or primer sequence followed by one
  • PEI, Wl two sequences
  • the first barcode index (index 1) is attached to PEI and Wl is connected to index 1 followed by index 2 and a polyT sequence for hybridization sequence capture of the RNA poly A sequence to the hydrogel bead.
  • index 1 is attached to PEI and Wl is connected to index 1 followed by index 2 and a polyT sequence for hybridization sequence capture of the RNA poly A sequence to the hydrogel bead.
  • index 1 is attached to PEI and Wl is connected to index 1 followed by index 2 and a polyT sequence for hybridization sequence capture of the RNA poly A sequence to the hydrogel bead
  • detection comprises determining the presence, number, and/or order of detectable bi-degenerate bases that comprise a unique molecular badge label. For example, if the unique label comprises detectable moieties, as described herein, fluorometry, mass spectrometry or other detection methodology can be used for detection. In another example, if the unique label comprises unique nucleotide sequences, a sequencing methodology can be used for detection. If the unique label comprises both unique sequences and detectable moieties, a combination of detection methods may be appropriate, e.g. fluorometry and a sequencing reaction. A similar embodiment would involve a sequence of tri-degenerate bases. Methods of sequencing oligonucleotides and nucleic acids are well known in the art (see US Pat 5525464, 5202231, etc.).
  • bi-degenerate encodes for four sequences ("AC”, “AG”, “GC”, “GG”).
  • Another bi degenerate “MY” encodes for "AC”, “AT”, “CC”, “CT”. Note that because "AC” is encoded by both bi degenerate bases in this example, these bi-degenerate bases are not compatible with each other.
  • bi-degenerate distance a metric, called the "bi-degenerate distance" between two bi-degenerate bases as the smallest Hamming distance between any pair of encoded sequences compared from the bi degenerate bases.
  • the Hamming distance is the smallest number of substitutions required to change one sequence to another sequence.
  • the bi-degenerate bases "RS” and “MY” have a bi-degenerate distance of zero, because zero substitutions are required when comparing the sequence "AC” (from the "RS” bi-degenerate) and the sequence "AC” (from the "MY” bi-degenerate).
  • the maximum bi-degenerate distance is also the maximum Hamming distance, which is the length of the bi-degenerate sequence itself.
  • a pair of bi-degenerate bases with a bi-degenerate distance of at least one can be used together to create labels which can be distinguished from each other. These bi-degenerate bases are then said to be “compatible”.
  • “RW” encodes for "AA”, “AT”, “GA”, and “GT”. It has a bi-degenerate distance of 1 with “RS” because any comparison of two sequences from the bi-degenerate bases requires at least one substitution.
  • the sequence “AA” from “RW” compared with “AC” from "RS” requires one change, whereas “AA” from “RW” compared to "GG” from “RS” requires two substitutions.
  • bi-degenerate distances between a pair of bi-degenerate bases because error removal or correction methods can then be used to distinguish between the bi-degenerate bases when the returned sequence has errors.
  • a bi-degenerate distance of two enables error removal: A sequence which is not from either bi-degenerate base but has an error such that it has a hamming distance of 1 from each base can be thrown away without misassignment to an incorrect base.
  • a bi-degenerate distance of three or greater enables error correction, where it is possible to still make the correct base assignment despite an error in the returned sequence. If the returned sequence has a Hamming distance of 1 with a known base and a Hamming distance greater than 1 to the others, it can be properly assigned to the known base.
  • a set of compatible bi-degenerate bases can be created by choosing a bi-degenerate of the desired length and calculating its bi-degenerate distance across all existing members of the set. If the smallest bi-degenerate distance is at least one, the chosen bi-degenerate can then be added to the set.
  • the set can then be increased in size until the desired number of bi-degenerate bases is found, subject to the limitations based on the length of each base. It is beneficial to define a metric, called the "set distance", for a set of bi-degenerate bases, which is defined as the minimum distance between any two bi-degenerate members of the set. A similar embodiment would involve a sequence of tri degenerate bases.
  • the set of three bi-degenerate bases (“RSSM”, “YSWM”, “RWWK”) has a set distance of two because the minimum bi-degenerate distance amongst any pair of bi-degenerate bases is two. That means error removal can be used for any returned sequence in this set of bi-degenerate bases.
  • the returned sequence “AGAT” has Hamming distance 2 from “RSSM”, 2 from “YSWM”, and 1 from "RWWK", so it can be assigned to "RWWK". Note that sets with the same set distance may have different average bi-degenerate distances amongst set members, so may differ in opportunities for error removal.
  • a set of bi-degenerate bases can be created which has the appropriate number of members, with a desired set distance. If a set cannot be generated with enough members even after multiple trials with randomized choices for membership candidates, then the length of the bi-degenerate bases can be increased to increase the total number of possibilities.
  • the oligonucleotide label has a length of 24 to 200 nucleotides. In another embodiment, the oligonucleotide label has a length of 24 to 100 nucleotides. In a further embodiment, the oligonucleotide label has a length of 24 to 75 nucleotides. In a preferred embodiment, the oligonucleotide label has a length of 24 to 50 nucleotides. In an even more preferred embodiment, the oligonucleotide label has a length of 24 to 45 nucleotides. In the most preferred embodiment, the oligonucleotide label has a length of 30 nucleotides.
  • the oligonucleotide label comprises one or more barcode sequences of each 10-50 nucleotides in length. In a preferred embodiment, the oligonucleotide label comprises one or more barcode sequences of each 10-35 nucleotides in length. In an even more preferred embodiment, the oligonucleotide label comprises one or two barcode sequences of each 10-20 nucleotides in length. In the most preferred embodiment, the oligonucleotide label comprises two barcode sequences of each 10 nucleotides in length.
  • the cell specific label which is encoded within the oligonucleotide label and comprised of III PAC symbols, has the sequence N*N*N*N*N*N*N*N*N*N*N*N*N*N*N*, wherein N* represents any lUPAC symbol, except the symbol "N" which encodes for one of four non degenerate nucleotides.
  • the cell specific label which is encoded within the oligonucleotide label and comprised of lUPAC symbols, is encoded by a first barcode with the sequence WN*N*N*N*N*N*N*N*N*N*N*N* and by a second barcode with the sequence RN*N*N*N*N*N*N*N*N*N*N*N*N*N*N*, wherein N* represents any lUPAC symbol, except the symbol "N" which encodes for one of four non-degenerate nucleotides.
  • the invention provides methods for generating unique molecular badge labels.
  • the methods typically use a plurality of detectable bi-degenerate bases to generate unique labels.
  • a unique label is produced by sequentially attaching two or more detectable oligonucleotide bi-degenerate bases to each other.
  • the detectable bi-degenerate bases may be present or provided in a plurality of detectable bases.
  • the same or a different plurality of bi degenerate bases may be used as the source of each detectable tag comprised in a unique label.
  • a plurality of bi-degenerate bases may be subdivided into subsets and single subsets may be used as the source for each tag.
  • each well of a M-well microplate is one degenerate sequence of N symbols in length and each well in the microplate contains 2 ⁇ non- degenerate oligonucleotide sequences corresponding to that one degenerate sequence.
  • a population of hydrogel beads is distributed in each well of the microplate, typically >10,000 beads per well, and the degenerate sequence is irreversibly captured onto the solid substrate.
  • a library of unique meta-barcode is created from the random combination of three packets, one from each well in the microwell plate. Following the mix and react sequence P times, the size of the meta-barcode library synthesized in this ⁇
  • GSP gene specific primer
  • Decoding a returned sequence first involves identifying each molecular badge from the returned sequence. Because each molecular badge is associated with a known set of bi-degenerate bases when encoded, error correction methods can be used as described in the set generation process to assign the molecular sequence to its corresponding bi-degenerate. If it is ambiguous to which bi-degenerate the badge should be assigned, then the error can be removed by throwing away the returned sequence to prevent misassignments. If all badges are able to be assigned to their corresponding bi degenerate bases, then the meta-barcode is then known and can be assigned. A similar embodiment would involve a sequence of tri-degenerate bases.
  • the invention provides methods for encoding both a unique molecular badge as well as a deterministic identifier (or cell meta-barcode) in the same position.
  • the method typically uses a symmetrical-key algorithm, but may also use an asymmetrical-key algorithm.
  • a symmetrical-key the known packet sequence consisting of the bi-degenerate base symbols Weak (W), Strong (S), aMino (M), Keto (K), puRine (R) and pYrimidine (Y), each representing one of two canonical nucleotides, is translated into a mixed-base nucleotide sequence during barcode synthesis.
  • the same encryption key is needed. This can be done in several ways, such as but not limited to, a symmetrical- key algorithm, a look-up table, or a hash table.
  • a symbolic find and replace function can be used to convert between the badge and packet.
  • the algorithm would take a string consisting of a nucleotide sequence as an input and would search for all instances of A and G and replace them with R. It would then search for all instances of T and C and replace them with Y. The output of the function would be the converted packet sequence.
  • a position-indifferent unique bi-degenerate pair such as Purine (R) and Pyrimidine (Y)
  • R Purine
  • Y Pyrimidine
  • the symmetrical-key algorithm can also work with an additional input which denotes a specified symbolic encryption for each position.
  • the algorithm would have two inputs, the first being the string consisting of the nucleotide sequence, the second would the identifier which determines the unique bi-degenerate pair (R/Y, W/S, or M/K) that is used for each position.
  • the position-specific symmetrical-key algorithm adds an addition layer of encryption to the packet conversion.
  • the symmetrical-key required to decipher the packet sequence as above can be enciphered in the nucleotide sequence in a specific, defined position so that knowledge of that sequence would define which bi-degenerate pair had been used to direct the synthesis of that molecular badge sequence.
  • the conversion from nucleotide sequence to barcode can also be done by generating all possible nucleotide sequences of the given length and mapping each sequence to a respective barcode based on a predetermined key.
  • This matrix can be used as a look-up table to search for any nucleotide sequence and have it map to a specified barcode.
  • a hash table can also be used as a more computationally efficient method for determining barcode sequences from strings of nucleotides when a large number of base pairs is needed.
  • the invention relates to a method comprising the encoding of the information of both a unique molecular badge sequence and a deterministic cell packet in the same physical sequence of an oligonucleotide, wherein either a symmetrical-key algorithm or an asymmetrical-key algorithm is used to translate a defined sequence of bi-degenerate or tri-degenerate lUPAC symbols into a mixed-base nucleotide sequence during barcode synthesis.
  • the invention relates to a method, wherein a position-specific symmetrical- key algorithm is used to translate a defined sequence of bi-degenerate or tri-degenerate lUPAC symbols into a mixed-base nucleotide sequence during barcode synthesis.
  • the invention relates to a method, wherein the beforehand described method is used for the multiplexing of samples within a sample library for subsequent sequencing analysis.
  • the invention relates to a method, wherein the decoding and/or demultiplexing of sequencing data derived from a sample library, comprising oligonucleotide labels described beforehand herein, is achieved by utilizing a symmetrical-key algorithm, a position-specific symmetrical-key algorithm, an asymmetrical-key algorithm, a look-up or a hash table.
  • Example 1 Compare barcode labels with a Unique Molecular Identifier sequence vs meta- barcode label sequences A comparison of the incumbent method of tagging agents within a cell with the method described herein illustrates the utility and novelty of this labelling method.
  • the information encoded in the deterministic identifier is contained in a separate nucleotide sequence from that which identifies the plurality of agents associated with a specific cell.
  • the labels that identify the molecular agents often called the Unique Molecular Identifiers (UM Is)
  • UMIs are nucleotides sequences, usually of 5-10 randomly incorporated bases, added onto the barcode string. UMIs are quite useful in identifying the agents that originated in the same cell before any transcription or amplification events that are subsequently carried out on the cell contents.
  • the number of available UMIs should be much greater than the anticipated number of the molecular agents in the cell. Because the current method uses a short string of nucleotides alone, the diversity of sequences represented is simply 4 ⁇ , where N is the length of the UM I sequence. Most examples in practice are 6-8 bases long, which provide 4,000-65,000 combinations. There is incentive to keep the sequence as short as possible because adding bases to the barcoding region reduces the number of bases that can be read by NGS sequencing in the biologically informative portion of the nucleotide.
  • the method here described utilizes the same sequence to both identify the associated molecular agent and, by inference, the originating cell. This results in a much larger potential diversity in molecular badges for the same number of cell meta-barcodes, which is determined by the length of the sequence of the packet sequences which make up the meta-barcode.
  • Adding an 8-base UMI sequence to the barcode region would add 65, 536 unique molecular identifiers.
  • the total barcoding region is 24 bases: two 8-base barcodes and one 8 base UMI portion.
  • 884,736 cell meta-barcodes would be available.
  • the meta-barcode sequences are synthesized through application of solid phase synthesis using the phosphoramidite method and phosphoramidite building blocks derived from protected 2'-deoxynucleosides (dA, dC, dG and dT), ribonucleosides (A, C, G, and U) or chemically modified nucleosides, e.g. LNA or BNA.
  • the barcode sequences could be synthesized directly on the solid support or, in another preferred embodiment, multiple oligonucleotide indices synthesized from relatively short sequences could be combined in a combinatorial manner to create a diverse library of unique barcode sequences.
  • Hydrogel beads with a cleavable/non-cleavable anchor linker incorporated into the hydrogel matrix are distributed into the wells of the three library plates and the first set of badge sequences are hybridized or ligated to the anchor sequence.
  • the number of beads dispensed into each well is in excess of the number of sequences per well and is typically >10,000 beads per well or 3,000,00 beads for all three plates.
  • the concentration of each sequence is in the nanomole range to ensure an excess of molecules are available to react and attach to the anchor and synthesized index.
  • the beads with the first index are pooled and redistributed into a second set of plates with the same sequences as the first set of plates.
  • the second index is attached to the beads and the process is repeated with a third set of plates.
  • the meta-code sequences are synthesized and attached to an individual antibody selected to bind with high affinity to a specific target antigen.
  • a collection of different antibodies uniquely targeting different antigens are distinguished in part by having a unique barcode defining the antibody with a poly A sequence to be captured by the poly T sequence of the meta barcode sequence on the barcode gel bead. In this way the unique barcode sequence identifying each unique antibody in a collection of antibodies targeting different cell antigens can be distinguished from each other and associated with the cell whose antigens they have labeled.
  • small molecules can be tagged with a library of meta-code sequences wherein one badge tags one small molecule and the badge is derived from the meta-code sequence associated with the cell wherein the small molecule is derived or interacts with other molecular agents in the cell.
  • a library of meta-codes can tag a population of cells and each badge related to a meta-code tags a small molecule interacting with or associated with a tagged cell.

Landscapes

  • Life Sciences & Earth Sciences (AREA)
  • Chemical & Material Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Organic Chemistry (AREA)
  • Genetics & Genomics (AREA)
  • Engineering & Computer Science (AREA)
  • Biochemistry (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Biomedical Technology (AREA)
  • Zoology (AREA)
  • Biotechnology (AREA)
  • General Engineering & Computer Science (AREA)
  • Molecular Biology (AREA)
  • Wood Science & Technology (AREA)
  • Microbiology (AREA)
  • Physics & Mathematics (AREA)
  • Plant Pathology (AREA)
  • Biophysics (AREA)
  • Crystallography & Structural Chemistry (AREA)
  • General Health & Medical Sciences (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Chemical Kinetics & Catalysis (AREA)
  • General Chemical & Material Sciences (AREA)
  • Medicinal Chemistry (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The invention relates to a method, comprising obtaining a sample comprising a plurality of cells, labeling at least a portion of one or more molecular agents in the cell, complements thereof, or reaction products therefrom, from a first cell of the plurality and a second cell of the plurality with a first same cell label specific to the first cell and a second same cell label specific to the second cell; and a unique label specific to each of one or more molecular agents derived from the cell label, complements thereof, or reaction products therefrom, from the first cell; and wherein a unique label specific to each of one or more molecular agents, complements thereof, or reaction products therefrom, from the second cell are unique with respect to each other.

Description

NUCLEIC ACID LABELING METHODS AND COMPOSITION
CROSS REFERENCE TO RELATED APPLICATIONS
This application claims the benefit of U.S. Provisional Application Nos. 62/812,496, filed on March 1, 2019. The content of this earlier filed application is hereby incorporated by reference herein in its entirety.
INCORPORATION OF THE SEQUENCE LISTING
The present application contains a sequence listing that is submitted in ASCII format via EFS-Web concurrent with the filing of this application, containing the file name 37578_0063P1_SL which is 4,096 bytes in size, created on February 25, 2020, and is herein incorporated by reference in its entirety.
FIELD OF THE INVENTION
The present invention is directed to methods for labeling nucleic acids in a cell with an oligonucleotide label as a unique identifier of molecules from a single cell and a second oligonucleotide label derived from the first label that further identifies individual molecules from the same cell. The first label is a meta-label synthesized from a pre-existing library of lUPAC bases representing degenerate nucleotides wherein each bi-degenerate or tri-degenerate base encodes for two or three pre-specified nucleotides. A string of bi-degenerate labels, for example, generates a unique code that is used to label molecules from a single cell and through application of combinatorial synthesis methods, a library of unique bi-degenerate labels is created to tag molecules from the same cell with the same bi degenerate label and molecules from a different cell with a different label. Each bi-degenerate label, in turn, encodes an oligonucleotide library of unique molecular identifiers derived from the bi degenerate label. These identifiers are separate from the cell label in that each molecular label is different from one molecule to the next and they share in common the same bi-degenerate sequence that indicates they are from the same cell. Using a bi-degenerate oligonucleotide label as a molecular and a cell identifier is a unique and different approach to associating molecules from individual cells in a population of cells. BACKGROUND OF THE INVENTION
The ability to label and distinguish agents of interest (e.g. DNA, RNA, proteins, chemicals etc) from individual cells in a population of cells serves multiple useful purposes in research and industry. Oligonucleotide labeling of agents associated with an individual cell in a population or group of cells is useful for a number of applications. One non-limiting example is to uniquely label individual RNA molecules in a cell to enumerate the number of RNA molecules in a specific cell and to distinguish and enumerate RNA molecules from different cells. Sequencing the labeled molecules allows the number of RNA molecules to be counted and their assignment and association to an individual cell from a population of cells to be made. This allows identification of transcriptional differences within a cell and between cells in a population of cells originating from one or more tissues from one or more individuals. This information can be of great significance in scientific research, in the research and development of therapeutic drugs, in the research and development of medical devices and in the research and development of diagnostic and prognostic tests of disease in humans, animals and plants. Examples include tracing cell lineages during embryogenesis for scientific research; tracking cell proliferation or differentiation during tumorigenesis for medical research and therapeutic or diagnostic development; identifying genotypically rare cells in a population of cells for scientific research; and measuring differences in immune cell populations as a function of disease state or application of a therapeutic drug for therapeutic or diagnostic test development. The source of cells for single cell genetic analysis is varied and can be individual plants, viruses, fungi, prokaryotic cells including bacteria, eukaryotic cells including animals and humans. Cells can be prepared to be labeled in a variety of different ways and include, but not limited to, the common steps of (a) enclosing or isolating an individual cell in a container; (b) introducing a lysis agent into the container to lyse the cell release the agents to be labeled, (c) introduction of the oligonucleotide label, (d) the process of attaching the label to one or more agents and (e) finally the ways by which the sequence of the label can be determined to extract information specific to the cell and different from other cells in the same population and, separately, to the labeled agents in each cell.
Despite this progress there are limitations with current methods. Authors describe a method for determining the sequences of a set of nucleic acid targets, comprising the tagging of each individual nucleic acid target in a set of nucleic acid targets to create a set of tagged constructs, with: (a) at least one unique identifier oligonucleotide that distinguishes the tagged construct from other tagged constructs in the set, wherein the at least one unique identifier oligonucleotide is not part of the nucleic acid target; and (b) at least one amplification site; amplifying the set of tagged constructs; and (c) determining the sequences of the amplified set of tagged constructs comprising the unique identifier oligonucleotides and the nucleic acid targets. This description is not applicable to tagging molecules from a single cell in a population of cells because there will be at least one molecule from all the molecules in a cell with a unique identifier that distinguishes it from other tagged molecules from the same cell. This allows distinguishing between molecules in a cell but not necessarily molecules from different cells. Additionally, the scheme described by Chee is constrained to tagging only nucleic acids and not other molecules like proteins, polysaccharides or small molecule drugs.
Other authors describe a similar approach for tagging and identifying polynucleotides except for the fact that a distinguishing feature of the tagged polynucleotide is that it has a particular nucleic acid sequence at one or more loci to distinguish it from other tagged polynucleotides in a mixture.
Some take a different approach where an oligonucleotide tag or label is described comprising multiple parts: an oligo dT sequence, a sequencing primer binding site, a common sequence that is the same for all oligonucleotide labels and a unique label tag sequence wherein the unique label tag sequence is selected from a set of at least m different label tag sequences. Encoded in the label in separate parts are oligonucleotide sequences comprising the label that is common to all molecules in a cell and one that is unique and variable from one molecule to the next. With this multi-part construction of the label, one section encodes from which cell the molecules originate and a label specific to each molecule from the cell.
Common to each of these nucleic acid labeling schemes is the need to synthesize a diverse library of labels which could be cumbersome and expensive since to label molecules from a population of cells, the sequence diversity of the library of oligonucleotide labels must be an order of magnitude or more greater than the number of cells in the population. The label diversity depends on the number of individual nucleotides that comprise the label so to have a diverse library of labels, the label is composed of a number of nucleotides. As a non-limiting example, a 6-mer length label provides a library of 4 = 4,096 unique combinations of 6-mer nucleotides, wherein each nucleotide at any given position is selected randomly from the group of non-degenerate nucleotides - A, G, T or C. The larger the number of nucleotides in the label, the greater the diversity but the larger the number of sequencing cycles in an lllumina sequencer needed to measure and identify the label. This in turn leaves fewer cycles for reading the nucleic acid sequence to which the label is attached and illustrates the need for a balance between library diversity and length of the labels in the library. Furthermore, the more information included in the label, the longer the label needs to be to accommodate this expanded information set. For example, at least two independent sequences are needed in the same label to identify the cell of origin and to identify and count the number agents per cell.
The present invention seeks to overcome the intrinsic limitations of current approaches for labeling of agents with oligonucleotide labels and to simplify the labeling by eliminating the unique molecular identifier label as an independent sequence and component of the barcode oligonucleotide sequence.
The current art provides methods for generating one oligonucleotide label for uniquely labeling the molecular agents in a cell with the same label and a second oligonucleotide label for tagging each molecular agent in a cell with a different specific and unique molecular identifier and combining the two labels into a single label for tagging molecular agents to identify each agent uniquely and to identify from which cell the agent originated in a population of cells. Within the current art this oligonucleotide label is typically a concatenation of a cell label and a unique molecular identifier. The methods for detection and identification of the unique labels tagging molecular agents from one or more cells is varied and may include methods based on sequencing the label, hybridization with a complementary fluorescently-labeled sequence (FISH), tagging the label with a unique set or combination of fluorophores and/or optically active molecules, measurement of the mass of the oligonucleotide label with a mass spectrometer or measurement of the length of the oligonucleotide with gel electrophoresis.
Synthesis of labels typically makes use of a plurality of detectable bi-degenerate bases to generate unique oligonucleotide labels. Typically, a library of bi-degenerate bases is synthesized by specifying to a oligonucleotide synthesizer the International Union of Pure and Applied Chemistry (lUPAC) nucleotide labels representing a single nucleotide from a group of one of five possible nucleotides (e.g. Adenine (A), Guanine (G), Thymine (T), Uracil (U) or Cytosine (C)). The synthesizer translates the input labels to synthesize an oligonucleotide sequence and this process is repeated independently with different unique sequences of labels to generate a library of unique oligonucleotide labels. Typically, the cost of this approach wherein each label is independently synthesized is costly and therefore not the preferred embodiment where libraries exceeding 384 labels are needed.
Another approach is to combine combinatorically two or more libraries of oligonucleotide labels to synthesize a single library with greater tag diversity than the diversity of the starting libraries. This library of bi-degenerate bases can be used to uniquely label molecular species in a single cell so that all species in a cell thus labeled are distinguished from other cells uniquely. A further embodiment is to attach or synthesize an additional random N-mer sequence to each oligonucleotide label so that each labeled molecule and its replicates in an individual cell is readily identified.
SUMMARY OF THE INVENTION
The methods allow a plurality of agents, ranging from 2 to millions of agents, to be uniquely labeled without the need to manually generate the same number of unique labels. The agents may be of diverse nature. In some important embodiments, the agents are nucleic acids such as genomic DNA fragments, RNA transcripts, long non-coding RNA, microRNA, circRNA, chromatin-DNA fragments or they can be non-nucleic acids such as proteins or small molecules. The invention contemplates that agents may be labeled in order to identify them, identify their source, identify their relationship with other agents, enumerate the number of agents in a population of agents, and/or identify one or more conditions to which the agents have been subject.
The methods of the invention also provide for amplifying nucleic acids to increase the number of read pairs that can be properly identified via their unique index combination. In a further embodiment, the method of the invention allows for each end-labeled nucleic acid to be identically labeled at either it's 5' and/or 3' ends. In a further embodiment, the methods of the invention provide for enumeration of the number of copies of the labeled agent in a population of agents.
As used herein, the term "degenerate," when used to refer to a nucleotide sequence, refers to one or more positions which may contain any of a plurality of different bases. Degenerate residues within an oligonucleotide or nucleotide sequence are denoted by standard lUPAC nucleic acid notation (see Figure 1) and are sub-divided into bi-degenerate and tri-degenerate bases.
The unique labels provided herein are at least partly nucleic acid in nature. The invention contemplates the labels are prepared by sequentially attaching either a bi-degenerate or tri degenerate base to each other. The order in which the bases attach to each other can be from a library of known base sequences or in a random manner. In turn and depending on the base degeneracy, at each position in the base sequence there is either one of two or one of three possible non-degenerate bases. The invention is based, in part, on the appreciation by the inventors that a sequence of bi degenerate or tri-degenerate bases results in one unique label and a second label is created from the sequences of non-degenerate bases derived from the single degenerate base sequence. The degenerate base sequence is defined as a meta-code packet and the non-degenerate sequences derived from a single meta-code are defined as badges. This combination of a single meta-code packet and multiple, dependent badges is a cypher coding scheme useful for multiple applications where an object or item is labeled with a known degenerate sequence and its dependencies labeled with a known yet related non-degenerate sequence.
One non-limiting example is to use a meta-code to uniquely label the molecular agents from a single cell and the associated badges uniquely label individual molecular agents from the same cell. This compact, two level cipher uniquely combines in a single coding scheme and a single oligonucleotide sequence two critical pieces of information: labeling of an individual cell and assigning unique and different labels derived from the cell label to each molecular agent in the cell. Furthermore, the invention allows a large number of labels to be generated (and thus a large number of agents to be uniquely labeled) using a relatively small number of oligonucleotides. As an example, a meta-code can be constructed from lUPAC bi-degenerate symbols wherein each symbol encodes for a specific pair of nucleotides in equal measure. In this manner N bi-degenerate symbols can be strung together to form a meta-code to label molecules from a specific cell and from the same meta-code, there are 2^ unique and different nucleotide sequences as badges for labeling individual molecules from the same cell. For example, if the cell meta-code is constructed from N = 3 bi-degenerate bases then there will be 23 = 8 unique nucleotide sequences as unique badges to label different molecular agents from that cell. Additionally, the invention contemplates the badge encoding independent information about the agent including identifying the population of agents relative to their common source and the number of unique agents in that population. Additionally, the invention contemplates the badge encoding another extrinsic species interacting with the agents from a single cell. This could be, for example, an antibody or a small molecule wherein the badge encodes an antibody or small molecule interacting specifically with an agent of a population of agents derived or associated with a single cell. The information encoded by the badge in this example would include cell-specific information and be directly related to the meta-code that unifies the relationship of these different agents to a common source, in this case a single cell.
Another non-limiting example from the healthcare industry is to use the meta-code to uniquely label an individual patient and the associated badges label all the health information associated with this patient, including different electronic medical records, lab tests, medical imaging results and results from visits to different physicians.
Another non-limiting example from the insurance industry is to use the meta-code to uniquely label an individual insurance policy holder and the associated badges label all the insurance information associated with the policy holder including claims, claim information and insurance policies held by the individual.
Another non-limiting example from inventory control is to use the meta-code to uniquely label an item in inventory and the associated badges label all the information associated with the inventory item including physical attributes of the item, its manufacturing history, its time in inventory and shipping history to a customer.
The invention provides, in part, a method comprising obtaining a sample comprising a plurality of cells; labeling at least a portion of two or more molecular agents such as DNA, RNA, proteins, small molecules, microRNA, long non-coding RNA, metabolites or other chemicals in the cell, complements thereof, or reaction products therefrom, from a first cell of the plurality and a second cell of the plurality with a first same cell label specific to the first cell and a second same cell label specific to the second cell; and a unique label specific to each of one or more molecular agents, complements thereof, or reaction products therefrom, from the first cell; and wherein a unique label specific to each of one or more molecular agents, complements thereof, or reaction products therefrom, from the second cell are unique with respect to each other. The cell specific label or meta-code enables the assignment of labeled molecules to a given cell and the molecular badge uniquely identifies different labeled molecules from that same cell.
An oligonucleotide label is typically synthesized by successive addition and polymerization of individual non-degenerate nucleotides (A, G, T or C) to create a single physical oligonucleotide sequence but only encodes for information as described by the position and type of nucleotide in the physical sequence.
A key inventive step of this invention is the realization that defining a sequence from lUPAC degenerate bases enables a two-level coding scheme comprised of two parts: a degenerate base sequence that defines an object-specific label and multiple non-degenerate nucleotide sequences derived from the object-specific label that defines additional labels to tag additional objects related to, derived from or dependent on the first object labeled with the original degenerate base sequence. The synthesis process can be performed manually with standard oligonucleotide chemistries, by programming a commercial oligonucleotide synthesizer to synthesize uniquely different labels or by synthesizing a set of labels and creating a library of unique labels using combinatorial methods. As applied to labeling cells and their molecular content, the cell meta-code can be constructed from standard lUPAC symbols encoding for either two different nucleotides (bi-degenerate cipher) or three different nucleotides (tri-degenerate cipher) at each position during synthesis of the oligonucleotide meta-code and that this encryption scheme permits the cell meta-code to be derived from the dependent molecular badges from the same degenerate base sequence as opposed to discrete sequences joined together to make a single, longer label sequence. A key benefit therefore is a compact and efficient approach to specifically labeling individual agents from individual cells with a shorter, informatically more efficient label than described in current methods. This translates into either lower cost sequencing for the same number of nucleotides sequenced or a larger number of nucleotides sequenced (deeper sequencing) of the attached nucleic acid.
There is further benefit of this approach relative to current methods for molecular labeling. A common approach to molecular labeling is to construct a molecular label from the sequential addition of an oligonucleotide barcode sequence plus an unique molecular identifier (UMI). The UM I typically consists of between 6-10 nucleotides randomly selected and attached to form a unique oligonucleotide sequence. The schema described here effectively eliminates the UM I, thus making the overall barcode shorter and requiring few sequencing cycles to read-out. Fewer read cycles on the sequencer translates into lower sequencing costs.
A second key benefit is the ability to easily correct for sequencing or synthesis errors since the meta code sequence is a well-defined and known sequence. In contrast, barcode labels constructed with an UMI consist of a known barcode sequence concatenated to a random sequence that is the unique molecular identifier. In this configuration since the molecular identifier is an unknown random sequence, it can be challenging to identify in the sequence data where the identifier sequence ends and the actual sequence data begins given there could be errors in the sequence data itself that could confound this identification.
A third key benefit is the availability of a much larger number of molecular labels without adding additional bases in the label sequence. This further economizes the sequencing of the labeled nucleic acid without decreasing diversity of possible molecular labels.
How the meta-barcode is synthesized in one non-limiting embodiment is as follows. An oligonucleotide synthesizer is programmed to synthesize a label based on a string of either bi degenerate or tri-degenerate symbols, At each nucleotide position and depending on whether it is a bi-degenerate or tri-degenerate symbol randomly inserts one of two (bi-degenerate) or similarly one of three (tri-degenerate) possible nucleotides in equal parts with a single nucleotide in that specific position to synthesize a unique oligonucleotide. This process repeats itself N times to build a N-mer molecular label (badge) sequence defined by the meta-code sequence.
The lUPAC nomenclature specifies either a bi-degenerate or tri-degenerate coding scheme. For bi degenerate coding, we can distinguish two classes of encoding: exclusive to a combination of bi degenerate symbols are certain symbols and the corresponding associated nucleotides as defined by lUPAC. In another embodiment the bi-degenerate encoding can be non-exclusive and not follow the lUPAC encoding scheme. The paired symbols (R, Y), (M, K), and (S, W) encode for either of two nucleotides, each of which is incorporated in equal proportion at that position by the synthesizer and inserted into the growing sequence as the oligonucleotide molecular label sequence (badge) is synthesized. Alternatively the bi-degenerate sequence may include specific non-degenerate nucleotides instead of a degenerate base at any position in the nucleotide sequence.
Individual meta-codes are synthesized from one of three canonical paired bi-degenerate symbols: (R, Y); (M, K); (S, W) so that the corresponding possible badges, the group of actual nucleotide sequences derived from a meta-code sequence, can be unambiguously assigned to a meta-code. This is further explained in Figure 2 where an example label barcode is synthesized from pairs of canonical bi degenerate symbols. Each lUPAC bi-degenerate base represents defined pairs of nucleotides A, G, T and C, the bi-degenerate base sequences are combined to define a meta-code packet and a cell meta barcode label is created by combining one or more meta-code packets wherein each packet defines multiple unique badge sequences for labelling molecular agents from a single cell. Decoding the packet sequence to its constituent nucleotides results in a library of unique nucleotide sequences called molecular badge sequences that can be sequenced to reveal the specific label identifying a molecular agent. Measurement of the molecular labels in this manner can be used to decipher the badge sequence to the meta-code packets and cell meta-barcodes and assign labeled agents to a specific cell and enumerate the number of copies of the agent from an individual cell as the badge sequences attached to the agents are processed for sequencing analysis.
The decoding process works in reverse. As shown in Figure 2, the sequencer records a string of nucleotides (A, G, T or C) as a label associated with a specific molecular agent from a specific cell. From the synthesis procedure, the library of molecular badges is already known and is loaded as packets into a look-up table. The table of known badge sequences is scanned to identify a match between the measured sequence with the packets in the look-up table. The cell meta-barcode label is then reconstructed from multiple packets based on the unique, one-to-one correspondence between the molecular badge sequence and the meta-code packet sequences. In this way many molecular badge sequences can be associated with one cell meta-barcode.
A similar concept is applied to the tri-degenerate encoding where the symbol H, B, V or D is input to the synthesizer, paired with the single nucleotide not included in the tri-degenerate cipher, to create a new sequence and where one of the three nucleotides is incorporated into the oligonucleotide molecular label sequence in equal proportion, to pair with the remaining single base.
A meta-code packet is constructed from strings of bi-degenerate symbols and by way of illustration, Figure 3a shows an example where a packet string of 3 bi-degenerate symbols gives rise to 8 unique nucleotide sequences. In general, for N bi-degenerate symbols comprising a meta-code there are 2^ unique nucleotide sequences. In one non-limiting example, programming the synthesizer to generate the bi-degenerate string WSS will give rise to a unique combination of 8 different nucleotide sequences all with the common property that they originated from the same packet sequence of bi degenerate symbols. In this way agents from different cells are identified by their bi-degenerate symbol packet sequence and agents from within the same cell can be enumerated by their nucleotide badge sequence label. Figure 3b shows another example where packets are formed from the combination of three bi-degenerate or tri-degenerate symbols selected from their respective canonical sets.
Different bi-degenerate symbols can be synthetically combined to create a large and diverse library of different bi-degenerate and nucleotide sequences. Consider first the non-limiting example in Figure 4 where three bi-degenerate symbols from a canonical set have been joined into a single group termed a meta-code packet and different packets can be combinatorially joined together to yield a diverse library of unique molecular label badge sequences. These sequence badges are the labels of specific molecular agents in a cell. In this non-limiting example, nine bi-degenerate symbols joined into a cell meta barcode sequence of three meta-code packets of three bi-degenerate symbols each represents a
8 = 512 unique nucleotide sequences where 8 is the number of nucleotide sequences per meta-code packets and 3 represents the number of meta-code packets joined together to create the cell meta barcode.
To generalize this schema, if the number of bi-degenerate symbols in a meta-code packet is N and the number of meta-code packets in a cell meta-barcode is M, then the number of unique nucleotide molecular badge sequences is
Figure imgf000012_0001
As in the previous example N = 3 and M = 3 so the number of
3 x 3
nucleotide sequences comprising the sequence badge is 2 =512.
The badge sequence is the label attached through different means to the molecular agents in a single cell. An integral step of the process to associate individual molecular agents to their respective cells of origins requires sequencing the badge sequence attached as labels to the molecular agents. Depending on the molecular agent, specifically if it is a nucleic acid, the molecular agent will also be sequenced to identify the specific nucleic acid associated with the label.
This coding schema allows for many badge sequences to be associated with and originate from a single cell meta-code packet. Labeling individual agents within a cell with this labeling scheme makes for a compact single barcode label by combining two labels - one at the cell level and the second at the molecular level - in the same label to associate individual molecules and their replicates to a specific cell. If the agent is replicated as preparation to measuring the nucleotide sequence that defines the label, then the molecular badge is also replicated. By counting the number of replicated unique molecular badges associated with a single cell, the number of agents uniquely associated with a specific cell can be enumerated and the number compared against the number of an internal agent used as a standard or against the number of an external agent added to the sample of cells as an external standard. This is an important property of the cipher scheme since it enables the number of individual agents from a cell to be counted and compared against an internal or external standard. In this way the number of agents from different cells can be compared and the benefit of doing so is to determine the natural differences between cells based on the relative abundance of a particular agent or the differences that may arise due to an external factor such as application of a natural or synthetic chemical, a biological molecule or another external physical, biological or chemical perturbation that would change the cell state in a manner measurable in the number of labeled agents.
The library of sequence badges can be synthesized directly or assembled combinatorially from a library of known bi-degenerate symbols such that a table of all possible nucleotide badge sequences corresponding to all possible packet sequences from the synthesized library is generated. Sequencing the sequence badges from a group of cells results in nucleotide sequences that can be compared computationally against the table of known packets. This allows the badge sequences in the sequence data to assign and associate individual molecular agents to their origin from different specific cells. Each molecular agent in a cell will also be labeled with a unique nucleotide badge sequence and if a particular badge is identified more than once in the sequence data then the number of times it occurs in the sequence data can be used to assess and compensate for any bias in the sequencing process. It is particularly useful when enumerating the number of specific molecular agents in a cell. One non limiting example would be to enumerate the number of copies of a specific transcript from a given cell based on counting the number of times a particular nucleotide badge sequence appears in the sequence data.
This schema can be extended further by using combinatorial methods. In one embodiment (Figure 5), a library of N bi-degenerate meta-code packets is synthesized where each well in a 96 well plate contains a single N length packet. This synthesis is repeated in two additional plates to produce a library of 96 x 96 x 96 = 884,736 unique N-length packets. These packets can be then combined into cell meta-barcode labels of 3 meta-code packets each where the meta-barcode label is assembled combinatorially by taking a different meta-code packet from a different well from one of each library plate. This would result in a (96)^ unique barcode labels and 2^ x ^ unique molecular badges. As an example (Figure 6), wherein a label is constructed from 10 bi-degenerate symbols per packet combinatorially synthesized by combining meta-labels from 2 x 384 microwell plates resulting in 2 x 384 = 147,456 packets synthesized and 1,048,576 unique molecular badges per cell meta-barcode label.
DESCRIPTION OF DRAWINGS
Non-limiting embodiments of the present invention will be described by way of example with reference to the accompanying Figures, which are schematic and are not intended to be drawn to scale. For purposes of clarity, not every component is labeled in every figure, nor is every component of each embodiment of the invention shown where illustration is not necessary to allow those of ordinary skill in the art to understand the invention.
Figure 1: The lUPAC coding scheme is used to create meta-codes that direct synthesis of unique molecular labels (badges) for labeling individual molecular agents in a cell. The meta-codes can be combined to create cell meta-barcodes to identify the cell where the agent originated.
Figure 2: This figure shows the nomenclature used for the molecular agent labeling process. A meta code is created from a string of bi-degenerate symbols and a barcode label is synthesized by combining one or more meta-codes wherein each meta-code can be an index combined to synthesize the barcode label. The sequence badge is derived from the meta-code and represents another level of labeling wherein each molecular agent from a single cell will have a unique sequence badge label. This allows for copies of a molecular agent to be enumerated and the number of copies normalized to account and correct for any bias in the sequence process.
Figures 3a-b: Figure 3a shows the process by which a library of badge sequences is created from a string of three bi-degenerate symbols from the same canonical set bi-degenerate symbols (packet). The badge sequence labels all the molecular agents from a particular cell and enables the association of each of those agents to a specific t cell. Note that because each bi-degenerate symbol can be replaced by one of two possible nucleotides the number of possible nucleotide sequences is 2^ where N is the number of bi-degenerate symbols that comprise the packet. Figure 3b shows a different example where the combination of bi-degenerate and tri-degenerate symbols yield a unique library of badge sequence labels by the combination of three bi-degenerate and tri-degenerate symbols for each meta-code packet.
Figures 4a-c illustrate as an example a cell meta-barcode label derived from three meta-code packets and the resulting number of unique molecular badges which is derived from this meta-barcode. In practice this meta-barcode would be used to identify molecular agents originating from the same cell and the sequence badges would be used to track the number of copies of individual molecules during the preparation and sequencing of molecular agents. Figures 4a-c show a combination of the three indices WSS-MKM-RRY codes for 512 unique sequence cassettes.
Figure 5 describes a combinatorial split pool synthesis as the preferred method by which the oligonucleotide barcode sequences are constructed.
Figure 6 describes a meta-barcode comprised of two packets, each constructed from 10 bi-degenerate symbols per packet. If 2 x 384 = 147, 456 indices are synthesized with this nomenclature there will be unique cell barcodes represented by the bi-degenerate symbols W, S, M, K, R and Y and 1,048,576 unique single nucleotide sequences per cell barcode.
DETAILED DESCRIPTION OF THE INVENTION
The invention provides methods and compositions for uniquely labeling agents of interest including for example nucleic acids such as DNA, DNA fragments, chromatin DNA, RNA, miRNA, long non-coding RNA, proteins, small molecules, peptides and metabolites. The ability to uniquely label agents has a number of applications, as contemplated by the invention, including but not limited to genomic sequencing, genomic assembly, screening of putative drugs and biologies, analysis of environmental samples to discover new organisms, labeling of individual elements of synthetic biology constructs, labeling of samples from a specific source or donor, quality control analysis of reagents to verify purity, the analysis of nucleic acids from single cells, analysis of various conditions on populations of cells, single cells or cell components. One of the major limitations of prior art labeling techniques is the limited number of available unique labels capable of labeling both the population of agents relative to their common source and the number of unique agents in that population. Typically, the number of agents to be labeled in any given application far exceeds the number of unique labels that are available. The methods of the invention can be used to synthesize essentially an infinite number of unique labels capable of dual level encoding of individual agents. Moreover, because of their nature, the labels can be easily detected and distinguished from each other, making them suitable for many applications and uses.
The methods of the invention easily and efficiently generate libraries of unique labels. Such libraries may be of any size and are preferably large libraries including tens to hundreds of millions to billions of unique labels. The libraries of unique labels may be synthesized separately from agents and then associated with agents post-synthesis. Alternately, unique labels may be synthesized in real-time (e.g. while in the presence of the agent) and in some instances the label synthesis is a function of the history of the agent. This means, in some instances, that synthesis of the label may occur while an agent is being exposed to one or more conditions. This would occur, for example, in a continuous flow system or in a microfluidic droplet. The invention therefore contemplates the resultant label may store (or code) within it information about the agent (i.e. agent-specific information) including the origin or source of the agent, the relatedness of the agent to other agents (e.g. the number of agents in a population of agents), the genomic distance between two agents (e.g. in the case of genomic fragments), conditions to which the agent may have been exposed, and the like.
Some methods of the invention comprise determining information about an agent based on the unique label associated with the agent. In some instances, determining information about the agent may comprise obtaining the nucleotide sequence of the unique label (i.e. sequencing the unique label). In other instances, determining information about the agent may comprise determining the presence, number and/or order of non-nucleic acid detectable moieties. In still other instances, determining information about the agent may comprise obtaining the nucleotide sequence of the unique label and determining the presence, number and/or order of non-nucleic acid detectable moieties. Methods for nucleic acid sequencing and detection of non-nucleic acid detectable moieties are known in the art and are described herein. As used herein, "agent" or "agents" refers to any moiety or entity that can be associated with, including being attached to, a unique label. An agent may be a single entity, or it may be plurality of entities. An agent may be a nucleic acid, a peptide, a protein, a cell, a cell lysate, a solid support, a polymer, a chemical, a metabolite, and the like, or an agent may be a plurality of any of the foregoing, or it may be a mixture of the foregoing. As an example, an agent may be nucleic acids (e.g. mRNA transcripts and/or genomic DNA fragments), solid supports such as beads or polymers, and/or proteins from a single cell or from a single cell population (e.g. a tumor or non- tumor tissue sample).
In some important embodiments, an agent is a nucleic acid. The nucleic acid agent may be single- stranded (ss) or double-stranded (ds), or it may be partially single-stranded and partially double- stranded. Nucleic acid agents include but are not limited to DNA such as genomic DNA fragments, PCR and other amplification products, RNA, cDNA, and the like. Nucleic acid agents may be fragments of larger nucleic acids such as but not limited to genomic DNA fragments.
An agent of interest may be associated with a unique label. As used herein, "associated" refers to a relationship between the agent and the unique label such that the unique label may be used to identify the agent, identify the source or origin of the agent, identify one or more conditions to which the agent has been exposed, etc. A label that is associated with an agent may be, for example, physically attached to the agent, either directly or indirectly, or it may be in the same defined, typically a physically separate, volume as the agent. A defined volume may be an emulsion droplet, a well (of for example a multiwall plate), a tube, a container, and the like. It is understood that the defined volume will typically contain only one agent and the label with which it is associated, although a volume containing multiple agents with multiple copies of the label is also contemplated depending on the application.
An agent may be associated with a single copy of a unique label or it may be associated with multiple copies of the same unique label including for example 2, 3, 4, 5, 6, 7, 8, 9, 10, 100, 1,000 (109), 10,000
(104), 100,000 (105) , 1,000,000 (106), 10,000,000 (107), 100,000,000 (10s), 1,000,000,000 (109) or more copies of the same unique label. In this context, the label is considered unique because it is different from labels associated with other, different agents.
Attachment of labels to agents may be direct or indirect. The attachment chemistry will depend on the nature of the agent and/or any derivatization or functionalization applied to the agent. For example, labels can be directly attached through covalent attachment. The label may include a moiety, which may be a non-nucleotide chemical modification, to facilitate attachment. By way of non-limiting example, the label may include methylated nucleotides, uracil basis, phosphorothioate groups, ribonucleotides, diol linkages, disulphide linkages, etc. to enable covalent attachment to an agent.
In another example, a label can be attached to an agent via a linker or in another indirect manner. Examples of linkers include, but are not limited to, carbon-containing chains, polyethylene glycol (PEG), nucleic acids, monosaccharide units and peptides. The linkers may be cleavable under certain conditions. Cleavable linkers are discussed in greater detail herein.
Methods for attaching nucleic acids to each other, as for example attaching nucleic acid labels to nucleic acid agents, are known in the art. Such methods include but are not limited to ligation, such as blunt end ligation or cohesive overhang ligation, and polymerase-mediated attachment methods (see, e.g. US Patent Nos. 7863058 and 7754429; Green and Sambrook, Molecular Cloning: A Laboratory Manual, Fourth edition, 2012; Current Protocols in Molecular Biology, and Current Protocols in Nucleic Acid Chemistry.
In some embodiments, oligonucleotide adapters are used to attach a unique label to an agent or to a solid support. In some embodiments, an oligonucleotide adapter comprises one or more known sequences, e.g. an amplification sequence, a capture sequence, a primer sequence, and the like. In some embodiments, the adapter comprises a thymidine (T) tail overhang. Methods for producing a thymidine tail overhand are known in the art, e.g. using terminal deoxynucleotide transferase (TdT) or a polymerase that adds a thymidine overhang at the termination of polymerization. In some embodiments, the oligonucleotide adapter comprises a region that is forked.
In some embodiments, the adapter comprises a capture or detection moiety. Examples of such moieties include, but are not limited to, fluorophores, microparticles such as quantum dots, gold nanoparticles, microbeads, biotin, DNP (dinitrophenyl), fucose, digoxigenin, avidin, streptavidin, amino acid-based tags such as, but not limited to HA-, Myc-, FLAG-, M BP-, SUMO-, Protein A-, polyhistidine- and GST-tags, antigens and other moieties known to those skilled in the art. In some embodiments the moiety is biotin.
A label and/or an agent may be attached to a solid support. In some instances, a label (or multiple copies of the same label) and the agent are attached to the same solid support. Examples of suitable solid supports include, but are not limited to, inert polymers (preferably non-nucleic acid polymers), porous hydrogel polymers, beads, magnetic beads, hydrogel beads, glass, ceramics, metals, with limited mobility carbon nanofibers or nanotubes, or peptides. In some embodiments, the solid support is an inert polymer or bead (porous or non-porous). The solid support may be functionalized to permit covalent attachment of the agent and/or label. Such functionalization may comprise placing on the solid support reactive groups that permit covalent attachment to an agent and/or a label.
Labels and/or agents may be attached to each other or to solid supports using cleavable linkers. Cleavable linkers are known in the art and include, but are not limited to, TEV, trypsin, thrombin, cathepsin B, cathespin D, cathepsin K, caspase lumatrix metalloproteinase sequences, phosphodiester, phospholipid, ester, beta galactose, b-glucoronide, dialkyl dialkoxysilane, cyanoethyl group, sulfone, ethylene glycolyl disuccinate, 2-N-acyl nitrobenzensulfonamide, a- thiophenylester, unsaturated vinyl sulfide, sulfonamide after activation, malondialdehyde (M DA)- indole derivative, levulinoyl ester, hydrazone, acylhydrazone, alkyl thioester, disulfide bridges, azo compounds, 2- Nitrobenzyl derivatives, phenacyl ester, 8-quinolinyl benzenesulfonate, coumarin, phosphotriester, bis-arylhydrazone, bimane bi-thiopropionic acid derivative, paramethoxybenzyl derivative, tert- butylcarbamate analogue, dialkyl or diaryl dialkoxysilane, orthoester, acetal, aconityl, hydrazone, b- thiopropionate, phosphoramidate, imine, trityl, vinyl ether, polyketal, alkyl 2- (diphenylphosphino)benzoate derivatives, allyl ester, 8-hydroxyquinoline ester, picolinate ester, vicinal diols, and selenium compounds. Cleavage conditions and reagents include, but are not limited to enzymes, nucleophilic/basic reagents, reducing agents, photo-irradiation, electrophilic/acidic reagents, organometallic and metal reagents, and oxidizing reagents.
The unique labels of the invention are, at least in part, nucleic acid in nature, and are generated by either sequentially attaching two or more detectable bi-degenerate base positions to each other or it could be synthesized as a sequence of bi-degenerate bases. The preferable embodiment is constructing the barcode label by sequential attachment of two or more detectable bi-degenerate bases. As used herein, a detectable bi-degenerate position is one where either one of two possible nucleotides defined by the specific base at that single base position in the sequence is incorporated uniquely into the oligonucleotide sequence and can be detected by sequencing of its nucleotide sequence and/or by detecting non-nucleic acid detectable moieties it may be attached to. Similarly, the barcode label can be constructed from other degenerate sequence libraries. A similar embodiment would involve a sequence of tri-degenerate bases.
The oligonucleotide bi-degenerate bases are typically randomly selected from a diverse plurality of oligonucleotide bi-degenerate bases. In some instances, an oligonucleotide tag may be present once in a plurality or it may be present multiple times in plurality. In the latter instance, the plurality of bi degenerate positions may be comprised of a number of subsets each comprising a plurality of identical bi-degenerate bases. In some important embodiments, these subsets are physically separate from each other. Physical separation may be achieved by providing the subsets in separate wells of a multiwall plate or separate droplets from an emulsion. It is the random selection from between the bi-degenerate bases and the combination of oligonucleotide bi-degenerate positions that result in multiple primary sequences (badges) that correspond to a unique label. Accordingly, the number of distinct (i.e., different) oligonucleotide bi-degenerate positions required to uniquely label a plurality of agents can be far less than the number of agents being labeled. This is particularly advantageous when the number of agents is large (e.g. when the agents are members of a library). A similar embodiment would involve a sequence of tri-degenerate bases.
The oligonucleotide bi-degenerate bases may be detectable by virtue of the nucleotide sequence, or by virtue of a non-nucleic acid detectable moiety attached to the oligonucleotide such as but not limited to a fluorophore, or by virtue of a combination of the nucleotide sequence and the non-nucleic acid detectable moiety. A similar embodiment would involve a sequence of tri-degenerate bases.
As used herein, the term "oligonucleotide" refers to a nucleic acid such as deoxyribonucleic acid (DNA), ribonucleic acid (RNA), or DNA/RNA hybrids and includes analogs of either DNA or RNA made from nucleotide analogs known in the art. Oligonucleotides may be single-stranded (such as sense or anti-sense oligonucleotides), double-stranded or partially single-stranded and partially double stranded.
In some aspects, the invention provides methods for generating unique labels. The methods typically use a plurality of detectable bi-degenerate bases to generate a plurality of molecular labels (badges) relating to a single packet sequence. In one preferred embodiment, a library of bi-degenerate bases is synthesized according to the International Union of Pure and Applied Chemistry (lUPAC) nomenclature of degenerate base symbols where each symbol represents a position in an oligonucleotide sequence that can have multiple possible alternatives derived from the canonical nucleotide bases of Adenine (A), Guanine (G), Thymine (T) or Cytosine (C) (Figure 1). The degenerate symbols can be combined together to form a unique label packet sequence and meta-barcode for tagging a population of agents and reducing the degenerate sequences into the set of unique set of canonical sequences tagging the individual agents with a common association, such as those derived from a single cell in a population of cells, and a unique label or molecular badge for each agent in the population of agents with a common association.
In a first instance, the bi-degenerate base symbols Weak (W), Strong (S), aMino (M), Keto (K), puRine (R) and pYrimindine (Y) representing one of two canonical nucleotides is used to synthesize a label for tagging a population of agents. The bi-degenerate symbols are grouped according to the reliability of synthesis of the underlying nucleotide sequence and to allow for unambiguous synthesis of a set of molecular badges uniquely associated with the cell meta-code packet. In this regard the bi-degenerate symbols are grouped into the following canonical sets: (R,Y), (M,K) and (W,S) and individual packets are synthesized only from the pair of canonical sequences. These packets can be joined together to form a cell meta-code barcode label. A label defined by a packet sequence of N bi-degenerate symbols allows for the synthesis of a library of unique 2^ molecular badge sequences that form a library of unique labels for tagging agents. This leads to a library of 2^ unique oligonucleotide sequences defined by the N length packet sequence. A key advantage of this label scheme is that it allows the same label to tag at the level of the degenerate meta-barcode sequence agents from a common source that are related to each other (such agents from a single cell) and coding at the level of canonical oligonucleotide badge sequences individual agents in the population of agents that enables digitally counting of agents and copies of agents from a common origin or source. In one example (Figure 3), a
10 packet sequence label of 10 degenerate base symbols will allow for the synthesis of a library of 2 =
1024 unique molecular badge sequences and for a meta-barcode labels made from 3 packets each, the oligonucleotide library size is 2^ x ^ = 1,073,741,824. The meta-barcode library encodes a population of agents with a common association and the individual molecular badge sequence encodes individual agents in that same population.
In a related instance, unique barcode labels can be synthesized from a second set of lUPAC tri degenerate base symbols, specifically H, B, V and D, to create a meta-barcode label of N tri-degenerate base symbols in length. The symbol H is matched with the base G, B with A, V with T and D with C to produce codes in the same manner as the bi-degenerate base codes. This similarly allows for the synthesis of a library of 2^ unique nucleotide sequences and combining M meta-codes into a barcode label results in 2^
Figure imgf000020_0001
unique molecular badges. An example would be if the meta-codes is 3 tri degenerate bases in length and the meta-codes combined into barcode labels of 5 meta-codes each,
3 x 5
then the library of sequence badges would be 2 = 32,768 in size. In another instance, unique molecular badge labels can be synthesized from a combination of lUPAC degenerate (and/or nondegenerate) base symbols at each position in a N symbol sequence depending on the degenerate base symbol in the sequence of N different symbols defining a library of unique meta-barcode labels.
In one embodiment of the invention, the oligonucleotide label is synthesized to consist of at least two sets of at least two or three consecutive nucleotides encoding standard lUPAC symbols wherein each set of at least two or three consecutive nucleotides has a hamming distance of at least 1, 2 or 3 to every other set of at least two or three nucleotides encoded within the oligonucleotide label.
In another embodiment of the invention, the oligonucleotide label described beforehand herein, optionally comprised within a composition, comprises at least two or more sets of two or three consecutive nucleotides encoding predefined lUPAC symbols wherein the code of lUPAC symbols encode a set of bi-degenerate and/or tri-degenerate bases each having a hamming distance of at least
2 to any of the bi-degenerate or tri-degenerate bases contained within the entire oligonucleotide label thereby allowing the detection of at least 1 sequencing error within said oligonucleotide label when subsequentially sequenced.
In a further embodiment of the invention, the oligonucleotide label described beforehand herein, optionally comprised within a composition, comprises at least two or more sets of two or three consecutive nucleotides encoding predefined lUPAC symbols wherein the code of lUPAC symbols encode a set of bi-degenerate and/or tri-degenerate bases each having a hamming distance of at least
3 to any of the bi-degenerate or tri-degenerate bases contained within the entire oligonucleotide label thereby allowing the correction of at least 1 sequencing error within said oligonucleotide label when subsequentially sequenced.
A unique nucleotide badge sequence may be a nucleotide sequence that is different (and thus distinguishable) from the sequence of each detectable oligonucleotide tag in a plurality of detectable oligonucleotide bi-degenerate bases. A unique nucleotide sequence may also be a nucleotide sequence that is different (and thus distinguishable) from the sequence of each detectable oligonucleotide tag in a first plurality of detectable oligonucleotide bi-degenerate bases but identical to the sequence of a least one detectable oligonucleotide tag in a second plurality of detectable oligonucleotide bi-degenerate bases. A unique sequence may differ from other sequences by multiple bases (or base pairs). The multiple bases may be contiguous or non-contiguous. Methods for obtaining nucleotide sequences (e.g. sequencing methods) are described herein and/or are known in the art. In some embodiments, detectable bi-degenerate bases comprise one or more of a ligation sequence, a priming sequence, a capture sequence, and a unique sequence (optionally referred to herein as an badge sequence). A ligation sequence is a sequence complementary to a second nucleotide sequence which allows for ligation of the detectable oligonucleotide tag to another entity comprising the second nucleotide sequence, e.g. another detectable oligonucleotide tag or an oligonucleotide adapter. A priming sequence is a sequence complementary to a primer, e.g. an oligonucleotide primer used for an amplification reaction such as but not limited to PCR. A capture sequence is a sequence capable of being bound by a capture entity. A capture entity may be an oligonucleotide comprising a nucleotide sequence complementary to a capture sequence, e.g. a second detectable oligonucleotide tag or an oligonucleotide attached to a bead.
In some embodiment, the bi-degenerate bases could comprise nucleotides selected from purine bases, pyrimidine bases, natural nucleotide bases, chemically modified nucleotide bases, biochemically modified nucleotide bases and non-natural nucleotide bases.
A capture entity may also be any other entity capable of binding to the capture sequence, e.g. an analog, antibody or peptide. An index sequence is a sequence comprising a unique nucleotide sequence and/or a detectable moiety as described above (Figure 4). Incorporation into a single barcode tag is shown in Figure 6.
"Complementary" is a term which is used to indicate a sufficient degree of complementarity between two nucleotide sequences such that stable and specific binding occurs between one and preferably more bases (or nucleotides, as the terms are used interchangeably herein) of the two sequences. For example, if a nucleotide in a first nucleotide sequence is capable of hydrogen bonding with a nucleotide in a second nucleotide sequence, then the bases are considered to be complementary to each other. Complete (i.e. 100%) complementarity between a first nucleotide sequence and a second nucleotide is preferable, but not required for ligation, priming or capture sequences.
Each unique label comprises two or more detectable oligonucleotide bi-degenerate bases. The two or more bi-degenerate bases may be three or more bi-degenerate bases, four or more bi-degenerate bases, or five or more bi-degenerate bases. In some embodiments a unique label comprises 2,3,4,5,6,7,8,9,10,15,20,30,40,50,100 or more detectable bi-degenerate bases. A similar embodiment would involve a sequence of tri-degenerate bases. The bi-degenerate bases are typically bound to each other, typically in a directional manner. Methods for sequentially attaching nucleic acids such as oligonucleotides to each other are known in the art and include, but are not limited to, ligation and polymerization, or a combination of both. Ligation reactions include blunt end ligation and cohesive overhang ligation. In some instances, ligation may comprise both blunt end and cohesive overhang ligation. A "cohesive overhang" (also referred to as a "cohesive end" or an "overhang") is a single stranded end sequence (attached to a double stranded sequence) capable of binding to another single stranded sequence thereby forming a double stranded sequence. A cohesive overhang may be generated by a polymerase, a restriction endonuclease, a combination of a polymerase and a restriction endonuclease, or Uracil-Specific Excision Reagent (USER) enzyme (NEB) or a combination of a Uracil DNA glycosylase enzyme and a DNA glycosylase- lyase Exonuclease VIII enzyme. A "cohesive overhang" may be a thymidine tail. Polymerization reactions include enzyme-mediated polymerization such as a polymerase-mediated fill-reaction. Similar considerations apply to tri-degenerate-based barcode labels. A similar embodiment would involve a sequence of tri-degenerate bases.
One non-limiting example is shown in Figure 6 where a cleavable linker attaches the oligonucleotide label to a solid substrate such as a hydrogel bead. The label may contain a promoter or primer sequence followed by one (PEI) of two sequences (PEI, Wl) used to construct a sequence library suitable for sequencing on an lllumina sequencer. This sequence construction could be different for a different commercial sequencer since the sequencing mechanism will likely be different. The first barcode index (index 1) is attached to PEI and Wl is connected to index 1 followed by index 2 and a polyT sequence for hybridization sequence capture of the RNA poly A sequence to the hydrogel bead. The same approach can be used with the poly A sequence replaced with a sequence specific primer for sequencing specific regions of a nucleic acid molecule targeted by the primer.
Methods for detecting and analyzing unique labels are known in the art. In some embodiments, detection comprises determining the presence, number, and/or order of detectable bi-degenerate bases that comprise a unique molecular badge label. For example, if the unique label comprises detectable moieties, as described herein, fluorometry, mass spectrometry or other detection methodology can be used for detection. In another example, if the unique label comprises unique nucleotide sequences, a sequencing methodology can be used for detection. If the unique label comprises both unique sequences and detectable moieties, a combination of detection methods may be appropriate, e.g. fluorometry and a sequencing reaction. A similar embodiment would involve a sequence of tri-degenerate bases. Methods of sequencing oligonucleotides and nucleic acids are well known in the art (see US Pat 5525464, 5202231, etc.).
It should be considered that not every bi-degenerate is compatible with every other bi-degenerate. For example, a bi-degenerate "RS" encodes for four sequences ("AC", "AG", "GC", "GG"). Another bi degenerate "MY" encodes for "AC", "AT", "CC", "CT". Note that because "AC" is encoded by both bi degenerate bases in this example, these bi-degenerate bases are not compatible with each other.
It is useful to define a metric, called the "bi-degenerate distance", between two bi-degenerate bases as the smallest Hamming distance between any pair of encoded sequences compared from the bi degenerate bases. The Hamming distance is the smallest number of substitutions required to change one sequence to another sequence. In the previous example, the bi-degenerate bases "RS" and "MY" have a bi-degenerate distance of zero, because zero substitutions are required when comparing the sequence "AC" (from the "RS" bi-degenerate) and the sequence "AC" (from the "MY" bi-degenerate). Note the maximum bi-degenerate distance is also the maximum Hamming distance, which is the length of the bi-degenerate sequence itself.
A pair of bi-degenerate bases with a bi-degenerate distance of at least one can be used together to create labels which can be distinguished from each other. These bi-degenerate bases are then said to be "compatible". For example, "RW" encodes for "AA", "AT", "GA", and "GT". It has a bi-degenerate distance of 1 with "RS" because any comparison of two sequences from the bi-degenerate bases requires at least one substitution. The sequence "AA" from "RW" compared with "AC" from "RS" requires one change, whereas "AA" from "RW" compared to "GG" from "RS" requires two substitutions.
It is beneficial to have larger bi-degenerate distances between a pair of bi-degenerate bases because error removal or correction methods can then be used to distinguish between the bi-degenerate bases when the returned sequence has errors. A bi-degenerate distance of two enables error removal: A sequence which is not from either bi-degenerate base but has an error such that it has a hamming distance of 1 from each base can be thrown away without misassignment to an incorrect base.
A bi-degenerate distance of three or greater enables error correction, where it is possible to still make the correct base assignment despite an error in the returned sequence. If the returned sequence has a Hamming distance of 1 with a known base and a Hamming distance greater than 1 to the others, it can be properly assigned to the known base. A set of compatible bi-degenerate bases can be created by choosing a bi-degenerate of the desired length and calculating its bi-degenerate distance across all existing members of the set. If the smallest bi-degenerate distance is at least one, the chosen bi-degenerate can then be added to the set. The set can then be increased in size until the desired number of bi-degenerate bases is found, subject to the limitations based on the length of each base. It is beneficial to define a metric, called the "set distance", for a set of bi-degenerate bases, which is defined as the minimum distance between any two bi-degenerate members of the set. A similar embodiment would involve a sequence of tri degenerate bases.
For example, the set of three bi-degenerate bases ("RSSM", "YSWM", "RWWK") has a set distance of two because the minimum bi-degenerate distance amongst any pair of bi-degenerate bases is two. That means error removal can be used for any returned sequence in this set of bi-degenerate bases. However, because the bi-degenerate distance between "RSSM" and "RWWK" is 3, error correction can be used for some returned sequences. The returned sequence "AGAT" has Hamming distance 2 from "RSSM", 2 from "YSWM", and 1 from "RWWK", so it can be assigned to "RWWK". Note that sets with the same set distance may have different average bi-degenerate distances amongst set members, so may differ in opportunities for error removal.
Thus a set of bi-degenerate bases can be created which has the appropriate number of members, with a desired set distance. If a set cannot be generated with enough members even after multiple trials with randomized choices for membership candidates, then the length of the bi-degenerate bases can be increased to increase the total number of possibilities.
In one embodiment of the invention, the oligonucleotide label has a length of 24 to 200 nucleotides. In another embodiment, the oligonucleotide label has a length of 24 to 100 nucleotides. In a further embodiment, the oligonucleotide label has a length of 24 to 75 nucleotides. In a preferred embodiment, the oligonucleotide label has a length of 24 to 50 nucleotides. In an even more preferred embodiment, the oligonucleotide label has a length of 24 to 45 nucleotides. In the most preferred embodiment, the oligonucleotide label has a length of 30 nucleotides.
In one embodiment of the invention, the oligonucleotide label comprises one or more barcode sequences of each 10-50 nucleotides in length. In a preferred embodiment, the oligonucleotide label comprises one or more barcode sequences of each 10-35 nucleotides in length. In an even more preferred embodiment, the oligonucleotide label comprises one or two barcode sequences of each 10-20 nucleotides in length. In the most preferred embodiment, the oligonucleotide label comprises two barcode sequences of each 10 nucleotides in length.
In one embodiment of the invention, the cell specific label, which is encoded within the oligonucleotide label and comprised of III PAC symbols, has the sequence N*N*N*N*N*N*N*N*N*N*, wherein N* represents any lUPAC symbol, except the symbol "N" which encodes for one of four non degenerate nucleotides.
In another embodiment of the invention, the cell specific label, which is encoded within the oligonucleotide label and comprised of lUPAC symbols, is encoded by a first barcode with the sequence WN*N*N*N*N*N*N*N*N* and by a second barcode with the sequence RN*N*N*N*N*N*N*N*N*, wherein N* represents any lUPAC symbol, except the symbol "N" which encodes for one of four non-degenerate nucleotides.
In some aspects, the invention provides methods for generating unique molecular badge labels. The methods typically use a plurality of detectable bi-degenerate bases to generate unique labels. In some embodiments, a unique label is produced by sequentially attaching two or more detectable oligonucleotide bi-degenerate bases to each other. The detectable bi-degenerate bases may be present or provided in a plurality of detectable bases. The same or a different plurality of bi degenerate bases may be used as the source of each detectable tag comprised in a unique label. In other words, a plurality of bi-degenerate bases may be subdivided into subsets and single subsets may be used as the source for each tag. This is exemplified in at least Figure 4 where synthesized in each well of a M-well microplate is one degenerate sequence of N symbols in length and each well in the microplate contains 2^ non- degenerate oligonucleotide sequences corresponding to that one degenerate sequence. A population of hydrogel beads is distributed in each well of the microplate, typically >10,000 beads per well, and the degenerate sequence is irreversibly captured onto the solid substrate. Following a typical combinatorial chemistry scheme, a library of unique meta-barcode is created from the random combination of three packets, one from each well in the microwell plate. Following the mix and react sequence P times, the size of the meta-barcode library synthesized in this □
way will be Q and the number of unique canonical molecular badge sequences to tag each molecule in a population will be 2^ x ^ where N = number of bi-degenerate bases in each meta-code and M = number of packets per meta-barcode label. For a Q= 96 well microplate and for P = 3 packet barcode a
then the number of available barcodes in the library can be 96 = 884,736 unique barcode labels. Included in either example would be a poly T sequence to capture the RNA poly A sequence or a gene specific primer (GSP) for hybridzing to a specific transcript sequence.
Decoding a returned sequence first involves identifying each molecular badge from the returned sequence. Because each molecular badge is associated with a known set of bi-degenerate bases when encoded, error correction methods can be used as described in the set generation process to assign the molecular sequence to its corresponding bi-degenerate. If it is ambiguous to which bi-degenerate the badge should be assigned, then the error can be removed by throwing away the returned sequence to prevent misassignments. If all badges are able to be assigned to their corresponding bi degenerate bases, then the meta-barcode is then known and can be assigned. A similar embodiment would involve a sequence of tri-degenerate bases.
In some aspects, the invention provides methods for encoding both a unique molecular badge as well as a deterministic identifier (or cell meta-barcode) in the same position. The method typically uses a symmetrical-key algorithm, but may also use an asymmetrical-key algorithm. In one instance, using a symmetrical-key, the known packet sequence consisting of the bi-degenerate base symbols Weak (W), Strong (S), aMino (M), Keto (K), puRine (R) and pYrimidine (Y), each representing one of two canonical nucleotides, is translated into a mixed-base nucleotide sequence during barcode synthesis. To convert the molecular badge sequence, consisting of A, T, G, and C, back to the packet sequence, the same encryption key is needed. This can be done in several ways, such as but not limited to, a symmetrical- key algorithm, a look-up table, or a hash table.
In the simplest form, using a symmetrical-key algorithm, a symbolic find and replace function can be used to convert between the badge and packet. In this case, the algorithm would take a string consisting of a nucleotide sequence as an input and would search for all instances of A and G and replace them with R. It would then search for all instances of T and C and replace them with Y. The output of the function would be the converted packet sequence. Using the symmetrical-key algorithm with a position-indifferent unique bi-degenerate pair, such as Purine (R) and Pyrimidine (Y), represents the simplest form of the conversion. The symmetrical-key algorithm can also work with an additional input which denotes a specified symbolic encryption for each position. In this case the algorithm would have two inputs, the first being the string consisting of the nucleotide sequence, the second would the identifier which determines the unique bi-degenerate pair (R/Y, W/S, or M/K) that is used for each position. Using the position-specific symmetrical-key algorithm adds an addition layer of encryption to the packet conversion. In another embodiment, the symmetrical-key required to decipher the packet sequence as above can be enciphered in the nucleotide sequence in a specific, defined position so that knowledge of that sequence would define which bi-degenerate pair had been used to direct the synthesis of that molecular badge sequence.
The conversion from nucleotide sequence to barcode can also be done by generating all possible nucleotide sequences of the given length and mapping each sequence to a respective barcode based on a predetermined key. This matrix can be used as a look-up table to search for any nucleotide sequence and have it map to a specified barcode. A hash table can also be used as a more computationally efficient method for determining barcode sequences from strings of nucleotides when a large number of base pairs is needed.
In some embodiments, the invention relates to a method comprising the encoding of the information of both a unique molecular badge sequence and a deterministic cell packet in the same physical sequence of an oligonucleotide, wherein either a symmetrical-key algorithm or an asymmetrical-key algorithm is used to translate a defined sequence of bi-degenerate or tri-degenerate lUPAC symbols into a mixed-base nucleotide sequence during barcode synthesis.
In a further embodiment, the invention relates to a method, wherein a position-specific symmetrical- key algorithm is used to translate a defined sequence of bi-degenerate or tri-degenerate lUPAC symbols into a mixed-base nucleotide sequence during barcode synthesis.
In one embodiment, the invention relates to a method, wherein the beforehand described method is used for the multiplexing of samples within a sample library for subsequent sequencing analysis.
In another embodiment, the invention relates to a method, wherein the decoding and/or demultiplexing of sequencing data derived from a sample library, comprising oligonucleotide labels described beforehand herein, is achieved by utilizing a symmetrical-key algorithm, a position-specific symmetrical-key algorithm, an asymmetrical-key algorithm, a look-up or a hash table.
EXAMPLES
Example 1 - Compare barcode labels with a Unique Molecular Identifier sequence vs meta- barcode label sequences A comparison of the incumbent method of tagging agents within a cell with the method described herein illustrates the utility and novelty of this labelling method.
Currently, the information encoded in the deterministic identifier, referred to as the cell barcode, is contained in a separate nucleotide sequence from that which identifies the plurality of agents associated with a specific cell. There are multiple methods to encode the information in a cell barcode, principally split and recombine combinatorial synthesis methods and specifically defined barcode sequences. In addition, the labels that identify the molecular agents, often called the Unique Molecular Identifiers (UM Is), are nucleotides sequences, usually of 5-10 randomly incorporated bases, added onto the barcode string. UMIs are quite useful in identifying the agents that originated in the same cell before any transcription or amplification events that are subsequently carried out on the cell contents. Counting the number of UMIs, rather than the detecting and counting the resulting individual strands of nucleic acid sequences individually, reduces the bias potentially introduced by cDNA amplification. In order to be sure that the count of original molecular species is representative of the pre-amplification state of the cell, the number of available UMIs should be much greater than the anticipated number of the molecular agents in the cell. Because the current method uses a short string of nucleotides alone, the diversity of sequences represented is simply 4^, where N is the length of the UM I sequence. Most examples in practice are 6-8 bases long, which provide 4,000-65,000 combinations. There is incentive to keep the sequence as short as possible because adding bases to the barcoding region reduces the number of bases that can be read by NGS sequencing in the biologically informative portion of the nucleotide.
In contrast, the method here described utilizes the same sequence to both identify the associated molecular agent and, by inference, the originating cell. This results in a much larger potential diversity in molecular badges for the same number of cell meta-barcodes, which is determined by the length of the sequence of the packet sequences which make up the meta-barcode.
Specifically, for the incumbent example of a cell barcode made of two 8-base sections, added sequentially by 384 well split and recombine method, the total number of sequences would be 384 x 384 = 147,456 cell barcodes. Adding an 8-base UMI sequence to the barcode region would add 65, 536 unique molecular identifiers. The total barcoding region is 24 bases: two 8-base barcodes and one 8 base UMI portion. By contrast example, using the same number of 24 bases to for coding, the method herein described, with three 8-base packets, and a 96 x 96 x 96 split and recombine strategy, 884,736 cell meta-barcodes would be available. The number of molecular badge combinations generated by those three packets combined is 2 ^ = 16,777,216. This would indicate the meta-barcoding scheme can synthesize near 100-fold more unique cell and molecular identifies all from the same nucleotide sequence.
Example 2 - Synthesis of a library of diverse meta-barcode labels
In a related embodiment the meta-barcode sequences are synthesized through application of solid phase synthesis using the phosphoramidite method and phosphoramidite building blocks derived from protected 2'-deoxynucleosides (dA, dC, dG and dT), ribonucleosides (A, C, G, and U) or chemically modified nucleosides, e.g. LNA or BNA. The barcode sequences could be synthesized directly on the solid support or, in another preferred embodiment, multiple oligonucleotide indices synthesized from relatively short sequences could be combined in a combinatorial manner to create a diverse library of unique barcode sequences.
This is exemplified in at least Figure 3 where synthesized in each well of a 96 well microplate is one degenerate sequence and all the 2^ non-degenerate oligonucleotide sequences that correspond to that one degenerate sequence. If the degenerate sequence is N symbols long, then the number of non-degenerate sequences per container is 2^. This provides a starting library of 288 unique degenerate sequences, each in a different well of one of three 96 well microplates. The process is summarized as follows:
• Synthesized in each well of the 3x96 well microplates is a different 7 molecular badge specified by a combination of the lUPAC labels W and S; M and K; R and Y. This provides a starting library a
of 288 packets from which to construct a library of 96 = 884,736 unique 3 packet, cell meta barcode labels; see also description above.
• In each well there are iJ - 128 different and unique nucleotide sequences. When combined into a library of three index barcodes, this results in 2,097,152 unique molecular identifying sequencing per barcode.
• Hydrogel beads with a cleavable/non-cleavable anchor linker incorporated into the hydrogel matrix are distributed into the wells of the three library plates and the first set of badge sequences are hybridized or ligated to the anchor sequence. The number of beads dispensed into each well is in excess of the number of sequences per well and is typically >10,000 beads per well or 3,000,00 beads for all three plates. The concentration of each sequence is in the nanomole range to ensure an excess of molecules are available to react and attach to the anchor and synthesized index.
• The beads with the first index are pooled and redistributed into a second set of plates with the same sequences as the first set of plates. The second index is attached to the beads and the process is repeated with a third set of plates.
• The final results is a library of hydrogel beads where each bead has multiple copies (>10 ) of the same 2 nucleotide, 3 index molecular badge sequence (the cell barcode) but different single nucleotide sequences (molecular identifier).
Example 3 - Oligonucleotide labeled antibodies for multiplexed protein analysis
In a related embodiment the meta-code sequences are synthesized and attached to an individual antibody selected to bind with high affinity to a specific target antigen. A collection of different antibodies uniquely targeting different antigens are distinguished in part by having a unique barcode defining the antibody with a poly A sequence to be captured by the poly T sequence of the meta barcode sequence on the barcode gel bead. In this way the unique barcode sequence identifying each unique antibody in a collection of antibodies targeting different cell antigens can be distinguished from each other and associated with the cell whose antigens they have labeled. In a similar way small molecules can be tagged with a library of meta-code sequences wherein one badge tags one small molecule and the badge is derived from the meta-code sequence associated with the cell wherein the small molecule is derived or interacts with other molecular agents in the cell. A library of meta-codes can tag a population of cells and each badge related to a meta-code tags a small molecule interacting with or associated with a tagged cell.

Claims

1. A method, comprising:
a) obtaining a sample comprising a plurality of cells,
b) labeling at least a portion of one or more molecular agents in the cell, complements thereof, or reaction products therefrom, from a first cell of the plurality and a second cell of the plurality with a first same cell label specific to the first cell and a second same cell label specific to the second cell; and a unique label specific to each of one or more molecular agents derived from the cell label, complements thereof, or reaction products therefrom, from the first cell; and wherein a unique label specific to each of one or more molecular agents, complements thereof, or reaction products therefrom, from the second cell are unique with respect to each other.
2. The method of claim 1, wherein the agent comprises a nucleic acid, a peptide, a protein, a cell, a cell lysate, a solid support, a polymer, a chemical, a metabolite, and the like, or a plurality of any of the foregoing, or a mixture of the foregoing.
3. The method of any one of claims 1 or 2, wherein the unique labels provided are at least partly nucleic acid in nature.
4. The method of any one of the preceding claims, wherein the one or more molecular agents are each associated with a single copy of a unique label or with multiple copies of the same unique label, comprising 2, 3, 4, 5, 6, 7, 8 , 9, 10, 100, 103, 104, 105, 106, 107, 108, 109, or more copies of the same unique label.
5. The method of any one of the preceding claims, wherein the unique label is an oligonucleotide label.
6. The method of any one of the preceding claims, wherein an oligonucleotide label is constructed from standard lUPAC symbols encoding for either one of two different nucleotides or one of three different nucleotides at each position during synthesis of the oligonucleotide label.
7. The method of any one of the preceding claims, wherein an oligonucleotide label is synthesized by successive addition and polymerization of individual nucleotides according to the lUPAC symbols defining the label to create a single oligonucleotide sequence wherein the first part is a molecular specific label (the badge) and the second part is a unique label (the meta- code packet), wherein the second part is encrypted within the first part.
8. The method of any one of the preceding claims, wherein the meta-bar code is created by combining one or more meta-code packets, wherein the meta-code packet is defined by combining one or more individual bi-degenerate bases, and wherein each packet defines a unique label badge sequence for labelling molecular agents from a cell.
9. The method of any one of the preceding claims, wherein the packet sequence is decoded to its constituent nucleotides resulting in a library of unique nucleotide sequences called molecular badge sequences, and wherein sequencing the molecular badge sequences reveals the specific label identifying a molecular agent.
10. The method of any one of the preceding claims, wherein the unique labels are synthesized in real-time.
11. The method according to any one of the proceeding claims, wherein the oligonucleotide label is synthesized to consist of at least two sets of two or three consecutive nucleotides encoding standard lUPAC symbols, wherein each set of at least two or three consecutive nucleotides has a hamming distance of at least 1 to every other set of at least two or three nucleotides encoded within the oligonucleotide label.
12. The method according to any one of the proceeding claims, wherein the oligonucleotide label is synthesized to consist of at least two sets of two or three consecutive nucleotides encoding standard lUPAC symbols, wherein each set of at least two or three consecutive nucleotides has a hamming distance of at least 2 to every other set of at least two or three nucleotides encoded within the oligonucleotide label.
13. The method according to any one of the proceeding claims, wherein the oligonucleotide label is synthesized to consist of at least two sets of two or three consecutive nucleotides encoding standard lUPAC symbols, wherein each set of at least two or three consecutive nucleotides has a hamming distance of at least 3 to every other set of at least two or three nucleotides encoded within the oligonucleotide label.
14. The method according to any one of the proceeding claims, wherein the oligonucleotide label is covalently attached, hybridized or ligated to an adaptor, a linker, to another oligonucleotide, a capture entity or a solid support.
15. The method according to any one of the proceeding claims, wherein the oligonucleotide label is covalently attached, hybridized or ligated to a solid support via an adaptor or a linker.
16. The method according to any one of the proceeding claims, wherein the adaptor is at least partly nucleic acid in nature.
17. The method according to any one of the proceeding claims, wherein the adaptor is an oligonucleotide comprising one or more known sequences.
18. The method according to any one of the proceeding claims, wherein the adaptor is an oligonucleotide comprising one or more primer sequences.
19. The method according to any one of the proceeding claims, wherein the adapter comprises a capture or detection moiety.
20. The method according to any one of the proceeding claims, wherein the oligonucleotide label is covalently attached, hybridized or ligated to one or more hydrogel beads.
21. The method according to any one of the proceeding claims, wherein said oligonucleotide label is covalently attached, hybridized or ligated to one or more hydrogel beads using a cleavable or non-cleavable linker.
22. The method according to any one of the proceeding claims, wherein said oligonucleotide label from a plurality of said oligonucleotide labels is covalently attached, hybridized or ligated to a solid support optionally using said cleavable or non-cleavable linker, wherein the plurality of said oligonucleotides that are covalently attached, hybridized or ligated to a solid support or each solid support within a group of solid supports create a library, wherein each solid support or each solid support within its group of solid supports has multiple copies (>2) of said oligonucleotide label encoding the same meta-barcode (the agent barcode) and same nucleotide badge sequence (molecular specific label), wherein said nucleotide badge sequence is unique to each of said solid supports or each of said groups of solid supports.
23. A method according to claim 22, wherein each group of said solid supports, which is covalently attached, hybridized or ligated to an oligonucleotide encoding a nucleotide badge sequence, which is unique to said group of solid supports, is placed in a separate well of a microplate with multiple separate wells.
24. A method according to any one of the proceeding claims, wherein each group of solid supports is placed in a separate well of a microplate with multiple separate wells prior to or after synthesis of said oligonucleotide or a plurality of said oligonucleotides in same said wells, wherein each solid support is covalently attached, hybridized or ligated to said oligonucleotide label or a plurality of said oligonucleotide labels, optionally using a cleavable or non-cleavable linker, wherein each solid support or group of solid supports within a separate well of said microplate is attached to said oligonucleotide or a plurality of said oligonucleotides encoding a nucleotide badge sequence that is unique to each group of said attached solid supports and each said separate well.
25. A nucleic acid of a length between 24 and 50 bases comprising at least one region of 4 to 24 nucleotides which can be used as a barcode or a label, wherein said barcode or label encodes two informationally parts within one single physical nucleotide sequence, a first part being a molecular specific label and a second part being a cell specific label.
26. A nucleic acid according to claim 25, wherein the cell specific label is constructed from standard lUPAC symbols, which can represent one of two different nucleotides (bi-degenerate cipher) or one of three different nucleotides (tri-degenerate cipher) at each single nucleotide position of said label sequence, wherein a string of N bi-degenerate lUPAC symbols can encode 2N unique nucleotide sequences and a sequence of N tri-degenerate lUPAC symbols can encode 2N unique nucleotide sequences.
27. A nucleic acid according to any one of claims 25-26, wherein the cell specific label of standard lUPAC symbols is encoded within the nucleotide sequence of the molecular specific label.
28. A nucleic acid according to any one of claims 25-27, wherein the cell specific label comprises a string of either bi-degenerate or tri-degenerate lUPAC symbols or a combination thereof, and wherein at each nucleotide position depending on whether it is a bi-degenerate or tri degenerate lUPAC symbol encodes randomly one of two or similarly one of three possible nucleotides in equal parts with a single nucleotide in that specific position and wherein this string repeats itself N times within a N-mer molecular label sequence.
29. A nucleic acid according to any one of claims 25-28, wherein said barcode or label comprises at least two or more sets of each two or three consecutive nucleotides, which each encode a predefined lUPAC symbol, wherein each set of consecutive nucleotides has a hamming distance of at least 2 to any of the other sets contained within the entire barcode or label thereby allowing the detection of at least one sequencing error when subsequentially sequenced and analyzed.
30. A nucleic acid according to any one of claims 25-29, wherein said barcode or label comprises at least two or more sets of each two or three consecutive nucleotides, which each encode a predefined lUPAC symbol, wherein each set of consecutive nucleotides has a hamming distance of at least 3 to any of the other sets contained within the entire barcode or label thereby allowing the correction of at least one sequencing error when subsequentially sequenced and analyzed.
31. A nucleic acid according to any one of claims 25-30, comprising at least one additional adaptor or spacer sequence.
32. A nucleic acid according to any one of claims 25-31, comprising at least one additional primer sequence.
33. A nucleic acid according to any one of claims 25-32, comprising at least one additional cleavable or non-cleavable linker sequence.
34. A nucleic acid according to claim 33, wherein said one or more linker sequences comprise one or more restriction enzyme target sequences.
35. A nucleic acid according to any one of claims 25 to 34 having a length of 50 nucleotides, comprising a barcode sequence of 9 nucleotides, wherein the cell specific label of lUPAC symbols encoded by the barcode sequence is WSSMKMRRY.
36. A nucleic acid according to any one of claims 25 to 34 having a length of 100 nucleotides, comprising two barcode sequences of each 10 nucleotides, wherein the cell specific label of lUPAC symbols encoded by the first barcode is WWWSSWSSSS and the cell specific label of lUPAC symbols encoded by the second barcode is MKKKMMKKKM.
37. A composition comprising, an oligonucleotide label comprising, a plurality of oligonucleotides, wherein the oligonucleotides are randomly attached to each other, wherein the same oligonucleotide label encodes two parts, a first part being a molecular specific label and a second part being a cell specific label.
38. The composition of claim 37, wherein the cell specific label is constructed from standard lUPAC symbols encoding for either two different nucleotides (bi-degenerate cipher) or three different nucleotides (tri-degenerate cipher) at each position during synthesis of the oligonucleotide label.
39. The composition of any one of claims 37 or 38, wherein the cell specific label is derived from the molecular specific label from a single oligonucleotide sequence.
40. The composition of any one of claims 37 to 39, wherein the oligonucleotide label is synthesized by successive addition and polymerization of individual nucleotides, and wherein the synthesis process is performed manually with standard oligonucleotide chemistries or by synthesizing a set of labels.
41. The composition of any one of claims 37 to 40, wherein the cell specific label is synthesized based on a string of either bi-degenerate or tri-degenerate symbols or a combination thereof, and wherein at each nucleotide position depending on whether it is a bi-degenerate or tri degenerate symbol randomly inserts one of two or similarly one of three possible nucleotides in equal parts with a single nucleotide in that specific position to synthesize a unique oligonucleotide, and wherein the process repeats itself N times to build a N-mer molecular label sequence.
42. A composition comprising, an oligonucleotide comprising, at least one barcode or label encoding two informationally parts within one single physical nucleotide sequence, a first part being a molecular specific label and a second part being a cell specific label, wherein the cell specific label consists of standard lUPAC symbols and is encoded within the nucleotide sequence of the molecular specific label.
43. The composition of claim 42, wherein said barcode or label comprises at least two or more sets of each two or three consecutive nucleotides, which each encode a predefined lUPAC symbol, wherein each set of consecutive nucleotides has a hamming distance of at least 2 to any of the other sets contained within the entire barcode or label thereby allowing for the detection of at least one sequencing error when subsequentially sequenced and analyzed.
44. The composition according to any one of claims 42 to 43, wherein said barcode or label comprises at least two or more sets of each two or three consecutive nucleotides, which each encode a predefined lUPAC symbol, wherein each set of consecutive nucleotides has a hamming distance of at least 3 to any of the other sets contained within the entire barcode or label thereby allowing for the correction of at least 1 sequencing error within said oligonucleotide label when subsequentially sequenced.
45. A method comprising, encoding of the information of both a unique molecular badge sequence as well as a deterministic cell meta barcode in the same physical sequence of an oligonucleotide, wherein either a symmetrical-key algorithm or an asymmetrical-key algorithm is used to translate a defined sequence of degenerate bi-degenerate or tri degenerate lUPAC symbols into a mixed-base nucleotide sequence during barcode synthesis.
46. A method according to claim 45, wherein a position-specific symmetrical-key algorithm is used to translate a defined sequence of degenerate bi-degenerate or tri-degenerate lUPAC symbols into a mixed-base nucleotide sequence during barcode synthesis.
47. A method according to any one of claims 45 or 46, wherein the decoding of sequencing data derived from a sample library comprising the composition according to claim 33-39 is achieved by utilizing a symmetrical-key algorithm, a position-specific symmetrical-key algorithm, an asymmetrical-key algorithm, a look-up or a hash table.
48. A kit comprising: a) a composition facilitating the labeling of one or more cells, one or more molecular agents in a cell, one or more complements thereof, or one or more reaction products therefrom with the composition comprising: i. a composition according to any one of claims 37 to 44,
ii. optionally reagents facilitating the attachment, hybridization or ligation of one or more oligonucleotide labels comprised in said composition to at least one said cell, said complements thereof, said molecular agents in said cell or to at least one other nucleic acid, b) optionally, reagents facilitating the amplification and/or sequencing of said oligonucleotide labels and any attached or ligated nucleic acids,
c) optionally, reagents facilitating the detection of said oligonucleotide labels and any attached or ligated cells, nucleic acids, cellular complements, molecular agents within or reaction products from one or more cells,
d) optionally, a composition facilitating the lysis of cells,
e) optionally, a composition inhibiting nucleic acid degradation and/or digestion, f) optionally, a composition comprising one or more solid support entities to be covalently attached, hybridized or ligated to said oligonucleotides as well as, optionally, reagents facilitating such attachment.
49. A kit according to claim 48, wherein the solid support entity is a bead, a magnetic bead or a hydrogel bead.
50. A method comprising, a) attaching a plurality of diverse label tags to a nucleic acid target from a sample that contains multiple copies of the nucleic acid target, thereby producing a plurality of labeled targets, wherein a label tag of the plurality of diverse label tags comprises nucleotides selected from purine bases, pyrimidine bases, natural nucleotide bases, chemically modified nucleotide bases, biochemically modified nucleotide bases, non-natural nucleotide bases and a label target of the plurality of label targets comprises a distinct label tag and at least a portion of a nucleic acid target or its complementary sequence, b) amplifying the plurality of labeled targets to produce a plurality of labeled targets, wherein an amplified labeled target of the plurality of labeled targets comprises a copy of at least a portion of the nucleic acid target, or its complementary sequence, and a copy of the label tag; and c) detecting the plurality of amplified labeled targets by sequencing at least a portion of the target and the label tag; and
d) determining the number of copies of the nucleic acid target, as indicated by the number of different label tags, that are associated with the nucleic acid target, wherein the label tag comprises a nucleic acid according to claims 25 to 36.
PCT/US2020/020321 2019-03-01 2020-02-28 Nucleic acid labeling methods and composition WO2020180659A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US201962812496P 2019-03-01 2019-03-01
US62/812,496 2019-03-01

Publications (1)

Publication Number Publication Date
WO2020180659A1 true WO2020180659A1 (en) 2020-09-10

Family

ID=72337568

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2020/020321 WO2020180659A1 (en) 2019-03-01 2020-02-28 Nucleic acid labeling methods and composition

Country Status (1)

Country Link
WO (1) WO2020180659A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP4001432A1 (en) * 2020-11-13 2022-05-25 Miltenyi Biotec B.V. & Co. KG Algorithmic method for efficient indexing of genetic sequences using associative arrays
WO2022118027A1 (en) * 2020-12-02 2022-06-09 Oxford University Innovation Limited Oligonucleotides

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8976049B2 (en) * 2013-06-03 2015-03-10 Good Start Genetics, Inc. Methods and systems for storing sequence read data
US20160289740A1 (en) * 2015-03-30 2016-10-06 Cellular Research, Inc. Methods and compositions for combinatorial barcoding
US9902950B2 (en) * 2010-10-08 2018-02-27 President And Fellows Of Harvard College High-throughput single cell barcoding
US20180320224A1 (en) * 2017-05-03 2018-11-08 The Broad Institute, Inc. Single-cell proteomic assay using aptamers

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9902950B2 (en) * 2010-10-08 2018-02-27 President And Fellows Of Harvard College High-throughput single cell barcoding
US8976049B2 (en) * 2013-06-03 2015-03-10 Good Start Genetics, Inc. Methods and systems for storing sequence read data
US20160289740A1 (en) * 2015-03-30 2016-10-06 Cellular Research, Inc. Methods and compositions for combinatorial barcoding
US20180320224A1 (en) * 2017-05-03 2018-11-08 The Broad Institute, Inc. Single-cell proteomic assay using aptamers

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP4001432A1 (en) * 2020-11-13 2022-05-25 Miltenyi Biotec B.V. & Co. KG Algorithmic method for efficient indexing of genetic sequences using associative arrays
WO2022118027A1 (en) * 2020-12-02 2022-06-09 Oxford University Innovation Limited Oligonucleotides

Similar Documents

Publication Publication Date Title
AU2018266377B2 (en) Universal short adapters for indexing of polynucleotide samples
CN108350499B (en) Convertible marking compositions, methods, and processes incorporating same
Van Dijk et al. Ten years of next-generation sequencing technology
US10253363B2 (en) Materials and methods to analyze RNA isoforms in transcriptomes
US11789906B2 (en) Systems and methods for genomic manipulations and analysis
AU2018261332A1 (en) Optimal index sequences for multiplex massively parallel sequencing
JP7332733B2 (en) High molecular weight DNA sample tracking tags for next generation sequencing
US20060263789A1 (en) Unique identifiers for indicating properties associated with entities to which they are attached, and methods for using
US20110257031A1 (en) Nucleic acid, biomolecule and polymer identifier codes
AU2017359048B2 (en) Methods for expression profile classification
US20160194699A1 (en) Molecular coding for analysis of composition of macromolecules and molecular complexes
EP1709203A2 (en) Improving polynucleotide ligation reactions
WO2020180659A1 (en) Nucleic acid labeling methods and composition
KR20230065357A (en) Methods for identification of samples
JP5926189B2 (en) RNA analysis method
US20220177964A1 (en) A high throughput sequencing method and kit
US20240117423A1 (en) Quantitative detection and analysis of molecules

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20766952

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20766952

Country of ref document: EP

Kind code of ref document: A1