WO2016063059A1 - Improved nucleic acid re-sequencing using a reduced number of identified bases - Google Patents

Improved nucleic acid re-sequencing using a reduced number of identified bases Download PDF

Info

Publication number
WO2016063059A1
WO2016063059A1 PCT/GB2015/053153 GB2015053153W WO2016063059A1 WO 2016063059 A1 WO2016063059 A1 WO 2016063059A1 GB 2015053153 W GB2015053153 W GB 2015053153W WO 2016063059 A1 WO2016063059 A1 WO 2016063059A1
Authority
WO
WIPO (PCT)
Prior art keywords
nucleotide
nucleic acid
sequencing
nucleotides
sequence
Prior art date
Application number
PCT/GB2015/053153
Other languages
French (fr)
Inventor
Tobias William Barr Ost
Russell Smith HAMILTON
Original Assignee
Cambridge Epigenetix Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Cambridge Epigenetix Ltd filed Critical Cambridge Epigenetix Ltd
Publication of WO2016063059A1 publication Critical patent/WO2016063059A1/en

Links

Classifications

    • CCHEMISTRY; METALLURGY
    • C12BIOCHEMISTRY; BEER; SPIRITS; WINE; VINEGAR; MICROBIOLOGY; ENZYMOLOGY; MUTATION OR GENETIC ENGINEERING
    • C12QMEASURING OR TESTING PROCESSES INVOLVING ENZYMES, NUCLEIC ACIDS OR MICROORGANISMS; COMPOSITIONS OR TEST PAPERS THEREFOR; PROCESSES OF PREPARING SUCH COMPOSITIONS; CONDITION-RESPONSIVE CONTROL IN MICROBIOLOGICAL OR ENZYMOLOGICAL PROCESSES
    • C12Q1/00Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions
    • C12Q1/68Measuring or testing processes involving enzymes, nucleic acids or microorganisms; Compositions therefor; Processes of preparing such compositions involving nucleic acids
    • C12Q1/6869Methods for sequencing

Definitions

  • the present technology relates to nucleic acid re-sequencing.
  • the detection of nucleic acid sequences present in a biological sample has been used, for example, as a method for identifying and classifying microorganisms, diagnosing infectious diseases, detecting and characterizing genetic abnormalities, identifying genetic changes associated with cancer, studying genetic susceptibility to disease, and measuring response to various types of treatment.
  • a common technique for detecting specific nucleic acid sequences in a biological sample is nucleic acid sequencing of the entire nucleic acid content of the sample. Sequencing methodologies are currently in use which allow for the parallel processing of billions of nucleic acids in a single sequencing run. As such, the information generated from a single sequencing run can be enormous. There is a great need to reduce the cost of generating, storing and processing sequencing data.
  • DNA sequencing can be carried out using electrophoresis to separate different length fragments.
  • Sanger sequencing involves the use of labelled, terminated ddNTP's which act to prevent strand extension once incorporated.
  • the use of a mixture of labelled ddNTP's and conventional unlabelled dNTP's produces a ladder of terminated fragments, where the identity of the terminal base is known by the identity of the label.
  • each of the four nucleic acid bases can be separately labelled.
  • Electrophoresis allows separation of the fragments of different lengths, and the identity of the labelled bases can be determined. If one of the four bases is labelled, then a ladder is produced showing labelled bands and gaps where the labels relate to a single nucleotide.
  • the size of the gaps can only be accurately determined by comparing with reads carried out with the other three labelled bases (i.e. you need four lanes on the same gel run at the same time).
  • a series of bands in the single lane of a gel gives no reliable information relating to the number of nucleic acid bases in the gaps.
  • the use of four lanes allows for the complete sequence to be determined as the identity of each base can be established.
  • US20060147935 describes certain methods for nucleic acid sequencing.
  • the application is not specific regarding the details of how to implement any methods disclosed therein.
  • the application requires the comparison of multiple runs in order to generate the sequencing read.
  • paragraph [0002] describes that where a single stopping nucleotide is used, four runs using different stopping nucleotides are required.
  • the invention described herein generates sequencing data from a single run. Sequence reads are generated without having to perform multiple runs on the same sample.
  • the methods of the invention do not require size based separation.
  • the methods of the invention are based on the ability to obtain useful information on the base sequence of a nucleic acid sample without having to generate a complete sequence read of all four nucleotide bases. Methods of the invention are carried out without size based separation. Methods of the invention are carried out without the need to compare multiple reads on the same sample in order to build the sequence.
  • Embodiments of the present invention relate to methods for obtaining nucleic acid sequence information where an original comparative sequence is already known.
  • the inventors have developed a method to reduce the complexity of nucleic acid sequence data by recording the appearance of only a single one of the four nucleic acid bases, and the accurate size (or length) of gaps between the appearances of single base being measured.
  • Such low resolution sequencing can be viewed as one base sequencing.
  • only a single base is identified, along with the number of bases in the gaps between the one identified, or called, bases.
  • the complexity of nucleic acid sequencing data is reduced, whilst useful information can still be obtained by comparing with the known reference sequence. Reducing the number of detectable labels from four to one reduces the cost of sequencing reagents, and reduces the complexity of instrumentation required to determine sequences.
  • the method can involve the use of repeated cycles of polymerase extension using one labelled, blocked nucleotide triphosphate and three unlabelled (dark) blocked nucleotide triphosphates.
  • the blocking groups are reversible, thereby allowing repetition of the cycles to continue extension of the same strands.
  • the method can use repeated alternating cycles using one nucleotide triphosphate followed by three nucleotide triphosphates.
  • the method can use a single labelled nucleotide and three unlabelled nucleotides, with the separation between the labels being recorded.
  • the method can use nanopore sequencing.
  • One embodiment of nanopore sequencing is to prepare samples where one of the four nucleotides is labelled with an identifiable label, and the gap size (or length) between adjacent labels is determined. As an example, sequencing reads of the invention are simplified thus:
  • D is any of the three bases A or G or T/U (i.e. D is 'not C).
  • V is any of the three bases A or C or G (i.e. V is 'not T').
  • B is any of the three bases C or G or T/U (i.e. B is 'not A').
  • H is any of the three bases A or C or T/U (i.e. H is 'not G').
  • the invention may be useful in the field of methylation analysis where the conversion of a subset of the C's to U's using bisulfite can be determined.
  • the invention is also useful in the rapid analysis of samples for the detection of pathogens. Such rapid analysis is useful in infection control and also quality control in manufacturing operations.
  • the detection can be carried out to identify small or large scale genetic changes.
  • the method can be used to identify copy number differences, large scale deletions, insertions or rearrangements.
  • the invention can also be used in genetic fingerprinting or forensic applications.
  • the invention can be used in HLA typing or tissue typing.
  • Some such methods include the steps of providing sequencing reagents to a target nucleic acid in the presence of a polymerase, the sequencing reagents including all four nucleotide monomers, each of which includes a reversibly terminating (blocking) moiety, wherein three of the nucleotide monomers are unlabelled (dark), and one is labelled.
  • each strand is extended by a single nucleotide, but on average three of the four strands are 'dark' on each extension cycle (i.e. three quarters of the strands incorporate a single dark nucleotide, the identity of which is not determined).
  • the cycles are repeated, and each read is recorded as 'dark' or 'labelled' for each cycle.
  • the number of 'dark' cycles for each read determines the size of the gaps between 'labelled' cycles.
  • Alternative methods include the steps of providing sequencing reagents to a target nucleic acid in the presence of a polymerase, the sequencing reagents including repeated alternating cycles where a first cycle has one of the four nucleotide monomers, and a second cycle has the remaining three nucleotide monomers. A third cycle has the one nucleotide monomer of the first cycle, and a fourth cycle has the remaining three nucleotide monomers etc. The number of bases incorporated on the 'three monomer' cycle is recorded, and thus the size of the gaps determined. In such methods, the nucleotide monomers lack a reversibly terminating moiety.
  • Both examples of the above-described methods include removing unincorporated sequencing reagents between cycles.
  • nucleotide monomers include a reversibly terminating moiety
  • further steps in the method may include removing the reversibly terminating moiety between consecutive sequencing cycles.
  • further steps in the method may include detecting incorporation of the nucleotide monomer into said polynucleotide.
  • the detecting includes detecting a label.
  • the detecting includes detecting pyrophosphate.
  • detecting pyrophosphate can include, but is not limited to, detecting a signal that is produced in the presence of, by the incorporation of or by the degradation of pyrophosphate.
  • the detecting includes detecting protons released upon monomer incorporation.
  • the detecting includes detecting the release of labelled pyrophosphate.
  • the at least one nucleotide monomer includes a label.
  • the label is selected from the group consisting of fluorescent moieties, chromophores, antigens, dyes, phosphorescent groups, radioactive materials, chemiluminescent moieties, scattering or fluorescent nanoparticles, Raman signal generating moieties, and electrochemical detection moieties.
  • Some embodiments of the method, where the nucleotide monomer includes a label may also include cleaving the label from the nucleotide monomer.
  • the first sequencing reagent is provided simultaneously to a plurality of target nucleic acids.
  • the plurality of target nucleic acids includes target nucleic acids having different nucleotide sequences.
  • the first sequencing reagent is provided in parallel to a plurality of target nucleic acids at separate features of an array.
  • the polymerase includes a polymerase selected from the group consisting of a DNA polymerase, an RNA polymerase, a reverse transcriptase, and mixtures thereof.
  • the polymerase may be a thermostable polymerase or a thermodegradable polymerase.
  • Methods for obtaining nucleic acid sequence information include the steps of (a) providing a first sequencing reagent to a target nucleic acid in the presence of a polymerase, the first sequencing reagent including a plurality of different nucleotide monomers, each including a reversibly terminating moiety, where three of the nucleotide monomers are unlabelled (dark), and one is labelled, (b) detecting the label, (c) removing the reversibly terminating moiety and the label, and (d) repeating steps (a), (b) and (c), whereby sequence information for at least a portion of the target nucleic acid is obtained.
  • Methods include the steps of (a) providing a first sequencing reagent to a target nucleic acid in the presence of a ligase, wherein the first sequencing reagent includes at least four oligonucleotides, wherein the oligonucleotides all include a reversibly terminating moiety and one of the four contains a label, (b) detecting the label, (c) removing the reversibly terminating moiety and the label, and (d) repeating steps (a), (b) and (c), whereby sequence information for at least a portion of the target nucleic acid is obtained.
  • the terminating moiety can be at the end of the oligonucleotide, or partly within the strand
  • Methods of the invention measure the length of dark regions interspersed with the nucleotide of interest. Thus reads of single bases are obtained with known separation. The dark regions are effectively degenerate with respect to the other three nucleotides.
  • the single nucleotide being called may be cytidine (C), adenosine (A), guanosine (G), thymidine (T) or uridine (U).
  • the single nucleotide being called may be cytidine (C).
  • the single nucleotide being called may be thymidine (T) or uridine (U).
  • the nucleotide being analysed may appear at least twice in the read.
  • the single nucleotide may appear at least 10 times in the sequencing read.
  • the single nucleotide may appear at least 20 times in the sequencing read.
  • the single nucleotide may appear at least 30 times in the sequencing read.
  • the nucleotide reads may span at least 50 nucleotides.
  • the nucleotide reads may span at least 100 nucleotides. Reads in this context refer to the total length of sequence determined (i.e. the total of N plus X bases where N is the bases in the gap and X is the single base being called).
  • Methods include a method for analysing cytosine methylation in a nucleotide sample, the method comprising;
  • Methods include a method for analysing cytosine methylation in a nucleotide sample, the method comprising;
  • the method may determine nucleic acid sequence data using repeated cycles of polymerase extension using one labelled, blocked nucleotide triphosphate and three unlabelled blocked nucleotide triphosphates. Each cycle has the four nucleotides present, but only one of the four is identifiable.
  • the labelled nucleotide may be adenosine (A), thereby measuring T or U, and the size of gaps therebetween.
  • the method may determine nucleic acid sequence data using repeated alternating cycles of polymerase extension comprising a first cycle of one nucleotide triphosphate and a second cycle of three nucleotide triphosphates, wherein on the cycles where three nucleotide triphosphates are added together, the number of incorporated nucleotides is determined.
  • the first cycle may be carried out using dATP and the second cycle carried out using dCTP, dTTP and dGTP.
  • the first cycle may be carried out using dGTP and the second cycle carried out using dCTP, dTTP and dATP.
  • the process may determine a large number of reads in parallel. For example, the method may generate at least 1 thousand sequence reads in parallel. For example, the method may generate at least 1 million sequence reads in parallel.
  • the sequencing may be carried out on a solid support.
  • kits comprising one labelled, blocked nucleotide triphosphate and three unlabelled blocked nucleotide triphosphates and a nucleic acid polymerase.
  • the blocking moiety may be a 3' azidomethyl (CH 2 N 3 ) group.
  • Figure 1 shows mapping efficiency of selected bases for unique matches for native (0%) and bisulfite converted genomes (98, 99, 100%). Sequence reads of only one base can therefore be mapped back against the reference genome providing the size of the gaps is accurately determined.
  • the sequencing methods described herein produce a molecular signature that may be compared with other signatures and predicted signatures.
  • the signature need not provide a nucleotide sequence at single nucleotide resolution. Rather, the signature can provide a unique identification of a nucleic acid based on a low resolution sequence of the nucleic acid.
  • the low resolution sequence can be, for example, degenerate with respect to the identity of the nucleotide type at one or more position in the nucleotide sequence of the nucleic acid.
  • sequence information that can be obtained using the methods described herein can be used in applications involved in genotyping, expression profiling, capturing alternative splicing, genome mapping, amplicon sequencing, methylation detection and metagenomics, as well as for applications involving detection of contaminant sequences, for example pathogen detection.
  • Embodiments of the present invention relate to methods for obtaining nucleic acid sequence information.
  • the inventors have developed a method to reduce the complexity of nucleic acid sequence data by recording the appearance of only a single one of the four nucleic acid bases, and the size of gaps between the appearances of single base being measured. Thus rather than identifying all four nucleotides in each sequence read, only a single base is identified, along with the number of bases in the gap between the identified, or called, bases. Thus the complexity of nucleic acid sequencing data is reduced, whilst useful information can still be obtained when compared against a reference sequence.
  • the current method accurately records the exact number of bases in the gaps.
  • the method does not rely on size separation, and does not rely on aligning multiple reads taken on different aliquots of the same sample in order to establish the size of the gaps.
  • the length of the gaps is determined by accurately noting the number of unlabelled nucleotides between the labelled or identified nucleotide.
  • Advantages of the method include a reduced cost of instrumentation.
  • the ability to detect only one colour simplifies the optics of the detection system, resulting in fewer lasers and filters.
  • the speed of acquiring data will be four times faster, as only one base is being recorded per cycle.
  • the reduced level of structural modification of the nucleic acids, along with the reduced level of imaging of each nucleic acid, which reduces the level of photo- induced nucleic acid damage allows a longer read length to be obtained per read.
  • the resultant reads are computationally easier to align as the alignment is carried out using two bases (X and N, where X is the specific base and N is the other three bases) rather than all four.
  • the method can involve the use of repeated cycles of polymerase extension using one labelled, blocked nucleotide triphosphate and three unlabelled (dark) blocked nucleotide triphosphates. Reducing the number of labels from four to one reduces the cost of sequencing reagents. Alternatively the method can use repeated alternating cycles using one nucleotide triphosphate followed by three nucleotide triphosphates.
  • Methods of the invention measure the length of dark regions interspersed with the nucleotide of interest. Thus reads of single bases are obtained with known separation. The dark regions are effectively degenerate with respect to the other three nucleotides. The length of the dark region can be determined, either from the length of the 'homopolymer' run obtained using three unlabelled nucleotides, or the number of cycles of reversibly blocked incorporation which are carried out between appearances of the labelled nucleotide monomer.
  • reducing the complexity refers to lowering the number of nucleotides whose identity is determined in an individual sequencing read.
  • the complexity can be reduced if only one or two of the bases in a sequence are identified, and the other positions are left marked as N. N signifies that a nucleotide is present, but that its identity is not known.
  • 'N' is being used as an abbreviation for 'the other three nucleotides'.
  • the sequence becomes a string of Cs and Ds where D is any of the three bases A or G or T/U (i.e. D is 'not C).
  • T and gaps are recorded
  • the sequence becomes a string of Ts and Vs, where V is any of the three bases A or C or G (i.e. V is 'not T').
  • As and Bs where B is any of the three bases C or G or T/U (i.e. B is 'not A').
  • the sequence becomes a string of Gs and Hs, where H is any of the three bases A or C or T/U (i.e. H is 'not G').
  • the length of the unknown regions need to be accurately established (i.e. the number of unknown bases in the gaps must be determined).
  • the sequence reads effectively become bases followed by gaps of known length before the next base. Any of the four bases can be measured, but for the purposes of examples below T is indicated.
  • the T can alternatively be A, C or G and V can alternatively be B, D or H. Sequences thus become TVnTVnTVn etc, where V is any base other than T and n is the number of A, C and G bases.
  • the single nucleotide being called may be cytidine (C).
  • the single nucleotide being called may be thymidine (T) or uridine (U).
  • T thymidine
  • U uridine
  • the application may be particularly useful in methylation analysis.
  • Bisulfite treatment convents most of the C bases to U. Thus comparison of the C or U bases pre and post bisulfite treatment gives an accurate record of the methylation status without needing to determine the exact sequences of all four bases.
  • the nucleotide being analysed may appear at least twice in the read.
  • the single nucleotide may appear at least 10 times in the sequencing read.
  • the single nucleotide may appear at least 20 times in the sequencing read.
  • the single nucleotide may appear at least 30 times in the sequencing read.
  • the read may be any length desired.
  • the nucleotide reads may span at least 50 nucleotides.
  • the nucleotide reads may span at least 100 nucleotides. Due to the lower number of fluorescent reads per strand, and the lower level of strand modification, the read length may be longer that obtained using cycles where every base has a modification in order to allow a label to be attached. In the case of One base' sequencing, only the one base being identified is modified, the remaining three 'dark' bases are all natural. Thus read lengths may be 1000 bases or greater. Reads in this context refer to the total length of sequence determined (i.e. the total of N plus X bases where N is the gap and X is the single base being called).
  • Typical fragments analysed on the Illumina system may be 500-1000 base pairs of duplex DNA.
  • the reads on said fragments may be for example 100-300 bases per read.
  • the reads can be taken from both ends of the fragment.
  • approximately 75 bases of a single type would be determined if the sequence was an equal mix of all four bases.
  • the single nucleotide may appear approximately 75 times if the length of the read spanned 300 nucleotides.
  • the length of fragments to be analysed can be thousands of bases, and each base may appear hundreds of times in each of the reads.
  • Methods include a method for analysing cytosine methylation in a nucleotide sample, the method comprising;
  • Methods include a method for analysing cytosine methylation in a nucleotide sample, the method comprising;
  • the identity of the bases between the T or U bases, or C bases, respectively is not determined.
  • the size, or length, in terms of the number of bases is recorded, but not the identity of the intervening bases. This is contrary to the prior art 'all base' sequencing, there the identity of all four nucleotides is recorded. In the methods described, there is no way of knowing the identity of three of the four bases, as they are supplied together in an unlabelled form which can not be differentiated.
  • Bisulfite treatment converts both cytosine and 5-formylcytosine residues in a polynucleotide into uracil. Where any 5-carboxycytosine is present (as a product of the oxidation step), this 5-carboxycytosine is converted into uracil in the bisulfite treatment. Without wishing to be bound by theory, it is believed that the reaction of the 5-formylcytosine proceeds via loss of the formyl group to yield cytosine, followed by a subsequent deamination to give uracil. The 5-carboxycytosine is believed to yield the uracil through a sequence of decarboxylation and deamination steps. Bisulfite treatment may be performed under conditions that convert both cytosine and 5-formylcytosine or 5-carboxycytosine residues in a polynucleotide as described herein into uracil.
  • a portion of the population may be treated with bisulfite by incubation with bisulfite ions (HSO3 2 ).
  • bisulfite ions HSO3 2
  • the use of bisulfite ions (HSO3 2 ) to convert unmethylated cytosines in nucleic acids into uracil is standard in the art and suitable reagents and conditions are well known to the skilled person. Numerous suitable protocols and reagents are also commercially available (for example, EpiTectTM, Qiagen NL; EZ DNA MethylationTM Zymo Research Corp CA; CpGenome Turbo Bisulfite Modification Kit; Millipore).
  • the method may include additional steps.
  • the method may include the step(s) of oxidising and/or reducing the sample prior to bisulfite treatment.
  • Bisulfite sequencing allows 5- methylcytosine to be distinguished from the unmethylated cytosine.
  • other cytosine modifications including 5-hydroxymethyl and 5-formyl have been identified.
  • techniques involving oxidation and/or reduction of the samples prior to bisulfite sequencing have been developed.
  • the sequencing output In order to extract value from bisulfite sequencing, the sequencing output must be compared with a sample which has not undergone bisulfite treatment.
  • a portion of the nucleic acid sample may be oxidised using an oxidising agent.
  • the oxidising agent may be a non-enzymatic oxidising agent, for example, an organic or inorganic chemical compound.
  • Suitable oxidising agents are well known in the art and include metal oxides, such as KRu0 4 , Mn0 2 and KMn0 4 .
  • Particularly useful oxidising agents are those that may be used in aqueous conditions, which are most convenient for the handling of the polynucleotide. However, oxidising agents that are suitable for use in organic solvents may also be employed where practicable.
  • the oxidising agent may comprise a perruthenate anion (Ru0 4 ).
  • Suitable perruthenate oxidising agents include organic and inorganic perruthenate salts, such as potassium perruthenate (KRu0 4 ) and other metal perruthenates; tetraalkyl ammonium perruthenates, such as tetrapropylammonium perruthenate (TPAP) and tetrabutylammonium perruthenate (TBAP); polymer-supported perruthenate (PSP) and tetraphenylphosphonium ruthenate.
  • the oxidising agents may be a metal (VI) oxo complex.
  • the oxidising agent may be manganate (Mn(VI)0 4 2" ), ferrate (Fe(VI)0 4 2" ), osmate (Os(VI)0 4 2” ), ruthenate (Ru(VI)0 4 2” ), or molybdate (Mo(VI)0 4 2" ).
  • the portions of polynucleotides may be reduced by treatment with a reducing agent.
  • the reducing agent is any agent suitable for generating an alcohol from an aldehyde.
  • the reducing agent or the conditions employed in the reduction step may be selected so that any 5-formylcytosine is selectively reduced (i.e. the reducing agent or reduction conditions are selective for 5-formylcytosine).
  • the reducing agent or conditions are selected to minimise or prevent any degradation of the polynucleotide.
  • Suitable reducing agents are well-known in the art and include NaBH 4 , NaC BH 3 and LiBH 4 .
  • Particularly useful reducing agents are those that may be used in aqueous conditions, as such are most convenient for the handling of the polynucleotide.
  • reducing agents that are suitable for use in organic solvents may also be employed where practicable.
  • the reduced and oxidised portion of the population are treated with bisulfite.
  • a second portion of the population which has not been oxidised or reduced is also treated with bisulfite.
  • the bisulfite treatment can be done separately on the three samples, or, if tagged primers are used, the samples can be pooled so that the reduced, oxidised and untreated samples are all exposed to bisulfite in the same reaction.
  • the method may determine nucleic acid sequence data using repeated cycles of polymerase extension using one labelled, blocked nucleotide triphosphate and three unlabelled blocked nucleotide triphosphates.
  • the labelled nucleotide may be adenosine (A), thereby measuring T or U, and the size of gaps therebetween.
  • the labelled nucleotide may be guanosine (G), thereby measuring the remaining C bases and the size of gaps therebetween.
  • the labelled nucleotide may be any of the four nucleotides.
  • the labelled nucleotide may be A, G, C or T.
  • the method may determine nucleic acid sequence data using repeated alternating cycles of polymerase extension comprising a first cycle of one nucleotide triphosphate and a second cycle of three nucleotide triphosphates, wherein on the cycles where three nucleotides triphosphates are added together, the number of incorporated nucleotides is determined.
  • the first cycle may be carried out using dATP and the second cycle carried out using dCTP, dTTP and dGTP.
  • the first cycle may be carried out using dGTP and the second cycle carried out using dCTP, dTTP and dATP.
  • the individual nucleotide may be any of the four nucleotides.
  • the individual nucleotide may be A, G, C or T.
  • the first cycle may be carried out using dATP and the second cycle carried out using dCTP, dTTP and dGTP.
  • the first cycle may be carried out using dCTP and the second cycle carried out using dATP, dTTP and dGTP.
  • the first cycle may be carried out using dGTP and the second cycle carried out using dCTP, dTTP and dATP.
  • the first cycle may be carried out using dTTP and the second cycle carried out using dCTP, dATP and dGTP.
  • the process may determine a large number of read in parallel. For example, the method may generate at least 1 thousand sequence reads in parallel. For example, the method may generate at least 1 million sequence reads in parallel.
  • the sequencing may be carried out on a solid support.
  • aspects of the present invention relate to methods for obtaining nucleic acid sequence information of a target nucleic acid.
  • the methods described herein relate to obtaining a molecular signature of a target nucleic acid, where the molecular signature includes a low resolution (less than all four nucleotides) representation of the target nucleic acid sequence.
  • Some embodiments of these methods can be employed with nucleotide monomers while others utilize oligonucleotides.
  • oligonucleotides one or more of the oligonucleotides can include a reversibly terminating moiety.
  • nucleotide monomers are used, one or more of the nucleotide monomers can include a reversibly terminating moiety.
  • Methods for obtaining nucleic acid sequence information include the steps of (a) providing a first sequencing reagent to a target nucleic acid in the presence of a polymerase, the first sequencing reagent including a plurality of different nucleotide monomers, each including a reversibly terminating moiety, where three of the nucleotide monomers are unlabelled (dark), and one is labelled, (b) detecting the label, (c) removing the reversibly terminating moiety and the label, and (d) repeating steps (a), (b) and (c), whereby sequence information for at least a portion of the target nucleic acid is obtained.
  • Methods include the steps of (a) providing a first sequencing reagent to a target nucleic acid in the presence of a ligase, wherein the first sequencing reagent includes at least four oligonucleotides, wherein the oligonucleotides all include a reversibly terminating moiety and one of the four contains a label, (b) detecting the label, (c) removing the reversibly terminating moiety and the label, and (d) repeating steps (a), (b) and (c), whereby sequence information for at least a portion of the target nucleic acid is obtained.
  • the terminating moiety can be at the end of the oligonucleotide, or partly within the strand such that the oligonucleotide length is reduced by the removal of the terminating moiety.
  • oligonucleotide and/or “nucleic acid” can refer to at least two nucleotide monomers linked together.
  • a nucleic acid can generally contain phosphodiester bonds, however, in some embodiments, nucleic acid analogs may have other types of backbones, comprising, for example, phosphoramide, phosphorothioate, phosphorodithioate), O- methylphosphoroamidite and peptide nucleic acid backbones and linkages.
  • nucleotides or oligonucleotides may be done to facilitate the addition of additional moieties such as labels, or to increase the stability of such molecules under certain conditions.
  • mixtures of naturally occurring nucleic acids and analogs can be made.
  • mixtures of different nucleic acid analogs, and mixtures of naturally occurring nucleic acids and analogs may be made.
  • the nucleic acids may be single stranded or double stranded, as specified, or contain portions of both double stranded or single stranded sequence.
  • the nucleic acid may be DNA, for example, genomic or cDNA, RNA or a hybrid.
  • a nucleic acid can contain any combination of deoxyribo- and ribo-nucleotides, and any combination of bases, including uracil, adenine, thymine, cytosine, guanine, inosine, xanthanine, hypoxanthanine, isocytosine, isoguanine, and base analogs such as nitropyrrole (including 3 -nitropyrrole) and nitroindole (including 5-nitroindole), etc.
  • bases including uracil, adenine, thymine, cytosine, guanine, inosine, xanthanine, hypoxanthanine, isocytosine, isoguanine, and base analogs such as nitropyrrole (including 3 -nitropyrrole) and nitroindole (including 5-nitroindole), etc.
  • nucleotide monomer can refer to a nucleotide or nucleotide analog that can become incorporated into a primer polynucleotide.
  • the nucleotide monomers are separate non-linked nucleotides. That is, the nucleotide monomers are not present as dimers, trimers, etc.
  • Such nucleotide monomers may be substrates for an enzyme that may extend a polynucleotide strand.
  • nucleotide monomers may be nucleotide 5-triphosphates (rNTP's, dNTP's or blocked NTP's).
  • Nucleotide monomers may or may not become incorporated into a nascent polynucleotide in a flow step depending on the sequence of the complementary polynucleotide.
  • Nucleotide monomers may or may not contain label moieties and/or terminator moieties.
  • Terminator moieties include reversibly terminating moieties. Incorporation of a nucleotide monomer comprising reversibly terminating moieties can inhibit extension of a primer polynucleotide, however, the moiety can be removed and the primer polynucleotide may be extended further once the block is removed.
  • nucleotide monomers include deoxyribonucleotides, modified deoxyribonucleotides, ribonucleotides, modified ribonucleotides, peptide nucleotides, modified peptide nucleotides, modified phosphate sugar backbone nucleotides and mixtures thereof. Nucleotide analogs which include a modified nucleobase can also be used in the methods described herein.
  • a nucleotide monomer may comprise a label moiety and/or a terminator moiety.
  • sequencing reagent can refer to a composition, such as a solution, comprising one or more precursors of a polymer such as nucleotide monomers.
  • a sequencing reagent includes one or more nucleotide monomers having a label moiety, a terminator moiety, or both.
  • moieties are chemical groups that are not naturally occurring moieties of nucleic acids, being introduced by synthetic means to alter the natural characteristics of the nucleotide monomers with regard to detectability under particular conditions or enzymatic reactivity under particular conditions.
  • a sequencing reagent comprises one or more nucleotide monomers that lack a label moiety and/or a terminator moiety.
  • the sequencing reagent consists of or consists essentially of one nucleotide monomer type, two different nucleotide monomer types, three different nucleotide monomer types or four different nucleotide monomer types.
  • "Different" nucleotide monomer types are nucleotide monomers that have different base moieties. Two or more nucleotide monomer types can have other moieties, such as those set forth above, that are the same as each other or different from each other.
  • a sequencing reagent may comprise an oligonucleotide that may be incorporated into a polymer. The oligonucleotide may comprise a terminator moiety and/or a label moiety.
  • nucleotide monomers For ease of illustration, various methods and compositions are described herein with respect to multiple nucleotide monomers. It will be understood that the multiple nucleotide monomers of these methods or compositions can be of the same or different types unless explicitly indicated otherwise. It should be understood that when providing a sequencing reagent comprising multiple nucleotide monomers to a target nucleic acid, the nucleotide monomers do not necessarily have to be provided at the same time. For example, two nucleotide monomers can be delivered, either together or separately, to a target nucleic acid. In such embodiments, a sequencing reagent comprising two nucleotide monomers will have been provided to the target nucleic acid. In some embodiments, zero, one or two of the nucleotide monomers will be incorporated into a polynucleotide that is complementary to the target nucleic acid.
  • complementary polynucleotides includes polynucleotide strands that are not necessarily complementary to the full length of the target sequence. That is, a complementary polynucleotide can be complementary to only a portion of the target nucleic acid. Complementary polynucleotides can be produced by extending a primer polynucleotide using cycles of sequencing reagents. As more nucleotide monomers are incorporated into the complementary polynucleotide, the complementary polynucleotide becomes complementary to a greater portion of the target nucleic acid. Typically, the complementary portion is a contiguous portion of the target nucleic acid.
  • sequencing read refers to a repetitive process of physical or chemical steps that is carried out to obtain signals indicative of the order of monomers in a polymer.
  • the signals can be indicative of an order of monomers at single monomer resolution or lower resolution.
  • the steps can be initiated on a nucleic acid target and carried out to obtain signals indicative of size of gaps between repeated appearances of the same base.
  • the process can be carried out to its completion, which may be the point at which signals from the process can no longer distinguish bases of the target with a reasonable level of certainty. If desired, completion can occur earlier, for example, once a desired amount of sequence information has been obtained, for example a particular read length.
  • a sequencing run can be carried out on a single target nucleic acid molecule or simultaneously on a population of target nucleic acid molecules having the same sequence, or simultaneously on a population of target nucleic acids having different sequences.
  • cycle refers to the portion of a sequencing run that is repeated to indicate the presence of at least one monomer in a polymer.
  • a cycle includes several steps such as steps for delivery of reagents, washing away unreacted reagents and detection of signals indicative of changes occurring in response to added reagents.
  • a cycle of a sequencing-by-synthesis (SBS) reaction can include delivery of a sequencing reagent that includes one or more type of nucleotide, washing to remove unreacted nucleotides, and detection to detect one or more nucleotides that are incorporated in an extended nucleic acid.
  • cycle can refer to the portion of a sequencing run that is repeated to extend a polynucleotide complementary to a target nucleic acid.
  • a cycle can include several steps such as the delivery of first reagent, washing away unreacted agents, and delivery of a second reagent.
  • delivery steps can be for limited extension of a polynucleotide complementary to a target nucleic acid.
  • the polynucleotide strand may be extended in each delivery step by one or more nucleotides, depending on the number of monomers which are terminated. Where all four monomers are terminated, each cycle extends each polynucleotide by a single monomer.
  • low resolution when used in reference to a sequence read, means providing less information on the order and type of monomers in a polymer than provided by a complete monomer resolution sequence representation of the same polymer.
  • the term can refer to a resolution at which at least one type of monomeric unit in a polymer can be distinguished from at least a first other type of monomeric unit in the polymer, but cannot necessarily be distinguished from a second other type of monomeric unit in the polymer.
  • low resolution when used in reference to a sequence representation of a nucleic acid means that two or three possible nucleotide types can be indicated as candidate residents at any particular position in the sequence while the two or three nucleotide types cannot necessarily be distinguished from each other in any and all of the sequence representation or in a portion of the sequence representation.
  • two different monomeric units from an actual polymer sequence can be assigned a common label (N) or identifier in a low resolution sequence representation.
  • three different monomeric units from an actual polymer sequence can be assigned a common label (N) or identifier in a low resolution sequence representation.
  • a low resolution representation of a nucleic acid can include a string of symbols and the number of different symbol types in the string can be less than the number of different nucleotide types in the actual sequence of the nucleic acid.
  • a low resolution sequence representation can include regions where the identity of monomeric units is unknown.
  • a sequence representation can include a sequence of distinguishable monomeric units interspersed with symbols representing regions of unknown content. The length of the region is known, just not the identity of each of the specific bases.
  • nucleotide monomers when used in reference to a monomer, nucleotide or other unit of a polymer, is intended to refer to the species of monomer, nucleotide or other unit. The type of monomer, nucleotide or other unit can be identified independent of their positions in the polymer. Similarly, when used in reference to a symbol or other identifier in a sequence representation, the term is intended to refer to the species of symbol or identifier and can be independent of their positions in the sequence representation.
  • Exemplary types of nucleotide monomers are those having either adenine (A), cytosine (C), guanine (G), thymine (T), or uracil (U) bases.
  • nucleotide monomers having cytosine are included those that are methylated at the 5-position, such as 5-methyl cytosine or 5-hydroxymethyl cytosine, and those that are not methylated at the 5-position.
  • degenerate means a location having more than one possible base, where the identity of the base is not identified at a particular location.
  • the term refers to a position in the nucleic acid representation for which two or more nucleotide types are identified as candidate occupants in the corresponding position of the actual nucleic acid sequence.
  • a degenerate position in a nucleic acid can have, for example, 2, 3 or 4 nucleotide types as candidate occupants.
  • the number of different nucleotide types at a degenerate position in a sequence representation can be greater than one and less than three, namely, two.
  • the number of different nucleotide types at a degenerate position in a sequence representation can be greater than one and less than four, namely, two or three. In other embodiments, the number of different nucleotide types at a degenerate position in a sequence representation can be greater than two and less than four, namely, three. Typically, the number of different nucleotide types at a degenerate position in a sequence representation can be less than the number of different nucleotide types present in the actual nucleic acid sequence that is represented.
  • performing a limited extension can include delivering a sequencing reagent to a target nucleic acid in the presence of a polymerase, where the sequencing reagent lacks at least one type of nucleotide monomer that can base-pair with at least one nucleotide in a target nucleic acid.
  • the sequencing reagent may contain at least one type of nucleotide monomer, but no more than three types of nucleotide monomer.
  • a sequencing reagent can be delivered to a target nucleic acid in the presence of polymerase containing three different nucleotide monomers (A, C, G).
  • a polynucleotide complementary to the target nucleic acid may be extended until the polymerase reaches an ⁇ ⁇ ⁇ ; here extension will be limited because of the lack of , r T in the sequencing reagent.
  • Such embodiments may be referred to as dark extensions since one purpose of this process is to extend down a target nucleic acid without necessarily reading the sequence of the target nucleic acid. In such cases the number of A, C and G nucleotides incorporated can be determined, but the order of incorporation is not determined.
  • the three base cycle can be followed by a cycle in which a single nucleotide monomer is incorporated.
  • the sequencing reagent can lack at least one, two, or three types of nucleotide monomer that may base-pair with at least one type of nucleotide in a target nucleic acid. In other embodiments, the sequencing reagent can lack at least one, two, or three different types of nucleotide monomer. It is also contemplated that in some embodiments, a sequencing reagent may contain a promiscuous nucleotide monomer such as a universal nucleotide monomer or semi-universal nucleotide monomer, that may base-pair with more than one type of nucleotide in a target nucleic acid.
  • a promiscuous nucleotide monomer such as a universal nucleotide monomer or semi-universal nucleotide monomer
  • nucleotide monomer a nucleotide monomer that pairs with the entire complement of nucleotides present in the target nucleic acid.
  • si-universal nucleotide monomer a nucleotide monomer that pairs with more than one but less than the entire complement of nucleotides present in the target nucleic acid.
  • the sequencing reagent can lack at least one type of nucleotide that may base-pair with at least one nucleotide in the target. Thus rather than having three nucleotide monomers, less than three can be used to achieve the same length of base extension before the extension stops due to lack of the correct monomer.
  • Additional methods that can be used for limited extension of a polynucleotide complementary to a target nucleic acid include delivering a sequencing reagent to a target nucleic acid in the presence of a polymerase, where the sequencing reagent includes at least one type of nucleotide monomer comprising a terminating moiety.
  • the nucleotide monomer may base-pair with at least one nucleotide that may be present in a target nucleic acid.
  • the terminating moiety is reversibly terminating.
  • the terminating moiety is a 3'-azidomethyl group.
  • a sequencing reagent containing A, C, G, T monomers is delivered to a target nucleic acid in the presence of polymerase.
  • Each of the monomers contains a terminating moiety.
  • One of the monomers is labelled. Thus on average, 25% of the reads will become labelled per cycle.
  • the extension of a polynucleotide using nucleotides with terminating moieties can be repeated. In such embodiments, reversibly terminating moieties can be used to facilitate subsequent extensions.
  • sequencing reagents in subsequent steps of limited extension can contain nucleotide monomers comprising terminating moieties that are the same or different.
  • the reversibly terminating moiety of an incorporated nucleotide monomer can be removed prior to a subsequent extension step.
  • unincorporated nucleotide monomers can be removed prior to delivering a subsequent sequencing reagent.
  • the terminating moiety is a 3 '-azidomethyl group.
  • the sequencing reagent can use a mixture of four blocked NTP's.
  • the blocked NTP's can have a chemical block at the 3' position.
  • the blocking moiety can be a small chemical moiety, for example an allyl, methoxymethyl or azidomethyl group.
  • the extended primers can be unblocked by removing the block at the 3- position to release a 3 '-OH.
  • the release can be carried out using a phosphine reagent.
  • the block can be removed using palladium and a phosphine.
  • Additional methods that can be used for limited extension of a polynucleotide complementary to a target nucleic acid include delivering a sequencing reagent to a target nucleic acid in the presence of a ligase, where the sequencing reagent includes at least one oligonucleotide.
  • the oligonucleotide comprises a terminating moiety.
  • the terminating moiety is a reversibly terminating moiety.
  • the oligonucleotide can be complementary to the target nucleic acid such that the oligonucleotide can be ligated to a polynucleotide complementary to at least a portion of the target nucleic acid, thus extending the polynucleotide complementary to the target nucleic acid.
  • An oligonucleotide can comprise at least two linked nucleotide monomers.
  • the oligonucleotide can be at least a 2-mer, 3-mer, 4-mer, 5-mer, 6-mer, 7-mer, 8-mer, 9-mer, or 10-mer.
  • the length of the oligonucleotide may exceed 10 linked nucleotides.
  • oligonucleotides of any length can be designed in order to facilitate accurate and/or rapid limited extensions.
  • the limited extensions can be dark extensions, however, as with the above examples of limited extension, there is no requirement that these limited extensions are dark extensions.
  • the sequencing reagent for limited extension can include a plurality of oligonucleotides.
  • the plurality of oligonucleotides can include different oligonucleotides.
  • the plurality of oligonucleotides can include degenerate oligonucleotides or oligonucleotides comprising promiscuous bases.
  • the plurality of oligonucleotides includes at least one oligonucleotide that is complementary to the target nucleic acid such that the oligonucleotide can be ligated to a polynucleotide complementary to at least a portion of the target nucleic acid, thus extending the polynucleotide complementary to the target nucleic acid.
  • a portion of the oligonucleotides corresponding to a single nucleotide to be identified can be labelled.
  • a sequencing reagent can be delivered to a target nucleic acid in the presence of ligase, where the sequencing reagent contains a plurality of oligonucleotides comprising reversibly terminating moieties. Some of the oligonucleotides may hybridize to various nucleotide sequences of the target nucleic acid, including a sequence where the hybridizing oligonucleotide can be ligated to a polynucleotide complementary to at least a portion of the target nucleic acid, thus extending the polynucleotide complementary to the target nucleic acid.
  • the extension of the polynucleotide is limited because the reversibly terminating moiety of the ligated oligonucleotide can prevent further extension of the polynucleotide.
  • the reversibly terminating moiety can be removed prior to a subsequent reagent delivery step.
  • the labels can be fluorescent labels.
  • the fluorescent labels can be attached to the base of the nucleotide.
  • the labels can be attached through cleavable linkers, which may be chemically cleavable.
  • the linkers may be attached through linkers which are cleavable under the same conditions as the removal of the reversible terminator moieties.
  • pyrophosphate released on incorporation of a nucleotide monomer into a polynucleotide complementary to at least a portion of the target nucleic acid can be detected using pyrosequencing techniques. Pyrosequencing detects the release of pyrophosphate as particular nucleotides are incorporated into a nascent polynucleotide.
  • protons released on incorporation of a nucleotide monomer into a polynucleotide complementary to at least a portion of the target nucleic acid can be detected using ion sensor techniques.
  • a target nucleic acid can include any nucleic acid of interest.
  • Target nucleic acids can include, but are not limited to, DNA, RNA, peptide nucleic acid, morpholino nucleic acid, locked nucleic acid, glycol nucleic acid, threose nucleic acid, mixtures thereof, and hybrids thereof.
  • genomic DNA fragments or amplified copies thereof are used as the target nucleic acid.
  • Some embodiments can utilize a single target nucleic acid.
  • Other embodiments can utilize a plurality of target nucleic acids.
  • a plurality of target nucleic acids can include a plurality of the same target nucleic acids, a plurality of different target nucleic acids where some target nucleic acids are the same, or a plurality of target nucleic acids where all target nucleic acids are different.
  • Embodiments that utilize a plurality of target nucleic acids can be carried out in multiplex formats such that reagents are delivered simultaneously to the target nucleic acids, for example, in a single chamber or on an array surface.
  • target nucleic acids can be amplified as described in more detail herein.
  • the plurality of target nucleic acids can include substantially all of a particular organism's genome.
  • the plurality of target nucleic acids can include at least a portion of a particular organism's genome including, for example, at least about 1%, 5%, 10%, 25%, 50%, 75%, 80%, 85%, 90%, 95%, or 99% of the genome.
  • the portion can have an upper limit that is at most about 1%, 5%, 10%, 25%, 50%, 75%, 80%, 85%, 90%, 95%, or 99% of the genome.
  • the population of target nucleic acid molecules may be a sample of DNA or RNA, for example a genomic DNA sample.
  • Suitable DNA and RNA samples may be obtained or isolated from a sample of cells, for example, mammalian cells such as human cells or tissue samples, such as biopsies.
  • the sample may be obtained from a formalin fixed parafin embedded (FFPE) tissue sample.
  • FFPE formalin fixed parafin embedded
  • Target nucleic acids can be obtained from any source.
  • target nucleic acids may be prepared from nucleic acid molecules obtained from a single organism or from populations of nucleic acid molecules obtained from natural sources that include one or more organisms.
  • Sources of nucleic acid molecules include, but are not limited to, organelles, cells, tissues, organs, or organisms.
  • Cells that may be used as sources of target nucleic acid molecules may be prokaryotic (bacterial cells, for example, Escherichia, Bacillus, Serratia, Salmonella, Staphylococcus, Streptococcus, Clostridium, Chlamydia, Neisseria, Treponema, Mycoplasma, Borrelia, Legionella, Pseudomonas, Mycobacterium, Helicobacter, Erwinia, Agrobacterium, Rhizobium, and Streptomyces genera); archeaon, such as crenarchaeota, nanoarchaeota or euryarchaeotia; or eukaryotic such as fungi, (for example, yeasts), plants, protozoans and other parasites, and animals (including insects (for example, Drosophila spp.), nematodes (for example, Caenorhabditis elegans), and mammals (for example, rat, mouse, monkey, non
  • the population may be a diverse population of nucleic acid molecules, for example a library, such as a whole genome library or a loci specific library.
  • Nucleic acid strands in the population may be amplified nucleic acid molecules, for example, amplified fragments of the same genetic locus or region from different samples.
  • Nucleic acid strands in the population may be enriched.
  • the population may be an enriched subset of a sample produced by pull-down onto a hybridisation array or digestion with a restriction enzyme.
  • the samples may be further processed, for example by amplification or sequencing.
  • the samples may be copied using a nucleic acid polymerase. If adaptors are attached to both ends of the target fragments, the population of fragments can be amplified using a single pair of primers complementary to the adaptors.
  • the nucleic acids from different sources can be separately tagged. The tags can thereby be used to help identify sequences from different sources.
  • the disclosure herein includes the use of two or more different populations of identifiable tags for the multiplexing of the analysis of different samples.
  • the methods described herein can be used in conjunction with a variety of sequencing techniques.
  • the sequencing may be carried out using a commercially available high throughput sequencing platform. Suitable sequencing platforms include Illumina TruSeq, LifeTech IonTorrent, Roche 454 and PacBio RS.
  • the sequencing may be carried out on a solid support.
  • the process to determine the nucleotide sequence of a target nucleic acid can be an automated process.
  • Embodiments include sequencing by synthesis (SBS) techniques.
  • SBS techniques generally involve the enzymatic extension of a nascent nucleic acid strand through the iterative addition of nucleotides against a template strand.
  • a single nucleotide monomer may be provided to a target nucleotide in the presence of a polymerase in each delivery.
  • more than one type of nucleotide monomer can be provided to a target nucleic acid in the presence of a polymerase in a delivery.
  • SBS can utilize nucleotide monomers that have a terminator moiety or those that lack any terminator moieties.
  • Methods utilizing nucleotide monomers lacking terminators include, for example, pyrosequencing and sequencing using gamma-phosphate-labelled nucleotides. In methods using nucleotide monomers lacking terminators, the number of different nucleotides added in each cycle can be dependent upon the template sequence and the mode of nucleotide delivery.
  • the terminator can be effectively irreversible under the sequencing conditions used as is the case for traditional Sanger sequencing which utilizes dideoxynucleotides, or the terminator can be reversible as is the case for sequencing methods developed by Solexa (now Illumina, Inc.). In preferred methods a terminator moiety can be reversibly terminating.
  • SBS techniques can utilize nucleotide monomers that have a label moiety or those that lack a label moiety. Accordingly, incorporation events can be detected based on a characteristic of the label, such as fluorescence of the label; a characteristic of the nucleotide monomer such as molecular weight or charge; a by-product of incorporation of the nucleotide, such as release of pyrophosphate; or the like.
  • a characteristic of the label such as fluorescence of the label
  • a characteristic of the nucleotide monomer such as molecular weight or charge
  • a by-product of incorporation of the nucleotide such as release of pyrophosphate
  • the different nucleotides can be distinguishable from each other, or alternatively, the two or more different labels can be the indistinguishable under the detection techniques being used.
  • the different nucleotides present in a sequencing reagent can have different labels and they can be distinguished using appropriate optics as exemplified by the sequencing methods developed by Solexa (now Illumina, Inc.).
  • one, two or three of the nucleotide monomers can be unlabelled.
  • Embodiments can utilize sequencing by ligation techniques. Such techniques utilize DNA ligase to incorporate oligonucleotides and identify the incorporation of such oligonucleotides.
  • Embodiments can include nanopore sequencing techniques.
  • the target nucleic acid passes through a nanopore.
  • the nanopore can be a synthetic pore or biological membrane protein, such as alpha-hemolysin.
  • each base-pair can be identified by measuring fluctuations in the electrical conductance of the pore.
  • the bases can be naturally occurring, or can be modified. Modified strands can be prepared where one base is made to carry a larger moiety, and is thus easier to detect.
  • the use of a nucleic acid where each A, C, G or T carries a label is within the scope of the current disclosure.
  • the gaps between the label can be determined in the absence of being able to identify each base in the sequence between the labels. In such a way, the requirements to identify every base can be eliminated, and the complexity of nanopore based sequencing methods reduced.
  • Embodiments can utilize methods involving the real-time monitoring of DNA polymerase activity. Nucleotide incorporations can be detected through fluorescence resonance energy transfer (FRET) interactions between a fluorophore-bearing polymerase and gamma- phosphate-labelled nucleotides, which can be detected with zero-mode waveguides.
  • FRET fluorescence resonance energy transfer
  • the illumination can be restricted to a zeptoliter-scale volume around a surface-tethered polymerase such that incorporation of fluorescently labelled nucleotides can be observed with low background.
  • SMRT real-time
  • a SMRT chip comprises a plurality of zero-mode waveguides (ZMW).
  • Each ZMW comprises a cylindrical hole tens of nanometers in diameter perforating a thin metal film supported by a transparent substrate.
  • attenuated light may penetrate the lower 20-30 nm of each ZMW creating a detection volume of about 10 "21 L. Smaller detection volumes increase the sensitivity of detecting fluorescent signals by reducing the amount of background that can be observed.
  • SMRT chips and similar technology can be used in association with nucleotide monomers fluorescently labelled on the terminal phosphate of the nucleotide.
  • the label is cleaved from the nucleotide monomer on incorporation of the nucleotide into the polynucleotide. Accordingly, the label is not incorporated into the polynucleotide, increasing the signal: background ratio. Moreover, the need for conditions to cleave a label from a labelled nucleotide monomers is reduced.
  • Target nucleic acids can be prepared where target nucleic acid sequences are interspersed approximately every 20 by with adaptor sequences.
  • the target nucleic acids can be amplified using rolling circle replication, and the amplified target nucleic acids can be used to prepare an array of target nucleic acids.
  • Methods of sequencing such arrays include sequencing by ligation, in particular, sequencing by combinatorial probe-anchor ligation (cPAL).
  • the sequencing methods described herein can be advantageously carried out in multiplex formats such that multiple different target nucleic acids are manipulated simultaneously.
  • different target nucleic acids can be treated in a common reaction vessel or on a surface of a particular substrate. This allows convenient delivery of sequencing reagents, removal of unreacted reagents and detection of incorporation events in a multiplex manner.
  • the target nucleic acids can be in an array format. In an array format, the target nucleic acids can be typically bound to a surface in a spatially distinguishable manner.
  • the target nucleic acids can be bound by direct covalent attachment, attachment to a bead or other particle or binding to a polymerase or other molecule that is attached to the surface.
  • the array can include a single copy of a target nucleic acid at each site (also referred to as a feature) or multiple copies having the same sequence can be present at each site or feature. Multiple copies can be produced by amplification methods such as, bridge amplification or emulsion PCR.
  • any of the above-described sequencing processes can be incorporated into the methods described herein.
  • the methods can utilize sequencing reagents having mixtures of one or more nucleotide monomers or can otherwise be carried out under conditions where one or more nucleotide monomers contact a target nucleic acid in a single sequencing cycle.
  • the methods can utilize sequencing reagents having mixtures of oligonucleotides and ligase.
  • other known sequencing processes can be easily by implemented for use with the methods and/or systems described herein.
  • Methods described herein are a useful tool in obtaining the molecular signature of a sequence, such as a DNA sequence.
  • sequence information that can be obtained using the methods described herein can be used in applications involved in genotyping, expression profiling, capturing alternative splicing, genome mapping, amplicon sequencing, methylation detection and metagenomics.
  • the reads obtained can be compared to a reference sequence. The comparison with previously obtained data may be referred to as re-sequencing, as opposed to de-novo sequencing.
  • low resolution sequence representations can provide a signature for different nucleic acids in a sample. Accordingly, the actual sequence of a target nucleic acid need not be determined at single-nucleotide resolution and, instead, a low resolution sequence representation of the nucleic acid can be used.
  • the low resolution sequence representations comprise one or more positions where single nucleotide assignments are not made.
  • the low resolution representations comprise one or more regions where no nucleotide assignment or a completely ambiguous nucleotide assignment is made interspersed by regions where at least one position is assigned with single base resolution. In some embodiments, these regions contain multiple consecutive positions (high resolution sequence islands) that are assigned with single base resolution.
  • the high resolution sequence island may contain one or more areas of sequence ambiguity, however, high resolution sequence islands are often preferred.
  • a low resolution sequence representation can be used to determine the presence or absence of a target nucleic acid in a particular sample or to quantify the amount of the target nucleic acid. Exemplary applications include, but are not limited to, expression analysis, identification of organisms, or evaluation of structure for chromosomes, expressed RNAs or other nucleic acids.
  • low resolution sequence representations for one or more target mRNA molecules can be used to determine expression levels in one or more samples of interest. So long as the low resolution sequence representations are sufficiently indicative of the mRNA, the actual sequence need not be known at single nucleotide resolution. For example, if a low resolution sequence representation distinguishes a target mRNA from all other mRNA species expressed in a target sample and in a reference sample, then comparison of the low resolution sequence representations from both samples can be used to determine relative expression levels.
  • Target nucleic acids used in expression methods can be obtained from any of a variety of different samples including, for example, cells, tissues or biological fluids from organisms such as those set forth above.
  • Presence or absence, or even quantities of target nucleic acids can be determined for samples that have been treated with different chemical agents, physical manipulations, environmental conditions or the like.
  • samples can be from organisms that are experiencing any of a variety of diseases, conditions, developmental states or the like.
  • a reference sample and target sample will differ in regard to one or more of the above factors (for example, treatment, conditions, species origin, or cell type).
  • low resolution sequence representations for target nucleic acids obtained from a particular organism can be used to characterize or identify the organism.
  • a pathogenic organism can be identified in an environmental sample or in a clinical sample from an individual based on at least one low resolution sequence representation for a target nucleic acid from the sample. So long as the one or more low resolution sequence representations are sufficiently indicative of the organism, the actual sequence need not be known at single nucleotide resolution. For example, if a low resolution sequence representation distinguishes a pathogenic bacterial strain from other bacteria, then comparison of the low resolution sequence representations from the sample of interest to low resolution sequence representations from reference samples or from a database can be used to detect presence or absence of the pathogenic bacterial strain.
  • a low resolution sequence representation of the 16S rRNA gene can be used to characterize and/or identify an organism.
  • the 16S RNA gene is highly conserved across species and contains highly conserved sequences that may be interspersed with variable sequences that may be species-specific.
  • a low resolution sequence representation of a 16S rRNA gene may identify a particular organism through the pattern of uniform and variable regions that may be obtained at low resolution.
  • the composition of particular sequencing reagents can be determined to obtain sequence information at low resolution in highly conserved regions of the 16S rRNA gene, and to obtain sequence information at a higher resolution in variable regions of the 16S rRNA gene.
  • the structure of a chromosome, RNA or other nucleic acid can be determined based on low resolution sequence representations. For example, if a low resolution sequence representation distinguishes a chromosomal region from other regions of a chromosome, then comparison of the low resolution sequence representations from a target sample and a reference sample for which the chromosome structure is known can be used to identify insertions, deletions or rearrangements in the target sample. Similarly, if a low resolution sequence representation distinguishes a target mRNA isoform (i.e.
  • Target nucleic acids used to determine chromosome or RNA structure can be obtained from any of a variety of samples including, but not limited to those exemplified above.
  • low resolution sequence representations can be obtained for a plurality of target nucleic acids that are fragments of a larger nucleic acid such as a genome.
  • the sequence information for the individual fragments can be used to determine the actual sequence of the larger nucleic acid at single nucleotide resolution.
  • multiple low resolution sequence representations from each feature can be used to determine the actual sequence of each fragment target nucleic acid at single nucleotide resolution.
  • the actual sequence of each fragment can then be used to determine the actual sequence of the larger sequence, for example, by alignment to a reference sequence or by de novo assembly methods.
  • the low resolution sequence representations from different features can be used directly to determine the actual sequence of the larger sequence, for example, using pattern matching methods.
  • low resolution sequence representation of a target nucleic acid can provide a scaffold on which to map other sequence representations of a target nucleic acid.
  • methylated cytosine residues may be identified in a target nucleic acid.
  • a target nucleic acid can be treated under conditions where cytosine residues are converted to uracil residues, but methylcytosine residues are protected, such as using bisulfite treatment of DNA.
  • a sequencing reagent for a flow step for limited extension may allow limited extension until a cytosine residue in the target nucleic acid is reached by the polymerase.
  • the sequencing reagent may contain a labelled GTP comprising a reversibly terminating moiety or, alternatively, the sequencing reagent may contain no GTP. At least one nucleotide may then be identified in at least one subsequent flow step, for example, by using a nucleotide having a distinguishable label. Thus a low resolution sequence representation of methylated cytosines in a target nucleic acid can be obtained. Additionally or alternatively, a sequencing reagent for a flow step for limited extension may allow limited extension until a uracil residue in the target nucleic acid is reached by the polymerase.
  • the sequencing reagent may contain a labelled ATP comprising a reversibly terminating moiety or, alternatively, the sequencing reagent may contain no ATP. At least one nucleotide may then be identified in at least one subsequent flow step, for example, by using a nucleotide having a distinguishable label. Thus a low resolution sequence representation of non-methylated cytosines in a target nucleic acid can be obtained.
  • a methylation profile of the sample can be obtained.
  • a first low resolution sequence representation can be obtained from a target nucleic acid that has been treated under conditions wherein cytosine residues are converted to uracil residues and a second low resolution sequence representation can be obtained from a sample of the target nucleic acid that has not been treated in this way.
  • the first low resolution sequence representation can be compared to the second low resolution sequence representation and differences in methylation status can be determined based on differences in the number of cytosines, uracils or both.
  • the use of unlabelled bases, and fewer fluorescent imaging cycles per strand may allow longer reads to be obtained that are currently obtainable using SBS cycles where every base is labelled. Thus read lengths of greater than 1000 bases per read may be obtained if desired. Multiple reads can be obtained on each sample where the fluorescently labelled or omitted base is varied, such that aliquots of the same sample are analysed in different ways.
  • Such methods can include obtaining at least two different low resolution sequence representations of a target nucleic acid and combining the predicted representations.
  • One example can include obtaining low resolution sequence representations where the C's and gaps are determined, and a second run where the T's and gaps are determined.
  • the different reads can be done on different aliquots of the same sample, or may be carried our on the same aliquot in the form of paired reads.
  • the two low resolution sequence representations may be combined to provide a higher resolution sequence representation.
  • Such paired read technologies can use different bases in each of the two reads.
  • Pair-end sequencing methods can include preparing a target nucleic acid, and/or plurality of target nucleic acids by fragmenting larger nucleic acid molecules and flanking the nucleic acid fragments with adaptors to allow sequencing reactions to be primed from each end of the adaptor-flanked molecules. In such instances different nucleotides can be used in each direction if desired.
  • a low resolution sequence can be compared to a reference sequence or a plurality of reference sequences, such as those obtained from an electronic database or a biological database.
  • a low resolution sequence can include the sequence of a target sequence.
  • a reference sequence can include a sequence representation of the target sequence.
  • a reference sequence can include the predicted sequence representation of a target sequence, where the sequence representation of a target sequence is obtained using methods described herein.
  • a sequence is analyzed by comparing the sequence to reference sequences, for example, reference nucleotide sequences.
  • Sequences can be compared utilizing a variety of methods. Examples of methods include utilizing a heuristic algorithm, such as a Basic Local Alignment Search Tool (BLAST) algorithm, a BLAST-like Alignment Tool (BLAT) algorithm, or a FASTA algorithm.
  • BLAST Basic Local Alignment Search Tool
  • BLAT BLAST-like Alignment Tool
  • FASTA FASTA algorithm.
  • Some embodiments described herein include databases. Databases can be used in comparing the sequence with a population of database sequences. Databases can contain a population of reference sequences. The population can include a variety of types of reference sequences, for example, nucleotide sequences, polypeptide sequences, or mixtures thereof.
  • the barcode sequence can be compared to one or more reference sequences obtained from any source.
  • the barcode sequence can be compared to one or more sequences generated by sequencing nucleic acids from one or more reference organisms either prior to or in parallel with generating the low resolution sequence data.
  • a population of reference sequences can be indexed.
  • a database can be pre-indexed for use with the methods and compositions described herein. Indexing can improve the efficiency of accessing the sequences and/or attributes associated with such sequences in a database.
  • An index can be created from a population of database sequences using one or more characteristics of each sequence. Such characteristics can be intrinsic or extrinsic to a database sequence. Intrinsic characteristics can include the primary structure of a sequence, and secondary structure of a sequence. The secondary structure of a polypeptide sequence or a nucleic acid sequence can be determined by methods well known in the art, such as methods using predictive algorithms. Extrinsic characteristics can include a variety of traits, for example, the source of a sequence, and the function of a sequence.
  • the identity of the source of a target nucleic acid can be identified, or otherwise characterized, by one or a plurality of traits and such traits will vary with the application of the methods and systems described herein.
  • the source of a sequence can be identified by comparing the low resolution sequence to reference sequences.
  • kits Disclosed are kits for carrying out the method.
  • the kits may include one labelled, blocked nucleotide triphosphate and three unlabelled blocked nucleotide triphosphates and a nucleic acid polymerase.
  • the blocked nucleotide triphosphate may have a 3' azidomethyl group.
  • Figure 1 shows that using 100 base reads of all four bases, native 4 base sequencing aligned 96.7% of the 100000 reads to the reference human genome (HG, build 37). One base only reads with position context preserved gave alignments between 92.4 - 92.9% depending on the single base selected.
  • Bisulfite converted 4 base reads (so AUTG not ACTG) aligned to 95.3%.

Landscapes

  • Life Sciences & Earth Sciences (AREA)
  • Chemical & Material Sciences (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Organic Chemistry (AREA)
  • Zoology (AREA)
  • Wood Science & Technology (AREA)
  • Health & Medical Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Microbiology (AREA)
  • Immunology (AREA)
  • Biotechnology (AREA)
  • Molecular Biology (AREA)
  • Biophysics (AREA)
  • Analytical Chemistry (AREA)
  • Physics & Mathematics (AREA)
  • Biochemistry (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Genetics & Genomics (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The inventors have developed a method to reduce the complexity of nucleic acid sequence data by recording the appearance of only one of the four nucleic acid bases, and the size of gaps between the appearances of single base being measured. The method can involve the use of repeated cycles of polymerase extension using one labelled, blocked nucleotide triphosphate and three unlabelled (dark) blocked nucleotide triphosphates, or cycles using one nucleotide triphosphatefollowed by three nucleotide triphosphates.

Description

Improved nucleic acid re-sequencing using a reduced number of identified bases
The present technology relates to nucleic acid re-sequencing. The detection of nucleic acid sequences present in a biological sample has been used, for example, as a method for identifying and classifying microorganisms, diagnosing infectious diseases, detecting and characterizing genetic abnormalities, identifying genetic changes associated with cancer, studying genetic susceptibility to disease, and measuring response to various types of treatment. A common technique for detecting specific nucleic acid sequences in a biological sample is nucleic acid sequencing of the entire nucleic acid content of the sample. Sequencing methodologies are currently in use which allow for the parallel processing of billions of nucleic acids in a single sequencing run. As such, the information generated from a single sequencing run can be enormous. There is a great need to reduce the cost of generating, storing and processing sequencing data.
DNA sequencing can be carried out using electrophoresis to separate different length fragments. Sanger sequencing involves the use of labelled, terminated ddNTP's which act to prevent strand extension once incorporated. The use of a mixture of labelled ddNTP's and conventional unlabelled dNTP's produces a ladder of terminated fragments, where the identity of the terminal base is known by the identity of the label. In such instances, each of the four nucleic acid bases can be separately labelled. Electrophoresis allows separation of the fragments of different lengths, and the identity of the labelled bases can be determined. If one of the four bases is labelled, then a ladder is produced showing labelled bands and gaps where the labels relate to a single nucleotide. However the size of the gaps can only be accurately determined by comparing with reads carried out with the other three labelled bases (i.e. you need four lanes on the same gel run at the same time). A series of bands in the single lane of a gel gives no reliable information relating to the number of nucleic acid bases in the gaps. The use of four lanes allows for the complete sequence to be determined as the identity of each base can be established.
US20060147935 describes certain methods for nucleic acid sequencing. The application is not specific regarding the details of how to implement any methods disclosed therein. The application requires the comparison of multiple runs in order to generate the sequencing read. For example paragraph [0002] describes that where a single stopping nucleotide is used, four runs using different stopping nucleotides are required. In contrast, the invention described herein generates sequencing data from a single run. Sequence reads are generated without having to perform multiple runs on the same sample. The methods of the invention do not require size based separation.
Summary
The methods of the invention are based on the ability to obtain useful information on the base sequence of a nucleic acid sample without having to generate a complete sequence read of all four nucleotide bases. Methods of the invention are carried out without size based separation. Methods of the invention are carried out without the need to compare multiple reads on the same sample in order to build the sequence.
Embodiments of the present invention relate to methods for obtaining nucleic acid sequence information where an original comparative sequence is already known. The inventors have developed a method to reduce the complexity of nucleic acid sequence data by recording the appearance of only a single one of the four nucleic acid bases, and the accurate size (or length) of gaps between the appearances of single base being measured. Such low resolution sequencing can be viewed as one base sequencing. Thus rather than identifying all four nucleotides in each sequence read, only a single base is identified, along with the number of bases in the gaps between the one identified, or called, bases. Thus the complexity of nucleic acid sequencing data is reduced, whilst useful information can still be obtained by comparing with the known reference sequence. Reducing the number of detectable labels from four to one reduces the cost of sequencing reagents, and reduces the complexity of instrumentation required to determine sequences.
The method can involve the use of repeated cycles of polymerase extension using one labelled, blocked nucleotide triphosphate and three unlabelled (dark) blocked nucleotide triphosphates. The blocking groups are reversible, thereby allowing repetition of the cycles to continue extension of the same strands. Alternatively the method can use repeated alternating cycles using one nucleotide triphosphate followed by three nucleotide triphosphates. Alternatively the method can use a single labelled nucleotide and three unlabelled nucleotides, with the separation between the labels being recorded. The method can use nanopore sequencing. One embodiment of nanopore sequencing is to prepare samples where one of the four nucleotides is labelled with an identifiable label, and the gap size (or length) between adjacent labels is determined. As an example, sequencing reads of the invention are simplified thus:
GCACATCTTGACGGTACCTAATCAGAAAGCCACGGCTAACTAC
In cases where C and gaps are recorded, the sequence becomes:
DCDCDDCDDDDCDDDDCCDDDDCDDDDDDCCDCDDCDDDCDDC
where D is any of the three bases A or G or T/U (i.e. D is 'not C).
In cases where T and gaps are recorded, the sequence becomes:
VVVVVTVTTVVVVVTVVVTVVTVVVVVVVVVVVVVVTVVVTVV
where V is any of the three bases A or C or G (i.e. V is 'not T').
In cases where A and gaps are recorded, the sequence becomes:
BBABABBBBBABBBBABBBAABBABAAABBBABBBBBAABBAB
where B is any of the three bases C or G or T/U (i.e. B is 'not A').
In cases where G and gaps are recorded, the sequence becomes:
GHHHHHHHHGHHGGHHHHHHHHHHGHHHGHHHHGGHHHHHHHH
where H is any of the three bases A or C or T/U (i.e. H is 'not G').
The invention may be useful in the field of methylation analysis where the conversion of a subset of the C's to U's using bisulfite can be determined.
The invention is also useful in the rapid analysis of samples for the detection of pathogens. Such rapid analysis is useful in infection control and also quality control in manufacturing operations. The detection can be carried out to identify small or large scale genetic changes. The method can be used to identify copy number differences, large scale deletions, insertions or rearrangements. The invention can also be used in genetic fingerprinting or forensic applications. The invention can be used in HLA typing or tissue typing.
Some such methods include the steps of providing sequencing reagents to a target nucleic acid in the presence of a polymerase, the sequencing reagents including all four nucleotide monomers, each of which includes a reversibly terminating (blocking) moiety, wherein three of the nucleotide monomers are unlabelled (dark), and one is labelled. Thus each strand is extended by a single nucleotide, but on average three of the four strands are 'dark' on each extension cycle (i.e. three quarters of the strands incorporate a single dark nucleotide, the identity of which is not determined). The cycles are repeated, and each read is recorded as 'dark' or 'labelled' for each cycle. Thus the number of 'dark' cycles for each read determines the size of the gaps between 'labelled' cycles.
Alternative methods include the steps of providing sequencing reagents to a target nucleic acid in the presence of a polymerase, the sequencing reagents including repeated alternating cycles where a first cycle has one of the four nucleotide monomers, and a second cycle has the remaining three nucleotide monomers. A third cycle has the one nucleotide monomer of the first cycle, and a fourth cycle has the remaining three nucleotide monomers etc. The number of bases incorporated on the 'three monomer' cycle is recorded, and thus the size of the gaps determined. In such methods, the nucleotide monomers lack a reversibly terminating moiety.
Both examples of the above-described methods include removing unincorporated sequencing reagents between cycles.
Where the nucleotide monomers include a reversibly terminating moiety, further steps in the method may include removing the reversibly terminating moiety between consecutive sequencing cycles.
Where the nucleotide monomers are incorporated, further steps in the method may include detecting incorporation of the nucleotide monomer into said polynucleotide. In some embodiments of the above-described methods, the detecting includes detecting a label. In other embodiments of the above-described methods, the detecting includes detecting pyrophosphate. In some such embodiments, detecting pyrophosphate can include, but is not limited to, detecting a signal that is produced in the presence of, by the incorporation of or by the degradation of pyrophosphate. In other embodiments of the above-described methods, the detecting includes detecting protons released upon monomer incorporation. In other embodiments the detecting includes detecting the release of labelled pyrophosphate.
In some embodiments of the methods, the at least one nucleotide monomer includes a label. In some such methods, the label is selected from the group consisting of fluorescent moieties, chromophores, antigens, dyes, phosphorescent groups, radioactive materials, chemiluminescent moieties, scattering or fluorescent nanoparticles, Raman signal generating moieties, and electrochemical detection moieties. Some embodiments of the method, where the nucleotide monomer includes a label, may also include cleaving the label from the nucleotide monomer.
In certain embodiments of the methods, the first sequencing reagent is provided simultaneously to a plurality of target nucleic acids. In some such methods, the plurality of target nucleic acids includes target nucleic acids having different nucleotide sequences. In some embodiments of the above-described methods, the first sequencing reagent is provided in parallel to a plurality of target nucleic acids at separate features of an array.
In embodiments of the described methods, the polymerase includes a polymerase selected from the group consisting of a DNA polymerase, an RNA polymerase, a reverse transcriptase, and mixtures thereof. The polymerase may be a thermostable polymerase or a thermodegradable polymerase.
Methods for obtaining nucleic acid sequence information include the steps of (a) providing a first sequencing reagent to a target nucleic acid in the presence of a polymerase, the first sequencing reagent including a plurality of different nucleotide monomers, each including a reversibly terminating moiety, where three of the nucleotide monomers are unlabelled (dark), and one is labelled, (b) detecting the label, (c) removing the reversibly terminating moiety and the label, and (d) repeating steps (a), (b) and (c), whereby sequence information for at least a portion of the target nucleic acid is obtained.
Alternative methods for obtaining nucleic acid sequence information include the use of oligonucleotides (and ligases) rather than nucleotide monomers (and polymerases). Such methods are within the scope of the disclosure, and can be implemented as alternatives. Methods include the steps of (a) providing a first sequencing reagent to a target nucleic acid in the presence of a ligase, wherein the first sequencing reagent includes at least four oligonucleotides, wherein the oligonucleotides all include a reversibly terminating moiety and one of the four contains a label, (b) detecting the label, (c) removing the reversibly terminating moiety and the label, and (d) repeating steps (a), (b) and (c), whereby sequence information for at least a portion of the target nucleic acid is obtained. In the case of oligonucleotides, the terminating moiety can be at the end of the oligonucleotide, or partly within the strand such that the length is reduced by the removal of the terminating moiety.
Methods of the invention measure the length of dark regions interspersed with the nucleotide of interest. Thus reads of single bases are obtained with known separation. The dark regions are effectively degenerate with respect to the other three nucleotides.
Any of the four nucleotides can be identified in the reads. The single nucleotide being called may be cytidine (C), adenosine (A), guanosine (G), thymidine (T) or uridine (U). The single nucleotide being called may be cytidine (C). The single nucleotide being called may be thymidine (T) or uridine (U).
For each sequencing read, the nucleotide being analysed may appear at least twice in the read. The single nucleotide may appear at least 10 times in the sequencing read. The single nucleotide may appear at least 20 times in the sequencing read. The single nucleotide may appear at least 30 times in the sequencing read.
The nucleotide reads may span at least 50 nucleotides. The nucleotide reads may span at least 100 nucleotides. Reads in this context refer to the total length of sequence determined (i.e. the total of N plus X bases where N is the bases in the gap and X is the single base being called).
Methods include a method for analysing cytosine methylation in a nucleotide sample, the method comprising;
a) treating a first portion of the sample with bisulfite;
b) obtaining sequencing reads on said bisulfite treated first portion relating to the position of T or U nucleotides and the number of nucleotides in the gaps between one T or U and the next T or U nucleotide;
c) obtaining sequencing reads relating to the position of T or U nucleotides and the number of nucleotides in the gaps between one T or U and the next T or U nucleotide on a second portion of the nucleotide sample which has not undergone bisulfite treatment; and d) comparing the two sets of sequencing reads to determine the extent of cytosine methylation. Methods include a method for analysing cytosine methylation in a nucleotide sample, the method comprising;
a) treating a first portion of the sample with bisulfite;
b) obtaining sequencing reads on said bisulfite treated first portion relating to the position of C nucleotides and the number of nucleotides in the gaps between one C and the next C nucleotide;
c) obtaining sequencing reads relating to the position of C nucleotides and the number of nucleotides in the gaps between one C and the next C nucleotide on a second portion of the nucleotide sample which has not undergone bisulfite treatment; and
d) comparing the two sets of sequencing reads to determine the extent of cytosine methylation.
The method may determine nucleic acid sequence data using repeated cycles of polymerase extension using one labelled, blocked nucleotide triphosphate and three unlabelled blocked nucleotide triphosphates. Each cycle has the four nucleotides present, but only one of the four is identifiable. In the case of methylation analysis, the labelled nucleotide may be adenosine (A), thereby measuring T or U, and the size of gaps therebetween.
The method may determine nucleic acid sequence data using repeated alternating cycles of polymerase extension comprising a first cycle of one nucleotide triphosphate and a second cycle of three nucleotide triphosphates, wherein on the cycles where three nucleotide triphosphates are added together, the number of incorporated nucleotides is determined. In the case of methylation analysis the first cycle may be carried out using dATP and the second cycle carried out using dCTP, dTTP and dGTP. Alternatively in the case of methylation analysis the first cycle may be carried out using dGTP and the second cycle carried out using dCTP, dTTP and dATP.
The process may determine a large number of reads in parallel. For example, the method may generate at least 1 thousand sequence reads in parallel. For example, the method may generate at least 1 million sequence reads in parallel. The sequencing may be carried out on a solid support. Disclosed are kits comprising one labelled, blocked nucleotide triphosphate and three unlabelled blocked nucleotide triphosphates and a nucleic acid polymerase. The blocking moiety may be a 3' azidomethyl (CH2N3) group.
Description of Figures
Figure 1 shows mapping efficiency of selected bases for unique matches for native (0%) and bisulfite converted genomes (98, 99, 100%). Sequence reads of only one base can therefore be mapped back against the reference genome providing the size of the gaps is accurately determined.
Detailed Description
The sequencing methods described herein produce a molecular signature that may be compared with other signatures and predicted signatures. In some embodiments, the signature need not provide a nucleotide sequence at single nucleotide resolution. Rather, the signature can provide a unique identification of a nucleic acid based on a low resolution sequence of the nucleic acid. The low resolution sequence can be, for example, degenerate with respect to the identity of the nucleotide type at one or more position in the nucleotide sequence of the nucleic acid. The sequence information that can be obtained using the methods described herein can be used in applications involved in genotyping, expression profiling, capturing alternative splicing, genome mapping, amplicon sequencing, methylation detection and metagenomics, as well as for applications involving detection of contaminant sequences, for example pathogen detection.
Embodiments of the present invention relate to methods for obtaining nucleic acid sequence information. The inventors have developed a method to reduce the complexity of nucleic acid sequence data by recording the appearance of only a single one of the four nucleic acid bases, and the size of gaps between the appearances of single base being measured. Thus rather than identifying all four nucleotides in each sequence read, only a single base is identified, along with the number of bases in the gap between the identified, or called, bases. Thus the complexity of nucleic acid sequencing data is reduced, whilst useful information can still be obtained when compared against a reference sequence.
Unlike methods described in the prior art, the current method accurately records the exact number of bases in the gaps. The method does not rely on size separation, and does not rely on aligning multiple reads taken on different aliquots of the same sample in order to establish the size of the gaps. The length of the gaps is determined by accurately noting the number of unlabelled nucleotides between the labelled or identified nucleotide.
Advantages of the method include a reduced cost of instrumentation. The ability to detect only one colour simplifies the optics of the detection system, resulting in fewer lasers and filters. The speed of acquiring data will be four times faster, as only one base is being recorded per cycle. The reduced level of structural modification of the nucleic acids, along with the reduced level of imaging of each nucleic acid, which reduces the level of photo- induced nucleic acid damage allows a longer read length to be obtained per read. The resultant reads are computationally easier to align as the alignment is carried out using two bases (X and N, where X is the specific base and N is the other three bases) rather than all four.
The method can involve the use of repeated cycles of polymerase extension using one labelled, blocked nucleotide triphosphate and three unlabelled (dark) blocked nucleotide triphosphates. Reducing the number of labels from four to one reduces the cost of sequencing reagents. Alternatively the method can use repeated alternating cycles using one nucleotide triphosphate followed by three nucleotide triphosphates.
Methods of the invention measure the length of dark regions interspersed with the nucleotide of interest. Thus reads of single bases are obtained with known separation. The dark regions are effectively degenerate with respect to the other three nucleotides. The length of the dark region can be determined, either from the length of the 'homopolymer' run obtained using three unlabelled nucleotides, or the number of cycles of reversibly blocked incorporation which are carried out between appearances of the labelled nucleotide monomer.
Included is a method of reducing the complexity of nucleic acid sequencing reads by recording the appearance of only a single one of the four nucleotide bases and the size of the gaps between the appearances of the single nucleotide base being recorded, wherein the identity of the intervening nucleotides in said gaps is not determined.
The term reducing the complexity refers to lowering the number of nucleotides whose identity is determined in an individual sequencing read. The complexity can be reduced if only one or two of the bases in a sequence are identified, and the other positions are left marked as N. N signifies that a nucleotide is present, but that its identity is not known.
The term 'N' is being used as an abbreviation for 'the other three nucleotides'. According to official notation, in cases where C and gaps are recorded, the sequence becomes a string of Cs and Ds where D is any of the three bases A or G or T/U (i.e. D is 'not C). In cases where T and gaps are recorded, the sequence becomes a string of Ts and Vs, where V is any of the three bases A or C or G (i.e. V is 'not T'). In cases where A and gaps are recorded, the sequence becomes a string of As and Bs, where B is any of the three bases C or G or T/U (i.e. B is 'not A'). In cases where G and gaps are recorded, the sequence becomes a string of Gs and Hs, where H is any of the three bases A or C or T/U (i.e. H is 'not G').
For the purposes of the invention, the length of the unknown regions (gaps) need to be accurately established (i.e. the number of unknown bases in the gaps must be determined). The sequence reads effectively become bases followed by gaps of known length before the next base. Any of the four bases can be measured, but for the purposes of examples below T is indicated. The T can alternatively be A, C or G and V can alternatively be B, D or H. Sequences thus become TVnTVnTVn etc, where V is any base other than T and n is the number of A, C and G bases.
The single nucleotide being called may be cytidine (C). The single nucleotide being called may be thymidine (T) or uridine (U). The application may be particularly useful in methylation analysis. Bisulfite treatment convents most of the C bases to U. Thus comparison of the C or U bases pre and post bisulfite treatment gives an accurate record of the methylation status without needing to determine the exact sequences of all four bases.
For each sequencing read, the nucleotide being analysed may appear at least twice in the read. The single nucleotide may appear at least 10 times in the sequencing read. The single nucleotide may appear at least 20 times in the sequencing read. The single nucleotide may appear at least 30 times in the sequencing read. The read may be any length desired.
The nucleotide reads may span at least 50 nucleotides. The nucleotide reads may span at least 100 nucleotides. Due to the lower number of fluorescent reads per strand, and the lower level of strand modification, the read length may be longer that obtained using cycles where every base has a modification in order to allow a label to be attached. In the case of One base' sequencing, only the one base being identified is modified, the remaining three 'dark' bases are all natural. Thus read lengths may be 1000 bases or greater. Reads in this context refer to the total length of sequence determined (i.e. the total of N plus X bases where N is the gap and X is the single base being called).
Examples are given for Illumina sequencing, and can be modified for other platforms as desired. Typical fragments analysed on the Illumina system may be 500-1000 base pairs of duplex DNA. The reads on said fragments may be for example 100-300 bases per read. In the case of paired end reads, the reads can be taken from both ends of the fragment. Thus in a single read of say 300 bases, approximately 75 bases of a single type would be determined if the sequence was an equal mix of all four bases. Thus the single nucleotide may appear approximately 75 times if the length of the read spanned 300 nucleotides. For systems where the read length is not constrained, the length of fragments to be analysed can be thousands of bases, and each base may appear hundreds of times in each of the reads.
Methods include a method for analysing cytosine methylation in a nucleotide sample, the method comprising;
a) treating a first portion of the sample with bisulfite;
b) obtaining sequencing reads on said bisulfite treated first portion relating to the position of T or U nucleotides and the number of nucleotides in the gaps between one T or U and the next T or U nucleotide;
c) obtaining sequencing reads relating to the position of T or U nucleotides and the number of nucleotides in the gaps between one T or U and the next T or U nucleotide on a second portion of the nucleotide sample which has not undergone bisulfite treatment; and d) comparing the two sets of sequencing reads to determine the extent of cytosine methylation.
Methods include a method for analysing cytosine methylation in a nucleotide sample, the method comprising;
a) treating a first portion of the sample with bisulfite;
b) obtaining sequencing reads on said bisulfite treated first portion relating to the position of C nucleotides and the number of nucleotides in the gaps between one C and the next C nucleotide; c) obtaining sequencing reads relating to the position of C nucleotides and the number of nucleotides in the gaps between one C and the next C nucleotide on a second portion of the nucleotide sample which has not undergone bisulfite treatment; and
d) comparing the two sets of sequencing reads to determine the extent of cytosine methylation.
In the reads described above, the identity of the bases between the T or U bases, or C bases, respectively, is not determined. The size, or length, in terms of the number of bases is recorded, but not the identity of the intervening bases. This is contrary to the prior art 'all base' sequencing, there the identity of all four nucleotides is recorded. In the methods described, there is no way of knowing the identity of three of the four bases, as they are supplied together in an unlabelled form which can not be differentiated.
Bisulfite treatment converts both cytosine and 5-formylcytosine residues in a polynucleotide into uracil. Where any 5-carboxycytosine is present (as a product of the oxidation step), this 5-carboxycytosine is converted into uracil in the bisulfite treatment. Without wishing to be bound by theory, it is believed that the reaction of the 5-formylcytosine proceeds via loss of the formyl group to yield cytosine, followed by a subsequent deamination to give uracil. The 5-carboxycytosine is believed to yield the uracil through a sequence of decarboxylation and deamination steps. Bisulfite treatment may be performed under conditions that convert both cytosine and 5-formylcytosine or 5-carboxycytosine residues in a polynucleotide as described herein into uracil.
A portion of the population may be treated with bisulfite by incubation with bisulfite ions (HSO32 ). The use of bisulfite ions (HSO32 ) to convert unmethylated cytosines in nucleic acids into uracil is standard in the art and suitable reagents and conditions are well known to the skilled person. Numerous suitable protocols and reagents are also commercially available (for example, EpiTect™, Qiagen NL; EZ DNA Methylation™ Zymo Research Corp CA; CpGenome Turbo Bisulfite Modification Kit; Millipore).
The method may include additional steps. The method may include the step(s) of oxidising and/or reducing the sample prior to bisulfite treatment. Bisulfite sequencing allows 5- methylcytosine to be distinguished from the unmethylated cytosine. In addition to 5-methyl cytosine, other cytosine modifications including 5-hydroxymethyl and 5-formyl have been identified. In order to differentiate between these different cytosine modifications, techniques involving oxidation and/or reduction of the samples prior to bisulfite sequencing have been developed. In order to extract value from bisulfite sequencing, the sequencing output must be compared with a sample which has not undergone bisulfite treatment.
Both 5-formylcytosine (5fC) and cytosine (C) are converted to uracil upon bisulfite treatment. Reduction of the formyl group to hydroxym ethyl C (hmC) prior to bisulfite treatment allows C and 5fc to be identified. 5-Methylcytosine (5mC) and 5- hydroxymethylcytosine (5hmC) are not converted to uracil by bisulfite. Oxidation of the hydroxymethyl group to a formyl group allows the two to be differentiated. A summary of the relevant transformations is shown below:
Figure imgf000014_0002
The structures of the bases is shown below:
Figure imgf000014_0001
cytosine (C) 5-methylcytosine (5MC)
Figure imgf000015_0001
5-hydroxymethylcytosine (5HMC) 5-formylcytosine (5FC)
In some embodiments, a portion of the nucleic acid sample may be oxidised using an oxidising agent. The oxidising agent may be a non-enzymatic oxidising agent, for example, an organic or inorganic chemical compound. Suitable oxidising agents are well known in the art and include metal oxides, such as KRu04, Mn02 and KMn04. Particularly useful oxidising agents are those that may be used in aqueous conditions, which are most convenient for the handling of the polynucleotide. However, oxidising agents that are suitable for use in organic solvents may also be employed where practicable.
In some embodiments, the oxidising agent may comprise a perruthenate anion (Ru04 ). Suitable perruthenate oxidising agents include organic and inorganic perruthenate salts, such as potassium perruthenate (KRu04) and other metal perruthenates; tetraalkyl ammonium perruthenates, such as tetrapropylammonium perruthenate (TPAP) and tetrabutylammonium perruthenate (TBAP); polymer-supported perruthenate (PSP) and tetraphenylphosphonium ruthenate. The oxidising agents may be a metal (VI) oxo complex. The oxidising agent may be manganate (Mn(VI)04 2"), ferrate (Fe(VI)04 2"), osmate (Os(VI)04 2"), ruthenate (Ru(VI)04 2" ), or molybdate (Mo(VI)04 2").
The portions of polynucleotides may be reduced by treatment with a reducing agent. The reducing agent is any agent suitable for generating an alcohol from an aldehyde. The reducing agent or the conditions employed in the reduction step may be selected so that any 5-formylcytosine is selectively reduced (i.e. the reducing agent or reduction conditions are selective for 5-formylcytosine). Thus, substantially no other functionality in the polynucleotide is reduced in the reduction step. The reducing agent or conditions are selected to minimise or prevent any degradation of the polynucleotide. Suitable reducing agents are well-known in the art and include NaBH4, NaC BH3 and LiBH4. Particularly useful reducing agents are those that may be used in aqueous conditions, as such are most convenient for the handling of the polynucleotide. However, reducing agents that are suitable for use in organic solvents may also be employed where practicable.
Following oxidation and reduction respectively, the reduced and oxidised portion of the population are treated with bisulfite. A second portion of the population which has not been oxidised or reduced is also treated with bisulfite. The bisulfite treatment can be done separately on the three samples, or, if tagged primers are used, the samples can be pooled so that the reduced, oxidised and untreated samples are all exposed to bisulfite in the same reaction.
The method may determine nucleic acid sequence data using repeated cycles of polymerase extension using one labelled, blocked nucleotide triphosphate and three unlabelled blocked nucleotide triphosphates. In the case of methylation analysis, the labelled nucleotide may be adenosine (A), thereby measuring T or U, and the size of gaps therebetween. In the case of methylation analysis, the labelled nucleotide may be guanosine (G), thereby measuring the remaining C bases and the size of gaps therebetween. In other applications, the labelled nucleotide may be any of the four nucleotides. The labelled nucleotide may be A, G, C or T.
The method may determine nucleic acid sequence data using repeated alternating cycles of polymerase extension comprising a first cycle of one nucleotide triphosphate and a second cycle of three nucleotide triphosphates, wherein on the cycles where three nucleotides triphosphates are added together, the number of incorporated nucleotides is determined. In the case of methylation analysis the first cycle may be carried out using dATP and the second cycle carried out using dCTP, dTTP and dGTP. In the case of methylation analysis the first cycle may be carried out using dGTP and the second cycle carried out using dCTP, dTTP and dATP. In other applications, the individual nucleotide may be any of the four nucleotides. The individual nucleotide may be A, G, C or T. In certain embodiments, the first cycle may be carried out using dATP and the second cycle carried out using dCTP, dTTP and dGTP. In certain embodiments, the first cycle may be carried out using dCTP and the second cycle carried out using dATP, dTTP and dGTP. In certain embodiments, the first cycle may be carried out using dGTP and the second cycle carried out using dCTP, dTTP and dATP. In certain embodiments, the first cycle may be carried out using dTTP and the second cycle carried out using dCTP, dATP and dGTP.
The process may determine a large number of read in parallel. For example, the method may generate at least 1 thousand sequence reads in parallel. For example, the method may generate at least 1 million sequence reads in parallel. The sequencing may be carried out on a solid support.
Aspects of the present invention relate to methods for obtaining nucleic acid sequence information of a target nucleic acid. The methods described herein relate to obtaining a molecular signature of a target nucleic acid, where the molecular signature includes a low resolution (less than all four nucleotides) representation of the target nucleic acid sequence. Some embodiments of these methods can be employed with nucleotide monomers while others utilize oligonucleotides. When oligonucleotides are used, one or more of the oligonucleotides can include a reversibly terminating moiety. In embodiments where nucleotide monomers are used, one or more of the nucleotide monomers can include a reversibly terminating moiety.
Methods for obtaining nucleic acid sequence information include the steps of (a) providing a first sequencing reagent to a target nucleic acid in the presence of a polymerase, the first sequencing reagent including a plurality of different nucleotide monomers, each including a reversibly terminating moiety, where three of the nucleotide monomers are unlabelled (dark), and one is labelled, (b) detecting the label, (c) removing the reversibly terminating moiety and the label, and (d) repeating steps (a), (b) and (c), whereby sequence information for at least a portion of the target nucleic acid is obtained.
Alternative methods for obtaining nucleic acid sequence information include the use of oligonucleotides (and ligases) rather than nucleotide monomers (and polymerases). Such methods are within the scope of the disclosure, and can be implemented as alternatives. Methods include the steps of (a) providing a first sequencing reagent to a target nucleic acid in the presence of a ligase, wherein the first sequencing reagent includes at least four oligonucleotides, wherein the oligonucleotides all include a reversibly terminating moiety and one of the four contains a label, (b) detecting the label, (c) removing the reversibly terminating moiety and the label, and (d) repeating steps (a), (b) and (c), whereby sequence information for at least a portion of the target nucleic acid is obtained. In the case of oligonucleotides, the terminating moiety can be at the end of the oligonucleotide, or partly within the strand such that the oligonucleotide length is reduced by the removal of the terminating moiety.
As used herein, "oligonucleotide" and/or "nucleic acid" can refer to at least two nucleotide monomers linked together. A nucleic acid can generally contain phosphodiester bonds, however, in some embodiments, nucleic acid analogs may have other types of backbones, comprising, for example, phosphoramide, phosphorothioate, phosphorodithioate), O- methylphosphoroamidite and peptide nucleic acid backbones and linkages.
Modifications of the nucleotides or oligonucleotides may be done to facilitate the addition of additional moieties such as labels, or to increase the stability of such molecules under certain conditions. In addition, mixtures of naturally occurring nucleic acids and analogs can be made. Alternatively, mixtures of different nucleic acid analogs, and mixtures of naturally occurring nucleic acids and analogs may be made. The nucleic acids may be single stranded or double stranded, as specified, or contain portions of both double stranded or single stranded sequence. The nucleic acid may be DNA, for example, genomic or cDNA, RNA or a hybrid. A nucleic acid can contain any combination of deoxyribo- and ribo-nucleotides, and any combination of bases, including uracil, adenine, thymine, cytosine, guanine, inosine, xanthanine, hypoxanthanine, isocytosine, isoguanine, and base analogs such as nitropyrrole (including 3 -nitropyrrole) and nitroindole (including 5-nitroindole), etc.
As used herein, "nucleotide monomer" can refer to a nucleotide or nucleotide analog that can become incorporated into a primer polynucleotide. In the methods described herein, the nucleotide monomers are separate non-linked nucleotides. That is, the nucleotide monomers are not present as dimers, trimers, etc. Such nucleotide monomers may be substrates for an enzyme that may extend a polynucleotide strand. Such nucleotide monomers may be nucleotide 5-triphosphates (rNTP's, dNTP's or blocked NTP's). Nucleotide monomers may or may not become incorporated into a nascent polynucleotide in a flow step depending on the sequence of the complementary polynucleotide. Nucleotide monomers may or may not contain label moieties and/or terminator moieties. Terminator moieties include reversibly terminating moieties. Incorporation of a nucleotide monomer comprising reversibly terminating moieties can inhibit extension of a primer polynucleotide, however, the moiety can be removed and the primer polynucleotide may be extended further once the block is removed. Such reversibly terminating moieties are well known in the art, and include for example an allyl, methoxymethyl or azidomethyl group. Examples of nucleotide monomers include deoxyribonucleotides, modified deoxyribonucleotides, ribonucleotides, modified ribonucleotides, peptide nucleotides, modified peptide nucleotides, modified phosphate sugar backbone nucleotides and mixtures thereof. Nucleotide analogs which include a modified nucleobase can also be used in the methods described herein. A nucleotide monomer may comprise a label moiety and/or a terminator moiety.
As used herein, "sequencing reagent" can refer to a composition, such as a solution, comprising one or more precursors of a polymer such as nucleotide monomers. In some embodiments, a sequencing reagent includes one or more nucleotide monomers having a label moiety, a terminator moiety, or both. Such moieties are chemical groups that are not naturally occurring moieties of nucleic acids, being introduced by synthetic means to alter the natural characteristics of the nucleotide monomers with regard to detectability under particular conditions or enzymatic reactivity under particular conditions. Alternatively, a sequencing reagent comprises one or more nucleotide monomers that lack a label moiety and/or a terminator moiety. In some embodiments, the sequencing reagent consists of or consists essentially of one nucleotide monomer type, two different nucleotide monomer types, three different nucleotide monomer types or four different nucleotide monomer types. "Different" nucleotide monomer types are nucleotide monomers that have different base moieties. Two or more nucleotide monomer types can have other moieties, such as those set forth above, that are the same as each other or different from each other. In some embodiments, a sequencing reagent may comprise an oligonucleotide that may be incorporated into a polymer. The oligonucleotide may comprise a terminator moiety and/or a label moiety.
For ease of illustration, various methods and compositions are described herein with respect to multiple nucleotide monomers. It will be understood that the multiple nucleotide monomers of these methods or compositions can be of the same or different types unless explicitly indicated otherwise. It should be understood that when providing a sequencing reagent comprising multiple nucleotide monomers to a target nucleic acid, the nucleotide monomers do not necessarily have to be provided at the same time. For example, two nucleotide monomers can be delivered, either together or separately, to a target nucleic acid. In such embodiments, a sequencing reagent comprising two nucleotide monomers will have been provided to the target nucleic acid. In some embodiments, zero, one or two of the nucleotide monomers will be incorporated into a polynucleotide that is complementary to the target nucleic acid.
As used herein, "complementary polynucleotides" includes polynucleotide strands that are not necessarily complementary to the full length of the target sequence. That is, a complementary polynucleotide can be complementary to only a portion of the target nucleic acid. Complementary polynucleotides can be produced by extending a primer polynucleotide using cycles of sequencing reagents. As more nucleotide monomers are incorporated into the complementary polynucleotide, the complementary polynucleotide becomes complementary to a greater portion of the target nucleic acid. Typically, the complementary portion is a contiguous portion of the target nucleic acid.
As used herein, "sequencing read" refers to a repetitive process of physical or chemical steps that is carried out to obtain signals indicative of the order of monomers in a polymer. The signals can be indicative of an order of monomers at single monomer resolution or lower resolution. In particular embodiments, the steps can be initiated on a nucleic acid target and carried out to obtain signals indicative of size of gaps between repeated appearances of the same base. The process can be carried out to its completion, which may be the point at which signals from the process can no longer distinguish bases of the target with a reasonable level of certainty. If desired, completion can occur earlier, for example, once a desired amount of sequence information has been obtained, for example a particular read length. A sequencing run can be carried out on a single target nucleic acid molecule or simultaneously on a population of target nucleic acid molecules having the same sequence, or simultaneously on a population of target nucleic acids having different sequences.
As used herein, "cycle" refers to the portion of a sequencing run that is repeated to indicate the presence of at least one monomer in a polymer. Typically, a cycle includes several steps such as steps for delivery of reagents, washing away unreacted reagents and detection of signals indicative of changes occurring in response to added reagents. For example, a cycle of a sequencing-by-synthesis (SBS) reaction can include delivery of a sequencing reagent that includes one or more type of nucleotide, washing to remove unreacted nucleotides, and detection to detect one or more nucleotides that are incorporated in an extended nucleic acid. In addition, "cycle" can refer to the portion of a sequencing run that is repeated to extend a polynucleotide complementary to a target nucleic acid. For example, a cycle can include several steps such as the delivery of first reagent, washing away unreacted agents, and delivery of a second reagent. Typically, such delivery steps can be for limited extension of a polynucleotide complementary to a target nucleic acid. In such embodiments, the polynucleotide strand may be extended in each delivery step by one or more nucleotides, depending on the number of monomers which are terminated. Where all four monomers are terminated, each cycle extends each polynucleotide by a single monomer.
As used herein, "low resolution", when used in reference to a sequence read, means providing less information on the order and type of monomers in a polymer than provided by a complete monomer resolution sequence representation of the same polymer. The term can refer to a resolution at which at least one type of monomeric unit in a polymer can be distinguished from at least a first other type of monomeric unit in the polymer, but cannot necessarily be distinguished from a second other type of monomeric unit in the polymer. For example, "low resolution" when used in reference to a sequence representation of a nucleic acid means that two or three possible nucleotide types can be indicated as candidate residents at any particular position in the sequence while the two or three nucleotide types cannot necessarily be distinguished from each other in any and all of the sequence representation or in a portion of the sequence representation. In particular embodiments, two different monomeric units from an actual polymer sequence can be assigned a common label (N) or identifier in a low resolution sequence representation. In some embodiments, three different monomeric units from an actual polymer sequence can be assigned a common label (N) or identifier in a low resolution sequence representation. Typically, the diversity of different characters in a low resolution sequence representation will be fewer than the diversity of different types of monomers in the polymer represented by the non low resolution sequence representation. For example, a low resolution representation of a nucleic acid can include a string of symbols and the number of different symbol types in the string can be less than the number of different nucleotide types in the actual sequence of the nucleic acid. In some examples, a low resolution sequence representation can include regions where the identity of monomeric units is unknown. For example, a sequence representation can include a sequence of distinguishable monomeric units interspersed with symbols representing regions of unknown content. The length of the region is known, just not the identity of each of the specific bases. As used herein the term "type," when used in reference to a monomer, nucleotide or other unit of a polymer, is intended to refer to the species of monomer, nucleotide or other unit. The type of monomer, nucleotide or other unit can be identified independent of their positions in the polymer. Similarly, when used in reference to a symbol or other identifier in a sequence representation, the term is intended to refer to the species of symbol or identifier and can be independent of their positions in the sequence representation. Exemplary types of nucleotide monomers are those having either adenine (A), cytosine (C), guanine (G), thymine (T), or uracil (U) bases. Among the nucleotide monomers having cytosine are included those that are methylated at the 5-position, such as 5-methyl cytosine or 5-hydroxymethyl cytosine, and those that are not methylated at the 5-position.
As used herein, "degenerate" means a location having more than one possible base, where the identity of the base is not identified at a particular location. When used in reference to a nucleic acid representation, the term refers to a position in the nucleic acid representation for which two or more nucleotide types are identified as candidate occupants in the corresponding position of the actual nucleic acid sequence. A degenerate position in a nucleic acid can have, for example, 2, 3 or 4 nucleotide types as candidate occupants. In particular embodiments, the number of different nucleotide types at a degenerate position in a sequence representation can be greater than one and less than three, namely, two. In other embodiments, the number of different nucleotide types at a degenerate position in a sequence representation can be greater than one and less than four, namely, two or three. In other embodiments, the number of different nucleotide types at a degenerate position in a sequence representation can be greater than two and less than four, namely, three. Typically, the number of different nucleotide types at a degenerate position in a sequence representation can be less than the number of different nucleotide types present in the actual nucleic acid sequence that is represented.
Methods for Limited Extension of a Polynucleotide
Lack of at Least One Nucleotide Monomer
Disclosed herein are methods that can be used for limited extension of a polynucleotide complementary to a target nucleic acid. In some embodiments, performing a limited extension can include delivering a sequencing reagent to a target nucleic acid in the presence of a polymerase, where the sequencing reagent lacks at least one type of nucleotide monomer that can base-pair with at least one nucleotide in a target nucleic acid. In some embodiments, the sequencing reagent may contain at least one type of nucleotide monomer, but no more than three types of nucleotide monomer.
In one example, a sequencing reagent can be delivered to a target nucleic acid in the presence of polymerase containing three different nucleotide monomers (A, C, G). In this example, a polynucleotide complementary to the target nucleic acid may be extended until the polymerase reaches an λΑλ ; here extension will be limited because of the lack of , rT in the sequencing reagent. Such embodiments may be referred to as dark extensions since one purpose of this process is to extend down a target nucleic acid without necessarily reading the sequence of the target nucleic acid. In such cases the number of A, C and G nucleotides incorporated can be determined, but the order of incorporation is not determined. The three base cycle can be followed by a cycle in which a single nucleotide monomer is incorporated.
In certain embodiments, the sequencing reagent can lack at least one, two, or three types of nucleotide monomer that may base-pair with at least one type of nucleotide in a target nucleic acid. In other embodiments, the sequencing reagent can lack at least one, two, or three different types of nucleotide monomer. It is also contemplated that in some embodiments, a sequencing reagent may contain a promiscuous nucleotide monomer such as a universal nucleotide monomer or semi-universal nucleotide monomer, that may base-pair with more than one type of nucleotide in a target nucleic acid. By "universal nucleotide monomer" is meant a nucleotide monomer that pairs with the entire complement of nucleotides present in the target nucleic acid. By "semi-universal nucleotide monomer" is meant, a nucleotide monomer that pairs with more than one but less than the entire complement of nucleotides present in the target nucleic acid. In such embodiments, the sequencing reagent can lack at least one type of nucleotide that may base-pair with at least one nucleotide in the target. Thus rather than having three nucleotide monomers, less than three can be used to achieve the same length of base extension before the extension stops due to lack of the correct monomer.
Nucleotide Monomer with Terminating Moiety
Additional methods that can be used for limited extension of a polynucleotide complementary to a target nucleic acid include delivering a sequencing reagent to a target nucleic acid in the presence of a polymerase, where the sequencing reagent includes at least one type of nucleotide monomer comprising a terminating moiety. In some embodiments, the nucleotide monomer may base-pair with at least one nucleotide that may be present in a target nucleic acid. In preferred embodiments, the terminating moiety is reversibly terminating. In certain embodiments, the terminating moiety is a 3'-azidomethyl group.
In one example, a sequencing reagent containing A, C, G, T monomers is delivered to a target nucleic acid in the presence of polymerase. Each of the monomers contains a terminating moiety. One of the monomers is labelled. Thus on average, 25% of the reads will become labelled per cycle. The extension of a polynucleotide using nucleotides with terminating moieties can be repeated. In such embodiments, reversibly terminating moieties can be used to facilitate subsequent extensions. Furthermore, sequencing reagents in subsequent steps of limited extension can contain nucleotide monomers comprising terminating moieties that are the same or different.
In preferred embodiments, the reversibly terminating moiety of an incorporated nucleotide monomer can be removed prior to a subsequent extension step. In certain embodiments, unincorporated nucleotide monomers can be removed prior to delivering a subsequent sequencing reagent. In certain embodiments, the terminating moiety is a 3 '-azidomethyl group.
The sequencing reagent can use a mixture of four blocked NTP's. The blocked NTP's can have a chemical block at the 3' position. The blocking moiety can be a small chemical moiety, for example an allyl, methoxymethyl or azidomethyl group. Once the extension has been carried out, the extended primers can be unblocked by removing the block at the 3- position to release a 3 '-OH. In the case of the azidomethyl group, the release can be carried out using a phosphine reagent. In the case of the allyl group, the block can be removed using palladium and a phosphine.
Oligonucleotides with a Terminator Moiety
Additional methods that can be used for limited extension of a polynucleotide complementary to a target nucleic acid include delivering a sequencing reagent to a target nucleic acid in the presence of a ligase, where the sequencing reagent includes at least one oligonucleotide. In some embodiments, the oligonucleotide comprises a terminating moiety. In preferred embodiments, the terminating moiety is a reversibly terminating moiety. The oligonucleotide can be complementary to the target nucleic acid such that the oligonucleotide can be ligated to a polynucleotide complementary to at least a portion of the target nucleic acid, thus extending the polynucleotide complementary to the target nucleic acid. An oligonucleotide can comprise at least two linked nucleotide monomers. In some embodiments, the oligonucleotide can be at least a 2-mer, 3-mer, 4-mer, 5-mer, 6-mer, 7-mer, 8-mer, 9-mer, or 10-mer. In some embodiments, the length of the oligonucleotide may exceed 10 linked nucleotides. It will be appreciated that oligonucleotides of any length can be designed in order to facilitate accurate and/or rapid limited extensions. It will also be appreciated that the limited extensions can be dark extensions, however, as with the above examples of limited extension, there is no requirement that these limited extensions are dark extensions.
In certain embodiments, the sequencing reagent for limited extension can include a plurality of oligonucleotides. In some embodiments, the plurality of oligonucleotides can include different oligonucleotides. In particular embodiments, the plurality of oligonucleotides can include degenerate oligonucleotides or oligonucleotides comprising promiscuous bases. In preferred embodiments, the plurality of oligonucleotides includes at least one oligonucleotide that is complementary to the target nucleic acid such that the oligonucleotide can be ligated to a polynucleotide complementary to at least a portion of the target nucleic acid, thus extending the polynucleotide complementary to the target nucleic acid. A portion of the oligonucleotides corresponding to a single nucleotide to be identified can be labelled.
In one example, a sequencing reagent can be delivered to a target nucleic acid in the presence of ligase, where the sequencing reagent contains a plurality of oligonucleotides comprising reversibly terminating moieties. Some of the oligonucleotides may hybridize to various nucleotide sequences of the target nucleic acid, including a sequence where the hybridizing oligonucleotide can be ligated to a polynucleotide complementary to at least a portion of the target nucleic acid, thus extending the polynucleotide complementary to the target nucleic acid. However, the extension of the polynucleotide is limited because the reversibly terminating moiety of the ligated oligonucleotide can prevent further extension of the polynucleotide. The reversibly terminating moiety can be removed prior to a subsequent reagent delivery step.
The labels can be fluorescent labels. The fluorescent labels can be attached to the base of the nucleotide. The labels can be attached through cleavable linkers, which may be chemically cleavable. The linkers may be attached through linkers which are cleavable under the same conditions as the removal of the reversible terminator moieties. In some methods for detecting the incorporation of nucleotide monomers, pyrophosphate released on incorporation of a nucleotide monomer into a polynucleotide complementary to at least a portion of the target nucleic acid can be detected using pyrosequencing techniques. Pyrosequencing detects the release of pyrophosphate as particular nucleotides are incorporated into a nascent polynucleotide.
In some methods for detecting the incorporation of nucleotide monomers, protons released on incorporation of a nucleotide monomer into a polynucleotide complementary to at least a portion of the target nucleic acid can be detected using ion sensor techniques.
A target nucleic acid can include any nucleic acid of interest. Target nucleic acids can include, but are not limited to, DNA, RNA, peptide nucleic acid, morpholino nucleic acid, locked nucleic acid, glycol nucleic acid, threose nucleic acid, mixtures thereof, and hybrids thereof. In a preferred embodiment, genomic DNA fragments or amplified copies thereof are used as the target nucleic acid.
Some embodiments can utilize a single target nucleic acid. Other embodiments can utilize a plurality of target nucleic acids. In such embodiments, a plurality of target nucleic acids can include a plurality of the same target nucleic acids, a plurality of different target nucleic acids where some target nucleic acids are the same, or a plurality of target nucleic acids where all target nucleic acids are different. Embodiments that utilize a plurality of target nucleic acids can be carried out in multiplex formats such that reagents are delivered simultaneously to the target nucleic acids, for example, in a single chamber or on an array surface. In preferred embodiments, target nucleic acids can be amplified as described in more detail herein. In some embodiments, the plurality of target nucleic acids can include substantially all of a particular organism's genome. The plurality of target nucleic acids can include at least a portion of a particular organism's genome including, for example, at least about 1%, 5%, 10%, 25%, 50%, 75%, 80%, 85%, 90%, 95%, or 99% of the genome. In particular embodiments the portion can have an upper limit that is at most about 1%, 5%, 10%, 25%, 50%, 75%, 80%, 85%, 90%, 95%, or 99% of the genome.
The population of target nucleic acid molecules may be a sample of DNA or RNA, for example a genomic DNA sample. Suitable DNA and RNA samples may be obtained or isolated from a sample of cells, for example, mammalian cells such as human cells or tissue samples, such as biopsies. In some embodiments, the sample may be obtained from a formalin fixed parafin embedded (FFPE) tissue sample. Suitable cells include somatic and germ-line cells.
Target nucleic acids can be obtained from any source. For example, target nucleic acids may be prepared from nucleic acid molecules obtained from a single organism or from populations of nucleic acid molecules obtained from natural sources that include one or more organisms. Sources of nucleic acid molecules include, but are not limited to, organelles, cells, tissues, organs, or organisms. Cells that may be used as sources of target nucleic acid molecules may be prokaryotic (bacterial cells, for example, Escherichia, Bacillus, Serratia, Salmonella, Staphylococcus, Streptococcus, Clostridium, Chlamydia, Neisseria, Treponema, Mycoplasma, Borrelia, Legionella, Pseudomonas, Mycobacterium, Helicobacter, Erwinia, Agrobacterium, Rhizobium, and Streptomyces genera); archeaon, such as crenarchaeota, nanoarchaeota or euryarchaeotia; or eukaryotic such as fungi, (for example, yeasts), plants, protozoans and other parasites, and animals (including insects (for example, Drosophila spp.), nematodes (for example, Caenorhabditis elegans), and mammals (for example, rat, mouse, monkey, non-human primate and human)).
The population may be a diverse population of nucleic acid molecules, for example a library, such as a whole genome library or a loci specific library.
Nucleic acid strands in the population may be amplified nucleic acid molecules, for example, amplified fragments of the same genetic locus or region from different samples.
Nucleic acid strands in the population may be enriched. For example, the population may be an enriched subset of a sample produced by pull-down onto a hybridisation array or digestion with a restriction enzyme.
The samples may be further processed, for example by amplification or sequencing. The samples may be copied using a nucleic acid polymerase. If adaptors are attached to both ends of the target fragments, the population of fragments can be amplified using a single pair of primers complementary to the adaptors. In order to further multiplex the readout, the nucleic acids from different sources can be separately tagged. The tags can thereby be used to help identify sequences from different sources. Thus the disclosure herein includes the use of two or more different populations of identifiable tags for the multiplexing of the analysis of different samples.
Sequencing Methods
The methods described herein can be used in conjunction with a variety of sequencing techniques. The sequencing may be carried out using a commercially available high throughput sequencing platform. Suitable sequencing platforms include Illumina TruSeq, LifeTech IonTorrent, Roche 454 and PacBio RS. The sequencing may be carried out on a solid support. In some embodiments, the process to determine the nucleotide sequence of a target nucleic acid can be an automated process.
Embodiments include sequencing by synthesis (SBS) techniques. SBS techniques generally involve the enzymatic extension of a nascent nucleic acid strand through the iterative addition of nucleotides against a template strand. In traditional methods of SBS, a single nucleotide monomer may be provided to a target nucleotide in the presence of a polymerase in each delivery. However, in some of the methods described herein, more than one type of nucleotide monomer can be provided to a target nucleic acid in the presence of a polymerase in a delivery.
SBS can utilize nucleotide monomers that have a terminator moiety or those that lack any terminator moieties. Methods utilizing nucleotide monomers lacking terminators include, for example, pyrosequencing and sequencing using gamma-phosphate-labelled nucleotides. In methods using nucleotide monomers lacking terminators, the number of different nucleotides added in each cycle can be dependent upon the template sequence and the mode of nucleotide delivery. For SBS techniques that utilize nucleotide monomers having a terminator moiety, the terminator can be effectively irreversible under the sequencing conditions used as is the case for traditional Sanger sequencing which utilizes dideoxynucleotides, or the terminator can be reversible as is the case for sequencing methods developed by Solexa (now Illumina, Inc.). In preferred methods a terminator moiety can be reversibly terminating.
SBS techniques can utilize nucleotide monomers that have a label moiety or those that lack a label moiety. Accordingly, incorporation events can be detected based on a characteristic of the label, such as fluorescence of the label; a characteristic of the nucleotide monomer such as molecular weight or charge; a by-product of incorporation of the nucleotide, such as release of pyrophosphate; or the like. In embodiments, where two or more different nucleotides are present in a sequencing reagent, the different nucleotides can be distinguishable from each other, or alternatively, the two or more different labels can be the indistinguishable under the detection techniques being used. For example, the different nucleotides present in a sequencing reagent can have different labels and they can be distinguished using appropriate optics as exemplified by the sequencing methods developed by Solexa (now Illumina, Inc.). Alternatively one, two or three of the nucleotide monomers can be unlabelled. However, it is also possible to use the same label for the two or more different nucleotides present in a sequencing reagent or to use detection optics that do not necessarily distinguish the different labels.
Embodiments can utilize sequencing by ligation techniques. Such techniques utilize DNA ligase to incorporate oligonucleotides and identify the incorporation of such oligonucleotides.
Embodiments can include nanopore sequencing techniques. In such embodiments, the target nucleic acid passes through a nanopore. The nanopore can be a synthetic pore or biological membrane protein, such as alpha-hemolysin. As the target nucleic acid passes through the nanopore, each base-pair can be identified by measuring fluctuations in the electrical conductance of the pore. In nanopore sequencing, the bases can be naturally occurring, or can be modified. Modified strands can be prepared where one base is made to carry a larger moiety, and is thus easier to detect. The use of a nucleic acid where each A, C, G or T carries a label, is within the scope of the current disclosure. The gaps between the label can be determined in the absence of being able to identify each base in the sequence between the labels. In such a way, the requirements to identify every base can be eliminated, and the complexity of nanopore based sequencing methods reduced.
Embodiments can utilize methods involving the real-time monitoring of DNA polymerase activity. Nucleotide incorporations can be detected through fluorescence resonance energy transfer (FRET) interactions between a fluorophore-bearing polymerase and gamma- phosphate-labelled nucleotides, which can be detected with zero-mode waveguides. The illumination can be restricted to a zeptoliter-scale volume around a surface-tethered polymerase such that incorporation of fluorescently labelled nucleotides can be observed with low background. In one example single molecule, real-time (SMRT) DNA sequencing technology provided by Pacific Biosciences Inc can be utilized with the methods described herein. A SMRT chip comprises a plurality of zero-mode waveguides (ZMW). Each ZMW comprises a cylindrical hole tens of nanometers in diameter perforating a thin metal film supported by a transparent substrate. When the ZMW is illuminated through the transparent substrate, attenuated light may penetrate the lower 20-30 nm of each ZMW creating a detection volume of about 10"21 L. Smaller detection volumes increase the sensitivity of detecting fluorescent signals by reducing the amount of background that can be observed.
SMRT chips and similar technology can be used in association with nucleotide monomers fluorescently labelled on the terminal phosphate of the nucleotide. The label is cleaved from the nucleotide monomer on incorporation of the nucleotide into the polynucleotide. Accordingly, the label is not incorporated into the polynucleotide, increasing the signal: background ratio. Moreover, the need for conditions to cleave a label from a labelled nucleotide monomers is reduced.
An additional example of a sequencing platform that can be used in association with the methods described herein is provided by Complete Genomics Inc. Libraries of target nucleic acids can be prepared where target nucleic acid sequences are interspersed approximately every 20 by with adaptor sequences. The target nucleic acids can be amplified using rolling circle replication, and the amplified target nucleic acids can be used to prepare an array of target nucleic acids. Methods of sequencing such arrays include sequencing by ligation, in particular, sequencing by combinatorial probe-anchor ligation (cPAL).
The sequencing methods described herein can be advantageously carried out in multiplex formats such that multiple different target nucleic acids are manipulated simultaneously. In particular embodiments, different target nucleic acids can be treated in a common reaction vessel or on a surface of a particular substrate. This allows convenient delivery of sequencing reagents, removal of unreacted reagents and detection of incorporation events in a multiplex manner. In embodiments using surface-bound target nucleic acids, the target nucleic acids can be in an array format. In an array format, the target nucleic acids can be typically bound to a surface in a spatially distinguishable manner. The target nucleic acids can be bound by direct covalent attachment, attachment to a bead or other particle or binding to a polymerase or other molecule that is attached to the surface. The array can include a single copy of a target nucleic acid at each site (also referred to as a feature) or multiple copies having the same sequence can be present at each site or feature. Multiple copies can be produced by amplification methods such as, bridge amplification or emulsion PCR.
It will be appreciated that any of the above-described sequencing processes can be incorporated into the methods described herein. For example, the methods can utilize sequencing reagents having mixtures of one or more nucleotide monomers or can otherwise be carried out under conditions where one or more nucleotide monomers contact a target nucleic acid in a single sequencing cycle. In addition, the methods can utilize sequencing reagents having mixtures of oligonucleotides and ligase. Furthermore, it will be appreciated that other known sequencing processes can be easily by implemented for use with the methods and/or systems described herein.
Applications
Methods described herein are a useful tool in obtaining the molecular signature of a sequence, such as a DNA sequence. The sequence information that can be obtained using the methods described herein can be used in applications involved in genotyping, expression profiling, capturing alternative splicing, genome mapping, amplicon sequencing, methylation detection and metagenomics. The reads obtained can be compared to a reference sequence. The comparison with previously obtained data may be referred to as re-sequencing, as opposed to de-novo sequencing.
In one example, low resolution sequence representations can provide a signature for different nucleic acids in a sample. Accordingly, the actual sequence of a target nucleic acid need not be determined at single-nucleotide resolution and, instead, a low resolution sequence representation of the nucleic acid can be used. In some embodiments, the low resolution sequence representations comprise one or more positions where single nucleotide assignments are not made. The low resolution representations comprise one or more regions where no nucleotide assignment or a completely ambiguous nucleotide assignment is made interspersed by regions where at least one position is assigned with single base resolution. In some embodiments, these regions contain multiple consecutive positions (high resolution sequence islands) that are assigned with single base resolution. In some embodiments, the high resolution sequence island may contain one or more areas of sequence ambiguity, however, high resolution sequence islands are often preferred. A low resolution sequence representation can be used to determine the presence or absence of a target nucleic acid in a particular sample or to quantify the amount of the target nucleic acid. Exemplary applications include, but are not limited to, expression analysis, identification of organisms, or evaluation of structure for chromosomes, expressed RNAs or other nucleic acids.
In particular embodiments, low resolution sequence representations for one or more target mRNA molecules can be used to determine expression levels in one or more samples of interest. So long as the low resolution sequence representations are sufficiently indicative of the mRNA, the actual sequence need not be known at single nucleotide resolution. For example, if a low resolution sequence representation distinguishes a target mRNA from all other mRNA species expressed in a target sample and in a reference sample, then comparison of the low resolution sequence representations from both samples can be used to determine relative expression levels. Target nucleic acids used in expression methods can be obtained from any of a variety of different samples including, for example, cells, tissues or biological fluids from organisms such as those set forth above. Presence or absence, or even quantities of target nucleic acids can be determined for samples that have been treated with different chemical agents, physical manipulations, environmental conditions or the like. Alternatively or additionally, samples can be from organisms that are experiencing any of a variety of diseases, conditions, developmental states or the like. Typically, a reference sample and target sample will differ in regard to one or more of the above factors (for example, treatment, conditions, species origin, or cell type).
In particular embodiments, low resolution sequence representations for target nucleic acids obtained from a particular organism can be used to characterize or identify the organism. For example, a pathogenic organism can be identified in an environmental sample or in a clinical sample from an individual based on at least one low resolution sequence representation for a target nucleic acid from the sample. So long as the one or more low resolution sequence representations are sufficiently indicative of the organism, the actual sequence need not be known at single nucleotide resolution. For example, if a low resolution sequence representation distinguishes a pathogenic bacterial strain from other bacteria, then comparison of the low resolution sequence representations from the sample of interest to low resolution sequence representations from reference samples or from a database can be used to detect presence or absence of the pathogenic bacterial strain. In another example, a low resolution sequence representation of the 16S rRNA gene can be used to characterize and/or identify an organism. The 16S RNA gene is highly conserved across species and contains highly conserved sequences that may be interspersed with variable sequences that may be species-specific. In some embodiments, a low resolution sequence representation of a 16S rRNA gene may identify a particular organism through the pattern of uniform and variable regions that may be obtained at low resolution. In other embodiments, the composition of particular sequencing reagents can be determined to obtain sequence information at low resolution in highly conserved regions of the 16S rRNA gene, and to obtain sequence information at a higher resolution in variable regions of the 16S rRNA gene.
In another example, the structure of a chromosome, RNA or other nucleic acid can be determined based on low resolution sequence representations. For example, if a low resolution sequence representation distinguishes a chromosomal region from other regions of a chromosome, then comparison of the low resolution sequence representations from a target sample and a reference sample for which the chromosome structure is known can be used to identify insertions, deletions or rearrangements in the target sample. Similarly, if a low resolution sequence representation distinguishes a target mRNA isoform (i.e. alternative splice product of a gene) from another mRNA isoform expression product of the same gene, then comparison of the low resolution sequence representations for both isoforms can be used to determine presence or absence of the target isoform. Target nucleic acids used to determine chromosome or RNA structure can be obtained from any of a variety of samples including, but not limited to those exemplified above.
In particular embodiments, low resolution sequence representations can be obtained for a plurality of target nucleic acids that are fragments of a larger nucleic acid such as a genome. In such embodiments, the sequence information for the individual fragments can be used to determine the actual sequence of the larger nucleic acid at single nucleotide resolution. For example, multiple low resolution sequence representations from each feature can be used to determine the actual sequence of each fragment target nucleic acid at single nucleotide resolution. The actual sequence of each fragment can then be used to determine the actual sequence of the larger sequence, for example, by alignment to a reference sequence or by de novo assembly methods. In an alternative embodiment, the low resolution sequence representations from different features can be used directly to determine the actual sequence of the larger sequence, for example, using pattern matching methods.
In more embodiments, low resolution sequence representation of a target nucleic acid can provide a scaffold on which to map other sequence representations of a target nucleic acid. In certain embodiments, methylated cytosine residues may be identified in a target nucleic acid. For example, a target nucleic acid can be treated under conditions where cytosine residues are converted to uracil residues, but methylcytosine residues are protected, such as using bisulfite treatment of DNA. In some embodiments, a sequencing reagent for a flow step for limited extension may allow limited extension until a cytosine residue in the target nucleic acid is reached by the polymerase. For example, the sequencing reagent may contain a labelled GTP comprising a reversibly terminating moiety or, alternatively, the sequencing reagent may contain no GTP. At least one nucleotide may then be identified in at least one subsequent flow step, for example, by using a nucleotide having a distinguishable label. Thus a low resolution sequence representation of methylated cytosines in a target nucleic acid can be obtained. Additionally or alternatively, a sequencing reagent for a flow step for limited extension may allow limited extension until a uracil residue in the target nucleic acid is reached by the polymerase. For example, the sequencing reagent may contain a labelled ATP comprising a reversibly terminating moiety or, alternatively, the sequencing reagent may contain no ATP. At least one nucleotide may then be identified in at least one subsequent flow step, for example, by using a nucleotide having a distinguishable label. Thus a low resolution sequence representation of non-methylated cytosines in a target nucleic acid can be obtained.
By using a sequencing reagent containing a labelled ATP comprising a reversibly terminating moiety or, alternatively, a sequencing reagent containing no ATP, and comparing the results of bisulfite and non-bisulfite treated samples, a methylation profile of the sample can be obtained. In particular embodiments, a first low resolution sequence representation can be obtained from a target nucleic acid that has been treated under conditions wherein cytosine residues are converted to uracil residues and a second low resolution sequence representation can be obtained from a sample of the target nucleic acid that has not been treated in this way. The first low resolution sequence representation can be compared to the second low resolution sequence representation and differences in methylation status can be determined based on differences in the number of cytosines, uracils or both. The use of unlabelled bases, and fewer fluorescent imaging cycles per strand may allow longer reads to be obtained that are currently obtainable using SBS cycles where every base is labelled. Thus read lengths of greater than 1000 bases per read may be obtained if desired. Multiple reads can be obtained on each sample where the fluorescently labelled or omitted base is varied, such that aliquots of the same sample are analysed in different ways. Such methods can include obtaining at least two different low resolution sequence representations of a target nucleic acid and combining the predicted representations. One example can include obtaining low resolution sequence representations where the C's and gaps are determined, and a second run where the T's and gaps are determined. The different reads can be done on different aliquots of the same sample, or may be carried our on the same aliquot in the form of paired reads. The two low resolution sequence representations may be combined to provide a higher resolution sequence representation. Such paired read technologies can use different bases in each of the two reads.
In more embodiments, methods described herein can be applied to pair-ended sequencing methods. Pair-end sequencing methods can include preparing a target nucleic acid, and/or plurality of target nucleic acids by fragmenting larger nucleic acid molecules and flanking the nucleic acid fragments with adaptors to allow sequencing reactions to be primed from each end of the adaptor-flanked molecules. In such instances different nucleotides can be used in each direction if desired.
In some embodiments, at least a portion of a low resolution sequence can be compared to a reference sequence or a plurality of reference sequences, such as those obtained from an electronic database or a biological database. In some embodiments, a low resolution sequence can include the sequence of a target sequence. In some embodiments, a reference sequence can include a sequence representation of the target sequence. For example, a reference sequence can include the predicted sequence representation of a target sequence, where the sequence representation of a target sequence is obtained using methods described herein.
In some embodiments, a sequence is analyzed by comparing the sequence to reference sequences, for example, reference nucleotide sequences. Sequences can be compared utilizing a variety of methods. Examples of methods include utilizing a heuristic algorithm, such as a Basic Local Alignment Search Tool (BLAST) algorithm, a BLAST-like Alignment Tool (BLAT) algorithm, or a FASTA algorithm. Some embodiments described herein include databases. Databases can be used in comparing the sequence with a population of database sequences. Databases can contain a population of reference sequences. The population can include a variety of types of reference sequences, for example, nucleotide sequences, polypeptide sequences, or mixtures thereof.
Although some of the analyses of the sequence are described in connection with database sequences, it will be appreciated that it is not necessary to compare the barcode sequence to a population of sequences in a database. In some embodiments, the barcode sequence can be compared to one or more reference sequences obtained from any source. For example, the barcode sequence can be compared to one or more sequences generated by sequencing nucleic acids from one or more reference organisms either prior to or in parallel with generating the low resolution sequence data.
In some embodiments, a population of reference sequences can be indexed. In preferred embodiments, a database can be pre-indexed for use with the methods and compositions described herein. Indexing can improve the efficiency of accessing the sequences and/or attributes associated with such sequences in a database. An index can be created from a population of database sequences using one or more characteristics of each sequence. Such characteristics can be intrinsic or extrinsic to a database sequence. Intrinsic characteristics can include the primary structure of a sequence, and secondary structure of a sequence. The secondary structure of a polypeptide sequence or a nucleic acid sequence can be determined by methods well known in the art, such as methods using predictive algorithms. Extrinsic characteristics can include a variety of traits, for example, the source of a sequence, and the function of a sequence.
The identity of the source of a target nucleic acid can be identified, or otherwise characterized, by one or a plurality of traits and such traits will vary with the application of the methods and systems described herein. In one embodiment, the source of a sequence can be identified by comparing the low resolution sequence to reference sequences.
Kits Disclosed are kits for carrying out the method. The kits may include one labelled, blocked nucleotide triphosphate and three unlabelled blocked nucleotide triphosphates and a nucleic acid polymerase. The blocked nucleotide triphosphate may have a 3' azidomethyl group.
Experimental data
Computer simulations of 100K reads (human) or 1M reads (Drosophila) show that the data generated using one base is representative of data generated using four bases. Simulated reads of 100 base pairs in length were compared using all four bases, and represented in two base code using B and N where B is one base, and N is the other three bases. Such reads are termed One base only' reads.
Figure 1 shows that using 100 base reads of all four bases, native 4 base sequencing aligned 96.7% of the 100000 reads to the reference human genome (HG, build 37). One base only reads with position context preserved gave alignments between 92.4 - 92.9% depending on the single base selected.
Bisulfite converted 4 base reads (so AUTG not ACTG) aligned to 95.3%.
Bisulfite converted T one base only reads (T only + positional context) aligned to 89.9%
Thus only using a single base rather than four bases still allows alignment within the genome, and still allows useful output from the sequencing data despite the reduced complexity of the data being generated and stored.

Claims

Claims:
A method of reducing the complexity of nucleic acid sequencing reads by recording the appearance of only a single one of the four nucleotide bases and the size of the gaps between the appearances of the single nucleotide base being recorded, wherein the identity of the intervening nucleotides in said gaps is not determined, wherein the method comprises repeated cycles of polymerase extension using one labelled, blocked nucleotide triphosphate and three unlabelled blocked nucleotide triphosphates in each cycle.
The method of claim 1 wherein the single nucleotide is cytidine (C).
The method of claim 1 wherein the single nucleotide is thymidine (T) or uridine (U).
The method of any preceding claim wherein the sample is bisulfite treated prior to sequencing.
The method of any preceding claim wherein the single nucleotide appears at least 20 times in the sequencing read.
The method of any preceding claim wherein the sequencing reads span at least 100 nucleotides.
The method of claim 1 for analysing cytosine methylation in a nucleotide sample, the method comprising;
a) treating a first portion of the sample with bisulfite;
b) obtaining sequencing reads on said bisulfite treated first portion relating to the position of T or U nucleotides and the number of nucleotides in the gaps between one T or U and the next T or U nucleotide;
c) obtaining sequencing reads relating to the position of T or U nucleotides and the number of nucleotides in the gaps between one T or U and the next T or U nucleotide on a second portion of the nucleotide sample which has not undergone bisulfite treatment; and
d) comparing the two sets of sequencing reads to determine the extent of cytosine methylation.
The method of claim 1 for analysing cytosine methylation in a nucleotide sample, the method comprising;
a) treating a first portion of the sample with bisulfite; b) obtaining sequencing reads on said bisulfite treated first portion relating to the position of C nucleotides and the number of nucleotides in the gaps between one C and the next C nucleotide;
c) obtaining sequencing reads relating to the position of C nucleotides and the number of nucleotides in the gaps between one C and the next C nucleotide on a second portion of the nucleotide sample which has not undergone bisulfite treatment; and
d) comparing the two sets of sequencing reads to determine the extent of cytosine methylation.
9. The method of claim 1 wherein the labelled nucleotide is adenosine (A), thereby measuring T or U, and the size of gaps therebetween.
10. The method of claim 1 wherein the labelled nucleotide is guanosine (G), thereby measuring C, and the size of gaps therebetween.
11. The method of any one preceding claim, wherein the method comprises the steps of
(a) providing a first sequencing reagent to a target nucleic acid in the presence of a polymerase, the first sequencing reagent including a plurality of different nucleotide monomers, each including a reversibly terminating moiety, where three of the nucleotide monomers are unlabelled (dark), and one is labelled,
(b) detecting the label,
(c) removing the reversibly terminating moiety and the label, and
(d) repeating steps (a), (b) and (c), whereby sequence information for at least a portion of the target nucleic acid is obtained.
12. The method of any one preceding claim wherein at least 1 million sequence reads are obtained in parallel.
13. The method of claim 12 wherein the sequencing is carried out on a solid support.
14. A kit comprising one labelled, blocked nucleotide triphosphate and three unlabelled blocked nucleotide triphosphates and a nucleic acid polymerase.
15. A kit according to claim 14 wherein the blocked nucleotide triphosphate has a 3' azidomethyl group.
PCT/GB2015/053153 2014-10-21 2015-10-21 Improved nucleic acid re-sequencing using a reduced number of identified bases WO2016063059A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
GBGB1418718.1A GB201418718D0 (en) 2014-10-21 2014-10-21 Improved nucleic acid re-sequencing using a reduced number of identified bases
GB1418718.1 2014-10-21

Publications (1)

Publication Number Publication Date
WO2016063059A1 true WO2016063059A1 (en) 2016-04-28

Family

ID=52013372

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/GB2015/053153 WO2016063059A1 (en) 2014-10-21 2015-10-21 Improved nucleic acid re-sequencing using a reduced number of identified bases

Country Status (2)

Country Link
GB (1) GB201418718D0 (en)
WO (1) WO2016063059A1 (en)

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9822394B2 (en) 2014-02-24 2017-11-21 Cambridge Epigenetix Limited Nucleic acid sample preparation
WO2018165207A1 (en) * 2017-03-06 2018-09-13 Singular Genomic Systems, Inc. Nucleic acid sequencing-by-synthesis (sbs) methods that combine sbs cycle steps
US10323269B2 (en) 2008-09-26 2019-06-18 The Children's Medical Center Corporation Selective oxidation of 5-methylcytosine by TET-family proteins
US10428381B2 (en) 2011-07-29 2019-10-01 Cambridge Epigenetix Limited Methods for detection of nucleotide modification
US10563248B2 (en) 2012-11-30 2020-02-18 Cambridge Epigenetix Limited Oxidizing agent for modified nucleotides
US10738072B1 (en) 2018-10-25 2020-08-11 Singular Genomics Systems, Inc. Nucleotide analogues
US10822653B1 (en) 2019-01-08 2020-11-03 Singular Genomics Systems, Inc. Nucleotide cleavable linkers and uses thereof
US11085076B2 (en) 2015-09-28 2021-08-10 The Trustees Of Columbia University In The City Of New York Synthesis of novel disulfide linker based nucleotides as reversible terminators for DNA sequencing by synthesis
US11410750B2 (en) 2018-09-27 2022-08-09 Grail, Llc Methylation markers and targeted methylation probe panel
US11566284B2 (en) 2016-08-10 2023-01-31 Grail, Llc Methods of preparing dual-indexed DNA libraries for bisulfite conversion sequencing
US12018325B2 (en) 2017-03-28 2024-06-25 The Trustees Of Columbia University In The City Of New York 3′-O-modified nucleotide analogues with different cleavable linkers for attaching fluorescent labels to the base for DNA sequencing by synthesis
US12024750B2 (en) 2018-04-02 2024-07-02 Grail, Llc Methylation markers and targeted methylation probe panel

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6294336B1 (en) * 1996-03-19 2001-09-25 Orchid Biosciences, Inc. Method for analyzing the nucleotide sequence of a polynucleotide by oligonucleotide extension on an array
US20100279882A1 (en) * 2009-05-01 2010-11-04 Mostafa Ronaghi Sequencing methods
US20110091883A1 (en) * 2009-02-18 2011-04-21 Helicos Biosciences Corporation Methods for analyzing minute cellular nucleic acids
US20120083417A1 (en) * 2010-09-23 2012-04-05 Centrillion Technology Holding Corporation Native-extension parallel sequencing
US20130079232A1 (en) * 2011-09-23 2013-03-28 Illumina, Inc. Methods and compositions for nucleic acid sequencing
WO2014165554A1 (en) * 2013-04-03 2014-10-09 Life Technologies Corporation Systems and methods for genetic sequencing
US20140309143A1 (en) * 2009-09-15 2014-10-16 Illumina Cambridge Limited Centroid markers for image analysis of high density clusters in complex polynucleotide sequencing

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6294336B1 (en) * 1996-03-19 2001-09-25 Orchid Biosciences, Inc. Method for analyzing the nucleotide sequence of a polynucleotide by oligonucleotide extension on an array
US20110091883A1 (en) * 2009-02-18 2011-04-21 Helicos Biosciences Corporation Methods for analyzing minute cellular nucleic acids
US20100279882A1 (en) * 2009-05-01 2010-11-04 Mostafa Ronaghi Sequencing methods
US20140309143A1 (en) * 2009-09-15 2014-10-16 Illumina Cambridge Limited Centroid markers for image analysis of high density clusters in complex polynucleotide sequencing
US20120083417A1 (en) * 2010-09-23 2012-04-05 Centrillion Technology Holding Corporation Native-extension parallel sequencing
US20130079232A1 (en) * 2011-09-23 2013-03-28 Illumina, Inc. Methods and compositions for nucleic acid sequencing
WO2014165554A1 (en) * 2013-04-03 2014-10-09 Life Technologies Corporation Systems and methods for genetic sequencing

Cited By (36)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11208683B2 (en) 2008-09-26 2021-12-28 The Children's Medical Center Corporation Methods of epigenetic analysis
US10767216B2 (en) 2008-09-26 2020-09-08 The Children's Medical Center Corporation Methods for distinguishing 5-hydroxymethylcytosine from 5-methylcytosine
US10533213B2 (en) 2008-09-26 2020-01-14 Children's Medical Center Corporation Selective oxidation of 5-methylcytosine by TET-family proteins
US10337053B2 (en) 2008-09-26 2019-07-02 Children's Medical Center Corporation Labeling hydroxymethylated residues
US10793899B2 (en) 2008-09-26 2020-10-06 Children's Medical Center Corporation Methods for identifying hydroxylated bases
US10443091B2 (en) 2008-09-26 2019-10-15 Children's Medical Center Corporation Selective oxidation of 5-methylcytosine by TET-family proteins
US10465234B2 (en) 2008-09-26 2019-11-05 Children's Medical Center Corporation Selective oxidation of 5-methylcytosine by TET-family proteins
US10612076B2 (en) 2008-09-26 2020-04-07 The Children's Medical Center Corporation Selective oxidation of 5-methylcytosine by TET-family proteins
US10323269B2 (en) 2008-09-26 2019-06-18 The Children's Medical Center Corporation Selective oxidation of 5-methylcytosine by TET-family proteins
US10774373B2 (en) 2008-09-26 2020-09-15 Children's Medical Center Corporation Compositions comprising glucosylated hydroxymethylated bases
US10508301B2 (en) 2008-09-26 2019-12-17 Children's Medical Center Corporation Detection of 5-hydroxymethylcytosine by glycosylation
US10731204B2 (en) 2008-09-26 2020-08-04 Children's Medical Center Corporation Selective oxidation of 5-methylcytosine by TET-family proteins
US11072818B2 (en) 2008-09-26 2021-07-27 The Children's Medical Center Corporation Selective oxidation of 5-methylcytosine by TET-family proteins
US12018320B2 (en) 2008-09-26 2024-06-25 The Children's Medical Center Corporation Selective oxidation of 5-methylcytosine by TET-family proteins
US10428381B2 (en) 2011-07-29 2019-10-01 Cambridge Epigenetix Limited Methods for detection of nucleotide modification
US10563248B2 (en) 2012-11-30 2020-02-18 Cambridge Epigenetix Limited Oxidizing agent for modified nucleotides
US9822394B2 (en) 2014-02-24 2017-11-21 Cambridge Epigenetix Limited Nucleic acid sample preparation
US11085076B2 (en) 2015-09-28 2021-08-10 The Trustees Of Columbia University In The City Of New York Synthesis of novel disulfide linker based nucleotides as reversible terminators for DNA sequencing by synthesis
US11959137B2 (en) 2015-09-28 2024-04-16 The Trustees Of Columbia University In The City Of New York Synthesis of novel disulfide linker based nucleotides as reversible terminators for DNA sequencing by synthesis
US12006540B2 (en) 2015-09-28 2024-06-11 The Trustees Of Columbia University In The City Of New York Synthesis of novel disulfide linker based nucleotides as reversible terminators for DNA sequencing by synthesis
US11999999B2 (en) 2015-09-28 2024-06-04 The Trustees Of Columbia University In The City Of New York Synthesis of novel disulfide linker based nucleotides as reversible terminators for DNA sequencing by synthesis
US11566284B2 (en) 2016-08-10 2023-01-31 Grail, Llc Methods of preparing dual-indexed DNA libraries for bisulfite conversion sequencing
US11591647B2 (en) 2017-03-06 2023-02-28 Singular Genomics Systems, Inc. Nucleic acid sequencing-by-synthesis (SBS) methods that combine SBS cycle steps
WO2018165207A1 (en) * 2017-03-06 2018-09-13 Singular Genomic Systems, Inc. Nucleic acid sequencing-by-synthesis (sbs) methods that combine sbs cycle steps
US11773439B2 (en) 2017-03-06 2023-10-03 Singular Genomics Systems, Inc. Nucleic acid sequencing-by-synthesis (SBS) methods that combine SBS cycle steps
US12018325B2 (en) 2017-03-28 2024-06-25 The Trustees Of Columbia University In The City Of New York 3′-O-modified nucleotide analogues with different cleavable linkers for attaching fluorescent labels to the base for DNA sequencing by synthesis
US12024750B2 (en) 2018-04-02 2024-07-02 Grail, Llc Methylation markers and targeted methylation probe panel
US11795513B2 (en) 2018-09-27 2023-10-24 Grail, Llc Methylation markers and targeted methylation probe panel
US11725251B2 (en) 2018-09-27 2023-08-15 Grail, Llc Methylation markers and targeted methylation probe panel
US11685958B2 (en) 2018-09-27 2023-06-27 Grail, Llc Methylation markers and targeted methylation probe panel
US11410750B2 (en) 2018-09-27 2022-08-09 Grail, Llc Methylation markers and targeted methylation probe panel
US11878993B2 (en) 2018-10-25 2024-01-23 Singular Genomics Systems, Inc. Nucleotide analogues
US11958877B2 (en) 2018-10-25 2024-04-16 Singular Genomics Systems, Inc. Nucleotide analogues
US10738072B1 (en) 2018-10-25 2020-08-11 Singular Genomics Systems, Inc. Nucleotide analogues
US11970735B2 (en) 2019-01-08 2024-04-30 Singular Genomics Systems, Inc. Nucleotide cleavable linkers and uses thereof
US10822653B1 (en) 2019-01-08 2020-11-03 Singular Genomics Systems, Inc. Nucleotide cleavable linkers and uses thereof

Also Published As

Publication number Publication date
GB201418718D0 (en) 2014-12-03

Similar Documents

Publication Publication Date Title
WO2016063059A1 (en) Improved nucleic acid re-sequencing using a reduced number of identified bases
US10167506B2 (en) Method of sequencing nucleic acid colonies formed on a patterned surface by re-seeding
EP2427572B1 (en) Sequencing methods
RU2752700C2 (en) Methods and compositions for dna profiling
RU2698125C2 (en) Libraries for next generation sequencing
EP2633069B1 (en) Sequencing methods
Chan Advances in sequencing technology
EP2788499B1 (en) Expanded radix for polymeric tags
US10364464B2 (en) Compositions and methods for co-amplifying subsequences of a nucleic acid fragment sequence
US20190233883A1 (en) Methods and compositions for analyzing nucleic acid
US9175348B2 (en) Identification of 5-methyl-C in nucleic acid templates
US11789906B2 (en) Systems and methods for genomic manipulations and analysis
US20170101675A1 (en) Ion sensor dna and rna sequencing by synthesis using nucleotide reversible terminators
CN103602719A (en) Gene sequencing method
Zascavage et al. Deep-sequencing technologies and potential applications in forensic DNA testing
Shetty et al. Introduction to nucleic acid sequencing
Edwards Whole-genome sequencing for marker discovery
Daniel et al. Sequencing Technology in Forensic Science: Next-Generation Sequencing
CN105648084B (en) A kind of method of the real-time synthesis order-checking detection base continuous mutation sequence of two nucleotide
Bhat et al. DNA Sequencing
Wood Mitochondrial Haplogrouping and Short Tandem Repeat Analyses in Anthropological Research using Next-Generation Sequencing Technologies
Liu 5 DNA Sequencing
Pal et al. RNA Sequencing (RNA-seq)
Khandappagol et al. Next-Generation DNA Sequencing: Methodology and Application
Chen DNA sequencing and short reads assembly

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 15793887

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 15793887

Country of ref document: EP

Kind code of ref document: A1