WO2016062713A1 - A computational method for the identification of variants in nucleic acid sequences - Google Patents

A computational method for the identification of variants in nucleic acid sequences Download PDF

Info

Publication number
WO2016062713A1
WO2016062713A1 PCT/EP2015/074253 EP2015074253W WO2016062713A1 WO 2016062713 A1 WO2016062713 A1 WO 2016062713A1 EP 2015074253 W EP2015074253 W EP 2015074253W WO 2016062713 A1 WO2016062713 A1 WO 2016062713A1
Authority
WO
WIPO (PCT)
Prior art keywords
reads
sequence
state
variants
suffix
Prior art date
Application number
PCT/EP2015/074253
Other languages
French (fr)
Inventor
David TORRENTS ARENALES
Santiago GONZÁLEZ ROSADO
Valentí MONCUNILL GONZALEZE
Original Assignee
Barcelona Supercomputing Center - Centro Nacional De Supercomputación
Institució Catalana De Recerca I Estudis Avançats
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Barcelona Supercomputing Center - Centro Nacional De Supercomputación, Institució Catalana De Recerca I Estudis Avançats filed Critical Barcelona Supercomputing Center - Centro Nacional De Supercomputación
Priority to EP15784335.0A priority Critical patent/EP3210145A1/en
Publication of WO2016062713A1 publication Critical patent/WO2016062713A1/en

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/20Allele or variant detection, e.g. single nucleotide polymorphism [SNP] detection

Definitions

  • the present invention relates to a computer implemented method for the identification and characterization of sequence variants in nucleic acids.
  • this method is able to quickly and accurately identify most types of sequence genome variations that have been associated to disease, that is, from single nucleotide substitutions to large structural variants.
  • This method has multiple and direct applications in diagnosis, prognosis and therapy.
  • NGS Next Generation Sequencing
  • a wide range of genome variation of cells and individuals has been identified to be the direct cause, or a predisposition, to genetic diseases: from single nucleotide variants (SNVs if they are somatic, and SNPs if they are polymorphic in the population), to structural variants (SVs), which can correspond to deletions, insertions, inversions, translocations and copy number changes (CNVs), ranging from a few nucleotides to large genomic regions, including complete chromosome arms.
  • SNVs single nucleotide variants
  • SVs structural variants
  • CNVs copy number changes
  • These variations can exist between patients and also emerge among cells of the same patient.
  • the unveiling of changes in the genome is driving discoveries such as the Philadelphia translocation between chromosomes 9 and 22, whose presence implies the development of chronic myelogenous leukemia (CML) and its identification allows the development and selection of last-generation therapies.
  • CML chronic myelogenous leukemia
  • Cancer is one of the most active diagnostic and therapeutic areas where NGS is being applied.
  • NGS NGS is being applied.
  • having access to all somatic variation accumulated in a tumor cell will allow the identification of the genetic causes of the tumor and, consequently, a more precise and specific (personalized) diagnosis, prognosis and therapy.
  • the identification of somatic variation in cancer genomes has currently the following sources of errors and limitations: (i) the initial alignment step, on which all the methods rely, is time consuming and particularly error prone with the tumor reads that carry the sequence variation, which are the most relevant for the analysis. It has been proven that many of these reads that carry changes and differences in their sequence are difficult or even impossible to align to the reference unmutated genome. The absence and the misplacement of tumor reads in the final alignment drastically affect all existing downstream methods for variant searching and calling. Although a number of alternative methods exist, this alignment step is generally performed with the same program (Li, H., et.al. "Fast and accurate short read alignment with Burrows-Wheeler transform" Bioinformatics 2009, vol.
  • reference-free has recently been used to describe other methods that, with different applications to the ones outlined above, also analyse NGS data without relying on a reference genome.
  • One of them describes a computational method for quantifying the abundance of RNA isoforms from RNA-seq data, which does not rely on the mapping of reads to a reference genome (Patro R., et. al. "Sailfish enables alignment-free isoform
  • Nordstrom K.J.V et.al. "Mutation identification by direct comparison of whole- genome sequencing data from mutant and wild-type individuals using k-mers” Nature Biotech. 2013, vol. 31 , pp. 325-331 .
  • This work presents a reference- free genome assembler that allows discovering homozygous mutations based on comparing k-mers in whole genome sequencing data.
  • these two methods can operate without the need of a reference sequence, they are not applicable in comparative studies such as the identification of somatic mutations and thus, cannot be used for genome analysis in a biomedical context.
  • the method is not restricted to the identification of a certain type of variant (SNVs,
  • a first aspect of the present invention is a computational method for the identification of nucleic acid variants between two genomic states comprising the steps of:
  • the performance and speed of this method make it more suitable for clinical applications (such as genomic analysis of cancer cells) than the complex and time-consuming pipelines developed so far.
  • the method can even be used to define very complex genomic scenarios such as the ones taking place in chromoplexy and chromothripsis related to cancer development and aggressiveness.
  • a second aspect of the present invention is a computer program product comprising program instructions for causing a computer system to perform the method for the identification of nucleic acid variants between two genomic states of the first aspect of the invention.
  • the computer program product may be embodied on a storage medium (for example, a CD-ROM, a DVD, a USB drive, on a computer memory or on a read-only memory) or carried on a carrier signal (for example, on an electrical or optical carrier signal).
  • a storage medium for example, a CD-ROM, a DVD, a USB drive, on a computer memory or on a read-only memory
  • a carrier signal for example, on an electrical or optical carrier signal
  • the computer program may be in the form of source code, object code, a code intermediate source and object code such as in partially compiled form, or in any other form suitable for use in the implementation of the processes according to the invention.
  • the carrier may be any entity or device capable of carrying the computer program.
  • the carrier may comprise a storage medium, such as a ROM, for example a CD ROM or a semiconductor ROM, or a magnetic recording medium, for example a floppy disc or hard disk.
  • a storage medium such as a ROM, for example a CD ROM or a semiconductor ROM, or a magnetic recording medium, for example a floppy disc or hard disk.
  • the carrier may be a transmissible carrier such as an electrical or optical signal, which may be conveyed via electrical or optical cable or by radio or other means.
  • the carrier may be constituted by such cable or other device or means.
  • the carrier may be an integrated circuit in which the computer program is embedded, the integrated circuit being adapted for performing, or for use in the performance of, the relevant methods.
  • a third aspect of the invention is a system for the identification of nucleic acid variants between two genomic states comprising the steps of: A) Computer/electronic means for Inputting 2 sets of nucleic acid reads, which are sequences retrieved from a sequencing method, wherein the first set of reads corresponds to cells representing a first test state, and the second set of reads corresponds to cells representing a second control state; B)
  • Computer/electronic means for filtering the reads comprising: B1 ) Keeping only the reads with at least a percentage X1 of their bases with a phred quality score higher than 20, being X1 equal or above 90%; B2) Splitting the reads with an undefined nucleotide, giving one sequence before, and one sequence after the undefined nucleotide, the latter being discarded; and B3) Discarding the sequence reads with less than X2 bases, wherein X2 is from 25 to 40; C) Computer/electronic means for generating a suffix structure, wherein the generation of the suffix structure comprises: C1 ) Generating a number of N-X2 new reads for each read of sequence length N, wherein the new N-X2 reads correspond to all suffixes with a length larger than X2 nucleotides and derived from each read of sequence length N, being a suffix the complete sequence of a sequenced read or a sub-sequence
  • the described system can be a part (for example, hardware in the form of a PCI card) of a computer system (for example a personal computer). Furthermore, the system can be an external hardware connected to the computer system by appropiate means.
  • the electronic/computer means may be used
  • a fourth aspect of the invention is a system comprising a processor and a memory, wherein the memory stores computer executable instructions that, when executed, cause the system to perform the method for the identification of nucleic acid variants between two genomic states of the first aspect of the invention.
  • FIG1 A non-limiting example of a suffix tree.
  • the string used for the example is banana$.
  • the string is expressed by the use of this suffix structure.
  • the root of the suffix is the uppermost node found on the picture.
  • FIG. Construction of the suffix structure. Construction of a suffix tree from two different sequences. The sequence “BANANA” which corresponds to a sequence of the first state and the sequence “BAANA” which corresponds to a second state sequence. Below those sequences, there are all their possible derived suffixes.
  • the tree structure is represented by a three squares' box, the left square represents the first state count, the middle square represents the corresponding letter and the right square represents the second state count, a)
  • the box corresponds to what in the text is referenced as node, position or nucleotide, b) That area corresponds to an unambiguous path: all nodes have a unique child with counts of both states, c) That specific node constitutes a breakpoint: the first state count and the second state count diverge in different ways at their children.
  • FIG. Breakpoint block Each dashed box represents a pair of first state (regular letters) and second state (italic and underlined letters) sequence that contains a breakpoint. All the dashed boxes contain sequences with the same suffix or prefix around the breakpoint which means that all of them constitute a unique breakpoint block (Note: the sequence used for this image does not represent any prior art sequence that holds any relation to the invention disclosed herein, i.e. it is simply shown for illustrative purposes).
  • NGS Next Generation Sequencing
  • base and the term “nucleotide” are herein used interchangeably, and refer to the monomers (subunits) which are repeated in a nucleic acid such as DNA or RNA, giving its sequence or primary structure.
  • reference genome refers to the complete nucleic acid sequence representing the whole genome of a species normally accepted by the wide community. Since the reference genome is usually assembled from the sequencing of DNA from a number of donors, it does not accurately represent the set of genes of any one single individual. Instead, a reference genome provides a mosaic of different DNA sequences from each donor. But, at general levels, the reference genome provides a good approximation of the DNA of any single individual. However, in genomic regions with high allelic diversity, the reference genome may differ
  • GRCh37 the Genome Reference Consortium human genome (build 37) is derived from thirteen anonymous volunteers from New York. Reference genomes are typically used as a guide on which new genomes are built and aligned, enabling their assembly and comparison.
  • forward strand refers to a nucleic acid sequence read from 5' terminal to 3' terminal ends.
  • reverse strand refers to the nucleic acid sequence which is complementary to the forward strand.
  • nucleic acid variant refers to a difference in sequence between two genomic states.
  • a variant can be a single nucleotide variant (SNV) if the difference between the two genomes (or two states of the same genome) is only due to the change of a single nucleotide. All other variants, among them insertions, deletions, inversions, duplications, translocations and others are termed structural variants (SV). The latter can have many sizes, from two bases up to entire pieces of a chromosome.
  • SNV single nucleotide variant
  • SV structural variants
  • genomic state can refer to two different genomes derived from two different individuals, or two genomes derived from two different cells of the same individual.
  • the two different cells can be a normal vs. a pathological cell, an undifferentiated vs. a differentiated cell, a cell which has been exposed to a certain external factor vs. an unexposed cell, etc.
  • mapping refers to aligning blocks of the first and second state to a reference genome.
  • read refers to a fragment of nucleic acid that is sequenced in its entirety.
  • the nucleic acid might be DNA, RNA, or even chemically altered nucleic acids.
  • the initial step in a high throughput sequencing run is the random fragmentation of a genome into millions of partly overlapping fragments called reads, which are usually amplified by Polymerase Chain Reaction and sequenced using a variety of techniques that are platform-dependent.
  • the lengths of the reads can also vary depending on the platform, and are usually on the order of a few dozens to a few hundreds of nucleotides.
  • the partly overlapping reads must be assembled if a complete picture of the genome is to be built.
  • depth of coverage refers to the number of times a nucleotide is read during the sequencing process. Deep sequencing means that the total number of reads is many times larger than the length of the sequence under study. Standard depth of coverage currently range from 30 to 100x for whole genomes, meaning that each position in the genome is represented from 30 to 100 times. Coverage similarly designates the average number of reads representing a given nucleotide in the reconstructed sequence.
  • Depth of coverage can be calculated from the length of the original genome (G), the number of reads (A/), and the average read length (L) as N * L/G.
  • the term "undefined nucleotide” as used herein refers to a certain position inside a sequenced read that could not be determined during the sequencing process, that is, a position for which the sequencing experiment has not unambiguously resolved whether it is occupied by an adenine (A), guanine (G), cytosine (C) or thymine (T), and therefore its nature is unknown.
  • Undefined nucleotides in reads are filtered out (removed) in the method of the invention, generating two or more fragments of defined sequence if the undefined nucleotides are removed form inner positions of the read.
  • the term "phred quality score” as used herein refers to the quality score given to each nucleotide base call in a sequenced read.
  • the phred score is a property given to each sequenced nucleotide and it is logarithmically related to the base-calling error probability.
  • a phred score of 10 assigned to a certain nucleotide in a sequenced read means that there is a 90% probability that the base call is correct
  • a phred score of 20 means that there is a 99% probability that the base call is correct
  • a phred score of 30 means that there is a 99.9% probability that the base call is correct.
  • the term “assembling” as used herein refers to grouping all the first state reads, and separately second state reads that share the same variant.
  • sequenced NGS read refers to the complete sequence of a sequenced NGS read or a sub-sequence derived from the latter when one nucleotide at the 5' terminal end is discarded. For instance, let us imagine a read of a length of 40 nucleotides which represents the beginning of the sequence that codes for the Homo sapiens PTGS2 enzyme (prostaglandin endoperoxide synthase-2, also known as cox-2, Genbank entry D28235). If an original read generated in an NGS experiment of the latter gene were: ATTATTAAATTATCAAAAAGAAAATGATCCACGCTCTTAG (lengh 40) its first five suffixes would be:
  • Suffix structure refers to a structure for string storage and compression that allows computational comparison and matching of sequences in a highly efficient manner.
  • Suffix structures that can be used in the method of the invention are, for instance, suffix trees and suffix arrays.
  • suffix tree also known as "suffix trie” as used herein refers to a tree-like data structure that is constructed to enable fast string matching. It allows the convenient storage of all substrings in a given string, that is, in terms of the present invention, the storage of all nucleotide sub-sequences (suffixes) in a given nucleotide sequence (of a read).
  • suffix tree see FIG.1 where a suffix tree is constructed for the term banana.
  • quadternary suffix tree refers to a suffix tree built with the 4-letter vocabulary of nucleic acids (G,A,T/U,C).
  • the suffix tree has a root, which is the 0 position from which the counting of the nucleotides begins.
  • the suffix array A contains the starting positions of these sorted suffixes: i 1234567
  • - ⁇ [3] contains the value 4, and therefore refers to the suffix starting at position 4 within S, which is the suffix ana$.
  • Suffix arrays are closely related to suffix trees. Suffix arrays can be
  • suffix array is an auxiliary array for the "suffix array”.
  • the LCP array H is constructed by comparing lexicographically consecutive suffixes to determine their longest common prefix:
  • prefix refers to the first part of a sequencing read, that is, from position 1 to a given position depending on the context. This term is used here, as it is used in a grammatical context referring to words.
  • breakpoint refers to the sequence immediately flanking a sequence variant.
  • a breakpoint is the point where the DNA broke in the second state and appears as a change in the first state compared to the second control state. In other words, where the continuity of the sequence of the control second state breaks in the first state.
  • node refers to any given position (nucleotide) in the tree. Relevant nodes in this invention are those that determine a change in the sequence of the second state compared to the first state (breakpoint nodes). In FIG. 2 it is represented a node where healthy and mutated reads take different paths because they contain a different sequence.
  • block or “breakpoint block” as used herein refers to all the reads derived from the two genomes compared, derived from the same position and including a variation in first state read.
  • a breakpoint block (FIG. 3). is composed of aligned reads derived from the sequencing of all four alleles involved (two coming from the first state genome and the other two derived from the second state genome). Normally, in the case of heterozygous variation, only the reads derived from the altered allele will contain the mutation or variant.
  • ambiguous path refers to multiple possible sequence solutions in a given tree. It is referred here as the opposite of unique and unambiguous path or sequence.
  • the first aspect of the present invention is a computational method for the identification of nucleic acid variants between two genomic states comprising the steps of:
  • computational method further comprises the step of: F) Cataloguing and annotating blocks according to the following: F1 ) If blocks between the first and second states only differ in one substituted nucleotide, the variant is catalogued as containing a single nucleotide variant and the single nucleotide variant is annotated; F2) If blocks between the first and second states differ in more than one nucleotide but the whole difference in sequence is contained within the block, the variant is catalogued as a small structural variant, and the small structural variant is annotated; and F3) If blocks between the first and second states differ in more than one nucleotide and the whole difference in sequence is not contained within the block, the variant is catalogued as a large structural variant, and the boundaries of all large structural variants are extended by retrieving suffixes overlapping at least X2 nucleotides in an iterative process which ends when the extended sequence reaches 200 nucleotides or when an ambiguous path is found.
  • F1 If blocks between the first and second states only differ in one
  • the method further comprises optionally mapping second state blocks, and subsequently mapping first state blocks, on a reference genome.
  • the suffix structure is a suffix tree or a suffix array.
  • X1 is equal or above 95%.
  • X1 is equal or above 99%.
  • X2 is from 30 to 35.
  • X2 is from 30 to 32.
  • X2 is equal to 30.
  • X3 is equal or above 4.
  • X3 is equal or above 6.
  • X3 is equal or above 8.
  • X3 is directly proportional to the depth of coverage in the sequencing experiment. This means that, the deeper the coverage, the more restrictive (higher) is the value for X3 (the more reads with the same variation between first and second states must be present for the node to be accepted as a breakpoint node).
  • X4 is between 5-10%.
  • X4 is between 5-7%.
  • X4 is 5%.
  • X5 is between 20-25%. In a particular embodiment of the first aspect of the invention, optionally in combination with any embodiment above or below, X5 is 20%.
  • the first set of reads corresponds to pathological cells of a patient, and the second set of reads corresponds to non-pathological cells of the same patient;
  • the first set of reads corresponds to cancer cells of a patient
  • the second set of reads corresponds to non-cancer cells of the same patient.
  • the first set of reads corresponds to virus-infected cells of a patient
  • the second set of reads corresponds to non-infected cells of the same patient.
  • the first set of reads and the second set of reads correspond to the same cell of the same patient in two different developmental stages.
  • the first set of reads corresponds to cells of a patient which have been exposed to a drug
  • the second set of reads corresponds to cells of the same patient which have not been exposed to a drug
  • the first set of reads corresponds to cells of a tissue
  • the second set of reads corresponds to cells of a tissue
  • the embodiments of the invention comprise processes performed in computer apparatus, the invention also extends to computer apparatus and to computer programs, particularly computer programs on or in a carrier, adapted for putting the invention into practice.
  • a computer program product comprising program instructions for causing a computer system to perform the method for the identification of nucleic acid variants between two genomic states as defined above.
  • the computer program product is embodied on a storage medium.
  • the computer program product is carried on a carrier signal.
  • the carrier may be any entity or device capable of carrying the program.
  • the goal in the case of the two aggressive forms of tumors was to test the method of the invention for the detection of somatic mutations accumulated in cancer cells as compared to normal (non-tumor) cells, and to assess how the method compares to other methods found in the state of the art.
  • somatic variants associated with cancer typically implies the sequencing of tumor and normal genome samples from the same patient, followed by multiple sequence comparisons. This process implies the alignment of normal and pathological reads to a reference genome, identifying sequence changes and finally isolating the somatic fraction of variants (i.e. those detected only in the tumor cell reads).
  • the method of the invention is capable of identifying SNVs and SVs of all sorts and sizes in a single run with great accuracy.
  • the method of the invention is even proven to be capable of identifying breakpoints associated with complex chromosomal rearrangements (chromotripsis and chromoplexy) in two forms of aggressive cancers.
  • characterization carried out by the method of the invention comprise the steps outlined below: A) Input data.
  • the method takes high quality sequencing data directly from FASTQ files of tumor and normal samples of the same individual. Alternatively, it is also able to accept BAM files, from which it extracts all the sequencing reads.
  • the user can define a cutoff, so that reads having over a certain threshold of their bases with a phred quality score ⁇ q20 are discarded.
  • X1 90 has been found to be especially suited for the purposes tested. This means that only reads with at least 90% of their bases with a phred quality score higher than 20 are kept.
  • the next step implies the building of a suffix structure.
  • This structure can, for instance, be a suffix tree or a suffix array.
  • a quaternary suffix tree (or "quad-tree") structure is first generated using all high quality normal and tumor reads (see FIG.1 and FIG.2 for a simplified version). All these sequences are
  • Each node of the tree has, at most, 4 branches, each one representing one of the four nucleotides.
  • the empirical value for X2 that has been mostly used in the tests disclosed is 30. This means that all reads with less than 30 bases are discarded. In the case of the presence of undefined base pairs ("N"), these are removed and the original sequence is split forming new shorter reads, which are inserted into the tree only if they are longer than 30 (X2) base pairs.
  • N undefined base pairs
  • Each of sequences accepted is inserted into the tree (both in forward and reverse), from the root, in original form (i.e. starting from nucleotide 1 to the end of the read), together with all derived suffixes larger than 30 bp (recursively starting from nucleotide 2 to the end, 3 to the end, etc., as defined in the definitions).
  • the next step consists in identifying all tumor specific reads. Because it was expected that variants generate new and distinct sequences in the mutated genome (first state) compared to the non-mutated control genome (second state), the method first searches and collects sequences (reads) that are only present in the tumor sample. These sequences are identified from the tree, as nodes and branches with an unbalanced representation of Normal (Count Normal Reads - CNR) and Tumor (Count Tumor Reads - CTR) reads . It was expected that nodes or branches covering a variation in the tumor sequence (first state) will theoretically have no representation in normal reads (second state).
  • nodes or branches with a CNR to CTR ratio below a certain threshold are selected.
  • This threshold can be adjusted by the user to account for expected levels of contamination of tumor cell reads among normal cell reads, i.e. reads of state one contaminating the sample of state two. This inconvenience may happen if a small group of cancer cells contaminate the sample of non-tumor cells, which is not uncommon.
  • a X4 larger than 0 always results in a higher sensitivity, but at the cost of lower specificity.
  • X4 was set to 0 for the in silico analysis and to 0.05 for the real tumor samples analyzed here, where it was assumed a maximum of 5% contamination of tumor reads in the control sample.
  • a final condition is imposed, which works as a reverse of X4.
  • a certain percentage of second state (control) reads can be present among first state (test, cancer) reads, which happens when genomes of state one contaminate state two (for example, it is common in cancer, where tumor cells can contaminate the normal sample during the extraction from the patient). This can also be corrected by only accepting nodes which are not over a threshold X5.
  • the next step consists in grouping those that are suspected to cover the same variant.
  • candidate sequences are organized by identity: two sequences belong to the same group if they overlap at least X2 base pairs. X2 was set to 30 in the examples disclosed.
  • Reverse complementary sequences are also evaluated during this grouping in order to be able to cover the variant in both
  • Sequence blocks (groups) with sequences in only one of the orientations (or with less than 4 tumor reads as explained above) are discarded. Once these groups are generated, it was interrogated the tree, also on the 30 base pair-overlap basis, to extract the normal (non-mutated) reads of the same region and add them to the block.
  • each block will represent a region in the genome containing the mutated and the non-mutated version (see a detailed example of a breakpoint block in FIG.3).
  • the consensus mutated and normal sequences from these blocks.
  • the corresponding normal consensus sequence can be used at the end of the procedure and mapped onto a reference genome to obtain the coordinates of the variant.
  • the method also includes step F:
  • the next step consists in identifying and classifying the variation included there.
  • Normal and Tumor consensus sequences derived from these blocks are recursively compared to identify differences.
  • a fist evaluation will search for small variants, which consist of those that are completely included within the consensus sequences (SNV and small SVs: insertions, deletions and inversions). All the blocks that do not match this criterion are then considered candidates for large SVs, i.e. likely to cover break points of intra or interchromosomal transitions, part of large deletions, insertions or inversions .
  • each tumor consensus sequence is extended on both ends by interrogating the tree for unambiguous tumor reads that overlap at least X2 (30 in this case) base pairs with the tumor consensus, reconstructing a (maximum) of 200 base pair region around the break and allowing the detection of newly generated sequence at the point of the break.
  • inventors identify the coordinates of the changes by mapping onto a reference genome the normal consensus sequences corresponding to each of the variants, avoiding potential mapping conflicts derived from the presence of the variant, as usually happens when using reference-based approaches. Sequences mapping (with the same score) to several positions in the genome are discarded. The calibration and default parameters for the method were adjusted using a high quality set of -1000 SNVs identified with the Sidron software in a CLL sample (Puente X.S. "Whole-genome sequencing identifies recurrent mutations in chronic lymphocytic leukaemia" Nature 201 1 , vol. 475, pp. 101 -105).
  • a personalized genome was simulated using the hg19 reference genome downloaded from UCSC (with no repeat-masking) (http://www.ucsc.edu ), and modifying it to match a randomly chosen human haplotype from 1000 genome database.
  • These 7,194,026 variants consist of 4,745,917 SNPs and
  • the catalog of somatic variants further added to this personalized genome includes 8240 SNVs (more than 100bp apart), 20 known tumor translocations (Egan, J.B. et al. "Whole genome analyses of a well -differentiated liposarcoma reveals novel SYT1 and DDR2
  • M003, M004 and MB1 were obtained with informed consent and an ethical vote (Institutional Review Board) following ICGC guidelines (www.icgc.org). M003, M004 and MB1 were accessed through the European Genome- phenome Archive (EGA, https://www.ebi.ac.uk/ega/) under access numbers EGAS00001000510 and EGAS00001000085.
  • EAA European Genome- phenome Archive
  • Variant genes in tumor samples were identified by analyzing all the changes identified with ANNOVAR (Wang K., et.al. "ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data” Nucleic acids research 2010, vol. 38, pp. e164). The analysis of the resulting genes potentially modified at coding or splicing level were further analyzed with Intogen (Gonzalez-Perez, A. et al. "IntOGen-mutations identifies cancer drivers across tumor types” Nature methods 2013, vol. 10, pp. 1081 -1082) in order to infer their potential role in oncogenesis.
  • PCR primers were designed on sequence blocks of 2000bp around the target variant using Primer 3 (http://frodo.wi.mit.edu/primer3) (Schgasser, A. et al. "Primer3--new capabilities and interfaces”. Nucleic acids research 2012, vol. 40, pp. e1 15). PCR reactions were performed for tumor and control samples. Each target locus was amplified using 50 ng of DNA. The amplification was performed using Qiagen Multiplex PCR Kit (Qiagen), and the reaction mix contained 2x QIAGEN Multiplex PCR Master Mix, 10x primer mix (2 ⁇ of each primer) and RNase-free water until a total reaction volume of 25 ⁇ .
  • PCR conditions were as follows: 96°C 10', 2 cycles of 96°C 30"-60°C 30"-72°C 1 '30", 2 cycles of 96°C 30"-58°C 30"-72°C 1 '30", 2 cycles of 96°C 30"-56°C 30"-72°C 1 '30", 35 cycles of 96°C 30"-54°C 30"-72°C 1 '30", and 70°C 10'.
  • PCR products have been run in a capillary electrophoresis gel (QIAxcel Advenced System, Qiagen) with the QIAxcel DNA screening kit (Qiagen), and the multiband PCR products were purified using NucleoSpin Gel and PCR Clean-up (Mercherey-Nagel).
  • Sanger sequencing PCR products were cleaned using ExoSAP-IT (USB) and sequenced using ABI Prism BigDye terminator v3.1 (Applied Biosystems) with 5 pmol of each primer. Sequencing reactions were run on an ABI-3730 Sanger sequencing platform (Applied Biosystems). Sequences were examined with the Mutation Surveyor DNA Variant Analysis Software (Softgenetics). G-bandinq, FISH and M-FISH analysis
  • inventors generated normal and tumor test genomes by first applying a random human haplotype (Schgasser et.al., ibid) and a predesigned catalog of somatic variants to the human reference genome, to then simulate whole genome sequencing at different depths of coverage. To assess for the applicability of the method in the current context of cancer genome analysis, it was compared its performance with existing strategies for somatic variant calling. For this, it was selected a
  • Variants are distributed as follows: 8240 SNV and 1798 SVs (738 deletions, 715 insertions and 345 inversions).
  • the table shows the number of breakpoints that define SVs.
  • the method was able to identify 4409 somatic SNVs and 1094 small SVs in this tumor genome.
  • a random set of 1 1 1 of these somatic calls were amplified and verified by Sanger sequencing using the same DNA used for whole genome sequencing. This process allowed us to verify >94% of SNVs (76 of 81 ) and >80% of SVs (28 of 35) identified by our method. These specificity rates are in agreement with the corresponding values obtained from the in silico analysis.
  • MB1 was previously described as presenting chromothripsis, a complex structural alteration of the genome hypothesized to arise from a single catastrophic event that generates multiple breakpoints, often affecting a single chromosome.
  • the method of the invention uncovered a total of 102 breakpoints corresponding to "large" SVs (i.e.
  • translocations From the assessment of a random set of 39 of these breaks through PCR amplification and Sanger sequencing, it could positively be verified 36 (92%). Among all the breakpoints detected, 25 were found to agree with the intervals of chromosomal translocations that previously let defining chromothripsis in this tumor, including 3 of the 4 verified at base pair resolution. In addition, it could be identified 65 novel breakpoints in the same tumor, comprising 53 intra and 12 interchromosomal translocations. From a subset of 37 of these translocations (16 intra- and 1 1 interchromosomal), 25 were verified (92.5%).
  • Cibulskis K., et.al. “Sensitive detection of somatic point mutations in impure and heterogeneous cancer samples” Nat. Biotech. 2013, vol. 31 , pp. 213-219 Wang K., et.al. "ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data” Nucleic acids research 2010, vol. 38, pp. e164
  • Patro R., et. al. "Sailfish enables alignment-free isoform quantification from RNA-seq reads using lightweight algorithms” Nature Biotech. 2014, vol. 32, pp. 462-464

Abstract

There is provided a computational method for the identification of nucleic acid variants between two cells, such as a normal cell vs. a pathological cell of a patient, or a cell at two different stages of development. The method is alignment-free, as it does not depend on the use of a reference genome, and accurately identifies all sorts of genetic variants, ranging from single nucleotide substitutions (SNVs) to large structural variants with great sensitivity and specificity. It thus allows to speed up in silico genomic analysis and personalized medicine.

Description

A computational method for the identification of variants in nucleic acid sequences
The present invention relates to a computer implemented method for the identification and characterization of sequence variants in nucleic acids. In particular, this method is able to quickly and accurately identify most types of sequence genome variations that have been associated to disease, that is, from single nucleotide substitutions to large structural variants. This method has multiple and direct applications in diagnosis, prognosis and therapy.
BACKGROUND ART
The genetic basis of disease is increasingly becoming more accessible thanks to the emergence of the Next Generation Sequencing (NGS) platforms, which have extremely reduced the costs and increased the throughput of genomic sequencing. For the first time in history, personalized medicine is close to becoming a reality through the analysis of each patient's genome. A wide range of genome variation of cells and individuals has been identified to be the direct cause, or a predisposition, to genetic diseases: from single nucleotide variants (SNVs if they are somatic, and SNPs if they are polymorphic in the population), to structural variants (SVs), which can correspond to deletions, insertions, inversions, translocations and copy number changes (CNVs), ranging from a few nucleotides to large genomic regions, including complete chromosome arms. These variations can exist between patients and also emerge among cells of the same patient. The unveiling of changes in the genome is driving discoveries such as the Philadelphia translocation between chromosomes 9 and 22, whose presence implies the development of chronic myelogenous leukemia (CML) and its identification allows the development and selection of last-generation therapies.
The ideal exploitation of genomic sequencing should involve the accurate identification of all variants, in order to derive a correct diagnosis and to select the best therapy (Xuan J. et.al. "Next-generation sequencing in the clinic: Promises and challenges" Cancer Lett. 2013, vol. 340, pp. 284-295). For clinical purposes, it is important that this computational process can be done within an effective timeframe. But a simple sequencing experiment typically yields thousands of millions of reads per genome, which have to be stored and analysed. The task is severely hindered by a variety of factors such as PCR-amplification, sequencing errors, limitations intrinsically linked to the size of the reads, biases in the sequencing techniques employed, the inherent repetitive and dynamic nature of the genomic sequences, and others. As a consequence, the analysis of genomes with diagnostic and therapeutic purposes is still a great challenge, both in the design of efficient algorithms and at the level of computing performance.
Cancer is one of the most active diagnostic and therapeutic areas where NGS is being applied. As the central aim of current research and upcoming clinical practice of this field, having access to all somatic variation accumulated in a tumor cell will allow the identification of the genetic causes of the tumor and, consequently, a more precise and specific (personalized) diagnosis, prognosis and therapy. The variants responsible for the origin and
progression of tumors are currently searched using a common scheme that involves the sequencing of both, tumor and normal genome samples of the same patient, and the subsequent scan and identification of the differences between them (Isakov O. et.al. "Deep sequencing data analysis: Challenges and solutions" in Bioinformatics - Trends and Methodologies, Humana Press, USA, Nov. 201 1 ). With available methods, prior to this invention, this identification process results complex and always implies an initial common step, where all the normal and tumor sequence reads are aligned to a reference genome. A great number of different methodologies have been developed to then analyse this initial sequence alignment and identify the changes present in the tumor compared with the normal and reference genomes. This implies that all the intrinsic limitations and errors associated with this initial alignment step affect the performance and accuracy of all existing methods for variant calling.
More precisely, the identification of somatic variation in cancer genomes has currently the following sources of errors and limitations: (i) the initial alignment step, on which all the methods rely, is time consuming and particularly error prone with the tumor reads that carry the sequence variation, which are the most relevant for the analysis. It has been proven that many of these reads that carry changes and differences in their sequence are difficult or even impossible to align to the reference unmutated genome. The absence and the misplacement of tumor reads in the final alignment drastically affect all existing downstream methods for variant searching and calling. Although a number of alternative methods exist, this alignment step is generally performed with the same program (Li, H., et.al. "Fast and accurate short read alignment with Burrows-Wheeler transform" Bioinformatics 2009, vol. 25, pp. 1754-1760), which implies that nearly all analyses done nowadays share the same type of errors derived from this mapping of reads. (ii) The usage of a reference genome also implies that the several millions of inherited variations (germline, i.e. not somatic) that exist in each particular patient emerges when using the reference genome, which clearly interferes with the identification of the target somatic fraction (normally comprising only between 2 and 10 thousand variants). A considerable number of these germline variants are then frequently mispredicted as somatic changes, increasing the rate of false positive and decreasing, consequently, the final reliability of the results.
On top of the limitations and errors inherent to the generation and
dependency of this initial alignment, the subsequent analysis, where somatic changes are finally identified, also implies a number of restrictions and complications. For example, despite the great deal of possibilities in terms of available methods, not a single one of them is able to identify a wide range of somatic variation, but instead, each is limited to the detection of a particular size and type of mutation (Medvedev P. "Computational methods for discovering structural variation with next-generation sequencing" Nat- Methods Suppl. 2009, vol. 6, S13-S20). There are programs that use this alignment to detect only SNVs and others that only identify SVs, among which, each one is able to detect a particular variant size. For instance, some methods identify insertions or deletions that comprise a few nucleotides (from
2 to a few dozens), others detect medium size SVs (from a few dozens of nucleotides to a few hundreds), and a small fraction of them, are designed for the identification of larger SVs (Ding, L, et.al. "Expanding the computational toolbox for mining cancer genomes" Nat. Rev. Genet. 2014, vol. 15, pp. 556- 570). As the detection complexity increases with the detection range, methods designed for the identification of large SVs are also more imprecise in defining the exact location in the genome and the type of change, which are often necessary for being able to derive the functional consequences of the mutation. These programs often only report regions where SVs might be located. All this implies that, complete analysis of genomes currently results complex, time consuming and imprecise. The analysis of a single genome pair in a typical cancer study now involves the combination of up to four or five different methods: one for the initial alignment, another for the calling of SNVs, another for small SVs, and one or more additional programs for medium to large size SVs. Each of these methods has been developed independently and thus involves their own filtering and scoring schemes and specific computational requirements, making the complete process even more burdensome and only accessible to a few centres and groups with enough computing resources and expertise. Even if all these requirements are met and complete and complex pipelines are built and put into practice, an important fraction of large SVs, which are relevant in disease, still remained either poorly described or undetected.
Of note, all the limitations mentioned hinder the incorporation of this type of genomic analysis into the clinical practice, which calls for much faster and more accurate computational methods. But current research and method development activity is centred in developing new combinations of
procedures still on the basis of the analysis of initial alignment. So far, there is not a single available method for the identification of somatic variation, that does not depend on the initial alignment of tumor and normal reads onto a reference genome. Thus, it would be highly desirable to have an alignment- free sequence comparison method that does not suffer from the limitations outlined above. Such a method would be already faster than existing methods only by excluding the need of the prior alignment step.
The term reference-free has recently been used to describe other methods that, with different applications to the ones outlined above, also analyse NGS data without relying on a reference genome. One of them describes a computational method for quantifying the abundance of RNA isoforms from RNA-seq data, which does not rely on the mapping of reads to a reference genome (Patro R., et. al. "Sailfish enables alignment-free isoform
quantification from RNA-seq reads using lightweight algorithms" Nature Biotech. 2014, vol. 32, pp. 462-464). A second study is related to "c/e novo" assembly of genomes carried out within the evolutionary biology field,
Nordstrom K.J.V, et.al. "Mutation identification by direct comparison of whole- genome sequencing data from mutant and wild-type individuals using k-mers" Nature Biotech. 2013, vol. 31 , pp. 325-331 . This work presents a reference- free genome assembler that allows discovering homozygous mutations based on comparing k-mers in whole genome sequencing data. However, although these two methods can operate without the need of a reference sequence, they are not applicable in comparative studies such as the identification of somatic mutations and thus, cannot be used for genome analysis in a biomedical context.
Clearly quick and robust comparative methods, able to detect all kinds of SVs differentiating two states (normal vs. pathological, undifferentiated vs.
differentiated, etc.) with high sensitivity, specificity and speed are still needed.
SUMMARY OF THE INVENTION
In contrast to what is found in the prior art, inventors have come up with a method for the identification of nucleic acid variants between two genomic states that does not depend on the alignment of reads of either state to a reference genome. This method is, on its own, able to accurately identify all types of variants (heterozygous and homozygous), from single nucleotide variations to large structural variants at base pair resolution with
unprecedented specificity, sensitivity and speed. Importantly, the method is not restricted to the identification of a certain type of variant (SNVs,
insertions, inversions, etc.) nor does it only perform for variants of a certain size, as is the case for many of the methods found in the art. Because of its underlying principle, all the limitations inherent to the use of the reference genome that apply to all other existing methods (outlined above) are overcome. This translates to a method that is not only more robust, but is also more thorough and much faster than the methods described to date.
Thus, a first aspect of the present invention is a computational method for the identification of nucleic acid variants between two genomic states comprising the steps of:
A) Inputting 2 sets of nucleic acid reads, which are sequences retrieved from a sequencing method, wherein the first set of reads corresponds to cells representing a first test state, and the second set of reads corresponds to cells representing a second control state; B) Filtering the reads, wherein the filtering comprises: B1 ) Keeping only the reads with at least a percentage X1 of their bases with a phred quality score higher than 20, being X1 equal or above 90%; B2) Splitting the reads with an undefined nucleotide, giving one sequence before, and one sequence after the undefined nucleotide, the latter being discarded; and B3) Discarding the sequence reads with less than X2 bases, wherein X2 is from 25 to 40; C) Generating a suffix structure comprising: C1 ) Generating a number of N-X2 new reads for each read of sequence length N, wherein the new N-X2 reads correspond to all suffixes with a length larger than X2 nucleotides and derived from each read of sequence length N, being a suffix the complete sequence of a sequenced read or a sub-sequence derived from the latter when one nucleotide at the 5' terminal end is discarded; and C2) Building a suffix structure, which includes all the suffixes generated in step C1 ); D) Detecting variants in the sequence between the reads of the first state and the reads of the second state by annotating breakpoint nodes, wherein a breakpoint node is a point in the suffix structure that determines a change in the sequence of the second state compared to the first state, said breakpoint nodes fulfilling all the following requirements: D1 ) The node must be at least X2 positions from the root of the suffix structure; D2) The node must have at least X3 reads with the same variation between first and second states, being X3 at least 2; D3) The percentage of first state reads in second state reads is not over a threshold X4, to account for possible contamination of control state reads with test state reads, wherein X4 is at least 5%; and D4) The percentage of second state reads in first state reads is not over a threshold X5, to account for possible contamination of test state reads with control state reads, wherein X5 is at least 20%; E) Assembling and filtering test and control reads derived from all breakpoint nodes accepted in step D, wherein: E1 ) Variants whose prefixes share more than X2 nucleotides are merged to give a block; and E2) If the nucleic acid whose variants are being identified is a double stranded DNA, then variants that do not have blocks in both forward and reverse sequence strands are discarded.
The performance and speed of this method make it more suitable for clinical applications (such as genomic analysis of cancer cells) than the complex and time-consuming pipelines developed so far. As it is shown in the data below, the method can even be used to define very complex genomic scenarios such as the ones taking place in chromoplexy and chromothripsis related to cancer development and aggressiveness.
A second aspect of the present invention is a computer program product comprising program instructions for causing a computer system to perform the method for the identification of nucleic acid variants between two genomic states of the first aspect of the invention.
The computer program product may be embodied on a storage medium (for example, a CD-ROM, a DVD, a USB drive, on a computer memory or on a read-only memory) or carried on a carrier signal (for example, on an electrical or optical carrier signal).
The computer program may be in the form of source code, object code, a code intermediate source and object code such as in partially compiled form, or in any other form suitable for use in the implementation of the processes according to the invention. The carrier may be any entity or device capable of carrying the computer program.
For example, the carrier may comprise a storage medium, such as a ROM, for example a CD ROM or a semiconductor ROM, or a magnetic recording medium, for example a floppy disc or hard disk. Further, the carrier may be a transmissible carrier such as an electrical or optical signal, which may be conveyed via electrical or optical cable or by radio or other means.
When the computer program is embodied in a signal that may be conveyed directly by a cable or other device or means, the carrier may be constituted by such cable or other device or means.
Alternatively, the carrier may be an integrated circuit in which the computer program is embedded, the integrated circuit being adapted for performing, or for use in the performance of, the relevant methods.
A third aspect of the invention is a system for the identification of nucleic acid variants between two genomic states comprising the steps of: A) Computer/electronic means for Inputting 2 sets of nucleic acid reads, which are sequences retrieved from a sequencing method, wherein the first set of reads corresponds to cells representing a first test state, and the second set of reads corresponds to cells representing a second control state; B)
Computer/electronic means for filtering the reads, wherein the filtering comprises: B1 ) Keeping only the reads with at least a percentage X1 of their bases with a phred quality score higher than 20, being X1 equal or above 90%; B2) Splitting the reads with an undefined nucleotide, giving one sequence before, and one sequence after the undefined nucleotide, the latter being discarded; and B3) Discarding the sequence reads with less than X2 bases, wherein X2 is from 25 to 40; C) Computer/electronic means for generating a suffix structure, wherein the generation of the suffix structure comprises: C1 ) Generating a number of N-X2 new reads for each read of sequence length N, wherein the new N-X2 reads correspond to all suffixes with a length larger than X2 nucleotides and derived from each read of sequence length N, being a suffix the complete sequence of a sequenced read or a sub-sequence derived from the latter when one nucleotide at the 5' terminal end is discarded; and C2) Building a suffix structure, which includes all the suffixes generated in step C1 ); D) Computer/electronic means for detecting variants in the sequence between the reads of the first state and the reads of the second state, said detection is performed by annotating breakpoint nodes, wherein a breakpoint node is a point in the suffix structure that determines a change in the sequence of the second state compared to the first state, said breakpoint nodes fulfilling all the following requirements: D1 ) The node must be at least X2 positions from the root of the suffix structure; D2) The node must have at least X3 reads with the same variation between first and second states, being X3 at least 2; D3) The percentage of first state reads in second state reads is not over a threshold X4, to account for possible contamination of control state reads with test state reads, wherein X4 is at least 5%; and D4) The percentage of second state reads in first state reads is not over a threshold X5, to account for possible contamination of test state reads with control state reads, wherein X5 is at least 20%;
E) Computer/electronic means for assembling and filtering test and control reads derived from all breakpoint nodes accepted in step D, wherein in the assembling and filtering: E1 ) Variants whose prefixes share more than X2 nucleotides are merged to give a block; and E2) If the nucleic acid whose variants are being identified is a double stranded DNA, then variants that do not have blocks in both forward and reverse sequence strands are discarded.
It is important to highlight that the described system can be a part (for example, hardware in the form of a PCI card) of a computer system (for example a personal computer). Furthermore, the system can be an external hardware connected to the computer system by appropiate means.
On the other hand, the electronic/computer means may be used
interchangeably, that is, a part of the described means may be electronic means and the other part may be computer means, or all described means may be electronic means or all described means may be computer means. Examples of an apparatus comprising only electronic means may be a CPLD, a FPGA or an ASIC. A fourth aspect of the invention is a system comprising a processor and a memory, wherein the memory stores computer executable instructions that, when executed, cause the system to perform the method for the identification of nucleic acid variants between two genomic states of the first aspect of the invention.
BRIEF DESCRIPTION OF THE DRAWINGS
FIG1 . A non-limiting example of a suffix tree. The string used for the example is banana$. The string is expressed by the use of this suffix structure. The root of the suffix is the uppermost node found on the picture.
FIG2. Construction of the suffix structure. Construction of a suffix tree from two different sequences. The sequence "BANANA" which corresponds to a sequence of the first state and the sequence "BAANA" which corresponds to a second state sequence. Below those sequences, there are all their possible derived suffixes. The tree structure is represented by a three squares' box, the left square represents the first state count, the middle square represents the corresponding letter and the right square represents the second state count, a) The box corresponds to what in the text is referenced as node, position or nucleotide, b) That area corresponds to an unambiguous path: all nodes have a unique child with counts of both states, c) That specific node constitutes a breakpoint: the first state count and the second state count diverge in different ways at their children.
FIG3. Breakpoint block. Each dashed box represents a pair of first state (regular letters) and second state (italic and underlined letters) sequence that contains a breakpoint. All the dashed boxes contain sequences with the same suffix or prefix around the breakpoint which means that all of them constitute a unique breakpoint block (Note: the sequence used for this image does not represent any prior art sequence that holds any relation to the invention disclosed herein, i.e. it is simply shown for illustrative purposes).
DETAILED DESCRIPTION OF THE INVENTION
All terms as used herein, unless otherwise stated, shall be understood in their ordinary meaning as known in the art. Other more specific definitions for certain terms as used in the present application are as set forth below and are intended to apply uniformly throughout the description and claims unless an otherwise expressly set out definition provides a broader definition.
The terms "computational" and "computer implemented" are taken here to mean the same and are used interchangeably. Therefore, "computational method" and "computer implemented method" are taken as synonyms.
The terms "Next Generation Sequencing" (NGS), "deep sequencing", "ultra- deep sequencing", "high throughput sequencing" are all used interchangeably and refer to the technology platforms currently being used as standard to enable the sequencing of genomes ("sequencing methods") with high speed and contained cost, such as the Roche/454, lllumina/Solexa, Life/APG and Pacific Biosciences platforms. The term "base" and the term "nucleotide" are herein used interchangeably, and refer to the monomers (subunits) which are repeated in a nucleic acid such as DNA or RNA, giving its sequence or primary structure.
The term "reference genome" as used herein refers to the complete nucleic acid sequence representing the whole genome of a species normally accepted by the wide community. Since the reference genome is usually assembled from the sequencing of DNA from a number of donors, it does not accurately represent the set of genes of any one single individual. Instead, a reference genome provides a mosaic of different DNA sequences from each donor. But, at general levels, the reference genome provides a good approximation of the DNA of any single individual. However, in genomic regions with high allelic diversity, the reference genome may differ
significantly from any one single individual. For example, GRCh37, the Genome Reference Consortium human genome (build 37) is derived from thirteen anonymous volunteers from New York. Reference genomes are typically used as a guide on which new genomes are built and aligned, enabling their assembly and comparison.
The term "forward strand" as used herein refers to a nucleic acid sequence read from 5' terminal to 3' terminal ends. The term "reverse strand" refers to the nucleic acid sequence which is complementary to the forward strand.
The term "nucleic acid variant" or simply "variant" as used herein refers to a difference in sequence between two genomic states. A variant can be a single nucleotide variant (SNV) if the difference between the two genomes (or two states of the same genome) is only due to the change of a single nucleotide. All other variants, among them insertions, deletions, inversions, duplications, translocations and others are termed structural variants (SV). The latter can have many sizes, from two bases up to entire pieces of a chromosome.
The term "genomic state" as used herein can refer to two different genomes derived from two different individuals, or two genomes derived from two different cells of the same individual. In the second case, the two different cells can be a normal vs. a pathological cell, an undifferentiated vs. a differentiated cell, a cell which has been exposed to a certain external factor vs. an unexposed cell, etc.
The term "mapping" as used herein refers to aligning blocks of the first and second state to a reference genome.
The term "read" as used herein refers to a fragment of nucleic acid that is sequenced in its entirety. The nucleic acid might be DNA, RNA, or even chemically altered nucleic acids. The initial step in a high throughput sequencing run is the random fragmentation of a genome into millions of partly overlapping fragments called reads, which are usually amplified by Polymerase Chain Reaction and sequenced using a variety of techniques that are platform-dependent. The lengths of the reads can also vary depending on the platform, and are usually on the order of a few dozens to a few hundreds of nucleotides. The partly overlapping reads must be assembled if a complete picture of the genome is to be built.
The term "depth of coverage" as used herein refers to the number of times a nucleotide is read during the sequencing process. Deep sequencing means that the total number of reads is many times larger than the length of the sequence under study. Standard depth of coverage currently range from 30 to 100x for whole genomes, meaning that each position in the genome is represented from 30 to 100 times. Coverage similarly designates the average number of reads representing a given nucleotide in the reconstructed sequence.
Depth of coverage can be calculated from the length of the original genome (G), the number of reads (A/), and the average read length (L) as N*L/G. The term "undefined nucleotide" as used herein refers to a certain position inside a sequenced read that could not be determined during the sequencing process, that is, a position for which the sequencing experiment has not unambiguously resolved whether it is occupied by an adenine (A), guanine (G), cytosine (C) or thymine (T), and therefore its nature is unknown.
Undefined nucleotides in reads are filtered out (removed) in the method of the invention, generating two or more fragments of defined sequence if the undefined nucleotides are removed form inner positions of the read.
The term "phred quality score" as used herein refers to the quality score given to each nucleotide base call in a sequenced read. The phred score is a property given to each sequenced nucleotide and it is logarithmically related to the base-calling error probability. A phred score of 10 assigned to a certain nucleotide in a sequenced read means that there is a 90% probability that the base call is correct, a phred score of 20 means that there is a 99% probability that the base call is correct, and a phred score of 30 means that there is a 99.9% probability that the base call is correct. The term "assembling" as used herein refers to grouping all the first state reads, and separately second state reads that share the same variant.
The term "suffix" as used herein refers to the complete sequence of a sequenced NGS read or a sub-sequence derived from the latter when one nucleotide at the 5' terminal end is discarded. For instance, let us imagine a read of a length of 40 nucleotides which represents the beginning of the sequence that codes for the Homo sapiens PTGS2 enzyme (prostaglandin endoperoxide synthase-2, also known as cox-2, Genbank entry D28235). If an original read generated in an NGS experiment of the latter gene were: ATTATTAAATTATCAAAAAGAAAATGATCCACGCTCTTAG (lengh 40) its first five suffixes would be:
ATTATTAAATTATCAAAAAGAAAATGATCCACGCTCTTAG (length 40) TTATTAAATTATCAAAAAGAAAATGATCCACGCTCTTAG (length 39) TATTAAATTATCAAAAAGAAAATGATCCACGCTCTTAG (length 38)
ATTAAATTATCAAAAAGAAAATGATCCACGCTCTTAG (length 37)
TTAAATTATCAAAAAGAAAATGATCCACGCTCTTAG (length 36)
The term "suffix structure" as used herein refers to a structure for string storage and compression that allows computational comparison and matching of sequences in a highly efficient manner. Suffix structures that can be used in the method of the invention are, for instance, suffix trees and suffix arrays.
The term "suffix tree" also known as "suffix trie" as used herein refers to a tree-like data structure that is constructed to enable fast string matching. It allows the convenient storage of all substrings in a given string, that is, in terms of the present invention, the storage of all nucleotide sub-sequences (suffixes) in a given nucleotide sequence (of a read). For an example of a suffix tree, see FIG.1 where a suffix tree is constructed for the term banana. The term "quaternary suffix tree" as used herein refers to a suffix tree built with the 4-letter vocabulary of nucleic acids (G,A,T/U,C). The suffix tree has a root, which is the 0 position from which the counting of the nucleotides begins. The term "suffix array" as used herein is an array of suffixes of a certain string of characters. Just like suffix trees, it is used for storage and compression of character strings, and it also allows to quickly perform string matching. The way a "suffix array" is constructed is: Consider the text i?=banana$ to be indexed:
i 1234567
b a n a n a $
The text ends with the special sentinel letter $ that is unique and
lexicographically smaller than any other character. The text has the following suffixes:
Suffix I
banana$ 1
anana$ 2
nana$ 3
ana$ 4
na$ 5
a$ 6
$ 7
These suffixes can be sorted:
Suffix i
$ 7
a$ 6
ana$ 4
anana$ 2
banana$ 1
na$ 5
nana$ 3
The suffix array A contains the starting positions of these sorted suffixes: i 1234567
Figure imgf000015_0001
Then, the complete array with suffixes itself:
i 1234567
7642153
1 $aaabnn
2 $ n n a a a
3 a a n $ n 4 $ n a a
5 a n $
6 $ a
7 $
So for example, -^[3] contains the value 4, and therefore refers to the suffix starting at position 4 within S, which is the suffix ana$.
Suffix arrays are closely related to suffix trees. Suffix arrays can be
constructed by performing a depth-first traversal of a suffix tree. The suffix array corresponds to the leaf-labels given in the order in which these are visited during the traversal, if edges are visited in the lexicographical order of their first character. A suffix tree can be constructed in linear time by using a combination of suffix and LCP array. It has been shown that every suffix tree algorithm can be systematically replaced with an algorithm that uses a suffix array enhanced with additional information (such as the LCP array) and solves the same problem in the same time complexity. Advantages of suffix arrays over suffix trees include improved space requirements, simpler linear time construction algorithms and improved cache locality. The term "LCP array" or "longest common prefix array" is an auxiliary array for the "suffix array". It stores the lengths of the longest common prefixes between pairs of consecutive suffixes in the sorted suffix array. In other words, it is the length of prefix that is common between the two consecutive suffixes in a sorted suffix array. For the suffix array explained above
(banana), the LCP array H is constructed by comparing lexicographically consecutive suffixes to determine their longest common prefix:
i 1 2 3 4 5 6 7
± 0 1 3 0 0 2
So, for example, H[4]=3 is the length of the longest common prefix ana shared by the suffixes ^[3] = S[4, 7] = aj¾i$and
,4 [4] = S[2, 7} = anana$ Note that ^[1] = _L Sjnce there is no lexicographically smaller suffix.
The term "prefix" as used herein, refers to the first part of a sequencing read, that is, from position 1 to a given position depending on the context. This term is used here, as it is used in a grammatical context referring to words.
The term "breakpoint" as used herein refers to the sequence immediately flanking a sequence variant. For SVs, a breakpoint is the point where the DNA broke in the second state and appears as a change in the first state compared to the second control state. In other words, where the continuity of the sequence of the control second state breaks in the first state.
The term "node" as used herein refers to any given position (nucleotide) in the tree. Relevant nodes in this invention are those that determine a change in the sequence of the second state compared to the first state (breakpoint nodes). In FIG. 2 it is represented a node where healthy and mutated reads take different paths because they contain a different sequence. The term "block" or "breakpoint block" as used herein refers to all the reads derived from the two genomes compared, derived from the same position and including a variation in first state read. A breakpoint block (FIG. 3). is composed of aligned reads derived from the sequencing of all four alleles involved (two coming from the first state genome and the other two derived from the second state genome). Normally, in the case of heterozygous variation, only the reads derived from the altered allele will contain the mutation or variant.
The term "ambiguous path" as used herein refers to multiple possible sequence solutions in a given tree. It is referred here as the opposite of unique and unambiguous path or sequence.
As is revealed in Paszkiewicz K., et.al. "De novo assembly of short sequence reads" Brief Bioinform. 2010, vol. 1 1 , pp. 475-472, the approximate minimum length that NGS reads must have in order to be able to reconstruct a genome is around 30. Bearing in mind the latter, a minimum length of approximately 30 bases was taken to be the minimum length of a productive read. 30 is a value that was found to be a viable cutoff for variable X2 in the definition of the method of the invention, although slightly smaller values might be viable as well.
As it has been cited above, the first aspect of the present invention is a computational method for the identification of nucleic acid variants between two genomic states comprising the steps of:
A) Inputting 2 sets of nucleic acid reads, which are sequences retrieved from a sequencing method, wherein the first set of reads corresponds to cells representing a first test state, and the second set of reads corresponds to cells representing a second control state; B) Filtering the reads, wherein the filtering comprises: B1 ) Keeping only the reads with at least a percentage X1 of their bases with a phred quality score higher than 20, being X1 equal or above 90%; B2) Splitting the reads with an undefined nucleotide, giving one sequence before, and one sequence after the undefined nucleotide, the latter being discarded; and B3) Discarding the sequence reads with less than X2 bases, wherein X2 is from 25 to 40; C) Generating a suffix structure comprising: C1 ) Generating a number of N-X2 new reads for each read of sequence length N, wherein the new N-X2 reads correspond to all suffixes with a length larger than X2 nucleotides and derived from each read of sequence length N, being a suffix the complete sequence of a sequenced read or a sub-sequence derived from the latter when one nucleotide at the 5' terminal end is discarded; and C2) Building a suffix structure, which includes all the suffixes generated in step C1 ); D) Detecting variants in the sequence between the reads of the first state and the reads of the second state by annotating breakpoint nodes, wherein a breakpoint node is a point in the suffix structure that determines a change in the sequence of the second state compared to the first state, said breakpoint nodes fulfilling all the following requirements: D1 ) The node must be at least X2 positions from the root of the suffix structure; D2) The node must have at least X3 reads with the same variation between first and second states, being X3 at least 2; D3) The percentage of first state reads in second state reads is not over a threshold X4, to account for possible contamination of control state reads with test state reads, wherein X4 is at least 5%; and D4) The percentage of second state reads in first state reads is not over a threshold X5, to account for possible contamination of test state reads with control state reads, wherein X5 is at least 20%; E) Assembling and filtering test and control reads derived from all breakpoint nodes accepted in step D, wherein: E1 ) Variants whose prefixes share more than X2 nucleotides are merged to give a block; and E2) If the nucleic acid whose variants are being identified is a double stranded DNA, then variants that do not have blocks in both forward and reverse sequence strands are discarded. In a particular embodiment of the first aspect of the invention, the
computational method further comprises the step of: F) Cataloguing and annotating blocks according to the following: F1 ) If blocks between the first and second states only differ in one substituted nucleotide, the variant is catalogued as containing a single nucleotide variant and the single nucleotide variant is annotated; F2) If blocks between the first and second states differ in more than one nucleotide but the whole difference in sequence is contained within the block, the variant is catalogued as a small structural variant, and the small structural variant is annotated; and F3) If blocks between the first and second states differ in more than one nucleotide and the whole difference in sequence is not contained within the block, the variant is catalogued as a large structural variant, and the boundaries of all large structural variants are extended by retrieving suffixes overlapping at least X2 nucleotides in an iterative process which ends when the extended sequence reaches 200 nucleotides or when an ambiguous path is found.
In a particular embodiment of the first aspect of the invention, the method further comprises optionally mapping second state blocks, and subsequently mapping first state blocks, on a reference genome.
In a particular embodiment of the first aspect of the invention, the suffix structure is a suffix tree or a suffix array. In a particular embodiment of the first aspect of the invention, optionally in combination with any embodiment above or below, X1 is equal or above 95%.
In a particular embodiment of the first aspect of the invention, optionally in combination with any embodiment above or below, X1 is equal or above 99%.
In a particular embodiment of the first aspect of the invention, optionally in combination with any embodiment above or below, X2 is from 30 to 35.
In a particular embodiment of the first aspect of the invention, optionally in combination with any embodiment above or below, X2 is from 30 to 32.
In a particular embodiment of the first aspect of the invention, optionally in combination with any embodiment above or below, X2 is equal to 30.
In a particular embodiment of the first aspect of the invention, optionally in combination with any embodiment above or below, X3 is equal or above 4.
In a particular embodiment of the first aspect of the invention, optionally in combination with any embodiment above or below, X3 is equal or above 6.
In a particular embodiment of the first aspect of the invention, optionally in combination with any embodiment above or below, X3 is equal or above 8.
In a particular embodiment of the first aspect of the invention, X3 is directly proportional to the depth of coverage in the sequencing experiment. This means that, the deeper the coverage, the more restrictive (higher) is the value for X3 (the more reads with the same variation between first and second states must be present for the node to be accepted as a breakpoint node).
In a particular embodiment of the first aspect of the invention, optionally in combination with any embodiment above or below, X4 is between 5-10%.
In a particular embodiment of the first aspect of the invention, optionally in combination with any embodiment above or below, X4 is between 5-7%.
In a particular embodiment of the first aspect of the invention, optionally in combination with any embodiment above or below, X4 is 5%.
In a particular embodiment of the first aspect of the invention, optionally in combination with any embodiment above or below, X5 is between 20-25%. In a particular embodiment of the first aspect of the invention, optionally in combination with any embodiment above or below, X5 is 20%.
In a particular embodiment of the first aspect of the invention, the first set of reads corresponds to pathological cells of a patient, and the second set of reads corresponds to non-pathological cells of the same patient;
In another particular embodiment of the first aspect of the invention, the first set of reads corresponds to cancer cells of a patient, and the second set of reads corresponds to non-cancer cells of the same patient.
In another particular embodiment of the first aspect of the invention, the first set of reads corresponds to virus-infected cells of a patient, and the second set of reads corresponds to non-infected cells of the same patient.
In another particular embodiment of the first aspect of the invention, the first set of reads and the second set of reads correspond to the same cell of the same patient in two different developmental stages.
In another particular embodiment of the first aspect of the invention, the first set of reads corresponds to cells of a patient which have been exposed to a drug, and the second set of reads corresponds to cells of the same patient which have not been exposed to a drug.
In a particular embodiment of the first aspect of the invention, the first set of reads corresponds to cells of a tissue, and the second set of reads
corresponds to cells of another tissue of the same or a different individual;
Although the present invention has been described in detail for purpose of illustration, it is understood that such detail is solely for that purpose, and variations can be made therein by those skilled in the art without departing from the scope of the invention.
Thus, while the preferred embodiments of the methods and of the systems have been described in reference to the environment in which they were developed, they are merely illustrative of the principles of the invention. Other embodiments and configurations may be devised without departing from the scope of the appended claims.
Further, although the embodiments of the invention comprise processes performed in computer apparatus, the invention also extends to computer apparatus and to computer programs, particularly computer programs on or in a carrier, adapted for putting the invention into practice.
Accordingly, it also forms part of the invention a computer program product comprising program instructions for causing a computer system to perform the method for the identification of nucleic acid variants between two genomic states as defined above. In a preferred embodiment, the computer program product is embodied on a storage medium.
In another preferred embodiment, the computer program product is carried on a carrier signal. The carrier may be any entity or device capable of carrying the program.
Throughout the description and claims the word "comprise" and variations of the word, are not intended to exclude other technical features, additives, components, or steps. Furthermore, the word "comprise" and its variations encompasses the term "consisting of. Additional objects, advantages and features of the invention will become apparent to those skilled in the art upon examination of the description or may be learned by practice of the invention. The following examples are provided by way of illustration, and they are not intended to be limiting of the present invention. Furthermore, the present invention covers all possible combinations of particular and preferred embodiments described herein.
EXAMPLES Examples of the usage of the method of the invention for the detection of variants in a test in silico case and two real aggressive forms of solid and haematological tumors are given below. In these examples, the first state reads are derived from tumor cells and the second state reads from non- tumor control cells.
The goal in the case of the two aggressive forms of tumors was to test the method of the invention for the detection of somatic mutations accumulated in cancer cells as compared to normal (non-tumor) cells, and to assess how the method compares to other methods found in the state of the art.
Currently, the identification of somatic variants associated with cancer typically implies the sequencing of tumor and normal genome samples from the same patient, followed by multiple sequence comparisons. This process implies the alignment of normal and pathological reads to a reference genome, identifying sequence changes and finally isolating the somatic fraction of variants (i.e. those detected only in the tumor cell reads).
As explained above, the determination of all somatic variants (SNVs and SVs) for a given tumor cell still requires complex computational pipelines with combinations of different methods, each of which is restricted to the detection of a particular type, either SNVs or SVs of particular sizes and types. For example, widely used programs, such as BreakDancer (Chen K., et.al.
"Breakdancer: an algorithm for high-resolution mapping of genomic structural variation" Nat. Methods 2009, vol. 6, pp. 677-681 ) or Delly (Rausch T., et.al. "Delly:structural variant discovery by integrated paired-end and split-read analysis" Bioinformatics 2012, vol. 28, pp. Ϊ333-Ϊ339), can identify SVs larger than 20 and 150 base pairs, respectively. Pindel (Ye K., et.al. "Pindel: a pattern growth approach to detect break points of large deletions and medium sized insertions from paired-end short reads" Bioinformatics 2009, vol. 25, pp. 2865-2871 ) is another widely used program. While these methods show high sensitivity and specificity values, the identification of large structural variants, as stated above, still remains an important challenge. Unfortunately, the need for complex scoring and filtering schemes present in current strategies to achieve acceptable levels of specificity also implies lowering drastically the sensitivity, leaving an important fraction of structural variants undetected. Both in the in silico test and also in the two cases of real tumors, it is revealed that the method of the invention is capable of identifying SNVs and SVs of all sorts and sizes in a single run with great accuracy. Remarkably, the method of the invention is even proven to be capable of identifying breakpoints associated with complex chromosomal rearrangements (chromotripsis and chromoplexy) in two forms of aggressive cancers.
Material and methods
An implementation of the Method
The general structure and the complete variant identification and
characterization carried out by the method of the invention comprise the steps outlined below: A) Input data.
As input, the method takes high quality sequencing data directly from FASTQ files of tumor and normal samples of the same individual. Alternatively, it is also able to accept BAM files, from which it extracts all the sequencing reads.
B) and C) Filtering of reads and generating a suffix structure.
When inputting the data, the user can define a cutoff, so that reads having over a certain threshold of their bases with a phred quality score < q20 are discarded. X1 =90 has been found to be especially suited for the purposes tested. This means that only reads with at least 90% of their bases with a phred quality score higher than 20 are kept.
The next step implies the building of a suffix structure. This structure can, for instance, be a suffix tree or a suffix array. In the version of the method that operates by constructing a suffix tree, a quaternary suffix tree (or "quad-tree") structure is first generated using all high quality normal and tumor reads (see FIG.1 and FIG.2 for a simplified version). All these sequences are
sequentially loaded into the tree on the basis of their sequence. Each node of the tree has, at most, 4 branches, each one representing one of the four nucleotides. To avoid sequence ambiguity derived from the complexity of the genome, only fragments of at least X2 base pairs are inserted into the tree. The empirical value for X2 that has been mostly used in the tests disclosed is 30. This means that all reads with less than 30 bases are discarded. In the case of the presence of undefined base pairs ("N"), these are removed and the original sequence is split forming new shorter reads, which are inserted into the tree only if they are longer than 30 (X2) base pairs. Each of sequences accepted is inserted into the tree (both in forward and reverse), from the root, in original form (i.e. starting from nucleotide 1 to the end of the read), together with all derived suffixes larger than 30 bp (recursively starting from nucleotide 2 to the end, 3 to the end, etc., as defined in the definitions).
Since subsequent searches through the tree start from the root, the presence of read suffixes allows a rapid identification of particular sequences and reads. D) Detecting nodes of the suffix structure containing variants.
Once all the sequences and derived suffixes are loaded into the tree, the next step consists in identifying all tumor specific reads. Because it was expected that variants generate new and distinct sequences in the mutated genome (first state) compared to the non-mutated control genome (second state), the method first searches and collects sequences (reads) that are only present in the tumor sample. These sequences are identified from the tree, as nodes and branches with an unbalanced representation of Normal (Count Normal Reads - CNR) and Tumor (Count Tumor Reads - CTR) reads . It was expected that nodes or branches covering a variation in the tumor sequence (first state) will theoretically have no representation in normal reads (second state). To favour this condition it is started to search the tree from the level 30 (X2) towards the leafs. This means that, starting from the root, the first 30 levels of depth in the suffix tree are not taken into account. Only nodes and branches that have a difference between first and second states of at least X3 reads are accepted. The value of X3 that has been empirically found to be most suitable is 4. This last point means that the difference between the first state and the second state must be repeated by suffixes derived from at least 4 different reads for the node to be taken as a variant between both states. With this strategy, it is expected that most differences between first and second state that derive from sequencing errors will be eliminated, and only those differences that are repeated a certain number of times (4 in this case) between both states are kept.
Additionally, nodes or branches with a CNR to CTR ratio below a certain threshold (X4) are selected. This threshold can be adjusted by the user to account for expected levels of contamination of tumor cell reads among normal cell reads, i.e. reads of state one contaminating the sample of state two. This inconvenience may happen if a small group of cancer cells contaminate the sample of non-tumor cells, which is not uncommon. Thus, X4=0 implies no expected contamination, i.e. no acceptance of reads coming from the normal sample (CNR) on that candidate variant node or branch, which implies lower final sensitivity but higher specificity. On the other hand, a X4 larger than 0 always results in a higher sensitivity, but at the cost of lower specificity. X4 was set to 0 for the in silico analysis and to 0.05 for the real tumor samples analyzed here, where it was assumed a maximum of 5% contamination of tumor reads in the control sample.
A final condition is imposed, which works as a reverse of X4. A certain percentage of second state (control) reads can be present among first state (test, cancer) reads, which happens when genomes of state one contaminate state two (for example, it is common in cancer, where tumor cells can contaminate the normal sample during the extraction from the patient). This can also be corrected by only accepting nodes which are not over a threshold X5.
E) Assembling and filtering test and control reads.
After all detectable tumor specific reads have been identified; the next step consists in grouping those that are suspected to cover the same variant. For this, candidate sequences are organized by identity: two sequences belong to the same group if they overlap at least X2 base pairs. X2 was set to 30 in the examples disclosed. Reverse complementary sequences are also evaluated during this grouping in order to be able to cover the variant in both
orientations. Sequence blocks (groups) with sequences in only one of the orientations (or with less than 4 tumor reads as explained above) are discarded. Once these groups are generated, it was interrogated the tree, also on the 30 base pair-overlap basis, to extract the normal (non-mutated) reads of the same region and add them to the block.
Ideally, each block will represent a region in the genome containing the mutated and the non-mutated version (see a detailed example of a breakpoint block in FIG.3). In order to classify and characterize the type of variation identified, there were extracted the consensus mutated and normal sequences from these blocks. The corresponding normal consensus sequence can be used at the end of the procedure and mapped onto a reference genome to obtain the coordinates of the variant.
Optionally, the method also includes step F:
F) Cataloguing and annotating blocks.
Once all possible breakpoint blocks are defined, the next step consists in identifying and classifying the variation included there. Normal and Tumor consensus sequences derived from these blocks are recursively compared to identify differences. A fist evaluation will search for small variants, which consist of those that are completely included within the consensus sequences (SNV and small SVs: insertions, deletions and inversions). All the blocks that do not match this criterion are then considered candidates for large SVs, i.e. likely to cover break points of intra or interchromosomal transitions, part of large deletions, insertions or inversions . In this case, each tumor consensus sequence is extended on both ends by interrogating the tree for unambiguous tumor reads that overlap at least X2 (30 in this case) base pairs with the tumor consensus, reconstructing a (maximum) of 200 base pair region around the break and allowing the detection of newly generated sequence at the point of the break.
After small and large somatic variants are defined, inventors identify the coordinates of the changes by mapping onto a reference genome the normal consensus sequences corresponding to each of the variants, avoiding potential mapping conflicts derived from the presence of the variant, as usually happens when using reference-based approaches. Sequences mapping (with the same score) to several positions in the genome are discarded. The calibration and default parameters for the method were adjusted using a high quality set of -1000 SNVs identified with the Sidron software in a CLL sample (Puente X.S. "Whole-genome sequencing identifies recurrent mutations in chronic lymphocytic leukaemia" Nature 201 1 , vol. 475, pp. 101 -105).
Construction of the in silico genome.
A personalized genome was simulated using the hg19 reference genome downloaded from UCSC (with no repeat-masking) (http://www.ucsc.edu ), and modifying it to match a randomly chosen human haplotype from 1000 genome database. These 7,194,026 variants consist of 4,745,917 SNPs and
2,447,367 deletions. The catalog of somatic variants further added to this personalized genome includes 8240 SNVs (more than 100bp apart), 20 known tumor translocations (Egan, J.B. et al. "Whole genome analyses of a well -differentiated liposarcoma reveals novel SYT1 and DDR2
rearrangements" PloS one 2014, vol. 9, pp. e871 13; Love, C. et al. "The genetic landscape of mutations in Burkitt lymphoma" Nature genetics 2012, vol. 44, pp. 1321 -1325; Ma, Y. et al. "Developmental timing of mutations revealed by whole-genome sequencing of twins with acute lymphoblastic leukemia" Proc. Natl. Acad. Sci. 2013, vol. 1 10, pp. 7429-7433;
Papaemmanuil, E. et al. "RAG-mediated recombination is the predominant driver of oncogenic rearrangement in ETV6-RUNX1 acute lymphoblastic leukemia" Nature genetics 2014 vol. 46, pp. 1 16-125; Richter, J. et al.
"Recurrent mutation of the ID3 gene in Burkitt lymphoma identified by integrated genome, exome and transcriptome sequencing" Nature genetics 2012, vol. 44, pp. 1316-1320; Teles Alves, I. et al. "Next-generation
sequencing reveals novel rare fusion events with functional implication in prostate cancer" Oncogene 2014, Epub, ahead of print), 715 random
Insertions, 738 random Deletions and 345 random Inversions, all ranging from 1 bp to 100MBp. In silico sequencing was simulated using ART lllumina (Huang W. et.al. "ART: a next-generation sequencing read simulator"
Bioinformatics 2012, vol. 28, pp. 593-594). For this, inventors first generated a profile using the M004 sample to extract parameters, like sequence variation or read length. Inventors then run the method at different depths of coverage, using the resulting parameters and the default error rate (0.00009).
Analysis of the in silico (simulated) genome with external methods
Each of the external methods tested for the comparison with the method of the invention was run on pooled libraries (normal and tumor) using default settings except for the following parameters: BreakDancer was run with -q 10 (mapping quality) and score cutoff of >80 as described before (Chen, K. et.al. ibid; Young, M.A. et al. "Background mutations in parental cells account for most of the genetic heterogeneity of induced pluripotent stem cells" Cell stem cell 2012, vol. 10, pp. 570-582); Pindel's results with less than 5 supporting reads were not considered as recommended elsewhere to increase specificity (Jones, D.T. et al. "Recurrent somatic alterations of FGFR1 and NTRK2 in pilocytic astrocytoma" Nature genetics 2013, vol. 45, pp. 927-932).
Predictions obtained with Delly were rejected if the number of supporting reads were less than 3 and the mapping quality below 20
(http://www.embl.de/~rausch/delly.html). For Breakdancer, Pindel and Delly, somatic variants were obtained by filtering out all the SVs found in both normal and tumor libraries: inventors only kept those SVs with no unique supporting reads from the normal library. CREST (Want J., et.al. "CREST maps somatic structural variation in cancer genomes with base-pair resolution" Nat. Methods 201 1 , vol. 12, pp. 652-654) and Mutect (Cibulskis
K., et.al. "Sensitive detection of somatic point mutations in impure and heterogeneous cancer samples" Nat. Biotech. 2013, vol. 31 , pp. 213-219) already provided somatic variants as direct results. BreakDancer and Pindel were used as complementary methods covering large and small SVs, respectively, as advised by the developers.
Datasets M003, M004 and MB1 were obtained with informed consent and an ethical vote (Institutional Review Board) following ICGC guidelines (www.icgc.org). M003, M004 and MB1 were accessed through the European Genome- phenome Archive (EGA, https://www.ebi.ac.uk/ega/) under access numbers EGAS00001000510 and EGAS00001000085.
Identification and analysis of variant genes
Variant genes in tumor samples were identified by analyzing all the changes identified with ANNOVAR (Wang K., et.al. "ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data" Nucleic acids research 2010, vol. 38, pp. e164). The analysis of the resulting genes potentially modified at coding or splicing level were further analyzed with Intogen (Gonzalez-Perez, A. et al. "IntOGen-mutations identifies cancer drivers across tumor types" Nature methods 2013, vol. 10, pp. 1081 -1082) in order to infer their potential role in oncogenesis.
Experimental Verification of variants
PCR primers were designed on sequence blocks of 2000bp around the target variant using Primer 3 (http://frodo.wi.mit.edu/primer3) (Untergasser, A. et al. "Primer3--new capabilities and interfaces". Nucleic acids research 2012, vol. 40, pp. e1 15). PCR reactions were performed for tumor and control samples. Each target locus was amplified using 50 ng of DNA. The amplification was performed using Qiagen Multiplex PCR Kit (Qiagen), and the reaction mix contained 2x QIAGEN Multiplex PCR Master Mix, 10x primer mix (2μΜ of each primer) and RNase-free water until a total reaction volume of 25μΙ. PCR conditions were as follows: 96°C 10', 2 cycles of 96°C 30"-60°C 30"-72°C 1 '30", 2 cycles of 96°C 30"-58°C 30"-72°C 1 '30", 2 cycles of 96°C 30"-56°C 30"-72°C 1 '30", 35 cycles of 96°C 30"-54°C 30"-72°C 1 '30", and 70°C 10'. All the PCR products have been run in a capillary electrophoresis gel (QIAxcel Advenced System, Qiagen) with the QIAxcel DNA screening kit (Qiagen), and the multiband PCR products were purified using NucleoSpin Gel and PCR Clean-up (Mercherey-Nagel). Regarding the Sanger sequencing, PCR products were cleaned using ExoSAP-IT (USB) and sequenced using ABI Prism BigDye terminator v3.1 (Applied Biosystems) with 5 pmol of each primer. Sequencing reactions were run on an ABI-3730 Sanger sequencing platform (Applied Biosystems). Sequences were examined with the Mutation Surveyor DNA Variant Analysis Software (Softgenetics). G-bandinq, FISH and M-FISH analysis
Conventional cytogenetics, was performed on Giemsa-banded chromosomes (G-banding) obtained after a 72-hour culture and stimulation with
tetradecanoyl-phorbol-acetate. Results of the ten metaphases analyzed were described according to the International System for Human Cytogenetic Nomenclature. (REF: ISCN 2013: Karger Medical and Scientific Publishers Editors. Lisa G Shaffer). FISH studies for the presence of the t(1 1 ;14) translocation and 17p deletions were performed using Vysis LSI IGH/CCND1 Dual Color Dual Fusion and Vysis LSI TP53 (17p13.1 )(Abbott Molecular, Des Plaines, IL) on fixed cells according to the manufacturer's specifications. Two hundred nuclei were examined for each probe. To identify the chromosomes involved in marker chromosomes and to disclose other possible structural balanced abnormalities, inventors performed 24-color karyotyping using 24XCyte human multicolor FISH (mFISH) probe kit according to
manufacturer's instructions (MetaSystems, Altlussheim, Germany) consisting of 24 different chromosome painting probes (combinatorial labeling). Image capture was done with Nikon Eclipse 50i equipped with a CCD-camera (CoolCubel , MetaSystems) and appropriate filters using Isis software.
Karyotyping was done using the 24-color mFISH upgrade package.
Additionally, whole chromosomal paintings (WCP) of chromosome 4
(Spectrum green) and 12 (spectrum orange) were performed simultaneously.
Results
As explained above, to assess the performance of the method of the invention, inventors measured both the fraction of somatic variants detected (sensitivity) and the precision of this detection (specificity) using both simulated and real cancer genome data together with orthogonal
experimental techniques.
In silico validation and comparison with other methods
To be able to measure the sensitivity and specificity of the method of the invention in a unique controlled context, inventors generated normal and tumor test genomes by first applying a random human haplotype (Untergasser et.al., ibid) and a predesigned catalog of somatic variants to the human reference genome, to then simulate whole genome sequencing at different depths of coverage. To assess for the applicability of the method in the current context of cancer genome analysis, it was compared its performance with existing strategies for somatic variant calling. For this, it was selected a
Figure imgf000031_0001
representative set of available protocols that are common parts of current pipelines for the analysis of tumor genomes: Mutect for SNVs, and
BreakDancer, Pindel, Delly, and CREST for SVs of different sizes (Table 1 , below). For the present comparison, they were run as described in their corresponding publication or website.
Figure imgf000031_0002
Table 1 . Scope of detection in the methods used in the in silico analysis.
Inventors first observed that the calling of somatic SNVs is nearly optimal and within the same range between Mutect and the method of the invention (which is here named as "Smufin"), with sensitivities of 97% and 92%, and
specificities of 93% and 99%, respectively (Tables 2 and 3, below).
Table 2. In silico assessment of variant calling.
1 Variants are distributed as follows: 8240 SNV and 1798 SVs (738 deletions, 715 insertions and 345 inversions). The table shows the number of breakpoints that define SVs.
2 Variants that fall into the rang of detection for each of the methods.
3 Performance values obtained counting only variants within the detection range of each of the methods. See Table 3 for comparison against the
Figure imgf000032_0001
complete SV catalog.
4 Expressed as average "distance +/- SD" from the breakpoint position.
5 According to the original publication, CREST has no size limit in detection. Nevertheless, among the predictions obtained, none was below 20nt.
Table 3. Sensitivity of SV calling considering all SVs introduced (1798)
On the other hand, the calling efficiency of somatic structural variants varied greatly between different methods, revealing clear differences when
compared to the method developed by the inventors. Some methods reached reasonable levels of sensitivity when the evaluation was restricted to their range of SV detection (Pindel and Delly), but these dropped drastically when compared against the complete catalog of structural variations in the tumor (Table 3, above). By contrast, the method of the invention was able to identify somatic SVs with a sensitivity of 74% independently of the size of the SV, reaching >90% sensitivity when only SVs larger than the read size were taken into account. The sensitivity of the method for somatic SNV and SV calling is actually similar to that resulting from the combination of all the above indicated methods together: 94% versus 89% for the method of the invention. The downside of combining these methods as a strategy for variant calling is actually the low levels of specificities achieved. In fact, in terms of specificity, the values for the external SV callers oscillated between 29% and 77%, whereas the method of the invention (here called Smufin) reached values of 91 % across all SVs. Detection of small somatic variants in human tumors
To further investigate the performance of the method of the invention with real data, it was also calculated and assessed the positive discovery rate on somatic SNVs and SVs calling using whole genome sequence (WGS) data from primary tumor and matched non-tumor samples. It was first tested the detection of "small" variants by analyzing a previously described sample (M004) of mantle cell lymphoma (MCL) (Bea, S., et.al. "Landscape of somatic mutations and clonal evolution in mantle cell lymphoma" Proc. Natl. Acad. Sci. of the USA 2013, vol. 1 10, pp. 18250-18255), an aggressive subtype of lymphoid neoplasia. The method was able to identify 4409 somatic SNVs and 1094 small SVs in this tumor genome. To evaluate the specificity of the method of the invention, a random set of 1 1 1 of these somatic calls were amplified and verified by Sanger sequencing using the same DNA used for whole genome sequencing. This process allowed us to verify >94% of SNVs (76 of 81 ) and >80% of SVs (28 of 35) identified by our method. These specificity rates are in agreement with the corresponding values obtained from the in silico analysis.
Identification of complex structural variation in aggressive tumors
It was next evaluated accuracy of the method of the invention in detecting large SVs involving the somatic insertion, deletion, inversion or translocation of DNA fragments from one hundred base pairs to megabases long. For this test, it was analyzed the WGS of another MCL (M003) and a pediatric form of a medulloblastoma (MB1 ), both known to present complex landscapes of chromosomal rearrangements (Bea, S., et.al, ibid; Rausch, T. et.al. "Genome sequencing of pediatric medulloblastoma links catastrophic DNA
rearrangements with TP53 mutations" Cell 2012, vol. 148, pp. 59-71 ).
Because these representative examples correspond to a hematological and a solid tumor, each sequenced in a different sequencing facility, this analysis also measures the consistency of the method of the invention across different types of data. Identification of Chromothripsis
MB1 was previously described as presenting chromothripsis, a complex structural alteration of the genome hypothesized to arise from a single catastrophic event that generates multiple breakpoints, often affecting a single chromosome. In this tumor sample, the method of the invention uncovered a total of 102 breakpoints corresponding to "large" SVs (i.e.
beyond the read size), covering 85 intra and 17 interchromosomal
translocations. From the assessment of a random set of 39 of these breaks through PCR amplification and Sanger sequencing, it could positively be verified 36 (92%). Among all the breakpoints detected, 25 were found to agree with the intervals of chromosomal translocations that previously let defining chromothripsis in this tumor, including 3 of the 4 verified at base pair resolution. In addition, it could be identified 65 novel breakpoints in the same tumor, comprising 53 intra and 12 interchromosomal translocations. From a subset of 37 of these translocations (16 intra- and 1 1 interchromosomal), 25 were verified (92.5%). Together with the clusters of breakpoints already pointed for chromosomes 3 and 4 in this tumor, new calls uncovered by the method of the invention let us define a third damaged region in chromosome 1 1 , with a density of 6 DNA breaks per Mb (between positions 39 and 45Mb). Interestingly, a high number of these breakpoints are chained in the tumor with breaks in chromosome 17. Furthermore, and complementary to the previous functional characterization of this tumor, novel genes were found to be affected by the novel breakpoints found in this tumor by the method, including some that have been previously associated with several types of tumors, such as NCOR-1, SIN3P, WDR52 and PALLD. This fact underscores the applicability of reference-free approaches to identify somatic
rearrangements in cancer not detected by traditional current strategies.
Identification of Chromoplexy
It was also analyzed an aggressive form of MCL (M003), previously described to have undergone complex chromosomal rearrangements. In this tumor, the method could identify a total of 30 breakpoints corresponding to large SVs. Using PCR amplification followed by Sanger sequencing, inventors verified
19 of the 22 breakpoints tested, involving 7 intra and 15 interchromosomal translocations. This not only confirms the correct location and the type of translocation identified, but it also shows that the method of the invention is able to reconstruct the correct sequence around the variants, as 5 of the breakpoints presented additional DNA stretches between 5 and 30 nt long.
It was next evaluated whether the method is able to define the karyotype of this tumor. For this, all 30 breakpoints identified were crossed with 18 non- centromeric and non-telomeric regions of chromosomal imbalances previously detected using Affymetrix SNP6.0 array (Affymetrix, Santa Clara, CA) (Bea, S., et.al, ibid). It was seen that the method could redefine, at base pair resolution, 16 of these 18 regions. By puzzling all the translocations detected, the landscape of this genome could be modelled, which included three derivative chromosomes formed by combinations of large fragments of chromosomes 3 and 12, with smaller parts of 4. These chimeric chromosomes were experimentally confirmed in the MCL cells by a combination of multicolor FISH and whole chromosome painting analysis. Furthermore, the resolution provided by the method of the invention allowed the identification of the fragmentation and fusion of genes not previously described in this sample. For example, it was found that these translocations cause the fusion of ANK2 and SOX5 genes. Interestingly, these two rearrangement events do not appear to be independent as the corresponding fragments generated after the double strand break are rejoined again reciprocally, i.e. generating both, 12 to 4 and 4 to 12 translocations and two different forms of ANK2-SOX5 fusions. In fact, 8 out of the 18 breakpoints, appear to be rejoined reciprocally, as recently described in prostate tumors (Baca, S.C. "Punctuated evolution of prostate cancer genomes" Cell 2013, vol. 153, pp. 666-667), suggesting an original organization of the chromatin where these regions were physically proximal and somehow interacting. A third translocation identified in the M003 tumor implies the breakage and putative inactivation of ARID2, a gene involved in chromatin remodeling.
Taken together, these results underscore the reliability of the method of the invention in the detection of nucleic acid variants of most types in a complex pathology such as cancer. The implementation of the method was able to detect all kinds of SNVs and SVs with great sensitivity and specificity, giving better results to what is obtained by a combination of different methods found in the art.
REFERENCES CITED IN THE APPLICATION Xuan J. et.al. "Next-generation sequencing in the clinic: Promises and challenges" Cancer Lett. 2013, vol. 340, pp. 284-295.
Isakov O. et.al. "Deep sequencing data analysis: Challenges and solutions" in Bioinformatics - Trends and Methodologies, Humana Press, USA, Nov. 201 1 .
Li, H., et.al. "Fast and accurate short read alignment with Burrows-Wheeler transform" Bioinformatics 2009, vol. 25, pp. 1754-1760. Medvedev P. "Computational methods for discovering structural variation with next-generation sequencing" Nat. Methods Suppl. 2009, vol. 6, S13-S20. Simpson JT. "Abyss: A parallel assembler for short read sequence data"
Genome Res. 2009, vol. 19, pp. 1 1 17-1 123.
Butler J. et.al. "Allpaths: de novo assembly of whole-genome shotgun microreads" Genome Res. 2008, vol. 18, pp. 810-820.
Puente X.S. "Whole-genome sequencing identifies recurrent mutations in chronic lymphocytic leukaemia" Nature 201 1 , vol. 475, pp. 101 -105
Egan, J.B. et al. "Whole genome analyses of a well-differentiated liposarcoma reveals novel SYT1 and DDR2 rearrangements" PloS one 2014, vol. 9, pp. e871 13
Love, C. et al. "The genetic landscape of mutations in Burkitt lymphoma" Nature genetics 2012, vol. 44, pp. 1321 -1325
Ma, Y. et al. "Developmental timing of mutations revealed by whole-genome sequencing of twins with acute lymphoblastic leukemia" Proc. Natl. Acad. Sci. 2013, vol. 1 10, pp. 7429-7433 Papaemmanuil, E. et al. "RAG-mediated recombination is the predominant driver of oncogenic rearrangement in ETV6-RUNX1 acute lymphoblastic leukemia" Nature genetics 2014 vol. 46, pp. 1 16-125
Richter, J. et al. "Recurrent mutation of the ID3 gene in Burkitt lymphoma identified by integrated genome, exome and transcriptome sequencing"
Nature genetics 2012, vol. 44, pp. 1316-1320
Teles Alves, I. et al. "Next-generation sequencing reveals novel rare fusion events with functional implication in prostate cancer" Oncogene 2014, Epub, ahead of print
Huang W. et.al. "ART: a next-generation sequencing read simulator" Bioinformatics 2012, vol. 28, pp. 593-594
Chen, K. et.al. ibid; Young, M.A. et al. "Background mutations in parental cells account for most of the genetic heterogeneity of induced pluripotent stem cells" Cell stem cell 2012, vol. 10, pp. 570-582
Jones, D.T. et al. "Recurrent somatic alterations of FGFR1 and NTRK2 in pilocytic astrocytoma" Nature genetics 2013, vol. 45, pp. 927-932
Want J., et.al. "CREST maps somatic structural variation in cancer genomes with base-pair resolution" Nat. Methods 201 1 , vol. 12, pp. 652-654
Cibulskis K., et.al. "Sensitive detection of somatic point mutations in impure and heterogeneous cancer samples" Nat. Biotech. 2013, vol. 31 , pp. 213-219 Wang K., et.al. "ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data" Nucleic acids research 2010, vol. 38, pp. e164
Gonzalez-Perez, A. et al. "IntOGen-mutations identifies cancer drivers across tumor types" Nature methods 2013, vol. 10, pp. 1081 -1082
Untergasser, A. et al. "Primer3--new capabilities and interfaces". Nucleic acids research 2012, vol. 40, pp. e1 15 Bea, S., et.al. "Landscape of somatic mutations and clonal evolution in mantle cell lymphoma" Proc. Natl. Acad. Sci. of the USA 2013, vol. 1 10, pp. 18250- 18255
Rausch, T. et.al. "Genome sequencing of pediatric medulloblastoma links catastrophic DNA rearrangements with TP53 mutations" Cell 2012, vol. 148, pp. 59-71
Baca, S.C. "Punctuated evolution of prostate cancer genomes" Cell 2013, vol. 153, pp. 666-667
Patro R., et. al. "Sailfish enables alignment-free isoform quantification from RNA-seq reads using lightweight algorithms" Nature Biotech. 2014, vol. 32, pp. 462-464
Nordstrom K.J.V, et.al. "Mutation identification by direct comparison of whole- genome sequencing data from mutat and wild-type individuals using k-mers" Nature Biotech. 2013, vol. 31 , pp. 325-331
Chen K., et.al. "Breakdancer: an algorithm for high-resolution mapping of genomic structural variation" Nat. Methods 2009, vol. 6, pp. 677-681 Rausch T., et.al. "Delly:structural variant discovery by integrated paired-end and split-read analysis" Bioinformatics 2012, vol. 28, pp. Ϊ333-Ϊ339
Ye K., et.al. "Pindel: a pattern growth approach to detect break points of large deletions and medium sized insertions from paired-end short reads"
Bioinformatics 2009, vol. 25, pp. 2865-2871
Paszkiewicz K., et.al. "De novo assembly of short sequence reads" Brief Bioinform. 2010, vol. 1 1 , pp. 475-472 Ding, L, et.al. "Expanding the computational toolbox for mining cancer genomes" Nat. Rev. Genet. 2014, vol. 15, pp. 556-570

Claims

1 . A computational method for the identification of nucleic acid variants between two genomic states comprising the steps of:
A) Inputting 2 sets of nucleic acid reads, which are sequences retrieved from a sequencing method, wherein the first set of reads corresponds to cells representing a first test state, and the second set of reads corresponds to cells representing a second control state;
B) Filtering the reads, wherein the filtering comprises:
B1 ) Keeping only the reads with at least a percentage X1 of their bases with a phred quality score higher than 20, being X1 equal or above 90%;
B2) Splitting the reads with an undefined nucleotide, giving one sequence before, and one sequence after the undefined nucleotide, the latter being discarded; and
B3) Discarding the sequence reads with less than X2 bases, wherein X2 is from 25 to 40;
C) Generating a suffix structure comprising:
C1 ) Generating a number of N-X2 new reads for each read of sequence length N, wherein the new N-X2 reads correspond to all suffixes with a length larger than X2 nucleotides and derived from each read of sequence length N, being a suffix the complete sequence of a sequenced read or a sub-sequence derived from the latter when one nucleotide at the 5' terminal end is discarded; and
C2) Building a suffix structure, which includes all the suffixes generated in step C1 );
D) Detecting variants in the sequence between the reads of the first state and the reads of the second state by annotating breakpoint nodes, wherein a breakpoint node is a point in the suffix structure that determines a change in the sequence of the second state compared to the first state, said breakpoint nodes fulfilling all the following requirements:
D1 ) The node must be at least X2 positions from the root of the suffix structure; D2) The node must have at least X3 reads with the same variation between first and second states, being X3 at least 2;
D3) The percentage of first state reads in second state reads is not over a threshold X4, to account for possible contamination of control state reads with test state reads, wherein X4 is at least 5%; and
D4) The percentage of second state reads in first state reads is not over a threshold X5, to account for possible contamination of test state reads with control state reads, wherein X5 is at least 20%; E) Assembling and filtering test and control reads derived from all breakpoint nodes accepted in step D, wherein:
E1 ) Variants whose prefixes share more than X2 nucleotides are merged to give a block; and
E2) If the nucleic acid whose variants are being identified is a double stranded DNA, then variants that do not have blocks in both forward and reverse sequence strands are discarded.
2. The computational method according to claim 1 , further comprising the step of:
F) Cataloguing and annotating blocks according to the following:
F1 ) If blocks between the first and second states only differ in one substituted nucleotide, the variant is catalogued as containing a single nucleotide variant and the single nucleotide variant is annotated;
F2) If blocks between the first and second states differ in more than one nucleotide but the whole difference in sequence is contained within the block, the variant is catalogued as a small structural variant, and the small structural variant is annotated; and
F3) If blocks between the first and second states differ in more than one nucleotide and the whole difference in sequence is not contained within the block, the variant is catalogued as a large structural variant, and the boundaries of all large structural variants are extended by retrieving suffixes overlapping at least X2 nucleotides in an iterative process which ends when the extended sequence reaches 200 nucleotides or when an ambiguous path is found.
3. The computational method according to any of the claims 1 to 2 further comprising optionally mapping second state blocks, and subsequently mapping first state blocks, on a reference genome.
4. The computational method according to any one of claims 1 to 3, wherein the suffix structure is a suffix array or a suffix tree.
5. The computational method according to any one of claims 1 to 4, wherein X2 is from 30 to 35.
6. The computational method according to any one of claims 1 to 5, wherein X3 is equal to or above 4.
7. The computational method according to any one of claims 1 to 5, wherein X3 is directly proportional to the depth of coverage in the sequencing experiment.
8. The computational method according to any one of claims 1 to 7, wherein threshold X4 is 5%.
9. The computational method according to any one of claims 1 to 8, wherein the threshold X5 is 20%.
10. The computational method according to any one of claims 1 to 9, wherein the first set of reads corresponds to pathological cells of a patient and the second set of reads corresponds to non-pathological cells of the same patient.
1 1 . A computer program product comprising program instructions for causing a computer system to perform the method for the identification of nucleic acid variants between two genomic states as defined in any of claims 1 to 10.
12. The computer program product according to claim 1 1 embodied on a storage medium.
13. The computer program product according to claim 1 1 carried on a carrier signal.
14. A system for the identification of nucleic acid variants between two genomic states comprising the steps of:
A) Computer/electronic means for Inputting 2 sets of nucleic acid reads, which are sequences retrieved from a sequencing method, wherein the first set of reads corresponds to cells representing a first test state, and the second set of reads corresponds to cells representing a second control state; B) Computer/electronic means for filtering the reads, wherein the filtering comprises:
B1 ) Keeping only the reads with at least a percentage X1 of their bases with a phred quality score higher than 20, being X1 equal or above 90%;
B2) Splitting the reads with an undefined nucleotide, giving one sequence before, and one sequence after the undefined nucleotide, the latter being discarded; and
B3) Discarding the sequence reads with less than X2 bases, wherein X2 is from 25 to 40;
C) Computer/electronic means for generating a suffix structure, wherein the generation of the suffix structure comprises:
C1 ) Generating a number of N-X2 new reads for each read of sequence length N, wherein the new N-X2 reads correspond to all suffixes with a length larger than X2 nucleotides and derived from each read of sequence length N, being a suffix the complete sequence of a sequenced read or a sub-sequence derived from the latter when one nucleotide at the 5' terminal end is discarded; and
C2) Building a suffix structure, which includes all the suffixes generated in step C1 );
D) Computer/electronic means for detecting variants in the sequence between the reads of the first state and the reads of the second state, said detection is performed by annotating breakpoint nodes, wherein a breakpoint node is a point in the suffix structure that determines a change in the sequence of the second state compared to the first state, said breakpoint nodes fulfilling all the following requirements: D1 ) The node must be at least X2 positions from the root of the suffix structure;
D2) The node must have at least X3 reads with the same variation between first and second states, being X3 at least 2;
D3) The percentage of first state reads in second state reads is not over a threshold X4, to account for possible contamination of control state reads with test state reads, wherein X4 is at least 5%; and
D4) The percentage of second state reads in first state reads is not over a threshold X5, to account for possible contamination of test state reads with control state reads, wherein X5 is at least 20%;
E) Computer/electronic means for assembling and filtering test and control reads derived from all breakpoint nodes accepted in step D, wherein in the assembling and filtering:
E1 ) Variants whose prefixes share more than X2 nucleotides are merged to give a block; and
E2) If the nucleic acid whose variants are being identified is a double stranded DNA, then variants that do not have blocks in both forward and reverse sequence strands are discarded.
15. A system comprising a processor and a memory, wherein the memory stores computer executable instructions that, when executed, cause the system to perform the method for the identification of nucleic acid variants between two genomic states as defined in any of claims 1 to 10.
PCT/EP2015/074253 2014-10-21 2015-10-20 A computational method for the identification of variants in nucleic acid sequences WO2016062713A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
EP15784335.0A EP3210145A1 (en) 2014-10-21 2015-10-20 A computational method for the identification of variants in nucleic acid sequences

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
EP14189691 2014-10-21
EP14189691.0 2014-10-21

Publications (1)

Publication Number Publication Date
WO2016062713A1 true WO2016062713A1 (en) 2016-04-28

Family

ID=51986982

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/EP2015/074253 WO2016062713A1 (en) 2014-10-21 2015-10-20 A computational method for the identification of variants in nucleic acid sequences

Country Status (2)

Country Link
EP (1) EP3210145A1 (en)
WO (1) WO2016062713A1 (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP3267346A1 (en) * 2016-07-08 2018-01-10 Barcelona Supercomputing Center-Centro Nacional de Supercomputación A computer-implemented and reference-free method for identifying variants in nucleic acid sequences
CN109920485A (en) * 2018-12-29 2019-06-21 浙江安诺优达生物科技有限公司 The method and its application of variation simulation are carried out to sequencing sequence
US10395759B2 (en) 2015-05-18 2019-08-27 Regeneron Pharmaceuticals, Inc. Methods and systems for copy number variant detection
EP3479271A4 (en) * 2016-06-30 2020-04-08 Nantomics, LLC Synthetic wgs bioinformatics validation
CN112735516A (en) * 2020-12-29 2021-04-30 上海派森诺生物科技股份有限公司 Group variation detection analysis method without reference genome

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
A. L. DELCHER ET AL: "Fast algorithms for large-scale genome alignment and comparison", NUCLEIC ACIDS RESEARCH, vol. 30, no. 11, 1 June 2002 (2002-06-01), pages 2478 - 2483, XP055180650, DOI: 10.1093/nar/30.11.2478 *
DANIEL M. SAVEL ET AL: "Suffix-Tree Based Error Correction of NGS Reads Using Multiple Manifestations of an Error", PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON BIOINFORMATICS, COMPUTATIONAL BIOLOGY AND BIOMEDICAL INFORMATICS, BCB'13, 1 September 2013 (2013-09-01), New York, New York, USA, pages 351 - 358, XP055181354, ISBN: 978-1-45-032434-2, DOI: 10.1145/2506583.2506644 *
K. PASZKIEWICZ ET AL: "De novo assembly of short sequence reads", BRIEFINGS IN BIOINFORMATICS, vol. 11, no. 5, 19 August 2010 (2010-08-19), pages 457 - 472, XP055181149, ISSN: 1467-5463, DOI: 10.1093/bib/bbq020 *
PEILIN JIA ET AL: "Consensus Rules in Variant Detection from Next-Generation Sequencing Data", PLOS ONE, vol. 7, no. 6, 8 June 2012 (2012-06-08), pages e38470, XP055181188, DOI: 10.1371/journal.pone.0038470 *

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10395759B2 (en) 2015-05-18 2019-08-27 Regeneron Pharmaceuticals, Inc. Methods and systems for copy number variant detection
US11568957B2 (en) 2015-05-18 2023-01-31 Regeneron Pharmaceuticals Inc. Methods and systems for copy number variant detection
EP3479271A4 (en) * 2016-06-30 2020-04-08 Nantomics, LLC Synthetic wgs bioinformatics validation
US10984890B2 (en) 2016-06-30 2021-04-20 Nantomics, Llc Synthetic WGS bioinformatics validation
EP3267346A1 (en) * 2016-07-08 2018-01-10 Barcelona Supercomputing Center-Centro Nacional de Supercomputación A computer-implemented and reference-free method for identifying variants in nucleic acid sequences
WO2018007034A1 (en) * 2016-07-08 2018-01-11 Barcelona Supercomputing Center - Centro Nacional De Supercomputación A computer-implemented and reference-free method for identifying variants in nucleic acid sequences
CN109920485A (en) * 2018-12-29 2019-06-21 浙江安诺优达生物科技有限公司 The method and its application of variation simulation are carried out to sequencing sequence
CN109920485B (en) * 2018-12-29 2023-10-31 浙江安诺优达生物科技有限公司 Method for carrying out mutation simulation on sequencing sequence and application thereof
CN112735516A (en) * 2020-12-29 2021-04-30 上海派森诺生物科技股份有限公司 Group variation detection analysis method without reference genome

Also Published As

Publication number Publication date
EP3210145A1 (en) 2017-08-30

Similar Documents

Publication Publication Date Title
Kumar et al. Next-generation sequencing and emerging technologies
Moncunill et al. Comprehensive characterization of complex structural variations in cancer by directly comparing genome sequence reads
De Roeck et al. NanoSatellite: accurate characterization of expanded tandem repeat length and sequence through whole genome long-read sequencing on PromethION
Wadapurkar et al. Computational analysis of next generation sequencing data and its applications in clinical oncology
Ren et al. RNA-seq analysis of prostate cancer in the Chinese population identifies recurrent gene fusions, cancer-associated long noncoding RNAs and aberrant alternative splicings
Hong et al. Incorporation of unique molecular identifiers in TruSeq adapters improves the accuracy of quantitative sequencing
Okonechnikov et al. InFusion: advancing discovery of fusion genes and chimeric transcripts from deep RNA-sequencing data
Tran et al. Objective and comprehensive evaluation of bisulfite short read mapping tools
Sakarya et al. RNA-Seq mapping and detection of gene fusions with a suffix array algorithm
AU2015367290A1 (en) Sequencing controls
WO2016062713A1 (en) A computational method for the identification of variants in nucleic acid sequences
Zhao et al. Robustness of RNA sequencing on older formalin-fixed paraffin-embedded tissue from high-grade ovarian serous adenocarcinomas
Gilpatrick et al. Targeted nanopore sequencing with Cas9 for studies of methylation, structural variants, and mutations
EP3482329B1 (en) A computer-implemented and reference-free method for identifying variants in nucleic acid sequences
Fasterius et al. A novel RNA sequencing data analysis method for cell line authentication
Ergin et al. RNA sequencing and its applications in cancer and rare diseases
Bacher et al. Mutational profiling in patients with MDS: ready for every-day use in the clinic?
Hallast et al. Assembly of 43 human Y chromosomes reveals extensive complexity and variation
Ahsan et al. A survey of algorithms for the detection of genomic structural variants from long-read sequencing data
Puurand et al. AluMine: alignment-free method for the discovery of polymorphic Alu element insertions
Wan et al. RNA sequencing and its applications in cancer diagnosis and targeted therapy
Seifi et al. Application of next-generation sequencing in clinical molecular diagnostics
US11718873B2 (en) Correcting for deamination-induced sequence errors
Gindin et al. Analytical principles of cancer next generation sequencing
KR101977976B1 (en) Method for increasing read data analysis accuracy in amplicon based NGS by using primer remover

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 15784335

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

REEP Request for entry into the european phase

Ref document number: 2015784335

Country of ref document: EP