EP3210145A1 - A computational method for the identification of variants in nucleic acid sequences - Google Patents
A computational method for the identification of variants in nucleic acid sequencesInfo
- Publication number
- EP3210145A1 EP3210145A1 EP15784335.0A EP15784335A EP3210145A1 EP 3210145 A1 EP3210145 A1 EP 3210145A1 EP 15784335 A EP15784335 A EP 15784335A EP 3210145 A1 EP3210145 A1 EP 3210145A1
- Authority
- EP
- European Patent Office
- Prior art keywords
- reads
- sequence
- state
- variants
- suffix
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Withdrawn
Links
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B30/00—ICT specially adapted for sequence analysis involving nucleotides or amino acids
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B20/00—ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
- G16B20/20—Allele or variant detection, e.g. single nucleotide polymorphism [SNP] detection
Definitions
- the present invention relates to a computer implemented method for the identification and characterization of sequence variants in nucleic acids.
- this method is able to quickly and accurately identify most types of sequence genome variations that have been associated to disease, that is, from single nucleotide substitutions to large structural variants.
- This method has multiple and direct applications in diagnosis, prognosis and therapy.
- NGS Next Generation Sequencing
- a wide range of genome variation of cells and individuals has been identified to be the direct cause, or a predisposition, to genetic diseases: from single nucleotide variants (SNVs if they are somatic, and SNPs if they are polymorphic in the population), to structural variants (SVs), which can correspond to deletions, insertions, inversions, translocations and copy number changes (CNVs), ranging from a few nucleotides to large genomic regions, including complete chromosome arms.
- SNVs single nucleotide variants
- SVs structural variants
- CNVs copy number changes
- These variations can exist between patients and also emerge among cells of the same patient.
- the unveiling of changes in the genome is driving discoveries such as the Philadelphia translocation between chromosomes 9 and 22, whose presence implies the development of chronic myelogenous leukemia (CML) and its identification allows the development and selection of last-generation therapies.
- CML chronic myelogenous leukemia
- Cancer is one of the most active diagnostic and therapeutic areas where NGS is being applied.
- NGS NGS is being applied.
- having access to all somatic variation accumulated in a tumor cell will allow the identification of the genetic causes of the tumor and, consequently, a more precise and specific (personalized) diagnosis, prognosis and therapy.
- the identification of somatic variation in cancer genomes has currently the following sources of errors and limitations: (i) the initial alignment step, on which all the methods rely, is time consuming and particularly error prone with the tumor reads that carry the sequence variation, which are the most relevant for the analysis. It has been proven that many of these reads that carry changes and differences in their sequence are difficult or even impossible to align to the reference unmutated genome. The absence and the misplacement of tumor reads in the final alignment drastically affect all existing downstream methods for variant searching and calling. Although a number of alternative methods exist, this alignment step is generally performed with the same program (Li, H., et.al. "Fast and accurate short read alignment with Burrows-Wheeler transform" Bioinformatics 2009, vol.
- reference-free has recently been used to describe other methods that, with different applications to the ones outlined above, also analyse NGS data without relying on a reference genome.
- One of them describes a computational method for quantifying the abundance of RNA isoforms from RNA-seq data, which does not rely on the mapping of reads to a reference genome (Patro R., et. al. "Sailfish enables alignment-free isoform
- Nordstrom K.J.V et.al. "Mutation identification by direct comparison of whole- genome sequencing data from mutant and wild-type individuals using k-mers” Nature Biotech. 2013, vol. 31 , pp. 325-331 .
- This work presents a reference- free genome assembler that allows discovering homozygous mutations based on comparing k-mers in whole genome sequencing data.
- these two methods can operate without the need of a reference sequence, they are not applicable in comparative studies such as the identification of somatic mutations and thus, cannot be used for genome analysis in a biomedical context.
- the method is not restricted to the identification of a certain type of variant (SNVs,
- a first aspect of the present invention is a computational method for the identification of nucleic acid variants between two genomic states comprising the steps of:
- the performance and speed of this method make it more suitable for clinical applications (such as genomic analysis of cancer cells) than the complex and time-consuming pipelines developed so far.
- the method can even be used to define very complex genomic scenarios such as the ones taking place in chromoplexy and chromothripsis related to cancer development and aggressiveness.
- a second aspect of the present invention is a computer program product comprising program instructions for causing a computer system to perform the method for the identification of nucleic acid variants between two genomic states of the first aspect of the invention.
- the computer program product may be embodied on a storage medium (for example, a CD-ROM, a DVD, a USB drive, on a computer memory or on a read-only memory) or carried on a carrier signal (for example, on an electrical or optical carrier signal).
- a storage medium for example, a CD-ROM, a DVD, a USB drive, on a computer memory or on a read-only memory
- a carrier signal for example, on an electrical or optical carrier signal
- the computer program may be in the form of source code, object code, a code intermediate source and object code such as in partially compiled form, or in any other form suitable for use in the implementation of the processes according to the invention.
- the carrier may be any entity or device capable of carrying the computer program.
- the carrier may comprise a storage medium, such as a ROM, for example a CD ROM or a semiconductor ROM, or a magnetic recording medium, for example a floppy disc or hard disk.
- a storage medium such as a ROM, for example a CD ROM or a semiconductor ROM, or a magnetic recording medium, for example a floppy disc or hard disk.
- the carrier may be a transmissible carrier such as an electrical or optical signal, which may be conveyed via electrical or optical cable or by radio or other means.
- the carrier may be constituted by such cable or other device or means.
- the carrier may be an integrated circuit in which the computer program is embedded, the integrated circuit being adapted for performing, or for use in the performance of, the relevant methods.
- a third aspect of the invention is a system for the identification of nucleic acid variants between two genomic states comprising the steps of: A) Computer/electronic means for Inputting 2 sets of nucleic acid reads, which are sequences retrieved from a sequencing method, wherein the first set of reads corresponds to cells representing a first test state, and the second set of reads corresponds to cells representing a second control state; B)
- Computer/electronic means for filtering the reads comprising: B1 ) Keeping only the reads with at least a percentage X1 of their bases with a phred quality score higher than 20, being X1 equal or above 90%; B2) Splitting the reads with an undefined nucleotide, giving one sequence before, and one sequence after the undefined nucleotide, the latter being discarded; and B3) Discarding the sequence reads with less than X2 bases, wherein X2 is from 25 to 40; C) Computer/electronic means for generating a suffix structure, wherein the generation of the suffix structure comprises: C1 ) Generating a number of N-X2 new reads for each read of sequence length N, wherein the new N-X2 reads correspond to all suffixes with a length larger than X2 nucleotides and derived from each read of sequence length N, being a suffix the complete sequence of a sequenced read or a sub-sequence
- the described system can be a part (for example, hardware in the form of a PCI card) of a computer system (for example a personal computer). Furthermore, the system can be an external hardware connected to the computer system by appropiate means.
- the electronic/computer means may be used
- a fourth aspect of the invention is a system comprising a processor and a memory, wherein the memory stores computer executable instructions that, when executed, cause the system to perform the method for the identification of nucleic acid variants between two genomic states of the first aspect of the invention.
- FIG1 A non-limiting example of a suffix tree.
- the string used for the example is banana$.
- the string is expressed by the use of this suffix structure.
- the root of the suffix is the uppermost node found on the picture.
- FIG. Construction of the suffix structure. Construction of a suffix tree from two different sequences. The sequence “BANANA” which corresponds to a sequence of the first state and the sequence “BAANA” which corresponds to a second state sequence. Below those sequences, there are all their possible derived suffixes.
- the tree structure is represented by a three squares' box, the left square represents the first state count, the middle square represents the corresponding letter and the right square represents the second state count, a)
- the box corresponds to what in the text is referenced as node, position or nucleotide, b) That area corresponds to an unambiguous path: all nodes have a unique child with counts of both states, c) That specific node constitutes a breakpoint: the first state count and the second state count diverge in different ways at their children.
- FIG. Breakpoint block Each dashed box represents a pair of first state (regular letters) and second state (italic and underlined letters) sequence that contains a breakpoint. All the dashed boxes contain sequences with the same suffix or prefix around the breakpoint which means that all of them constitute a unique breakpoint block (Note: the sequence used for this image does not represent any prior art sequence that holds any relation to the invention disclosed herein, i.e. it is simply shown for illustrative purposes).
- NGS Next Generation Sequencing
- base and the term “nucleotide” are herein used interchangeably, and refer to the monomers (subunits) which are repeated in a nucleic acid such as DNA or RNA, giving its sequence or primary structure.
- reference genome refers to the complete nucleic acid sequence representing the whole genome of a species normally accepted by the wide community. Since the reference genome is usually assembled from the sequencing of DNA from a number of donors, it does not accurately represent the set of genes of any one single individual. Instead, a reference genome provides a mosaic of different DNA sequences from each donor. But, at general levels, the reference genome provides a good approximation of the DNA of any single individual. However, in genomic regions with high allelic diversity, the reference genome may differ
- GRCh37 the Genome Reference Consortium human genome (build 37) is derived from thirteen anonymous volunteers from New York. Reference genomes are typically used as a guide on which new genomes are built and aligned, enabling their assembly and comparison.
- forward strand refers to a nucleic acid sequence read from 5' terminal to 3' terminal ends.
- reverse strand refers to the nucleic acid sequence which is complementary to the forward strand.
- nucleic acid variant refers to a difference in sequence between two genomic states.
- a variant can be a single nucleotide variant (SNV) if the difference between the two genomes (or two states of the same genome) is only due to the change of a single nucleotide. All other variants, among them insertions, deletions, inversions, duplications, translocations and others are termed structural variants (SV). The latter can have many sizes, from two bases up to entire pieces of a chromosome.
- SNV single nucleotide variant
- SV structural variants
- genomic state can refer to two different genomes derived from two different individuals, or two genomes derived from two different cells of the same individual.
- the two different cells can be a normal vs. a pathological cell, an undifferentiated vs. a differentiated cell, a cell which has been exposed to a certain external factor vs. an unexposed cell, etc.
- mapping refers to aligning blocks of the first and second state to a reference genome.
- read refers to a fragment of nucleic acid that is sequenced in its entirety.
- the nucleic acid might be DNA, RNA, or even chemically altered nucleic acids.
- the initial step in a high throughput sequencing run is the random fragmentation of a genome into millions of partly overlapping fragments called reads, which are usually amplified by Polymerase Chain Reaction and sequenced using a variety of techniques that are platform-dependent.
- the lengths of the reads can also vary depending on the platform, and are usually on the order of a few dozens to a few hundreds of nucleotides.
- the partly overlapping reads must be assembled if a complete picture of the genome is to be built.
- depth of coverage refers to the number of times a nucleotide is read during the sequencing process. Deep sequencing means that the total number of reads is many times larger than the length of the sequence under study. Standard depth of coverage currently range from 30 to 100x for whole genomes, meaning that each position in the genome is represented from 30 to 100 times. Coverage similarly designates the average number of reads representing a given nucleotide in the reconstructed sequence.
- Depth of coverage can be calculated from the length of the original genome (G), the number of reads (A/), and the average read length (L) as N * L/G.
- the term "undefined nucleotide” as used herein refers to a certain position inside a sequenced read that could not be determined during the sequencing process, that is, a position for which the sequencing experiment has not unambiguously resolved whether it is occupied by an adenine (A), guanine (G), cytosine (C) or thymine (T), and therefore its nature is unknown.
- Undefined nucleotides in reads are filtered out (removed) in the method of the invention, generating two or more fragments of defined sequence if the undefined nucleotides are removed form inner positions of the read.
- the term "phred quality score” as used herein refers to the quality score given to each nucleotide base call in a sequenced read.
- the phred score is a property given to each sequenced nucleotide and it is logarithmically related to the base-calling error probability.
- a phred score of 10 assigned to a certain nucleotide in a sequenced read means that there is a 90% probability that the base call is correct
- a phred score of 20 means that there is a 99% probability that the base call is correct
- a phred score of 30 means that there is a 99.9% probability that the base call is correct.
- the term “assembling” as used herein refers to grouping all the first state reads, and separately second state reads that share the same variant.
- sequenced NGS read refers to the complete sequence of a sequenced NGS read or a sub-sequence derived from the latter when one nucleotide at the 5' terminal end is discarded. For instance, let us imagine a read of a length of 40 nucleotides which represents the beginning of the sequence that codes for the Homo sapiens PTGS2 enzyme (prostaglandin endoperoxide synthase-2, also known as cox-2, Genbank entry D28235). If an original read generated in an NGS experiment of the latter gene were: ATTATTAAATTATCAAAAAGAAAATGATCCACGCTCTTAG (lengh 40) its first five suffixes would be:
- Suffix structure refers to a structure for string storage and compression that allows computational comparison and matching of sequences in a highly efficient manner.
- Suffix structures that can be used in the method of the invention are, for instance, suffix trees and suffix arrays.
- suffix tree also known as "suffix trie” as used herein refers to a tree-like data structure that is constructed to enable fast string matching. It allows the convenient storage of all substrings in a given string, that is, in terms of the present invention, the storage of all nucleotide sub-sequences (suffixes) in a given nucleotide sequence (of a read).
- suffix tree see FIG.1 where a suffix tree is constructed for the term banana.
- quadternary suffix tree refers to a suffix tree built with the 4-letter vocabulary of nucleic acids (G,A,T/U,C).
- the suffix tree has a root, which is the 0 position from which the counting of the nucleotides begins.
- the suffix array A contains the starting positions of these sorted suffixes: i 1234567
- - ⁇ [3] contains the value 4, and therefore refers to the suffix starting at position 4 within S, which is the suffix ana$.
- Suffix arrays are closely related to suffix trees. Suffix arrays can be
- suffix array is an auxiliary array for the "suffix array”.
- the LCP array H is constructed by comparing lexicographically consecutive suffixes to determine their longest common prefix:
- prefix refers to the first part of a sequencing read, that is, from position 1 to a given position depending on the context. This term is used here, as it is used in a grammatical context referring to words.
- breakpoint refers to the sequence immediately flanking a sequence variant.
- a breakpoint is the point where the DNA broke in the second state and appears as a change in the first state compared to the second control state. In other words, where the continuity of the sequence of the control second state breaks in the first state.
- node refers to any given position (nucleotide) in the tree. Relevant nodes in this invention are those that determine a change in the sequence of the second state compared to the first state (breakpoint nodes). In FIG. 2 it is represented a node where healthy and mutated reads take different paths because they contain a different sequence.
- block or “breakpoint block” as used herein refers to all the reads derived from the two genomes compared, derived from the same position and including a variation in first state read.
- a breakpoint block (FIG. 3). is composed of aligned reads derived from the sequencing of all four alleles involved (two coming from the first state genome and the other two derived from the second state genome). Normally, in the case of heterozygous variation, only the reads derived from the altered allele will contain the mutation or variant.
- ambiguous path refers to multiple possible sequence solutions in a given tree. It is referred here as the opposite of unique and unambiguous path or sequence.
- the first aspect of the present invention is a computational method for the identification of nucleic acid variants between two genomic states comprising the steps of:
- computational method further comprises the step of: F) Cataloguing and annotating blocks according to the following: F1 ) If blocks between the first and second states only differ in one substituted nucleotide, the variant is catalogued as containing a single nucleotide variant and the single nucleotide variant is annotated; F2) If blocks between the first and second states differ in more than one nucleotide but the whole difference in sequence is contained within the block, the variant is catalogued as a small structural variant, and the small structural variant is annotated; and F3) If blocks between the first and second states differ in more than one nucleotide and the whole difference in sequence is not contained within the block, the variant is catalogued as a large structural variant, and the boundaries of all large structural variants are extended by retrieving suffixes overlapping at least X2 nucleotides in an iterative process which ends when the extended sequence reaches 200 nucleotides or when an ambiguous path is found.
- F1 If blocks between the first and second states only differ in one
- the method further comprises optionally mapping second state blocks, and subsequently mapping first state blocks, on a reference genome.
- the suffix structure is a suffix tree or a suffix array.
- X1 is equal or above 95%.
- X1 is equal or above 99%.
- X2 is from 30 to 35.
- X2 is from 30 to 32.
- X2 is equal to 30.
- X3 is equal or above 4.
- X3 is equal or above 6.
- X3 is equal or above 8.
- X3 is directly proportional to the depth of coverage in the sequencing experiment. This means that, the deeper the coverage, the more restrictive (higher) is the value for X3 (the more reads with the same variation between first and second states must be present for the node to be accepted as a breakpoint node).
- X4 is between 5-10%.
- X4 is between 5-7%.
- X4 is 5%.
- X5 is between 20-25%. In a particular embodiment of the first aspect of the invention, optionally in combination with any embodiment above or below, X5 is 20%.
- the first set of reads corresponds to pathological cells of a patient, and the second set of reads corresponds to non-pathological cells of the same patient;
- the first set of reads corresponds to cancer cells of a patient
- the second set of reads corresponds to non-cancer cells of the same patient.
- the first set of reads corresponds to virus-infected cells of a patient
- the second set of reads corresponds to non-infected cells of the same patient.
- the first set of reads and the second set of reads correspond to the same cell of the same patient in two different developmental stages.
- the first set of reads corresponds to cells of a patient which have been exposed to a drug
- the second set of reads corresponds to cells of the same patient which have not been exposed to a drug
- the first set of reads corresponds to cells of a tissue
- the second set of reads corresponds to cells of a tissue
- the embodiments of the invention comprise processes performed in computer apparatus, the invention also extends to computer apparatus and to computer programs, particularly computer programs on or in a carrier, adapted for putting the invention into practice.
- a computer program product comprising program instructions for causing a computer system to perform the method for the identification of nucleic acid variants between two genomic states as defined above.
- the computer program product is embodied on a storage medium.
- the computer program product is carried on a carrier signal.
- the carrier may be any entity or device capable of carrying the program.
- the goal in the case of the two aggressive forms of tumors was to test the method of the invention for the detection of somatic mutations accumulated in cancer cells as compared to normal (non-tumor) cells, and to assess how the method compares to other methods found in the state of the art.
- somatic variants associated with cancer typically implies the sequencing of tumor and normal genome samples from the same patient, followed by multiple sequence comparisons. This process implies the alignment of normal and pathological reads to a reference genome, identifying sequence changes and finally isolating the somatic fraction of variants (i.e. those detected only in the tumor cell reads).
- the method of the invention is capable of identifying SNVs and SVs of all sorts and sizes in a single run with great accuracy.
- the method of the invention is even proven to be capable of identifying breakpoints associated with complex chromosomal rearrangements (chromotripsis and chromoplexy) in two forms of aggressive cancers.
- characterization carried out by the method of the invention comprise the steps outlined below: A) Input data.
- the method takes high quality sequencing data directly from FASTQ files of tumor and normal samples of the same individual. Alternatively, it is also able to accept BAM files, from which it extracts all the sequencing reads.
- the user can define a cutoff, so that reads having over a certain threshold of their bases with a phred quality score ⁇ q20 are discarded.
- X1 90 has been found to be especially suited for the purposes tested. This means that only reads with at least 90% of their bases with a phred quality score higher than 20 are kept.
- the next step implies the building of a suffix structure.
- This structure can, for instance, be a suffix tree or a suffix array.
- a quaternary suffix tree (or "quad-tree") structure is first generated using all high quality normal and tumor reads (see FIG.1 and FIG.2 for a simplified version). All these sequences are
- Each node of the tree has, at most, 4 branches, each one representing one of the four nucleotides.
- the empirical value for X2 that has been mostly used in the tests disclosed is 30. This means that all reads with less than 30 bases are discarded. In the case of the presence of undefined base pairs ("N"), these are removed and the original sequence is split forming new shorter reads, which are inserted into the tree only if they are longer than 30 (X2) base pairs.
- N undefined base pairs
- Each of sequences accepted is inserted into the tree (both in forward and reverse), from the root, in original form (i.e. starting from nucleotide 1 to the end of the read), together with all derived suffixes larger than 30 bp (recursively starting from nucleotide 2 to the end, 3 to the end, etc., as defined in the definitions).
- the next step consists in identifying all tumor specific reads. Because it was expected that variants generate new and distinct sequences in the mutated genome (first state) compared to the non-mutated control genome (second state), the method first searches and collects sequences (reads) that are only present in the tumor sample. These sequences are identified from the tree, as nodes and branches with an unbalanced representation of Normal (Count Normal Reads - CNR) and Tumor (Count Tumor Reads - CTR) reads . It was expected that nodes or branches covering a variation in the tumor sequence (first state) will theoretically have no representation in normal reads (second state).
- nodes or branches with a CNR to CTR ratio below a certain threshold are selected.
- This threshold can be adjusted by the user to account for expected levels of contamination of tumor cell reads among normal cell reads, i.e. reads of state one contaminating the sample of state two. This inconvenience may happen if a small group of cancer cells contaminate the sample of non-tumor cells, which is not uncommon.
- a X4 larger than 0 always results in a higher sensitivity, but at the cost of lower specificity.
- X4 was set to 0 for the in silico analysis and to 0.05 for the real tumor samples analyzed here, where it was assumed a maximum of 5% contamination of tumor reads in the control sample.
- a final condition is imposed, which works as a reverse of X4.
- a certain percentage of second state (control) reads can be present among first state (test, cancer) reads, which happens when genomes of state one contaminate state two (for example, it is common in cancer, where tumor cells can contaminate the normal sample during the extraction from the patient). This can also be corrected by only accepting nodes which are not over a threshold X5.
- the next step consists in grouping those that are suspected to cover the same variant.
- candidate sequences are organized by identity: two sequences belong to the same group if they overlap at least X2 base pairs. X2 was set to 30 in the examples disclosed.
- Reverse complementary sequences are also evaluated during this grouping in order to be able to cover the variant in both
- Sequence blocks (groups) with sequences in only one of the orientations (or with less than 4 tumor reads as explained above) are discarded. Once these groups are generated, it was interrogated the tree, also on the 30 base pair-overlap basis, to extract the normal (non-mutated) reads of the same region and add them to the block.
- each block will represent a region in the genome containing the mutated and the non-mutated version (see a detailed example of a breakpoint block in FIG.3).
- the consensus mutated and normal sequences from these blocks.
- the corresponding normal consensus sequence can be used at the end of the procedure and mapped onto a reference genome to obtain the coordinates of the variant.
- the method also includes step F:
- the next step consists in identifying and classifying the variation included there.
- Normal and Tumor consensus sequences derived from these blocks are recursively compared to identify differences.
- a fist evaluation will search for small variants, which consist of those that are completely included within the consensus sequences (SNV and small SVs: insertions, deletions and inversions). All the blocks that do not match this criterion are then considered candidates for large SVs, i.e. likely to cover break points of intra or interchromosomal transitions, part of large deletions, insertions or inversions .
- each tumor consensus sequence is extended on both ends by interrogating the tree for unambiguous tumor reads that overlap at least X2 (30 in this case) base pairs with the tumor consensus, reconstructing a (maximum) of 200 base pair region around the break and allowing the detection of newly generated sequence at the point of the break.
- inventors identify the coordinates of the changes by mapping onto a reference genome the normal consensus sequences corresponding to each of the variants, avoiding potential mapping conflicts derived from the presence of the variant, as usually happens when using reference-based approaches. Sequences mapping (with the same score) to several positions in the genome are discarded. The calibration and default parameters for the method were adjusted using a high quality set of -1000 SNVs identified with the Sidron software in a CLL sample (Puente X.S. "Whole-genome sequencing identifies recurrent mutations in chronic lymphocytic leukaemia" Nature 201 1 , vol. 475, pp. 101 -105).
- a personalized genome was simulated using the hg19 reference genome downloaded from UCSC (with no repeat-masking) (http://www.ucsc.edu ), and modifying it to match a randomly chosen human haplotype from 1000 genome database.
- These 7,194,026 variants consist of 4,745,917 SNPs and
- the catalog of somatic variants further added to this personalized genome includes 8240 SNVs (more than 100bp apart), 20 known tumor translocations (Egan, J.B. et al. "Whole genome analyses of a well -differentiated liposarcoma reveals novel SYT1 and DDR2
- M003, M004 and MB1 were obtained with informed consent and an ethical vote (Institutional Review Board) following ICGC guidelines (www.icgc.org). M003, M004 and MB1 were accessed through the European Genome- phenome Archive (EGA, https://www.ebi.ac.uk/ega/) under access numbers EGAS00001000510 and EGAS00001000085.
- EAA European Genome- phenome Archive
- Variant genes in tumor samples were identified by analyzing all the changes identified with ANNOVAR (Wang K., et.al. "ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data” Nucleic acids research 2010, vol. 38, pp. e164). The analysis of the resulting genes potentially modified at coding or splicing level were further analyzed with Intogen (Gonzalez-Perez, A. et al. "IntOGen-mutations identifies cancer drivers across tumor types” Nature methods 2013, vol. 10, pp. 1081 -1082) in order to infer their potential role in oncogenesis.
- PCR primers were designed on sequence blocks of 2000bp around the target variant using Primer 3 (http://frodo.wi.mit.edu/primer3) (Schgasser, A. et al. "Primer3--new capabilities and interfaces”. Nucleic acids research 2012, vol. 40, pp. e1 15). PCR reactions were performed for tumor and control samples. Each target locus was amplified using 50 ng of DNA. The amplification was performed using Qiagen Multiplex PCR Kit (Qiagen), and the reaction mix contained 2x QIAGEN Multiplex PCR Master Mix, 10x primer mix (2 ⁇ of each primer) and RNase-free water until a total reaction volume of 25 ⁇ .
- PCR conditions were as follows: 96°C 10', 2 cycles of 96°C 30"-60°C 30"-72°C 1 '30", 2 cycles of 96°C 30"-58°C 30"-72°C 1 '30", 2 cycles of 96°C 30"-56°C 30"-72°C 1 '30", 35 cycles of 96°C 30"-54°C 30"-72°C 1 '30", and 70°C 10'.
- PCR products have been run in a capillary electrophoresis gel (QIAxcel Advenced System, Qiagen) with the QIAxcel DNA screening kit (Qiagen), and the multiband PCR products were purified using NucleoSpin Gel and PCR Clean-up (Mercherey-Nagel).
- Sanger sequencing PCR products were cleaned using ExoSAP-IT (USB) and sequenced using ABI Prism BigDye terminator v3.1 (Applied Biosystems) with 5 pmol of each primer. Sequencing reactions were run on an ABI-3730 Sanger sequencing platform (Applied Biosystems). Sequences were examined with the Mutation Surveyor DNA Variant Analysis Software (Softgenetics). G-bandinq, FISH and M-FISH analysis
- inventors generated normal and tumor test genomes by first applying a random human haplotype (Schgasser et.al., ibid) and a predesigned catalog of somatic variants to the human reference genome, to then simulate whole genome sequencing at different depths of coverage. To assess for the applicability of the method in the current context of cancer genome analysis, it was compared its performance with existing strategies for somatic variant calling. For this, it was selected a
- Variants are distributed as follows: 8240 SNV and 1798 SVs (738 deletions, 715 insertions and 345 inversions).
- the table shows the number of breakpoints that define SVs.
- the method was able to identify 4409 somatic SNVs and 1094 small SVs in this tumor genome.
- a random set of 1 1 1 of these somatic calls were amplified and verified by Sanger sequencing using the same DNA used for whole genome sequencing. This process allowed us to verify >94% of SNVs (76 of 81 ) and >80% of SVs (28 of 35) identified by our method. These specificity rates are in agreement with the corresponding values obtained from the in silico analysis.
- MB1 was previously described as presenting chromothripsis, a complex structural alteration of the genome hypothesized to arise from a single catastrophic event that generates multiple breakpoints, often affecting a single chromosome.
- the method of the invention uncovered a total of 102 breakpoints corresponding to "large" SVs (i.e.
- translocations From the assessment of a random set of 39 of these breaks through PCR amplification and Sanger sequencing, it could positively be verified 36 (92%). Among all the breakpoints detected, 25 were found to agree with the intervals of chromosomal translocations that previously let defining chromothripsis in this tumor, including 3 of the 4 verified at base pair resolution. In addition, it could be identified 65 novel breakpoints in the same tumor, comprising 53 intra and 12 interchromosomal translocations. From a subset of 37 of these translocations (16 intra- and 1 1 interchromosomal), 25 were verified (92.5%).
- Cibulskis K., et.al. “Sensitive detection of somatic point mutations in impure and heterogeneous cancer samples” Nat. Biotech. 2013, vol. 31 , pp. 213-219 Wang K., et.al. "ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data” Nucleic acids research 2010, vol. 38, pp. e164
- Patro R., et. al. "Sailfish enables alignment-free isoform quantification from RNA-seq reads using lightweight algorithms” Nature Biotech. 2014, vol. 32, pp. 462-464
Landscapes
- Life Sciences & Earth Sciences (AREA)
- Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Engineering & Computer Science (AREA)
- Biophysics (AREA)
- General Health & Medical Sciences (AREA)
- Analytical Chemistry (AREA)
- Chemical & Material Sciences (AREA)
- Bioinformatics & Computational Biology (AREA)
- Biotechnology (AREA)
- Evolutionary Biology (AREA)
- Proteomics, Peptides & Aminoacids (AREA)
- Medical Informatics (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Theoretical Computer Science (AREA)
- Genetics & Genomics (AREA)
- Molecular Biology (AREA)
- Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
Abstract
Description
Claims
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
EP14189691 | 2014-10-21 | ||
PCT/EP2015/074253 WO2016062713A1 (en) | 2014-10-21 | 2015-10-20 | A computational method for the identification of variants in nucleic acid sequences |
Publications (1)
Publication Number | Publication Date |
---|---|
EP3210145A1 true EP3210145A1 (en) | 2017-08-30 |
Family
ID=51986982
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
EP15784335.0A Withdrawn EP3210145A1 (en) | 2014-10-21 | 2015-10-20 | A computational method for the identification of variants in nucleic acid sequences |
Country Status (2)
Country | Link |
---|---|
EP (1) | EP3210145A1 (en) |
WO (1) | WO2016062713A1 (en) |
Families Citing this family (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10395759B2 (en) | 2015-05-18 | 2019-08-27 | Regeneron Pharmaceuticals, Inc. | Methods and systems for copy number variant detection |
KR20190039693A (en) * | 2016-06-30 | 2019-04-15 | 난토믹스, 엘엘씨 | Synthetic WGS Bioinformatics Validation (Synthetic WGS Bioinformatics Validation) |
EP3267346A1 (en) * | 2016-07-08 | 2018-01-10 | Barcelona Supercomputing Center-Centro Nacional de Supercomputación | A computer-implemented and reference-free method for identifying variants in nucleic acid sequences |
CN109920485B (en) * | 2018-12-29 | 2023-10-31 | 浙江安诺优达生物科技有限公司 | Method for carrying out mutation simulation on sequencing sequence and application thereof |
CN112735516A (en) * | 2020-12-29 | 2021-04-30 | 上海派森诺生物科技股份有限公司 | Group variation detection analysis method without reference genome |
-
2015
- 2015-10-20 EP EP15784335.0A patent/EP3210145A1/en not_active Withdrawn
- 2015-10-20 WO PCT/EP2015/074253 patent/WO2016062713A1/en active Application Filing
Non-Patent Citations (2)
Title |
---|
None * |
See also references of WO2016062713A1 * |
Also Published As
Publication number | Publication date |
---|---|
WO2016062713A1 (en) | 2016-04-28 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Kumar et al. | Next-generation sequencing and emerging technologies | |
Moncunill et al. | Comprehensive characterization of complex structural variations in cancer by directly comparing genome sequence reads | |
De Roeck et al. | NanoSatellite: accurate characterization of expanded tandem repeat length and sequence through whole genome long-read sequencing on PromethION | |
Wadapurkar et al. | Computational analysis of next generation sequencing data and its applications in clinical oncology | |
Roth et al. | PyClone: statistical inference of clonal population structure in cancer | |
Ren et al. | RNA-seq analysis of prostate cancer in the Chinese population identifies recurrent gene fusions, cancer-associated long noncoding RNAs and aberrant alternative splicings | |
Zhao et al. | Robustness of RNA sequencing on older formalin-fixed paraffin-embedded tissue from high-grade ovarian serous adenocarcinomas | |
Sakarya et al. | RNA-Seq mapping and detection of gene fusions with a suffix array algorithm | |
AU2015367290A1 (en) | Sequencing controls | |
WO2016062713A1 (en) | A computational method for the identification of variants in nucleic acid sequences | |
JP2023504529A (en) | Systems and methods for automating RNA expression calls in cancer prediction pipelines | |
Gilpatrick et al. | Targeted nanopore sequencing with Cas9 for studies of methylation, structural variants, and mutations | |
Ergin et al. | RNA sequencing and its applications in cancer and rare diseases | |
EP3482329B1 (en) | A computer-implemented and reference-free method for identifying variants in nucleic acid sequences | |
Fasterius et al. | A novel RNA sequencing data analysis method for cell line authentication | |
Hallast et al. | Assembly of 43 human Y chromosomes reveals extensive complexity and variation | |
Bacher et al. | Mutational profiling in patients with MDS: ready for every-day use in the clinic? | |
Ahsan et al. | A survey of algorithms for the detection of genomic structural variants from long-read sequencing data | |
Mun et al. | A study of transposable element-associated structural variations (TASVs) using a de novo-assembled Korean genome | |
US20240141425A1 (en) | Correcting for deamination-induced sequence errors | |
Wan et al. | RNA sequencing and its applications in cancer diagnosis and targeted therapy | |
Seifi et al. | Application of next-generation sequencing in clinical molecular diagnostics | |
Gindin et al. | Analytical principles of cancer next generation sequencing | |
KR101977976B1 (en) | Method for increasing read data analysis accuracy in amplicon based NGS by using primer remover | |
Cheng et al. | Whole genome error-corrected sequencing for sensitive circulating tumor DNA cancer monitoring |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PUAI | Public reference made under article 153(3) epc to a published international application that has entered the european phase |
Free format text: ORIGINAL CODE: 0009012 |
|
17P | Request for examination filed |
Effective date: 20170522 |
|
AK | Designated contracting states |
Kind code of ref document: A1 Designated state(s): AL AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HR HU IE IS IT LI LT LU LV MC MK MT NL NO PL PT RO RS SE SI SK SM TR |
|
AX | Request for extension of the european patent |
Extension state: BA ME |
|
DAV | Request for validation of the european patent (deleted) | ||
DAX | Request for extension of the european patent (deleted) | ||
GRAP | Despatch of communication of intention to grant a patent |
Free format text: ORIGINAL CODE: EPIDOSNIGR1 |
|
INTG | Intention to grant announced |
Effective date: 20180813 |
|
STAA | Information on the status of an ep patent application or granted ep patent |
Free format text: STATUS: THE APPLICATION IS DEEMED TO BE WITHDRAWN |
|
18D | Application deemed to be withdrawn |
Effective date: 20190103 |