WO2017085458A1 - A method for determining an alignment of segments of a genome - Google Patents
A method for determining an alignment of segments of a genome Download PDFInfo
- Publication number
- WO2017085458A1 WO2017085458A1 PCT/GB2016/053428 GB2016053428W WO2017085458A1 WO 2017085458 A1 WO2017085458 A1 WO 2017085458A1 GB 2016053428 W GB2016053428 W GB 2016053428W WO 2017085458 A1 WO2017085458 A1 WO 2017085458A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- variants
- genome
- alternative
- variant
- nucleotides
- Prior art date
Links
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B30/00—ICT specially adapted for sequence analysis involving nucleotides or amino acids
- G16B30/10—Sequence alignment; Homology search
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B30/00—ICT specially adapted for sequence analysis involving nucleotides or amino acids
Definitions
- the present disclosure relates to genome sequencing. More specifically, but not exclusively, a method for aligning variants of genome reads and comparing two sets of variants.
- Next generation sequencing or high throughput sequencing, allows large amounts of sequencing data to be generated rapidly. Interpretation of such genetic data can have numerous clinical applications, particularly in medical fields such as cancer.
- Next generation sequencing produces short sequencing reads which are aligned using short read alignment techniques. Alignment of short sequencing reads to a reference genome allows detection of genetic variants such as a single nucleotide polymorphism (SNP), insertion or deletion (Indel) and copy number variations (CNV). Sequence alignment is useful for discovering structural, functional, and evolutionary relationships between biological sequences.
- SNP single nucleotide polymorphism
- Indel insertion or deletion
- CNV copy number variations
- a method for determining an alignment of segments of a genome for use in genome diagnostics comprises obtaining a plurality of segments of a sequence of nucleotides associated with a subject genome, the plurality of segments including one or more variants, determining a plurality of alternative alignments of the variants, wherein each of the plurality of alternative alignments defines one way of aligning the plurality of segments to form the string of nucleotides, scoring the plurality of alternative alignments, and selecting one of the plurality of alternative alignments based on the score.
- the obtained plurality of segments may be reads of the subject genome.
- the subject genome may be the genome of a patient or person being tested.
- the method may further comprise placing each variant on a phase of a chromosome prior to determining the plurality of alternative alignments.
- the method may further comprise introducing a break into the chromosome if it cannot be determined whether or not two variants are on the same phase.
- variants may be included on all phases.
- a plurality of alternatives of the chromosome may be provided, each one with the variant positioned on one of the phases.
- the method may further comprise converting the one or more variants into a regular form by inserting one or more gaps into the variant sequence prior to determining the alternative alignments.
- a regular form may be a known or standardised form for representing variants and/or strings of nucleotides.
- the one or more gaps may be inserted to increase the number of matching nucleotides between the variant sequence and a reference sequence.
- the reference sequence may be a known genome sequence, which is used as a means of comparison.
- the one or more gaps may be inserted when a nucleotide has either been inserted or deleted in the sequence.
- An alignment with a minimum number of gaps may be used for further processing.
- the method may further comprise left aligning the plurality of alternative alignments prior to scoring the plurality of alternative alignments. The alignments may be scored based on the number of variants.
- the selected one of the plurality of alternative alignments may be the alignment with the fewest variants.
- the method may further comprise matching variants of the subject genome with variants of a known reference genome.
- the known reference genome may be a genome known to have a certain characteristic, e.g. a known defect.
- the matching of variants may further comprise obtaining a string of nucleotides associated with a subject genome.
- the string of nucleotides may include one or more variants.
- the string of nucleotides may be derived from the selected one of the plurality of alternative alignments.
- the derivation may be that the string of nucleotides forms part of the alignment.
- the derivation may alternatively be that the string of nucleotides is the whole of the alignment.
- the method may further comprise expressing the string of nucleotides in a plurality of different formats to obtain a plurality of alternative nucleotide strings, and comparing the plurality of alternative nucleotide strings with at least one nucleotide string of the known reference genome to identify similarities and/or differences between the plurality of alternative nucleotide strings and the at least one nucleotide sequence of the known reference genome.
- the plurality of different formats may be alternative notations for representing nucleotides.
- the at least one nucleotide sequence of the known reference genome may be a variant indicative of a characteristic of the known reference genome. The characteristic may be indicative of a strength or weakness of the genome.
- the at least one nucleotide string of the known reference genome may be stored in a database of known reference genomes.
- computer readable media including computer readable instructions which are operable, in use, to instruct a computer system to perform any method disclosed herein.
- apparatus comprising a memory arranged to store computer readable instructions arranged to instruct a computer to perform any method disclosed herein and a processor arranged to process the computer readable instructions.
- FIG 1 illustrates the process of the present disclosure
- Figure 2 illustrates the system arranged to perform the process of Figure 1.
- Described herein is a process for determining whether reference reads of a human genome sequence include variants with respect to a reference human genome. When variants are identified they may be indicative of a cause of a rare disease or a pathogenic variant.
- the standard alignment method used in aligning biological sequences is based on the Smith-Waterman algorithm with an affine gap penalty model. Although it is usual to obtain a single alignment using this algorithm it has been shown that it can be extended to obtain multiple alignments each with the same cumulative distance as the minimum distance, for example using algorithms such as Gotoh scan.
- the process described herein firstly compares variants in nucleotide sequences of patient reads. This is carried out in order to ensure that the reads are correctly aligned by ensuring correct phasing and proximity. Once the reads are aligned variant matching with reference databases can be performed to identify relevant variants in the human genome sequence. In particular, a variant is identified where a section of a nucleotide sequence differs from the corresponding section in the reference human genome.
- an already known algorithm which shall be called the equal-best algorithm is utilised for both comparing the variants in the nucleotide sequence and performing the variant matching with reference databases.
- the term "equal-best algorithm” refers to known algorithm processes that can be used to obtain multiple alignments each with the same cumulative distance as the minimum distance.
- step 1 reads of a subject genome are obtained.
- the reads are short relative to the length of the genome and it is therefore necessary to align the phasings of each of the reads. This process is discussed in detail below.
- a phasing algorithm is applied to identify phase couplings as well as phase breaks to create phased sections. While this step in the process is optional, phasing information allows accurate analysis and the application of a phasing algorithm is recommended for the most accurate analysis. Various mathematical models are used to determine the most likely phase to which the variants should be aligned. This process therefore provides a more accurate determination of the linked arrangement of variants on either the paternal or maternal chromosome. Numerous phasing algorithms can optionally be applied, including but not limited to hapcut and probHap.
- the phasing algorithm places each variant in one of the phases of the chromosome. If the analysis is germ-line then there are only two phases. If the algorithm is somatic then there may be more than two phases at different sections in the chromosome.
- the phasing algorithm may introduce a "break" from time to time.
- a break is placed when the algorithm is not able to determine if two adjacent variants are on the same phase or different phases. This can be due to various factors including proximity to other variants as well as lack of read evidences.
- the second approach is conservative where two adjacent variants for whom no phasing information is available are treated as separate and not combined. Results obtained this way are "definite" - but the inability to combine adjacent variants often lead to missing some matches in the database or between two sets of variants. The conservative mode is more commonly used because the eventual results are definite matches.
- phase sections are combined into analysis sections by applying proximity considerations, at step 20.
- Proximity considerations allow the long sequence to be broken up into many smaller sequences. The proximity consideration is simply based on how far apart two adjacent variances in a sequence are. In conservative mode, each phase section is broken down into smaller sections. If phasing information is available, each smaller section will correspond to a phased section. If no phasing information is available then each section will correspond to one variant. Note that some padding will be applied when extracting the variant sequences - usually up to the adjacent variant on either side.
- a pre-processing step is performed by converting the first variant into a form where the variants are marked differently including deletions. Pre-processing is applied prior to the application of the segmental variant comparison algorithm in order to place the data in the best form for obtaining accurate results.
- gaps are introduced into the variant sequence to increase the number of matching nucleotides between the variant sequence and the reference sequence.
- a gap is inserted when a residue has either been deleted or inserted in the sequence.
- Sequences are aligned to provide the minimum "gapped edit distance" with respect to the reference sequence.
- Each variant within a segment should be converted into a minimum distance form.
- the minimum distance form is one where the alignment produces the lowest match distance.
- the segmental variant comparison algorithm is applied to each enumerated section and a comparison result obtained.
- the segmental variant comparison algorithm uses the equal-best algorithm to compare two sets of variants and to determine which variants in a first segment are matched by the variants in a second segment.
- the approach is based on using the equal-best algorithm to align the two nucleotide sequences to obtain all the different alignments/interpretations of the second segment with respect to the first segment such that the different interpretations all have the same "gapped edit distance" which is also the optimum distance.
- the equal-best algorithm can detect single nucleotide polymorphisms (SNPs), insertions, deletions as well as mixed variants.
- Each alternative alignment from the equal-best algorithm is then converted to a variant marked in the same as the first variant was pre-processed.
- Each variant marked sequence is then compared to the reference sequence.
- TP is the number of true positives (the number of positives that are correctly identified as such)
- FN is the number of false negatives (the number of negatives that are incorrectly identified as such)
- FP is the number of false positives (the number of positives that are incorrectly identified as such)
- k1 , k2 and k3 are "a priori" coefficients.
- an alignment is selected that involves the fewest of variants when written down. It is still possible to get more than one alignment (or set of variants) with the same minimum distance. In such cases it is acceptable to just pick one of the remaining.
- Publicly available databases record reports of variants in patient samples and record useful associated data such as clinical significance. Examples of such publicly available databases are Clinvar and dbSNP. It is useful to compare a variant in a patient variant call format (VCF) with known variants in databases to provide further information regarding the variant, including its clinical significance. Variants in the database can be pre-processed in the same way as described previously for variants in the subject sequence.
- VCF patient variant call format
- variants The matching of variants is complicated by the multiple forms in which variants can be written. Various different notations are well known in the art. The representation of the same variant in a patient variant file may not match the entry as written in the dbSNP due to various mitigating circumstances such as the complexity of the variants as well as the presence of other variants in nearby positions in the chromosome. Variant matching is complicated by the different forms that a variation can be represented by insertions and deletions can be left-aligned, right-aligned or in-between. The variant can be enlarged to show the surrounding context. Complex variants can be written as a single variant or as a number of simple variants.
- the equal-best algorithm for database matching uses the reference genome along with the patient VCF (and any phasing information in the VCF file) to reconstruct the nucleotide sequence for a region. This allows for a more complete discovery of known variants.
- variants in a corresponding region are processed so that a minimal distance and left- aligned form of each variant replaces the original entry. This is done by extending the region around each variant using the reference genome and aligning the extended reference sequence to the extended alternate sequence using Smith Waterman and Gotoh algorithm. Variants that have been pre-processed therefore have the following two properties:
- variants are stored within the database.
- Variants are re-expressed from the database in a ' regularized' form before matching with the patient variants.
- the pre-processing can be done depending on the corresponding patient variant sections positions and not on the whole database.
- the equal-best algorithm is the applied to each enumerated variant section.
- the new variants produced by equal-best are matched against the processed database using chromosome, position, reference and alternate. If during the process of regularisation one variant splits into two or more, all of the component variants have to be matched for the corresponding database entry to be deemed a match.
- Any suitable database of genetic variation may be used. Examples of such databases are DBSNP and Clinvar.
- An example entry in the DBSNP database is as follows containing the chromosome, position, an ID, variants and information about the variant:
- V2 CTGTGtG- 1 TP 1 FN 1 FP : score -3
- V2 CTGTG-tG 1 TP 1 FN 1 FP : score -3
- V2 CTG-tGtG 2 TP 2 FP : score -4
- V2 C-TGtGtG 2 TP 2 FP : score -4
- V2 CTGTG-tG G/GT G/GT GG/G 3 Output: one of two below
- Chromosome 1 1 Chromosome 1 1 , position 102398434 ref C alt T tag rs17884405
- Chromosome 1 1 Chromosome 1 1 , position 102398436 ref C alt CT tag rs398097780
- Chromosome 1 1 Chromosome 1 1 , position 102398434 ref C alt T tag rs17884405
- Chromosome 1 1 Chromosome 1 1 , position 102398436 ref C alt CT tag rs398097780
- Figure 2 illustrates the system used to implement the process of Figure 1.
- the process itself is carried out by a computer system 100.
- the process is performed by software stored in memory 120, which is run by processor 110.
- the computer system includes an input 130 arranged to receive chromosome reads. This input could be a network connection for uploading the reads that have been performed by another system, a direct link to a system that has carried out the reads, or any other means for importing data into a computer system.
- the system then has an output 140, which outputs the genetic matches that are identified at the end of the process.
- the output may be the display of the results on a computer screen.
- the computer system 100 is arranged to communicate with a database on a server 200, which stores all of the known genetic variants with which the patient's variants are being compared. Whilst Figure 2 illustrates the database on server 200, it may be that in alternative arrangements the database is stored in the memory 120 of the computer system 100.
- the various methods described above may be implemented by one or more computer program products or computer readable media provided on one or more devices.
- the computer program product or computer readable media may include computer code arranged to instruct a computer or a plurality of computers to perform the functions of one or more of the various methods described above.
- the computer program and/or the code for performing such methods may be provided to an apparatus, such as a computer, on a computer readable medium or computer program product.
- the computer readable medium may be transitory or non-transitory.
- the computer readable medium could be, for example, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, or a propagation medium for data transmission, for example for downloading the code over the Internet.
- the computer readable medium could take the form of a physical computer readable medium such as semiconductor or solid state memory, magnetic tape, a removable computer diskette, a random access memory (RAM), a read-only memory (ROM), a rigid magnetic disc, and an optical disk, such as a CD-ROM, CD-R/W or DVD.
- An apparatus such as a computer may be configured in accordance with such code to perform one or more processes in accordance with the various methods discussed herein.
- Such an apparatus may take the form of a data processing system.
- a data processing system may be a distributed system.
- Such a data processing system may be distributed across a network.
- Some of the processes may be performed by software on a user device, while other processes may be performed by software on a server, or a combination thereof.
Landscapes
- Physics & Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Chemical & Material Sciences (AREA)
- Analytical Chemistry (AREA)
- Biophysics (AREA)
- Proteomics, Peptides & Aminoacids (AREA)
- Health & Medical Sciences (AREA)
- Engineering & Computer Science (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Biotechnology (AREA)
- Evolutionary Biology (AREA)
- General Health & Medical Sciences (AREA)
- Medical Informatics (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Theoretical Computer Science (AREA)
- Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
Abstract
Description
Claims
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
AU2016355090A AU2016355090A1 (en) | 2015-11-18 | 2016-11-04 | A method for determining an alignment of segments of a genome |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
GB1520355.7 | 2015-11-18 | ||
GB1520355.7A GB2544506A (en) | 2015-11-18 | 2015-11-18 | A method for determining an alignment of segments of a genome |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2017085458A1 true WO2017085458A1 (en) | 2017-05-26 |
Family
ID=55132989
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/GB2016/053428 WO2017085458A1 (en) | 2015-11-18 | 2016-11-04 | A method for determining an alignment of segments of a genome |
Country Status (3)
Country | Link |
---|---|
AU (1) | AU2016355090A1 (en) |
GB (1) | GB2544506A (en) |
WO (1) | WO2017085458A1 (en) |
-
2015
- 2015-11-18 GB GB1520355.7A patent/GB2544506A/en not_active Withdrawn
-
2016
- 2016-11-04 AU AU2016355090A patent/AU2016355090A1/en not_active Abandoned
- 2016-11-04 WO PCT/GB2016/053428 patent/WO2017085458A1/en active Application Filing
Non-Patent Citations (4)
Title |
---|
A. MENELAOU ET AL: "Genotype calling and phasing using next-generation sequencing reads and a haplotype scaffold", BIOINFORMATICS., vol. 29, no. 1, 23 October 2012 (2012-10-23), GB, pages 84 - 91, XP055334411, ISSN: 1367-4803, DOI: 10.1093/bioinformatics/bts632 * |
ANONYMOUS: "Variant Normalization - Genome Analysis Wiki", 2 July 2015 (2015-07-02), XP055334466, Retrieved from the Internet <URL:http://genome.sph.umich.edu/w/index.php?title=Variant_Normalization&oldid=13615> [retrieved on 20170111] * |
J. BLOM ET AL: "Exact and complete short-read alignment to microbial genomes using Graphics Processing Unit programming", BIOINFORMATICS., vol. 27, no. 10, 30 March 2011 (2011-03-30), GB, pages 1351 - 1358, XP055334278, ISSN: 1367-4803, DOI: 10.1093/bioinformatics/btr151 * |
P. DANECEK ET AL: "The variant call format and VCFtools", BIOINFORMATICS, vol. 27, no. 15, 7 June 2011 (2011-06-07), pages 2156 - 2158, XP055154030, ISSN: 1367-4803, DOI: 10.1093/bioinformatics/btr330 * |
Also Published As
Publication number | Publication date |
---|---|
GB2544506A (en) | 2017-05-24 |
GB201520355D0 (en) | 2015-12-30 |
AU2016355090A1 (en) | 2018-06-07 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Clima et al. | HmtDB 2016: data update, a better performing query system and human mitochondrial DNA haplogroup predictor | |
DiVincenzo et al. | The allelic spectrum of Charcot–Marie–Tooth disease in over 17,000 individuals with neuropathy | |
Liu et al. | A review of bioinformatic methods for forensic DNA analyses | |
US10176294B2 (en) | Accurate typing of HLA through exome sequencing | |
CN105229649B (en) | System and method for human genome analysis of variance and the report of disease association | |
KR102199322B1 (en) | Noninvasive prenatal molecular karyotyping from maternal plasma | |
Muggli et al. | Misassembly detection using paired-end sequence reads and optical mapping data | |
JP6762932B2 (en) | Methods, systems, and processes for de novo assembly of sequencing leads | |
JP2017527257A (en) | Determination of chromosome presentation | |
Huang et al. | A novel multi-alignment pipeline for high-throughput sequencing data | |
Wood et al. | Neoepiscope improves neoepitope prediction with multivariant phasing | |
JP5687834B2 (en) | Personal genome integrated management method and apparatus | |
US11640859B2 (en) | Data based cancer research and treatment systems and methods | |
Roy et al. | SeqReporter: automating next-generation sequencing result interpretation and reporting workflow in a clinical laboratory | |
JP2019530098A (en) | Method and apparatus for coordinated mutation selection and treatment match reporting | |
Glusman et al. | Ultrafast comparison of personal genomes via precomputed genome fingerprints | |
Wolf et al. | DNAseq workflow in a diagnostic context and an example of a user friendly implementation | |
CN111863132A (en) | Method and system for screening pathogenic variation | |
Wegrzyn et al. | PineSAP—sequence alignment and SNP identification pipeline | |
Ellingson et al. | Automated quality control for genome wide association studies | |
WO2017085459A1 (en) | A method for matching with known variants in a database | |
Schmidt et al. | VarGrouper: A Bioinformatic Tool for Local Haplotyping of Deletion–Insertion Variants from Next-Generation Sequencing Data after Variant Calling | |
Sahu et al. | Towards an efficient computational mining approach to identify EST-SSR markers | |
Lebo et al. | Bioinformatics in clinical genomic sequencing | |
WO2017085458A1 (en) | A method for determining an alignment of segments of a genome |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 16793994 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
ENP | Entry into the national phase |
Ref document number: 2016355090 Country of ref document: AU Date of ref document: 20161104 Kind code of ref document: A |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 16793994 Country of ref document: EP Kind code of ref document: A1 |