NL2014199B1 - A computer implemented method for generating a variant call file. - Google Patents

A computer implemented method for generating a variant call file. Download PDF

Info

Publication number
NL2014199B1
NL2014199B1 NL2014199A NL2014199A NL2014199B1 NL 2014199 B1 NL2014199 B1 NL 2014199B1 NL 2014199 A NL2014199 A NL 2014199A NL 2014199 A NL2014199 A NL 2014199A NL 2014199 B1 NL2014199 B1 NL 2014199B1
Authority
NL
Netherlands
Prior art keywords
data
variant
value
sequence
type
Prior art date
Application number
NL2014199A
Other languages
Dutch (nl)
Other versions
NL2014199A (en
Inventor
Karten Johannes
Original Assignee
Genalice B V
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Genalice B V filed Critical Genalice B V
Priority to NL2014199A priority Critical patent/NL2014199B1/en
Publication of NL2014199A publication Critical patent/NL2014199A/en
Application granted granted Critical
Publication of NL2014199B1 publication Critical patent/NL2014199B1/en

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/10Sequence alignment; Homology search
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B50/00ICT programming tools or database systems specially adapted for bioinformatics

Abstract

The invention relates to a computer implemented method (200) for generating a variant call file from an input file with aligned data sequences that have been aligned with a reference data sequence, the method comprises a aggregating process (214) generating aggregated data linked to positions of the reference data sequence from aligned data sequences; and auxiliary aggregated data for variants linked to positions of the reference data sequence. The auxiliary aggregated data is indicative for sequence parts of aligned data sequences which bases perfectly match the reference data sequence from an end of the aligned data sequence to a variant. In a subsequent realignment process (216) "incorrect" mappings are detected and the associated auxiliary aggregated data is used to correct the aggregated data of the positions on the reference corresponding to the "incorrect" mapping.

Description

A computer implemented method for generating a variant call file
TECHNICAL FIELD
The invention relates to a computer implemented method for generating a variant call file from an input file comprising aligned data sequences.
BACKGROUND
The present disclosure relates generally to the generation of a variant call file from aligned data sequences. The variant call file enables to visualise variants in a genome relative to a reference genome.
In software tools available on the marked, the aligned sequences are read from an input file, for example a BAM file or a SAM file. The data in the input file is entirely processed to obtain statistics on the numbers of aligned sequences, bases and variants are mapped on each of the positions on the reference genome. The statistics are then stored in a VCF file or another intermediate file format. This statistics comprises a lot of noise as each error made by a DNA sequencing machine will be present in the statistics. These errors will make it more difficult to find variants when the statistics are represented by graphs on a display. An error made by the sequencing machine will have the same contribution in the statistics as a real mutation in an aligned sequence. Furthermore, the bases at the begin/end of an aligned sequence could be mapped erroneously due to the lack of information to mapped said bases correctly. Therefore, in one or more subsequent processes, the data in the file with statistics is processed to filter out these type of errors. Due to the large size of the data file and huge amount of data in the file, the throughput time to obtain the final result will be long. The reading and writing of the files with intermediate results alone will already take a serious amount of time.
SUMMARY
It is an object of the invention to provide an improved method for generating a variant call file (VCF) from an input file comprising aligned data sequences. The improvement is at least one of: increase of processing efficiency, upgrading signal to noise quality of variants, correction of incorrect mapping, generating quality information for variants in a VCF file, reduced processing power, reduced processing time and reduced CPU usage.
According to the invention, this object is achieved by a method having the features of Claim 1. Advantageous embodiments and further ways of carrying out the invention may be attained by the measures mentioned in the dependent claims.
According to a first aspect of the invention, there is provided a method for generating a variant call file from an input file with aligned sequences that have been aligned with a reference data sequence. The method comprises a first aggregating process generating aggregated data linked to positions of the reference data sequence from aligned data sequences and storing the aggregated data in a variant call store. The method further comprises a second aggregating process and a realignment process. The second aggregating process generates auxiliary aggregated data for variants linked to positions of the reference data sequence. The auxiliary aggregated data is indicative for sequence parts of aligned data sequences which bases perfectly match the reference data sequence from an end of the aligned data sequence to a variant. The realignment process comprises detecting in the aggregated data a position with a variant of a first type and a variant of a second type, wherein the variant of the second type comprises auxiliary aggregated data. In response to detection, the following three actions are performed: a) the aggregated data corresponding to the variant of the second type is added to the aggregated data corresponding to the variant of the first type, b) aggregated data of nearby positions is updated using the auxiliary aggregated data of the variant of the second type, and c) the aggregated data corresponding to the variant of the second type is removed.
The idea is to speed up the generation of a VCF file by reducing the amount of data that has to be read or written to a file on disc or main memory of a processing unit and to decrease the throughput time. It is commonly known that reading and writing information to a disc is a relative slow process. As the data files comprising information related to DNA sequences are very large, typically in the range of 100 Gbytes, the reading/writing will consume a lot of time. In BAM files, the aligned data sequences are in order of position number of its first aligned base on the reference. This enables to read only the aligned data sequences having its first aligned base in a defined range on the reference and not the entire input file. This also allows parallel processing of ranges. A VCF file comprises aggregated information of the bases of the aligned data sequences on each respective position on the reference. A VCF file of DNA sequences has typically a file size in the range of 400 Mbytes. In the VCF file, for each position it is identified how many times a base with value A, C, T and G is mapped at said position and also the number of times there is a variant such as “insert, “deletion” and “break”. It has further been found that the filtering processes should not necessarily be performed on the entire aggregated data in a VCF file at once, but could be performed directly on aggregated data of only a limited range of positions. The segmentation of the positions of the reference is based on this understanding. For each position in a segment all sequences having a variant mapped on said position have to be read from the file with aligned sequences. “Variant” in the context of the present application means any possible occurrence that is mapped on a positions of the reference, this includes next to the value of the base at said position on the reference, the other possible three base values (updates), inserts and deletions. By the order of the aligned sequences in the input file and knowledge about the maximal length of an aligned sequence, it is possible to read only a part of the input file to obtain the aggregated data and auxiliary aggregated data for a position or a range of positions on the reference sequence. The auxiliary aggregated data, which is not necessarily stored in the VCF file, could be used as intermediate data to correct mapping results which could be correct but which are regarded incorrect when taking into account the mapping results of reads that are mapped at the same position on the read. This is possible by administering the number of bases of at the beginning and end of a read that perfectly match the reference. By generating the auxiliary aggregated data simultaneously with the aggregated data it is possible to correct these “incorrect” mappings in one run and not in a subsequent process wherein the reads have to be read a second time from a BAM file.
In an embodiment, the second aggregating process generates auxiliary aggregated data of a first type indicative of the sequence parts of the aligned data sequences preceding the variants that are mapped on the same position of the reference data sequence and in a further embodiment, the second aggregating process generates auxiliary aggregated data of a second type indicative of the sequence parts of the aligned data sequences following the variants that are mapped on the same position of the reference data sequence.
In an embodiment, the second aggregating process generates auxiliary aggregated data of a third type indicating an aggregated quality value for similar variants mapped at a position on the reference data sequence and/or counting information indicating the quality of the similar variants is above and below a predefined quality value. These features enable to take into account the quality of variants and to correct the quality values of a variant at a particular position on the reference when correcting the mapping of reads due to “incorrect” mapping.
In an embodiment, the variant of the first type and the variant of the second type is at least one combination of a group comprising the following combinations: “update”-“insert”; “update”-“delete”; “insert”-“update”; “delete”-“update”; “insert”-“delete”; “delete”-“insert”; “insert”-“reference” and “delete”-“reference”. The correction can be made on different possible “incorrect” mappings.
In an embodiment, the first aggregating process and the second aggregating process are combined to form one process to generate simultaneously the aggregated data and the auxiliary aggregated data. In this way, the aligned sequences have to be read once from a file to generate a VCF file wherein the aggregated data in the VCF file is corrected for “incorrect” mappings.
In an embodiment, the method further comprises an upgrade process. The upgrade process comprises: - aggregating information of variants of the type “insert” at a particular position on the reference sequence to obtain aggregated base value information for each position of the variants of the type “insert” and a total value corresponding to the number of variants of the type “insert” at said particular position on the reference sequence, the aggregated base value information comprising a count value for each possible base value for each position of the variants of the type “insert”; - determining an additional minimum count value based on the total value and a user defined error ratio; - determining a most common base value at a position of the variants of the type “insert”; and, - updating the first aggregated base value information if the count value corresponding to a specific base value different from the most common base value at the position of the variants of the type “insert” is smaller than the minimum count value by removing the specific base value and increasing the count value of the most common base value with the count value corresponding to the specific base value.
In an embodiment, the method further comprises an upgrade process comprising: - aggregating information of variants of the type “insert” at a particular position on the reference sequence to obtain aggregated base value information for each position of the variants of the type “insert” and a total value corresponding to the number of variants of the type “insert” at said particular position on the reference sequence, the aggregated base value information comprising a count value for each possible base value for each position of the variants of the type “insert”; - determining an additional minimum count value based on the total value and a user defined error ratio; - determining a most common base value at a position of the variants of the type “insert”; and, - updating the first aggregated base value information if the count value corresponding to a specific base value different from the most common base value at the position of the variants of the type “insert” is smaller than the minimum count value by removing the specific base value and increasing the count value of the most common base value with the count value corresponding to the specific base value.
By generating auxiliary aggregated base value information for variants of the type “insert” and/or “update” simultaneously with the other aggregated data it becomes possible to reduce the number of variants at a particular position of the reference in the VCF file to be generated and to improve the signal to noise ratio of a variant at said particular position. This increases the possibility that a mutation will be recognized in the VCF file as important/relevant mutation for a particular disease or abnormality in the DNA material.
In an embodiment, wherein the method comprises a combined process running on a single processing unit, the combined process includes at least the first aggregating process, the second aggregating process and the realignment process, the method further comprises: - segmenting the reference data sequence in non-overlapping segments of N positions; - P parallel combined processes, where P>1; wherein a combined process retrieves from the input file the aligned data sequences which have a first base with a position number in a range which starts M positions before the first position of a segment and ends with the last position of the non-overlapping segment for processing by the first aggregating process, the second aggregating process to generate variant call data for the N positions associated with the segment. These features allow to distribute the processing of the aligned data sequences over a multitude of CPU’s wherein each CPU has to read the aligned data sequences once to generate the VCF data corresponding to a non -overlapping segment.
In a further embodiment, the processing of the non-overlapping segments is performed by P processing units and subsequent non-overlapping segments have increasing integer index numbers, the method further comprises: - distributing the processing of the non-overlapping segments over the P processing units, wherein the non-overlapping segments are cyclically assigned to the P processing units, a processing unit with index number i processes successively segments with index number i, i + P, i + 2P, i + 3P, ... , i + nP. And in a further embodiment, the method uses P buffers to store variant call data, processing unit with index number i stores the variant call data corresponding to a non-overlapping segment in a buffer with index number i, the method further comprises: - retrieving cyclically variant call data corresponding to a nonoverlapping segment from buffer with index number 1 to P; and, - adding the variant call data corresponding to the non-overlapping segment to the VCF file.
These features allows to generate efficiently a VCF file wherein the aggregated data is stored in order of position number of the reference sequence. This provides random access possibilities to the VCF file to post process VCF data corresponding to a specific range of positions of the reference without reading the entire VCF file.
In an embodiment, the combined process further comprises filtering the aggregated data. This feature allows to remove irrelevant variants before storing the VCF data in a file.
According to a second aspect, there is provided a computer implemented system comprising a processor, an Input/Output device to connect to the network system, a database and a data storage comprising instructions, which when executed by the processor cause the computer implemented system to perform any of the methods described above.
Other features and advantages will become apparent from the following detailed description, taken in conjunction with the accompanying drawings which illustrate, by way of example, various features of embodiments.
BRIEF DESCRIPTION OF THE DRAWINGS
These and other aspects, properties and advantages will be explained hereinafter based on the following description with reference to the drawings, wherein like reference numerals denote like or comparable parts, and in which:
Fig. 1 illustrates the hardware structure of a computer implemented system;
Fig. 2 illustrates a block diagram of a system performing the method for generating a variant call file;
Fig. 3 illustrates the relation between segmentation of the reference sequence and read aligned data sequences;
Fig. 4 illustrates the processing of aligned data sequences of a segment on a processing unit;
Figs. 5 - 9 illustrates an example for correcting inconsistent mapping at the beginning of aligned data sequences;
Figs. 10-13 illustrate four other examples of inconsistent mapping; Figs. 14-17 illustrates an example for upgrading the quality of variants of the type “insert”;
Fig. 18 illustrates an example for generating auxiliary aggregate data indicative for the quality of a variant; and,
Fig. 19 is a block diagram illustrating a computer implemented system.
DETAILED DESCRIPTION
Prior to describing exemplary embodiments in detail, terms to be used herein will be defined.
First, the term "read sequence" (or abbreviated as "read") is a short-length genome sequence data output from a genome sequencer. A length of the read sequence varies generally from 35 to 500 base pairs (bp) depending on a kind of the genome sequencer, and generally expressed as four letters, A, C, G and T in the case of DNA.
The term "reference genome" (or abbreviated as “reference”) refers a database describing the genome as a sequence of bases. When mapping read on the reference genome, the position on the reference of one or more sequence parts of the read is search for which has the best similarity. In the present description, the term “reference data sequence” refers to reference genome.
The term "base" is the minimum unit constituting a target genome sequence and a read. As described above, DNA is expressed with four letters such as A, C, G and T, each of which is called a base, and so is the read sequence.
In the present description, the term “aligned data sequence” refers to a read that is mapped on the reference genome with the best similarity.
The term “variant” refers to any possible event (of one or more bases) which is located by the mapping process on a position of the reference, this includes next to the value of the base at said position on the reference, the other possible three base values (SNP), updates, inserts, deletions and break fusion.
Fig. 1 illustrates the hardware structure of a general purpose computing system 100. An input file with aligned data sequences is stored on a storage medium 112, for example an hard-disc. The input file could be Binary Alignment/Map (BAM) file, a Sequence Alignment/Map (SAM) file or any other file format comprising aligned data sequences. The aligned data sequences should be sorted by position and the input file should be indexed to allow random excess to the aligned data sequences. Typical file sizes for files with aligned data sequences are in the range of 100 Gbyte. The system 100 further comprises a system bus 116, a number of central processing units (CPU) 120 - 120’ and associated CPU cache 118-118’. A CPU cache is used by the CPU of a computing system to reduce the average time to access data from the main memory. The CPU cache is a smaller and faster memory which stores copies of the data from frequently used main memory locations. The system 100 comprises further a storage medium 126 to store a Variant Call Format (VCF) file. The storage media 112 and 126 could be the same device. A VCF file is a text file used in bioinformatics for storing gene sequence variations with respect to a reference data sequence, i.e. reference genome. By using the variant call format only the variations need to be stored along with a reference genome. Furthermore, the number of occurrences of each variation at positions on the reference is stored and the number of bases of the aligned data sequences having a similar base value as the reference sequence at the mapped position are stored in the VCF file. The information stored in a VCF file can be seen as aggregated data which is aggregated from the aligned data sequences.
One of the basic ideas of the method for generating a VCF file from an input file with aligned data sequences is to reduce the amount of data that has to be exchanged over the system bus when processing the aligned data sequences to obtain the “final” VCF-file. Final in the present context means, a VCF-file which has been obtained by aggregating data from a file with aligned data sequences at least the following actions which aggregated data is post-processed by at least one of the following operations: 1) searching for and correcting inconsistent mapping results, 2) upgrading the signal quality of variants and 3) removing noise. Another basic idea is that the searching for and correcting inconsistent mapping results could be performed on the aggregated data if some auxiliary data is aggregated for the first and last variant of the aligned data sequences, this auxiliary aggregated data is when all data is aggregated for a position on the reference, used to correct the aggregated data of reference bases. A reference base is a base of an aligned data sequence that after mapping on a position of the reference has the same base value as the base value of the base of the reference on said position. The following operations 1) and 2) will be described below in more detail.
It has been found that it is possible to generate a VCF file where each aligned data sequence is transferred once from the main memory 114 to the cache memory 118, the aligned data sequence could be processed by the CPU to obtain variant call data with almost no interchange of intermediate data from the CPU cache 118 to the main memory 114, and finally the variant call data is transferred once from the CPU cache to the main memory. In this way, during processing of the aligned data sequence, the CPU 120 is minimally busy with the execution of wait cycles which are necessary when reading/writing data from/to the relative “slow” main memory. Arrows with reference numerals 122 and 122’ indicate the transfer of aligned data sequences to the CPU cache and arrow with reference numerals 124 and 124’ indicate the transfer of variant call data from the CPU cache to the main memory.
Figure 4 illustrates how the aligned data sequences are read from the input file, processed by the CPU in cache and the result of the processing, i.e. the variant call data, is supplied to the main memory to be written to an output file, i.e. VCF-file. In Fig. 4 it is assumed that aggregated data has to be generated from position iN+1 on the reference data sequence. Furthermore, it is assumed in the present description that the first base of an aligned data sequence has an associated position on the reference data sequence and that the aligned data sequences are stored in the input file by order of the associated position on the reference data sequence of their first base. To aggregated all data from the input file for position iN+1, all aligned data sequences have to be processed that has a base mapped on said position. If the maximum length of an aligned data sequence in the input file is M+1 bases than all aligned data sequences in the range [iN+1-M,iN+1] have to be read from the input file and processed. Therefore, is indicated in the top of Fig. 4 that reading from the input file is started at position iN+1-M. After reading the aligned data sequences with their first base in the range [iN+1 -M,iN+36k-M], i.e. 36k positions of the reference, aggregated data and auxiliary aggregated data is collected for positions on the reference in the range [iN+1, iN+36k], Subsequently, the first 32k positions are further processed by performing a realignment process, upgrading process and filtering process to obtain the variant call data that will be stored in a VCF file.
After processing the first 32 positions with aggregated data, the aligned data sequences in the range [iN+1+36k-M, iN+68k-M] will be read from the input file and the data corresponding to the aligned data sequences is aggregated. The “incomplete” aggregated data for the positions in the range [iN+1 +32k, iN+36k-M] will now become complete. Subsequently, the second 32k positions, i.e. the range [iN+1 +32k, iN+64k] are further processed. This routine of reading/aggregating 32k positions and further processing is repeated till the end of the range of positions for which variant call data has to be generated.
The size of 32k positions to be repetitively read from the input file and 36k possible positions for aggregated data in cache depends on the input data and the cache size available for a processor performing the method. The higher the number of bases/variants that are mapped on the same position, the more memory is needed to store the aggregated data. To avoid that intermediate data leaves temporarily the cache for other data, the size has to be selected such that the throughput time to process a predefined amount of data is minimal. This could be done empirically by performing test runs on a limited set of aligned sequences from the input file.
It might also be possible that the size is automatically determined by a control process which measures the throughput time for a cycle of reading/aggregating and further processing the aggregated data. If the amount of processed aligned data sequence per time period drops below an adaptive threshold, the size will be decreased. This an indication that the processor has to insert wait cycles for retrieving intermediate data from main memory. Consequently, the size of positions to be processed each cycle has to be decreased and the adaptive threshold has to be adapted accordingly. If the amount of processed aligned data sequences per time period is above the adaptive threshold for a predetermined period of time, the control process increases the size and measures the amount of processed aligned data sequences per time period again. If the amount of processed aligned data sequences per time period increases, the new size is used and the adaptive threshold is changed accordingly. If not, the old size and threshold will be used.
Generally, the larger the available cache size, the larger the amount of aligned data sequences to be read and processed each cycle. The more bases/variants are mapped on a positions, the more memory will be used to store the aggregated data, the smaller the amount of aligned data sequences to be read and processed each cycle to avoid unnecessary wait cycles at the CPU due to transferring data between cache and main memory.
The mechanism described above improves the computing efficiency of a CPU performing the method according to the present application and thus the throughput time to generate a VCF file from an input file comprises aligned data sequences when only one CPU is used. The throughput time could be further reduced by using more than one CPU. This is illustrated by Fig. 2.
Fig. 2 shows a block diagram of a computer implemented system 200 performing the method for generating a variant call file from an input file comprising aligned data sequences. The input file is stored on a data storage device 202, for example a hard disc. The system 200 further comprises P processing units or workers 204,204’, a controller 208, a data collector 206 and a data storage device 210 for storing the generated VCF file. Each processing unit 204 comprises a CPU and cache memory as shown in Fig. 1. The concept is that every processing unit each time generates the variant call data for a part of the positions of the reference data sequence. The reference data sequence is segmented in non-overlapping segments of N-positions. The controller 208 distributes the processing of the non-overlapping segments over the P processing units 204, 204’. The non-overlapping segments are cyclically assigned to the P processing units. A processing unit 204 with index number i processes successively segments with index number i, i + P, i + 2P, i + 3P, ... , i+nP. A processing unit 204, 204’ performs a combined process. A combined process retrieves from the data storage device 202 with the input file the aligned data sequences which have a first base with a position number in a range which starts M positions before the first position of a segment and ends at least with the last position of the non-overlapping segment, processes 212-220 the aligned data sequences and at the end stores the generated variant call data in a corresponding buffer 222. The system comprises thus for each processing unit a buffer. The controller 208 further informs the collector 206 where to find the buffer of the processing units and the size of a segment. Processing unit with index number i stores the variant call data corresponding to a non-overlapping segment in a buffer with index number i. The controller cyclically retrieves variant call data corresponding to a non-overlapping segment from buffer with index number 1 to P and adds the variant call data corresponding to the non-overlapping segment to the VCF file. The collector 206 starts with retrieving variant call data from the buffer of the next processing unit 204 after the current processing unit has completed the generation of variant call data for the positions associated with a segment.
The combined process is the combination of the following processes: - a reading process 212 reads the aligned data sequence from the data storage device; - an aggregating process 214 generates aggregated data linked to positions of the reference data sequences and generating auxiliary aggregated data; - a realignment process 216 processes the auxiliary aggregated data and corrects the aggregated data accordingly; - an upgrade process 218 improves the signal quality of variants of the type “insert” or “update” which have a length of at least 2 bases; - a filter process 220 removes noise from the variant call data; and - a buffering process 222 stores the generated variant call data in a buffer associated with the processing unit 204.
Fig. 3 illustrates which aligned data sequences have to be read from the input file to generate the variant call data for the positions on the reference data sequence corresponding to a segment.
To obtain the aggregated data for a particular position on the reference sequence, e.g. the genome, all aligned data sequences in the input having a base mapped on the particular position have to be processed and thus be read from the input file with aligned data sequences. In the input file, which is indexed for random excess, all aligned data sequences are ordered by the position number on the reference sequence of the first base of the sequence which is mapped on the reference sequence. Assume that the maximum length of an aligned data sequence is M+1 Bases. In that case, to generate the aggregated data for position X on the reference sequence the aligned data sequence with first base mapped on the positions [X-M,X] have to be read from the input file and processed.
Segment SEG_2 corresponds to the positions [N+1, 2N] on the reference sequence. To generate the VCF data for these positions, the aligned data sequences having an index in the range [N+1-M,2N] have to be read from the input file.
By segmenting the reference sequence and using more than one CPU, a part of the aligned data sequences is read twice from the input file, namely the aligned data sequence having a position lower than the first position of the segment. In the example before the aligned data sequences indexed in the M positions before the first position of the segment.
By reading each time the aligned data sequences in a limited range of positions as disclosed with reference to Fig. 4 the input data, i.e. aligned data sequences, the intermediate results during processing of the input data and the final data for the variant call format (VCF) file are not swapped between cache of the CPU and the main memory. The processing can be very fast and there will minimal waste of CPU cycles due to reading data from main memory and cache and writing data from cache to main memory. Another advantage is that with the correct size of range of positions from which aligned data sequences have to be read the system bus used for writing data from the main memory and the CPU cache is only used to load the alignment data sequence at the beginning of the processing of the aligned data sequences in that range and the result of processing, i.e. the variant call data, is moved from cache to main memory when the cache has to be used for new data.
As described above, a processing unit 204 performs a number of processes on the aligned data sequences to obtain the variant call data for storage in a VCF-file. The aggregating process 214 comprises a first aggregating process and a second aggregating process. The first aggregating process generates aggregated data linked to positions of the reference data sequence from the aligned data sequences. This is a commonly known process wherein for each position of the reference data sequence is counted the number of bases that have been mapped on said position which has the same base value as the value of the base at said position, the so called ref base count. Furthermore for each variant the number of occurrences is counted. The second aggregating process generates auxiliary aggregated data for variants linked to positions of the reference data sequence. The auxiliary aggregated data is indicative for sequences parts of aligned data sequences which bases perfectly match the reference data sequence from an end of the aligned data sequence to a variant. This second aggregating process and the use of the auxiliary aggregated data will be explained in further detail with reference to Figs. 5-8.
Fig. 5 shows two aligned data sequences readl and read2 and the corresponding mapping on the reference data sequence. Position 1 - 25, ref pos, and corresponding base value, ref value, of the reference data sequence are shown. The base value at position 1 is “C”. The base value at position 24 is “G”. Fig. 5 shows only a part of the bases of reach. The arrows show how a base of the read is mapped on the reference. Readl comprises between the base that is mapped on position 11 and the base that is mapped on position 12 a base with the value “C”. This base is variant of the type “insert”. Read2 has the same insert between the bases that have been mapped on the positions 11 and 12 of the reference.
Fig. 6 shows three other aligned data sequences read3, read4 and read5 which comprises bases that are mapped on the same positions as in Fig. 5. Flowever, in these cases the mapping software has detected a variant of the type “update” at position 11 instead of a variant of the type “insert”. Due to lack of characteristic information of the sequence of bases before the base with value “C”, they all have the same base value A. The mapping software has decided that the base with value “C” is a variant of the type “update”. Flowever, the base with value “C” could also be a variant of the type “insert”. Thus the mapping software is not able to make always the correct mapping. Fig. 7 shows how the mapping of read3 in Fig. 6 changes if the variant of the type “update” is changed in a variant of the type “insert”. Is in Fig. 6 the first base with value “A” mapped on position 6 of the reference, in Fig. the first base is mapped on position 7 of the reference.
Given the reads in Fig. 5 and 6, the first aggregating process will count two time a variant of the type “insert” between position 11 and 12 and counts 3 variant of the type ’’update” on position 11. The second aggregating process generates auxiliary aggregated data for variants linked to positions of the reference data sequence. The auxiliary aggregated data is indicative for sequences parts of aligned data sequences which bases perfectly match the reference data sequence from an end of the aligned data sequence to a variant. As the base values are known when the reference is know, this information of the sequences part has not to be stored in the auxiliary data. Only the length of the sequence parts before the first variant has to be stored for each read which comprises a variant of the type “update” at said position.
Fig. 8 shows three possible embodiments to store the auxiliary aggregated data. In Fig. 8a, a table is generated wherein for each position of the reference the number of bases before the first variant that have been mapped on a particular position of the reference is counted. Given the three reads in Fig. 6, one base of a sequence part before the “update” is mapped on position 6, two base are mapped on position 7, etc. In Fig. 8b, another table is given. The first column corresponds to the distance in number of base between the variant and a base of the sequence part heading the variant. For each distance the number of bases is counted. Counted are the bases having a distance of 1, 2, 3, 4 and 5 from the variant. It might be clear that counted value for each distance decreases with increase of the distance. In Fig. 8c the auxiliary aggregated data is a table which stores the length of the sequence part before the first variant. The first column corresponds to the length and the second column corresponds to the number of sequence parts prior to the variant having said length. Which of the embodiments give the best performance depends on the implementation.
After the aggregation process, a realignment process will start. In the realignment process, the occurrence of possible incorrect mappings is checked. The process detects in the aggregated data a position with a variant of a first type and a variant of a second type. If the variant of the first type does not comprise aggregated data and the variant of the second type comprises auxiliary aggregated data, the process determines whether the variant of the second type could be changed in a variant of the first type. If this is the case, the auxiliary aggregated data is used to correct the aggregated data. In response to detection of such a possible change from variant of the second type into a variant of the first type a) the aggregated data corresponding to the variant of the second type is added to the aggregated data corresponding to the variant of the first type, b) the aggregated data is updated using the auxiliary aggregated data of the variant of the second type, and c) the aggregated data corresponding to the variant of the second type is removed. Fig. 9 shows the aggregated data of the reads in Fig 5 and 6 before and after the realignment process. It shows at position 11,2 counted bases with value A, 3 variant of the type “update” and 2 variants of the type “insert” before the realignment process. After the realignment process at position 1, there are 5 bases with value “A” mapped on position 11 and 5 variants of the type “insert”. Furthermore, due to the shifting over one positions of the sequence parts before the variant, the counted value at the positions 6, 7 and 8 are reduced with one base.
The realignment process could be used for both the first variant of an aligned data sequence and the last variant of an aligned data sequence. For both, a specific table with aggregated data is generated which is indicative for sequences parts of aligned data sequences which bases perfectly match the reference data sequence from an end of the aligned data sequence to a variant.
Fig. 10 shows another example of possible incorrect mapping. In this embodiment in read 1001 there is a variant of the type “insert” between position 11 and 12. The mapping of read 1002 comprises a variant of the type “delete” between the bases mapped on positon 10 and 12 of the reference. Flowever, due to the base values of the sequence part before the “delete”, it is possible to amend the mapping in the aggregated data. Was before realignment the first base mapped on position 6 of the reference. After realignment, the first base is mapped on position 8 of the reference.
Fig. 11 shows yet another example of possible incorrect mapping. In this embodiment read 1101 comprises a variant of the type “delete” between position 10 and 12. The mapping of read 1102 comprises a variant of the type “update” of the bases mapped on positon 11 of the reference. Flowever, due to the base values of the sequence part before the “update”, it is possible to amend the mapping in the aggregated data. Was before realignment the first base mapped on position 6 of the reference. After realignment, the first base is mapped on position 5 of the reference.
Fig. 12 shows a further example of possible incorrect mapping. In this embodiment read 1201 comprises a variant of the type “insert”, i.e. one base with value “A” between bases with position 19 and 20 after mapping on the reference. The mapping of read 1202 seems to be a perfect match. Flowever, as the first fourteen bases of read 1202 has the value “A”, these bases could also be mapped on the reference wherein the last base of the sequence of bases with value “A” is mapped as variant of the type “insert”. By generating for each base on a particular position on the reference auxiliary aggregated indicating the length of the sequence of bases of the reads till an end of the read that perfectly matches before/after the base that perfectly matches the reference value of said particular position, it is possible to correct the aggregated data without reading the reads a second time. A data structure as shown in Fig. 8 could be used to record the auxiliary data. In the case of read 1202, for position 19 there will be recorded in the data structure that there was a read with thirteen matching bases before the base with value “A” that was mapped on position 19. After processing all reads having a variant mapped on said position and generating corresponding aggregated data, the program could determine on aggregated quality data that it is very likely that a base with value “A” is inserted between position 19 and 20 on the reference. The auxiliary aggregated data due to read 1202 could then be used in a realignment process to decrease the amount of base mapped on position 6 with 1 and to increase the amount of variants of the type “insert” at said position with 1. This correction increases the importance of the “insert” and the likelihood that the insert will be recognized as a permutation compared to the reference data in a variant call viewer. Position 6 is regarded in the context of the present application a nearby position of positions 19 and 20 of the reference. It might be clear to the person skilled in the art that the correction described above could also be used when an insert is a sequence of two or more elements with the same value.
Fig. 13 shows another example of possible incorrect mapping comparable with the incorrect mapping in Fig. 12. This example illustrates that a repetition of two different base values “C” and “A” at the beginning or ending of a sequence could result in incorrect mapping despite the fact that all bases of the read perfectly matches the reference data. Read 1302 differs from the reference 1301 in that is does not comprise the bases at position 18 and 19 of the reference. Flowever, as the values on the positions 18 and 19 are one or more times repeated before position 17, reads which start with this repetition of base values, could be corrected, if necessary, in a similar way as described in the example above. At the bottom of Fig. 13, the positions of the bases of read 1302 after correction are shown.
The actions to perform the realignment process by using auxiliary aggregated data are based on reducing the memory footprint to store the auxiliary aggregated data which is necessary to perform the realignment. As a result of this, the realignment process can be effectively executed on a CPU as all data to perform the realignment can be stored in cache at the same time, in this way ineffective wait cycles of the CPU can be reduced significantly.
It should be noted that the combination of second aggregation process and realignment process could also be used with existing variant call file generation tools. In that case, first a VCF file is generated by existing tools with “raw” aggregated variant call data, “raw” means that no realignment is performed on the aggregated data. Subsequently, the input file with aligned data sequences is
In the examples of realignment given above, the part of aligned data sequences before the first variant is misaligned. The value of the bases of these parts perfectly match, i.e. are similar, to the value of the bases of the reference data sequence. It might be clear that the method could also be applied to the part of aligned data sequences following the last variant, wherein the base values of the parts perfectly match the reference data sequence. For both situations auxiliary aggregated data should be generated. Further for both situations, the auxiliary aggregated data is indicative for sequence parts of aligned data sequences which bases perfectly match the reference data sequence from an end of the aligned data sequence to a variant, wherein the variant could be the first and/or the last variant in an aligned data sequence.
As described above, the combined process includes an upgrade process 218 which improves the signal quality of variants of the type “insert” or “update” which have a length of at least 2 bases. This upgrade process will be described with reference to Figs. 14-17.
It is commonly known that a sequencer introduces errors in the generated data sequences, i.e. reads. For example, the probability that a base has an incorrect base value (A,C,G,T) is 2%. All these incorrect base values will result in a variant. These errors will also occur in variants of the type “insert” or “update” independent of the number of base, i.e. length, of the variant mapped at a particular position on the data reference sequence. Fig. 14 discloses an example of a table with 25 variants of the type “insert” all mapped at the same position on the reference data sequence. Each variant has a length of 20 bases. We assume that the error rate of the sequencer is 2%. Accordingly, 20 bases in the table has obtained an incorrect value. The bases with an incorrect value are bold and underlined. The first action of the upgrade process is aggregating information of variants of the type “update” at a particular position on the reference sequence to obtain aggregated base value information for each position of the insert sequence and a total value corresponding to the number of insert sequences at said particular position on the reference sequence. The aggregated base value information comprises a count value for each possible base value for each position of the insert sequences. Fig. 15 shows a table of aggregated base value information for the sequences shown in Fig. 14. At position 1, 25 bases are counted with a value “A”, at position 2, 24 bases are counted with a value “A” and 1 base with a value “T”, and so on. The total value corresponding to the number of insert sequences as each position is 25, this corresponds to the sum of all the count values #A, #C, #G and #T at each position.
After generating the aggregated base value information, an additional minimum count value is determined based on the total value and a user defined error ratio. The user defined error ratio defines the percentage of bases in a stack of bases that could be erroneous. In the current example we use an error ratio of 5%. Given a stack of 25 variants of the type “insert”, the additional minimum count value is 1,25. Thus if the count value is below the additional minimum count value, this means that the value of corresponding bases could be incorrect and may be changed in any other base value. Furthermore, for each position a most common base value is determined. This will be the base value having the most occurrences at the position of the insert.
Finally, the aggregated base value information is updated if the count value corresponding to a specific base value different from the most common base value at the position of the insert sequence is smaller than the minimum count value. If that is the case, the corresponding base of the insert sequences at said position having the specific base value will receive the most common base value, the count value corresponding to the specific base value will be set to zero and the count value of the most common base value is increased with the count value corresponding to the specific base value at said position.
Fig. 16 shows a table with the aggregated data corresponding to the variant of the type “insert” as shown in Fig. 14 that would be stored in a VCF file if the upgrade process would not be performed. This would mean that at the corresponding position on the reference data sequence, 11 different “insert sequences” would be found. The sequence with the most occurrences has the count value 15. Fig. 16 shows a table with the aggregated data in case the upgrade process is performed on the 25 “insert” sequences shown in Fig. 14 and that will be stored in the VCF file. Now only one “insert” sequences with a count of 25 will be stored in the VCF file. It might be clear that when reviewing the VCF file, 25 similar variants at a position will be less difficult to detect as variant as 15 similar variants accompanied with 10 diffing variants.
In the examples given above, auxiliary aggregated data is generated with respect to base values and the position of a variant with respect the beginning and end of the read. The auxiliary aggregated data with respect to the position of the first/last variant of a read with respect to the beginning and end of the read is the first type of auxiliary aggregated data and is generated by the second aggregating process. Furthermore, auxiliary aggregated data is generated for variants of the type “insert”. This auxiliary aggregated data is the second type of auxiliary aggregated data which is generated in the upgrade process. The method according to the present invention also generates auxiliary aggregated data about the quality of a variant. This third type of auxiliary aggregated data is used to filter out irrelevant permutations before generating the VCF file. Flow the third type of auxiliary data could be used to filter the aggregated data to obtain a VCF with less irrelevant permutations is known to the person skilled in the art and will therefore not be described in further detail.
For each variant the following third type of auxiliary aggregated data is generated: 1) the total number of times (FREQ_ALL) a particular variant on a positions on the reference occurs in a set of mapped reads; 2) the number of times (FREQ_Fligh) said variant has a quality above a predefined quality threshold value; and, 3) an aggregated quality value (Q_VARIANT) for said variant. It might be clear that the number of times (FREQ_LOW) a variant has a quality value below the predefined quality value could be derived by the subtraction FREQ_LOW=FREQ_ALL - FREQ_HIGH.
Flow to derive the aggregated quality value will be described with reference to Fig. 18. Fig. 18 shows a table for a variant that is mapped on a particular position on the reference. A “variant” can be any constellation of one or more bases from a sequence that a mapped on/linked to a specific position on the reference, this includes, a base of the sequence having the same value as the base at said position of the reference, a SNP, an insert, a mutation, i.e. any other base value different from the value of the base at said position of the reference. The first column comprises a count number identifying the variants with the same value from the reads mapped on said particular position. The second column comprises the Quality score (Q_SEQ) given to a variant during sequencing. The third column comprises the Mapping Quality (Q_MAP) given to the variant during the mapping process. Both quality values are present in the input file with mapped reads. A value Q20 identifies the probability that the given value/mapping is incorrect. A value Qxx corresponds to a value V Thus Q20=0,01, Q30=0,001 /10^ and Q40=0,0001.
The values Q_SEQ and Q_MAP of a variant of a read are used to determine a new quality value Q_CROSS for said variant of the read. The new quality value Q_CROSS is the lowest quality of Q_SEQ and Q_MAP. The rational of this selection criterion is the following. Q_SEQ is an indication of the likelihood that the value of the variant is incorrect and Q_MAP is an indication of the likelihood that the mapping of the variant is incorrect. So the likelihood that both the value and mapping of the variant are correct is (1-Q_SEQ)*(1-Q_MAP) and consequently the likelihood the at least one of the value and mapping is incorrect is 1-(1-Q_SEQ)*(1-Q_MAP). The latter can be simplified by taking the worst quality value of Q_SEQ and Q_MAP representing the lowest quality. Thus, the observation with the lowest quality, which corresponds to the observation with the highest likelihood that the observation is incorrect, determines the new quality Q_CROSS for said variant of the read. Thus a variant with a Q_MAP of Q60 and a Q_SEQ of Q10 will have a Q_CROSS of Q10. This is illustrated in the table of Fig. 18 by the last variant with index number 5. It should be noted that there is no relation or correlation between Q_MAP and Q_SEQ as they are determined in two totally different processes. Therefore, it is not obvious to determine a new quality value Q_CROSS for a variant in the way described above.
In a subsequent action, Q_CROSS is used to determine an aggregated quality value Q_VARIANT for all variant of a particular type mapped at a particular position of the reference. At the end of processing all variants of a particular mapped a particular position of the reference Q_VARIANT will correspond to the highest value of Q_CROSS, i.e. the Q_CROSS value of the variant with the highest likelihood that both the base value determined by the sequencing process and the mapping at the reference are correct. Rational behind this concept of determining a quality value for all variants of a particular type mapped at a particular position of the reference is that first the least reliable observation is used to assign a value Q_CROSS to a variant and subsequently the variant with the highest reliable observation will determine the quality value for all variant of said type.
By generating the third type of auxiliary aggregated data simultaneously with the first and second type of auxiliary aggregated data, each aligned data sequence has to be read once from the main memory to the cache memory to generate the aggregated data. This mechanism reduces the processing power to generate a VCF file significantly. It might be clear, that the method to generate the third type of aggregated data could also be applied in existing programs that generate a VCF file from files with aligned data sequences. The method provides in that cases a way to determine a quality value for each type of variant in a VCF file, which quality value could subsequently be used to filter out variants with a low reliability before analysing or viewing data in the VCF file.
Referring to Fig. 19, there is illustrated a block diagram of exemplary components of a computer implemented system 1400. The computer implemented system 1400 can be any type of computer having sufficient memory and computing resources to perform the described method. As illustrated in Fig. 6, the processing unit 1400 comprises a processor 1410, data storage 1420, an
Input/Output unit 1430 and a database 1440. The data storage 1420, which could be any suitable memory to store data, comprises instructions that, when executed by the processor 1410 cause the computer implemented system 1400 to perform the actions corresponding to any of the methods described in the present application. The data base could be used to store the reference index data structure, the data patterns to be matched and the results of the methods.
The method could be embodied as a computer program product comprising instructions that can be loaded by a computer arrangement, causing said computer arrangement to perform any of the methods described above. The computer program product could be stored on a computer readable medium.
While the invention has been described in terms of several embodiments, it is contemplated that alternatives, modifications, permutations and equivalents thereof will become apparent to those skilled in the art upon reading the specification and upon study of the drawings. The invention is not limited to the illustrated embodiments. Changes can be made without departing from the scope and spirit of the appended claims. *******

Claims (15)

1. Een op een computer geïmplementeerde werkwijze (200) voor het genereren van een variant-call bestand uit een invoerbestand met uitgelijnde gegevenssequenties die zijn uitgelijnd met een referentiegegevenssequentie, waarbij de werkwijze omvat: - een eerste aggregatieproces dat genereert uit de uitgelijnde gegevenssequenties aggregatiedata die gekoppeld zijn aan posities van de referentiegegevenssequentie; en, - het opslaan (206) van de aggregatiedata in een variant-call bestand, met het kenmerk dat de werkwijze verder omvat: - een tweede aggregatieproces dat genereert extra aggregatiedata van varianten die gekoppeld zijn aan posities van de referentiegegevenssequentie, de extra aggregatiedata is indicatief voor sequentiedelen van de uitgelijnde gegevenssequenties wiens basen perfect aansluiten bij de referentiegegevenssequentie van een uiteinde van de uitgelijnde gegevenssequentie tot aan een variant; en een herschikkingsproces (216) omvattende: - het detecteren in de aggregatiedata van een positie met een variant van een eerste type en een variant van een tweede type, waarbij de variant van het tweede type extra aggregatiedata omvat; en, - in reactie op detectie a) het toevoegen van de aggregatiedata betreffende de variant van het tweede type aan de aggregatiedata betreffende de variant van het eerste type, b) het bijwerken van de aggregatiedata van nabijgelegen posities met gebruikmaking van de extra aggregatiedata van de variant van de tweede type, en c) het verwijderen van de aggregatiedata betreffende de variant van het tweede type.A computer-implemented method (200) for generating a variant call file from an input file with aligned data sequences aligned with a reference data sequence, the method comprising: - a first aggregation process that generates from the aligned data sequences aggregation data that be linked to positions of the reference data sequence; and, - storing (206) the aggregation data in a variant call file, characterized in that the method further comprises: - a second aggregation process that generates additional aggregation data from variants linked to positions of the reference data sequence, the additional aggregation data being indicative of sequence portions of the aligned data sequences whose bases perfectly match the reference data sequence from one end of the aligned data sequence to a variant; and a rearrangement process (216) comprising: - detecting in the aggregation data a position with a variant of a first type and a variant of a second type, wherein the variant of the second type comprises additional aggregation data; and, in response to detection a) adding the aggregation data regarding the variant of the second type to the aggregation data regarding the variant of the first type, b) updating the aggregation data from nearby positions using the additional aggregation data from the variant of the second type, and c) removing the aggregation data concerning the variant of the second type. 2. De werkwijze volgens conclusie 1, waarbij het tweede aggregatieproces extra aggregatiedata van een eerste type genereert welke indicatief is voor de sequentiedelen van de uitgelijnde gegevenssequenties voorafgaand aan de varianten die zijn afgebeeld op dezelfde positie van de referentiegegevenssequentie.The method of claim 1, wherein the second aggregation process generates additional aggregation data of a first type indicative of the sequence portions of the aligned data sequences prior to the variants mapped to the same position of the reference data sequence. 3. De werkwijze volgens conclusie 1 of 2, waarbij het tweede aggregatieproces extra aggregatiedata van een tweede type genereert welke indicatief is voor de sequentiedelen van de uitgelijnde gegevenssequenties na de varianten die zijn afgebeeld op dezelfde positie van de referentiegegevenssequentie.The method of claim 1 or 2, wherein the second aggregation process generates additional aggregation data of a second type indicative of the sequence portions of the aligned data sequences after the variants mapped to the same position of the reference data sequence. 4. De werkwijze volgens één van de conclusies 1-3., waarbij het tweede aggregatieproces extra aggregatiedata van een derde type genereert aangevende een geaggregeerde kwaliteitswaarde voor gelijke varianten die afgebeeld zijn op een positie van de referentiegegevenssequentie en/of tellen van informatie die aangeeft dat de kwaliteit van dezelfde varianten boven en onder een vooraf gedefinieerde kwaliteitswaarde is.The method of any one of claims 1-3, wherein the second aggregation process generates additional aggregation data of a third type indicating an aggregated quality value for equal variants depicted at a position of the reference data sequence and / or counting information indicating that the quality of the same variants is above and below a pre-defined quality value. 5. De werkwijze volgens één van de conclusies 1 - 4, waarbij de variant van het eerste type en de variant van het tweede type tenminste één combinatie van een groep omvattende de volgende combinaties: “update”-“insert”; “update”-“delete”; “insert”-“update”; “delete”-“update”; “insert”-“delete”; “delete”-“insert”; ”insert”-“reference” en “delete”-“reference” is.The method of any one of claims 1 to 4, wherein the variant of the first type and the variant of the second type comprise at least one combination of a group comprising the following combinations: "update" - "insert"; "Update" - "delete"; "Insert" - "update"; "Delete" - "update"; "Insert" - "delete"; "Delete" - "insert"; "Insert" - "reference" and "delete" - "reference" is. 6. De werkwijze volgens één van de conclusies 1 - 5, waarbij het eerste aggregatieproces en het tweede aggregatieproces zijn samengevoegd tot één proces (214) om gelijktijdig de aggregatiedata en de extra aggregatiedata te genereren.The method of any one of claims 1 to 5, wherein the first aggregation process and the second aggregation process are merged into one process (214) to simultaneously generate the aggregation data and the additional aggregation data. 7. De werkwijze volgens één van de conclusies 1 - 6, waarbij de werkwijze verder omvat een opwaardering werkwijze omvattende: - het aggregeren van informatie van varianten van het type "insert" op een bepaalde positie op de referentiesequentie om geaggregeerde base-waarde informatie voor elke positie van de varianten van het type "insert" en een totale waarde die overeenkomt met het aantal varianten van het type "insert" op genoemde bepaalde positie op de referentiesequentie te verkrijgen, de geaggregeerde base-waarde informatie omvattende een telwaarde voor elke mogelijke base-waarde voor elke positie van de varianten van het type "insert"; - het bepalen van een extra minimum telwaarde gebaseerd op de totale waarde en een gebruiker-gedefinieerde foutverhouding; - het bepalen van een meest voorkomende base-waarde op een positie van de varianten van het type "insert"; en, - het bijwerken van de eerste geaggregeerde base-waarde informatie, wanneer de telwaarde die behoort bij een specifieke base-waarde verschillend van de meest voorkomende base-waarde op de positie van de varianten van het type "insert" kleiner is dan de minimum telwaarde door het verwijderen van de specifiek base-waarde en het verhogen van de telwaarde van de meest voorkomende base-waarde met de telwaarde die behoort bij de specifieke base-waarde.The method of any one of claims 1 to 6, wherein the method further comprises an upgrade method comprising: - aggregating information of variants of the "insert" type at a certain position on the reference sequence to provide aggregated base value information for obtain any position of the variants of the "insert" type and a total value corresponding to the number of variants of the "insert" type at said particular position on the reference sequence, the aggregated base-value information including a count value for each possible base value for each position of the variants of the "insert" type; - determining an additional minimum count value based on the total value and a user-defined error ratio; - determining a most common base value at a position of the variants of the "insert" type; and - updating the first aggregated base value information when the count value associated with a specific base value different from the most common base value at the position of the "insert" variants is less than the minimum count value by removing the specific base value and increasing the count value of the most common base value by the count value associated with the specific base value. 8. De werkwijze volgens één van de conclusies 1 - 7, waarbij de werkwijze verder omvat een opwaardering proces omvattende: - het aggregeren van informatie van varianten van het type "update" op een bepaalde positie op de referentiesequentie om geaggregeerde base-waarde informatie voor elke positie van de varianten van het type "update" en een totale waarde die overeenkomt met het aantal varianten van het type "update" op genoemde bepaalde positie op de referentiesequentie te verkrijgen, de geaggregeerde base-waarde informatie omvattende een telwaarde voor elke mogelijke base-waarde voor elke positie van de varianten van het type "update"; - het bepalen van een extra minimum telwaarde gebaseerd op de totale waarde en een gebruiker-gedefinieerde foutverhouding; - het bepalen van een meest voorkomende base-waarde op een positie van de varianten van het type "update"; en, - het bijwerken van de eerste geaggregeerde base-waarde informatie, wanneer de telwaarde die behoort bij een specifieke base-waarde verschillend van de meest voorkomende base-waarde op de positie van de varianten van het type "update" kleiner is dan de minimum telwaarde door het verwijderen van de specifiek base- waarde en het verhogen van de telwaarde van de meest voorkomende base-waarde met de telwaarde die behoort bij de specifieke base-waarde.The method of any one of claims 1 to 7, wherein the method further comprises an upgrade process comprising: - aggregating information of "update" variants at a particular position on the reference sequence to provide aggregated base-value information for obtain any position of the variants of the "update" type and a total value corresponding to the number of variants of the "update" type at said specific position on the reference sequence, the aggregated base-value information including a count value for each possible base -value for each position of the "update" variants; - determining an additional minimum count value based on the total value and a user-defined error ratio; - determining a most common base value at a position of the "update" variants; and - updating the first aggregated base value information, when the count value associated with a specific base value different from the most common base value at the position of the "update" variants is less than the minimum count value by removing the specific base value and increasing the count value of the most common base value by the count value associated with the specific base value. 9. De werkwijze volgens één van de conclusies 1 - 8, waarbij de werkwijze een gecombineerd proces uitvoert op een enkele verwerkingseenheid, het gecombineerde proces omvat ten minste het eerste aggregatieproces, het tweede aggregatieproces en het herschikkingsproces, de werkwijze omvat verder: - het segmenteren van de referentiegegevenssequentie in niet-overlappende segmenten van N posities; - P parallelle gecombineerd processen, waarbij P>1; waarbij een gecombineerd proces de uitgelijnde gegevens sequenties die een eerste base met een positienummer in een bereik dat M posities begint vóór de eerste positie van een segment en eindigt met de laatste positie van het niet-overlappende segment verkrijgt uit het invoerbestand om te worden verwerkt door het eerste aggregatieproces, het tweede aggregatieproces teneinde variant-call-gegevens voor de N posities geassocieerd met het segment te genereren.The method according to any of claims 1 to 8, wherein the method performs a combined process on a single processing unit, the combined process comprises at least the first aggregation process, the second aggregation process and the rearrangement process, the method further comprising: - segmenting of the reference data sequence in non-overlapping segments of N positions; - P parallel combined processes, where P> 1; wherein a combined process obtains the aligned data sequences that a first base with a position number in a range that starts M positions before the first position of a segment and ends with the last position of the non-overlapping segment from the input file to be processed by the first aggregation process, the second aggregation process to generate variant call data for the N positions associated with the segment. 10. De werkwijze volgens conclusie 9, waarbij de verwerking van de niet-overlappende segmenten wordt uitgevoerd door P verwerkingseenheden en op elkaar volgende niet-overlappende segmenten oplopende integer indexnummers hebben, de werkwijze verder omvat: - het verdelen van de verwerking van de niet-overlappende segmenten over de P verwerkingseenheden, waarbij de niet-overlappende segmenten cyclisch worden toegewezen aan de P verwerkingseenheden, een verwerkingseenheid met indexnummer i verwerkt achtereenvolgens segmenten met indexnummer i, i + P,The method of claim 9, wherein the processing of the non-overlapping segments is performed by P processing units and successive non-overlapping segments have ascending integer index numbers, the method further comprising: - dividing the processing of the non-overlapping segments overlapping segments over the P processing units, the non-overlapping segments being cyclically assigned to the P processing units, a processing unit with index number i sequentially processes segments with index number i, i + P, 11. De werkwijze volgens conclusie 10, waarbij de werkwijze P buffers gebruikt om variant-call-gegevens op te slaan, verwerkingseenheid met indexnummer i slaat de variant-call-gegevens die corresponderen met een niet-overlappende segment op in een buffer met indexnummer i, de werkwijze omvat verder: - het cyclisch verkrijgen van variant-call-gegevens die corresponderen met een niet-overlappende segment van buffer met indexnummer 1 tot P; en, - het toevoegen van de variant-call-gegevens die corresponderen met de niet-overlappende segmenten aan het VCF bestand.The method of claim 10, wherein the method uses P buffers to store variant call data, processor with index number i stores the variant call data corresponding to a non-overlapping segment in a buffer with index number i the method further comprises: cyclically obtaining variant call data corresponding to a non-overlapping segment of buffer with index number 1 to P; and - adding the variant call data corresponding to the non-overlapping segments to the VCF file. 12. De werkwijze volgens één van de conclusies 9-11, waarbij het gecombineerd proces verder omvat het filteren van de aggregatiedata.The method of any one of claims 9-11, wherein the combined process further comprises filtering the aggregation data. 13. Een op een computer geïmplementeerd systeem (1100) omvattende een processor (1110), een invoer / uitvoer inrichting (1130), een databank (1140) en een gegevensopslag (1120) verbonden met de processor, de gegevensopslag omvat instructies die, wanneer uitgevoerd door de processor (1110), ertoe leiden dat het op de computer geïmplementeerde systeem de werkwijze uitvoert volgens een van de conclusies 1-12.A computer-implemented system (1100) comprising a processor (1110), an input / output device (1130), a database (1140) and a data storage (1120) connected to the processor, the data storage including instructions that, when performed by the processor (1110), cause the system implemented on the computer to perform the method according to any of claims 1-12. 14. Een computerprogramma omvattende instructies die door een computerinrichting kunnen worden geladen, en die er toe leiden dat de computerinrichting een van de werkwijzen volgens conclusies 1-12 uitvoert.A computer program comprising instructions that can be loaded by a computer device, and that cause the computer device to perform one of the methods according to claims 1-12. 15. Een voor een processor leesbaar medium voorzien van een computerprogramma omvattende instructies die door een computerinrichting kunnen worden geladen, en die ertoe leiden dat de computerinrichting een van de werkwijzen volgens conclusies 1-12 uitvoert. *******A processor-readable medium provided with a computer program comprising instructions that can be loaded by a computer device and which cause the computer device to perform one of the methods according to claims 1-12. *******
NL2014199A 2015-01-27 2015-01-27 A computer implemented method for generating a variant call file. NL2014199B1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
NL2014199A NL2014199B1 (en) 2015-01-27 2015-01-27 A computer implemented method for generating a variant call file.

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
NL2014199A NL2014199B1 (en) 2015-01-27 2015-01-27 A computer implemented method for generating a variant call file.
PCT/NL2016/050062 WO2016122318A1 (en) 2015-01-27 2016-01-27 A computer implemented method for generating a variant call file

Publications (2)

Publication Number Publication Date
NL2014199A NL2014199A (en) 2016-09-28
NL2014199B1 true NL2014199B1 (en) 2017-01-06

Family

ID=52815236

Family Applications (1)

Application Number Title Priority Date Filing Date
NL2014199A NL2014199B1 (en) 2015-01-27 2015-01-27 A computer implemented method for generating a variant call file.

Country Status (2)

Country Link
NL (1) NL2014199B1 (en)
WO (1) WO2016122318A1 (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10395759B2 (en) 2015-05-18 2019-08-27 Regeneron Pharmaceuticals, Inc. Methods and systems for copy number variant detection
US10600499B2 (en) 2016-07-13 2020-03-24 Seven Bridges Genomics Inc. Systems and methods for reconciling variants in sequence data relative to reference sequence data

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9552458B2 (en) * 2012-03-16 2017-01-24 The Research Institute At Nationwide Children's Hospital Comprehensive analysis pipeline for discovery of human genetic variation
WO2014186604A1 (en) * 2013-05-15 2014-11-20 Edico Genome Corp. Bioinformatics systems, apparatuses, and methods executed on an integrated circuit processing platform

Also Published As

Publication number Publication date
NL2014199A (en) 2016-09-28
WO2016122318A1 (en) 2016-08-04

Similar Documents

Publication Publication Date Title
CN103959254B (en) The method and apparatus of the migration/duplication of the data after optimizing duplicate removal
NL2014199B1 (en) A computer implemented method for generating a variant call file.
US10552460B2 (en) Sensor data management apparatus, sensor data management method, and computer program product
CN103793628A (en) System and method for aligning genome sequence considering entire read
Hozza et al. How big is that genome? Estimating genome size and coverage from k-mer abundance spectra
JP5267748B2 (en) Operation management system, operation management method, and program
CN112000848A (en) Graph data processing method and device, electronic equipment and storage medium
CN111081315A (en) Method for detecting homologous pseudogene variation
Srivastava et al. Accurate, fast and lightweight clustering of de novo transcriptomes using fragment equivalence classes
CN109739819A (en) Snapshot lossless compression method, device, equipment and the readable storage medium storing program for executing that can be recalled
CN108595912A (en) Detect the method, apparatus and system of chromosomal aneuploidy
US20220139506A1 (en) Method for automatically collecteing and matching of laboratory data
JP6203313B2 (en) Feature selection device, feature selection method, and program
US8341376B1 (en) System, method, and computer program for repartitioning data based on access of the data
CN103793626A (en) System and method for aligning genome sequence
US20160357844A1 (en) Database apparatus, search apparatus, method of constructing partial graph, and search method
US20120330563A1 (en) Assembly Error Detection
KR101584857B1 (en) System and method for aligning genome sequnce
WO2012159320A1 (en) Method and device for clustering large-scale image data
JP6730587B2 (en) Cache miss estimation program, cache miss estimation method, and information processing apparatus
CN109978006B (en) Face image clustering method and device
CN112687339B (en) Method and device for counting sequence errors in plasma DNA fragment sequencing data
WO2020230397A1 (en) Method for detecting outlier among theoretical masses
CN107229663B (en) Data processing method and device and data table processing method and device
CN112397148A (en) Sequence comparison method, sequence correction method and device thereof

Legal Events

Date Code Title Description
PD Change of ownership

Owner name: GENALICE HOLDING B.V.; NL

Free format text: DETAILS ASSIGNMENT: CHANGE OF OWNER(S), ASSIGNMENT; FORMER OWNER NAME: GENALICE B.V.

Effective date: 20170404

PD Change of ownership

Owner name: NORLIN GENALICE LIMITED; GB

Free format text: DETAILS ASSIGNMENT: CHANGE OF OWNER(S), ASSIGNMENT; FORMER OWNER NAME: GENALICE HOLDING B.V.

Effective date: 20171116

PD Change of ownership

Owner name: 42 GENETICS LTD; GB

Free format text: DETAILS ASSIGNMENT: CHANGE OF OWNER(S), ASSIGNMENT; FORMER OWNER NAME: NORLIN GENALICE LIMITED

Effective date: 20210106