CN115552536A - Method and system for efficient data compression in MPEG-G - Google Patents

Method and system for efficient data compression in MPEG-G Download PDF

Info

Publication number
CN115552536A
CN115552536A CN202180034395.5A CN202180034395A CN115552536A CN 115552536 A CN115552536 A CN 115552536A CN 202180034395 A CN202180034395 A CN 202180034395A CN 115552536 A CN115552536 A CN 115552536A
Authority
CN
China
Prior art keywords
data
annotation
genome
index
reads
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202180034395.5A
Other languages
Chinese (zh)
Inventor
C·艾伯蒂
马西莫·拉瓦西
保洛·里贝卡
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Genomsys SA
Original Assignee
Genomsys SA
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Genomsys SA filed Critical Genomsys SA
Publication of CN115552536A publication Critical patent/CN115552536A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B50/00ICT programming tools or database systems specially adapted for bioinformatics
    • G16B50/50Compression of genetic data
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/10Sequence alignment; Homology search
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B50/00ICT programming tools or database systems specially adapted for bioinformatics

Abstract

A computer-implemented method for storing or transmitting a representation of genome sequencing data in a genome file format that includes annotation data associated with the genome sequencing data, the genome sequencing data comprising reads of a nucleotide sequence, the method comprising the steps of: aligning (10) the reads to one or more reference sequences, thereby creating aligned reads, classifying (14) the aligned reads according to a classification rule based on a mapping of the aligned reads on the one or more reference sequences, thereby creating a class (18) of aligned reads, entropy encoding the classified aligned reads into a number of descriptor blocks, structuring the descriptor blocks with header information, thereby creating a first classified access unit (119) containing genomic sequencing data, the method further comprising encoding annotation data (12) into different access units (122) of a second classification and encoding index data into a main annotation index (MAI, 123, 211), wherein the index data represents an encoded form of annotation string data obtained by employing at least one compressed string indexing algorithm (28) on the annotation string data (212), and wherein the MAI associates an encoded annotation string with the access unit of the second classification, the method further comprising associating the encoded annotation string with the access unit of the first classification and the MAI of the second classification.

Description

Method and system for efficient data compression in MPEG-G
Technical Field
The present invention relates to the field of data compression for MPEG-G.
The MPEG-Moving Pictures Experts Group (MPEG) is a working group of data compression experts formed by ISO and IEC to set standards for audio and video compression and transmission.
This working group has developed standards for efficient video compression of video in the early 90's of the last century. MPEG technology basically consists in reducing the entropy of video and audio source data so that higher compression ratios can be achieved for efficient storage and transmission.
Given the large amount of expertise in data compression within the MPEG experts group, it was decided to develop a standard for compression of genomic information to overcome the limitations of the solutions (e.g., CRAM and BAM file formats) existing in the art.
Therefore, even though MPEG-G relates to the compression of genome data, the main idea of exploiting data redundancy is taken from the field of video and audio compression, which is the closest technical field of the present application.
The present invention actually applies the syntax element construction of the genome data in a manner similar to the syntax element applied to the compression of video and audio data of MPEG.
However, knowing the fact that genomic data is very different from audio and video data, the data classification and syntax elements are different from those used in the MPEG video and audio standards: in fact, redundancies present in the genomic data must be exploited and these redundancies are different from the multimedia data.
Thus, the present invention studies to compress genomic data in an efficient manner in order to obtain files of reduced size and also easy to randomly access in the compressed domain.
The present invention builds on the encoding and decoding methods, systems and computer programs disclosed in patent applications WO 2018/068827A1, WO 2018/068828A1, WO 2018/068829A1, WO 2018/068830A1, the disclosures of which relating to entropy encoding of genomic data may be necessary for understanding some aspects of the invention; the disclosures of the foregoing documents are therefore to be considered as being incorporated by reference in the present invention.
The present disclosure provides a novel method of representing annotations and metadata associated to genome sequencing data that reduces the utilized storage space by providing new indexing functionality not available with known prior art representation methods, provides a single syntax for several metadata formats and improves data access performance.
The methods disclosed in the present invention provide a higher compression ratio of genome sequencing data and associated annotations by:
representing the genome sequencing data and associated annotations in terms of the syntax of numerical and textual descriptors as defined in the present disclosure
Compressing non-indexed descriptors separately from indexed text descriptors
Applying transformations such as differential coding, run-length coding, byte separation, and entropy coders (e.g., CABAC, huffman coding, arithmetic coding, range coding) to non-indexed descriptors
Applying a compressed full-text string indexing algorithm, such as a compressed string pattern-matching data structure, a compressed suffix array, an FM index, and a hash table, to indexed text descriptors by eliminating redundancy with both index and compressed payload as is achieved by existing methods.
An advantage of compressing non-indexed descriptors separately from indexed text descriptors is that, these type 2 data, once grouped separately, exhibit lower entropy than when they were encoded together, and therefore a higher compression ratio can be achieved.
By using a compressed full text string indexing algorithm, the methods described in the present invention do not require both a compressed payload with genomic information and an index of that information to support selective access, thus achieving a better compression ratio. The compressed full text string indexing algorithm is both indexing and compressed information, and can be used to perform both selective access and to retrieve the desired information through decompression. The present invention overcomes the need to have both an index and a compressed payload as currently required by existing solutions in the art.
The method also allows previously unrelated genome annotation-related concepts to be hierarchically described and stored in a compressed form. This makes it possible to encode relationships between such concepts that could not previously be described, thus allowing a novel way of describing and interchanging data.
Background
Genomic or proteomic information generated by DNA, RNA or protein sequencing machines is transformed during different stages of data processing to produce a wide variety of data. In prior art solutions, these data are currently stored in computer files having different and unrelated structures. Therefore, this information is very difficult to archive, communicate, and elaborate.
Reference to genomic or proteomic sequences in the present invention includes, for example, but is not limited to, nucleotide sequences, deoxyribonucleic acid (DNA) sequences, ribonucleic acid (RNA) and amino acid sequences.
Sequence alignment refers to the process of arranging sequence reads by finding regions of similarity that can be the result of functional, structural, or evolutionary relationships between sequences. When alignment is performed with reference to a pre-existing nucleotide sequence, referred to as a "reference sequence", the process is referred to as "mapping". Prior art solutions store this information in "SAM", "BAM" or "CRAM" files. The process of performing an alignment of sequences is also referred to as "aligning".
The concept of aligning sequences to reconstruct a partial or whole genome is depicted in fig. 2 of WO2018068827A1, the disclosure of which is incorporated herein by reference.
There is clearly a need to provide appropriate genome sequencing data and metadata representation (genome file format) by organizing and partitioning the data, so that compression of the data and metadata is maximized, and several functionalities are efficiently achieved, such as selective access and support for efficient incremental updates and other data processing functionalities at different stages of the genome data lifecycle.
Furthermore, when genome sequencing data generated by high-throughput sequencing machines is analyzed by processing pipelines and analysts, annotations are generated that express different regions of the genome of several distinct properties, and are currently represented by a wide variety of text formats. While the different types of results and annotations generated are conceptually related to each other and ideally need to be accessed and used in conjunction, the current solutions used in this technology make these metadata separate in the form of independent and separate text files and encoded data related to the same genomic reads. These formats do not support any type of join between elements of one document and elements of other documents that are conceptually joined and thus may share a common biological meaning.
In the best case, this lack of explicit linkage suggests that processing and using genomic data and annotation information requires time consuming and overly inefficient parsing of potentially large text files in searching for specific information and associated metadata. In the worst case, the fact that it is not possible to describe the connection prevents the development of an efficient bioinformatic workflow and database for downstream applications such as biomedical research or personalized medicine.
For example, RNA sequencing reads aligned onto a gene (which typically consists of a set of intervals on a reference genome) need to be counted in order to measure the extent of expression of the gene in the biological conditions used for the experiment. Different biological conditions (resulting in different sets of reads generated by different experiments) are typically compared in the context of a particular experiment targeted to find a path linking a genotype to a phenotype. The process of generating information about individual reads and their alignment to the reference genome and aggregating that information into results with more general genetic and biological implications is called "secondary analysis".
Different types of annotations (meta-information) generated by secondary analysis using genome sequencing reads may be conceptually linked to genome sequencing reads aligned to one or more intervals of a genome sequence used as a reference.
A genomic interval can be uniquely identified by specifying the sequence of nucleotides in a reference combination (i.e., chromosome, gene, set of contiguous bases, single base in the genome), the strand of molecules, which can be in forward or reverse direction, and the starting and ending positions of the range of bases (i.e., nucleotides) contained in the interval.
Figure BDA0003936497030000031
Features associated with a genomic interval, such as variants, the number of aligned reads at a given location (also denoted as "coverage"), the portion of the genome that binds to a protein, the nature and location of the gene, and the region associated with a particular gene function, can be uniquely identified and associated with the genomic interval. The interval can be as short as a single base, or it can span thousands of nucleotides or more.
Numerous integration experiments can construct a composite analysis of genomic sequencing data. Different sequencing derived protocols generally characterize each experiment; which is used to sample different functions or compartments of the cell. The results produced by the primary analysis (i.e., alignment of reads relative to a reference) and the secondary analysis (i.e., integration and statistical studies performed on the results by comparison) in each experiment can be displayed in graphical form using a software application called a genome browser, enabling one-dimensional navigation of the genome along the positions of nucleotides. The information generated by the secondary analysis associated to each position in the genome or to each interval is typically displayed in the form of a different curve (or "trace") per sequencing experiment, which represents the presence and structure of the transcript, sequence variants in an individual or population, coverage of sequencing reads, strength of protein binding to each position of the genome.
The state of the art genome annotation format generated by the analysis tools represents all of the aforementioned results-also referred to as "features" -using several diverse and independently defined and maintained formats. Such formats are typically characterized by poor and inconsistent syntax and semantics, which lead to the proliferation of slightly different and incompatible file formats for each type of analysis result. A drawback of all current existing solutions is that scientists working on integrated analysis of genomic data are forced to systematically transcode different formats by using complex concatenations of text processing tools and programs when needed to jointly access and study sets of experiments. This proliferation of different formats leads to poor interoperability and reproducibility of results across different groups of scientists even using only slightly different representations and associated semantics.
The format most commonly used to represent genome annotations generated by genome sequencing data analysis and used in the art is:
variant detection Format (VCF) to represent variants relative to a reference genome that may be present in a single individual or in a population of individuals;
browser Extensible Data (BED) format, which supports the representation of Data lines typically displayed in annotation tracks presented in genome browsers. http:// genome. Ucsc. Edu/FAQ/FAQformat # format1
The Generic Feature Format (GFF) represents genomic features in a text file characterized by 9 columns and tab-delimiters.
The Gene Transfer Format (GTF) is an extension of GFF and is backward compatible with GFF.
The BigWig format is used to represent dense contiguous data to be displayed as a graph in a genome browser.
Moreover, the fact that it is not possible to describe such diverse data by means of a unified hierarchical structure means that it is also not possible at all to describe the relationships between features belonging to different classes, which makes progress in the field more difficult.
Disclosure of Invention
In order to solve the above-mentioned problems of the prior art, the subject matter of claims 1, 9, 12, 14 and 16 is presented. Advantageous modifications are indicated in the dependent claims.
More specifically, the present disclosure provides a computer-implemented method for encoding, storing and/or transmitting a representation of genome sequencing data in a genome file format comprising annotation data associated with the genome sequencing data, the genome sequencing data comprising reads of nucleotide sequences, the method comprising the steps of:
aligning the reads to one or more reference sequences, thereby creating aligned reads,
classifying the aligned reads according to a classification rule based on a mapping of the aligned reads on the one or more reference sequences, thereby creating a class of aligned reads,
entropy encoding the sorted aligned reads into a number of descriptor blocks,
structuring the descriptor block with header information, thereby creating an access unit containing a first category of genome sequencing data,
the method further includes encoding annotation data into different access units of a second classification and encoding index data into a master annotation index, wherein the index data represents an encoded form of annotation string data obtained by employing at least one compressed string indexing algorithm on the annotation string data, and wherein the MAI associates an encoded annotation string with the access units of the second classification.
Preferably, the method further comprises jointly encoding the access units of the first class, the access units of the second class and the MAI.
The method may further comprise the steps of: storing or transmitting encoded genomic sequencing data on or to a computer-readable storage medium; or the encoded genome sequencing data may be made available to the user in any other manner known in the art, such as by transmitting the genome sequencing data over a data network or another data infrastructure.
In the context of the present disclosure, a descriptor may be implemented, for example, as a genome annotation descriptor as defined in the detailed description below.
Further preferably, the access unit of the second classification containing genome annotation data further comprises information data identifying a genomic interval, wherein the genomic interval identifies a nucleotide sequence in the one or more reference sequences such that annotation data contained in the access unit of the second classification is associated with the associated encoded read of the genomic sequence contained in the access unit of the first classification containing genome sequencing data.
According to a (further) preferred embodiment, the encoding of the annotation data and the index data comprises the steps of:
encoding genome annotation data into a genome annotation descriptor, wherein the genome annotation descriptor comprises a numerical descriptor and a textual descriptor, the encoding comprising the steps of:
-selecting a subset of text descriptors from said text descriptors according to configuration parameters, in particular provided by a user;
-transforming said subset of text descriptors by employing a first string transformation method to produce a string index;
-transforming and encoding said string index by employing a string index transformation method, thereby generating primary annotation index data;
-transforming said numeric descriptors and text descriptors not included in said subset of text descriptors by employing at least one second transformation method different from the first transformation method;
-encoding said numeric descriptors and text descriptors not included in said subset of text descriptors into separate access units of a second classification by employing at least one first entropy encoder for numeric descriptors and at least one second entropy encoder for text descriptors not included in said subset of text descriptors.
Further preferably, the first string transformation method includes the steps of:
-inserting a string terminator character after each text descriptor for signaling the termination of each text descriptor;
-concatenating the text descriptors;
-interleaving the genome annotation record index data to associate the text descriptor with the location of the genome annotation record within the access unit of the second classification.
According to a (further) preferred embodiment, the string index transformation method is one of string pattern matching, suffix array, FM-index, hash table.
Preferably, the at least one second transformation method is one of: differential encoding, run-length encoding, byte separation, and entropy encoders (e.g., CABAC, huffman encoding, arithmetic encoding, range encoding).
According to a (further) preferred embodiment, the master annotation index contains in its header the number of AU types and the number of indices of each AU type.
Further preferably, the method described above further comprises encoding the sorted unaligned reads.
The object of the present invention is further solved by a method for decoding and extracting nucleotide sequences and genome annotation data encoded according to the method described above, said method comprising the steps of:
analyzing the genome data multi-elements into a genome syntactic element layer;
parsing the compressed annotation data;
analyzing the main comment index;
expanding the genomic layer into sorted reads of nucleotide sequences;
selectively decoding the sorted reads of nucleotide sequences on one or more reference sequences to generate uncompressed reads of nucleotide sequences;
selectively decoding the annotation data associated with the classified reads.
Preferably, the method further comprises decoding information data relating to a genomic interval, wherein the genomic interval identifies a nucleotide sequence in the one or more reference sequences such that the annotation data is associated with the relevant encoded reads of the genomic sequence.
Further preferably, the method further comprises decoding data encoded according to the method described above for storing or transmitting a representation of the genome sequencing data in a genome file format comprising annotation data associated with the genome sequencing data.
According to another aspect of the present disclosure, a genome encoder for compressing genome sequence data in a genome file format comprising annotation data associated with the genome sequencing data is presented, wherein the genome sequence data comprises reads of a nucleotide sequence, and wherein the encoder comprises:
-an alignment unit for aligning the reads to one or more reference sequences, thereby creating aligned reads;
a data classification unit for classifying the aligned reads according to a classification rule based on a mapping of the aligned reads on the one or more reference sequences, thereby creating classes of aligned reads,
an entropy encoding unit for entropy encoding the sorted aligned reads into a number of descriptor blocks,
-an access unit coding unit for structuring the descriptor block with header information, thereby creating an access unit containing a first classification of genome sequencing data,
-a genomic annotation encoding unit for encoding annotation data into different access units of the second classification and encoding index data into a master annotation index, wherein the index data represents an encoded form of annotation string data obtained by employing at least one compressed string indexing algorithm on the annotation string data, and wherein the MAI associates an encoded annotation string with the access units of the second classification.
Preferably, the encoder comprises means for jointly encoding said access unit of the first class, said access unit of the second class and said MAI.
According to a (further) preferred embodiment, the genomic encoder comprises encoding means for performing the steps of the encoding method described above.
The present disclosure further relates to a genome decoder apparatus for decoding a nucleotide sequence and genome annotation data encoded by an encoder as described above, the decoder comprising:
-means for parsing the genomic data multiplex into a genomic syntax element layer;
-means for parsing the compressed annotation data;
-means for parsing a master annotation index;
-means for expanding the genomic layer into sorted reads of nucleotide sequences;
-means for selectively decoding the sorted reads of nucleotide sequences on one or more reference sequences to generate uncompressed reads of nucleotide sequences;
-means for selectively decoding the annotation data associated to the classified reads.
Preferably, the genomic decoder further comprises decoding means for performing the steps of the decoding method described above.
According to another aspect of the disclosure, a computer-readable medium is presented, comprising instructions, which when executed by at least one processor, cause the at least one processor to perform the method described above.
Term(s)
In this disclosure, the following terms and expressions are used:
bitstream syntax: structure of data encoded as a bit sequence (i.e., bit stream) in digital data storage or communication applications. The term refers to the format of an encoded bitstream that is typically generated by an encoding application (i.e., an encoder) and processed as input to a decoding application (i.e., a decoder) to reconstruct uncompressed data when compression is used. The bitstream syntax uses several syntax elements to represent the information encoded in the bitstream.
The grammar elements are as follows: a component of a bitstream syntax representing one or more characteristics of encoded information. In the bitstream generated by the encoder, the syntax elements may be compressed or uncompressed.
A source model: in information theory, the expression "source model" specifies the definition of the set of events generated by the source, its context, and the probability associated to each event and the corresponding context. In data compression, knowledge of the source of the information to be encoded is used to define a source model, which makes it possible to reduce the entropy of the model and, therefore, the number of bits required to represent (i.e., encode) the information generated by the source.
Sequencing data: a set of sequencing reads generated by a sequencing protocol.
Sequencing reads (i.e., reads): in sequencing, reads are deduced sequences (or base pair probabilities) that correspond to all or part of the base pairs of a nucleic acid molecule.
Genome interval: a series of bases (i.e., nucleotides) between the beginning and ending positions on a nucleotide sequence, such as a chromosome, a gene, a transcriptome, or any other nucleotide sequence.
Genome characterization: a set of genomic intervals sharing biological properties.
Annotation data: quantitative, qualitative, or sequencing information associated with a genomic feature. These include variants, browser trajectories, functional annotations, methylation patterns and levels, sequencing coverage and statistics, feature expression matrices, contact matrices, affinity of proteins for nucleic acids.
Functional annotation: information associated with genomic features, particularly with respect to the hierarchy of concepts related to biological transcription and translation of genomic information (genes, transcripts, exons, coding sequences, etc.). The formats currently used to represent this information include GFF, GTF, BED and all their derivatives.
A multiplexer: an encoding module that receives as input a number of different types of access units and generates a structured bitstream for streaming or file storage purposes.
Genome annotation record: a data structure consisting of a collection of genome annotation descriptors representing a genomic interval and annotation data relating to genome function annotations, browser tracks, genome variants, gene expression information, contact matrices, and other annotations associated with the genomic interval. One genome annotation record can be logically linked to other genome annotation records and associated annotations.
String data structure: a data structure for indexing strings to allow rapid searching in a compressed domain.
Main Index Table (MIT): ISO/IEC 23092-1 and WO2018068827A1 and WO2018152143A 1. Classes used to associate genomic intervals and encoded genomic sequencing reads with access units used to carry compressed reads and associated metadata mapped on the intervals.
Genome compressed data block, access unit, genome data layer, genome data multi-element
The data structure further disclosed by the invention relies on the following concept:
a data block is defined as a collection of descriptor vector elements of the same type (e.g., location, distance, inverse complement flag, location and type of mismatch) that make up a layer. A layer is typically made up of a large number of data blocks. The data block may be partitioned into genomic data packets as described in co-pending patent application No. WO2018068830A1, incorporated herein by reference, which contain transmission units having a size typically specified according to communication channel requirements. This split feature is desirable to achieve transmission efficiency using typical network communication protocols.
An access unit is defined as a subset of the genomic data that can be fully decoded independently of other access units by using only globally available data (e.g., decoder configuration) or by using information contained in other access units. An access unit consists of a header and the result of multiplexing data blocks of different layers. Several capsules of the same type are encapsulated in a block, and several blocks are multiplexed in one access unit. These concepts are depicted in figures 5 and 6 of WO 2018068827. For clarity, in this disclosure, access units containing compressed genomic sequencing data are referred to as access units of a first classification, while access units containing compressed annotation data are referred to as access units of a second classification.
A genomic data layer is defined as a collection of blocks of genomic data that encode the same type of data (e.g., blocks of locations in the same layer that encode perfectly matched reads on a reference genome).
The genomic data stream is a packetized version of the genomic data layer in which the encoded genomic data is carried as a payload of a genomic data packet that includes additional service data in a header. An example of layering 3 genomic data into 3 genomic data streams is seen in figure 7 of WO 2018068827.
Genomic data multiplex is defined as the sequence of a genomic access unit used to convey genomic data related to one or more processes of genomic sequencing, analysis, or processing. Figure 7 of WO2018068827 provides a schematic diagram of the relationship between genomic multiplex carrying three genomic data streams decomposed in access units. The access unit encapsulates data blocks belonging to three streams and being partitioned into a genome set to be transmitted over the transport network.
Drawings
FIG. 1 shows the relationship between the present invention and the encoding device described in ISO/IEC 23092.
FIG. 2 shows a coding device for genome annotation that works according to the principles of the present invention and extends the coding device described in ISO/IEC 23092.
FIG. 3 shows a decoding device for genome annotation that operates according to the principles of the present invention and extends the decoding device described in ISO/IEC 23092.
FIG. 4 shows a decoding device for genome annotation that operates according to the principles of the present invention and extends the decoding device described in ISO/IEC 23092 to allow partial decoding driven by text interrogation.
FIG. 5 shows an example of a possible layout of an uncompressed index of a string index that may be used to illustrate the string index algorithm presented in this disclosure.
Fig. 6 shows how the string index algorithms of the two series are combined in order to maximize compression and speed, beyond what would otherwise be possible by using only one series.
FIG. 7 shows the relationship between the present invention and the decoding apparatus described in ISO/IEC 23092.
FIG. 8 illustrates how the conceptual organization of data described in the present invention provides for a text query to be performed.
FIG. 9 illustrates how the conceptual organization of data described in the present invention provides for a search over a genomic interval to be performed.
Detailed Description
Important aspects of the disclosed solution are:
1 classifying sequence reads into different classes based on the results of the alignment relative to a reference sequence so that encoded data can be selectively accessed according to criteria related to the results of the alignment. This means a file format specification that "contains" the structured data elements in compressed form. This approach can be seen as being in contrast to prior art approaches (e.g., SAM and BAM) in which data is structured in an uncompressed form and then the entire file is compressed. A first significant advantage of the described method is that various forms of selective access to data elements in the compressed domain can be efficiently and naturally provided, which is not possible or very inconvenient in prior art methods.
2 decompose the sorted reads into layers of isomorphic metadata in order to reduce the entropy of the information as much as possible. The decomposition of genomic information into specific isomorphic data and metadata "layers" provides the significant advantage of being able to define different models of information sources characterized by low entropy. Such models may not only differ between layers, but may also differ internally within each layer. This structuring enables the use of the most appropriate specific compression for each type of data or metadata and parts thereof, with a significant gain in coding efficiency over prior art methods.
3 structuring the layers into access units, i.e. genomic information that can be decoded independently by using only globally available parameters (e.g. decoder configuration) or by using information contained in other access units. When compressed data within a layer is partitioned into data blocks that are included into access units, different models of information sources characterized by low entropy may be defined.
The 4 information is structured such that any relevant subset of data used by the genomic analysis application can be efficiently and selectively accessed by means of an appropriate interface. These features enable faster access to data and more efficient processing. The primary index table and the local index table enable selective access to information carried by the encoded (i.e., compressed) data layer without the need to decode the entire compressed data. Furthermore, the association mechanism between the various data layers is specified to enable selective access to any possible combination of a subset of semantically associated data and/or metadata layers without the need to decode all layers.
5 joint storage of the primary index table and the access unit.
The encoding scheme for the genome reads is shown in the encoder of fig. 1.
Classification of sequence reads
Sequence reads generated by a sequencing machine are classified by the disclosed invention into five different "classes" based on the results of the alignment relative to one or more reference sequences. Defining the class according to the presence of substitutions, insertions, deletions and clipped bases relative to the one or more reference sequences based on a match to/mapping on a reference genome.
Five are possible results when aligning the DNA sequence of nucleotides relative to a reference sequence:
1. a certain region in the reference sequence is found to match the sequence read without any errors (perfect mapping). Such nucleotide sequences will be referred to as "perfect match reads" or denoted as "P-like".
2. A certain region in the reference sequence is found to match a sequence read, there are several mismatches consisting of several positions where the sequencing machine is unable to detect any base (or nucleotide). Such mismatches are denoted as "N". Such sequences will be referred to as "N mismatched reads" or "N-like".
3. A region in the reference sequence is found to match a sequence read, there are several mismatches consisting of several positions where the sequencing machine is unable to detect any base (or nucleotide) or has detected a base that is different from the base reported in the reference genome. This type of mismatch is called a Single Nucleotide Variation (SNV) or a Single Nucleotide Polymorphism (SNP). The sequence will be referred to as an "M mismatched read" or "M-like".
4. The fourth class consists of sequencing reads that exhibit a mismatch type that contains the presence of the same mismatch plus insertions or deletions (i.e., indels) of class M. Insertions are represented by sequences of one or more nucleotides that are not present in the reference but are present in the read sequence. According to the literature, when an inserted sequence is at the edge of the sequence, it is referred to as "soft clipping" (i.e., nucleotides do not match the reference but remain in aligned reads, in contrast to "hard-clipped" nucleotides that are discarded). Deletions are "holes" (missing nucleotides) in the reads aligned relative to the reference. Such sequences will be referred to as "I mismatched reads" or "class I".
5. The fifth class contains all reads that do now find any valid mapping on the reference genome according to the specified alignment constraints. Such sequences are referred to as unmapped and belong to "class U".
Unmapped reads may be combined into a single sequence using a de novo sequencing combination algorithm. Once the new sequence has been created, the unmapped reads may be further mapped relative thereto and classified as one of four classes P, N, M and I.
Once the classification of the reads is completed with the definition of the class, further processing includes defining a set of distinct syntax elements representing the remaining information, enabling the reconstruction of the DNA read sequence when represented as mapping on a given reference sequence. A DNA segment related to a given reference sequence may be adequately expressed by:
syntax elements used in the encoding of genomic reads
The start position (pos) on the reference genome.
A flag that signals whether a read must be treated as reverse complement (rcomp) with respect to a reference.
Distance to paired pairs in the case of paired reads (pair).
The value of the read length in the case of sequencing techniques yields variable length reads. With a constant read segment length, the read segment length associated to each read may be explicitly omitted and may be stored in the main file header.
Additional flags that describe specific characteristics of the read (duplicate reads, first or second read of a pair, etc.).
For each mismatch:
omicron mismatch position (nmis for class N, snpp for class M, and indp for class I)
O mismatch type (absence in class N, snpt in class M, indt in class I)
Optional soft-clipped nucleotide strings (when present) (indc in class I).
This classification creates groups of descriptors (syntax elements) that can be used to unambiguously represent genomic sequence reads.
For each layer of the genomic data structure disclosed in the present invention, different encoding algorithms may be employed depending on the particular characteristics of the data or metadata carried by that layer and its statistical properties. The "coding algorithm" must be defined as the association of a particular "source model" of descriptors with a particular "entropy coder". A particular "source model" may be specified and selected to achieve the most efficient encoding of data in terms of minimization of source entropy. The selection of the entropy encoder may be driven by coding efficiency considerations and/or probability distribution characteristics and associated implementation issues. Each selection of a particular coding algorithm will be referred to as a "coding mode" that applies to the entire "layer" or to all "data blocks" contained into an access unit. Each "source model" associated to a coding mode is characterized by:
definition of syntax elements (e.g., read position, read pair information, mismatch with respect to a reference sequence, etc.) issued by each source.
Definition of the associated probability model.
Definition of an associated entropy coder.
For each data layer, the source model employed in one access unit is independent of the source models used by other access units of the same data layer. This enables each access unit to use the most efficient source model for each data layer in terms of minimization of entropy
Genome annotation
Genome annotations, browser trajectories, variant information, gene expression matrices, and other annotations referred to in the present invention are associated with, for example, but not limited to, nucleotide sequences, deoxyribonucleic acid (DNA) sequences, ribonucleic acid (RNA), and amino acid sequences. Although the annotation of a reference genome in the form of a nucleotide sequence is described in considerable detail herein, it will be appreciated that the compression methods and systems may also be implemented for the annotation of other genomic or proteomic sequences, although there are several variations, as will be appreciated by those skilled in the art.
Genome functional annotations are defined as remarks to describe what the function of those genes and their transcripts is by means of elucidation or comment on coding or non-coding regions added to the identified locations of the genes and the genome.
Genomic variants (or variants) describe the differences between a genomic sample and a reference genome. Variants are generally classified as small-scale (e.g., substitutions, insertions, and deletions) and large-scale (i.e., structural changes) (e.g., copy number changes and chromosomal rearrangements).
The genome browser trace is a curve that correlates to aligned genome sequencing reads displayed in the genome browser. Each point in the curve corresponds to a location in the reference genome and expresses information associated to that location. Typical information represented as browser traces are the presence and structure of transcripts, sequence variants in individuals or populations, coverage of sequencing reads, strength of protein binding to each position of the genome, and the like.
A gene expression matrix is a two-dimensional array in which rows represent genomic features (typically genes or transcripts), lists are exemplified for various samples or experimental conditions such as tissues, and numbers that count the number of times each gene is expressed in a particular sample (counters are also referred to as "expression levels" for a particular gene).
The contact matrix was generated by Hi-C experiments, and each i, j entry measures the strength of physical interaction between two genomic regions i and j at the DNA level. At the lowest granularity, i and j represent two positions on the genome that are represented as a single sequence of all concatenated chromosomes.
Limitations of the current state of the art
So far, the classes of annotation data listed above are represented using different and incompatible text formats, which are typically compressed using common text compressors such as gzip, bzip2, and the like. In most cases. The parser processes this information by first decompressing the entire file and then parsing the decoded text to find (and if present, extract) the desired piece of information. Each of the formats for each category of data is modified quite frequently, independently and sometimes vastly, by different users or groups of users to generate several "variants" or "language (dials)" of the same format. This fact leads to serious interoperability problems and requires first to "sanitize" each file format variant to be able to exchange data.
Another limitation of current formats is the lack of support for establishing links between different types of annotation data represented in compressed form. For example, correlating a set of variants to a given gene requires:
1) Decompressing and parsing variant files (i.e., decompressing BCF to VCF)
2) Decompressing and parsing Gene annotation files (i.e., GTF/GFF)
3) The use of genomic locations of variants and genes from two parsing operations over the entire file, respectively, to establish a linkage would require another specialized format not present at the time of writing the text.
A disadvantage of the prior art format is that it is stored on different files. This is inefficient in terms of data compression and does not support any efficient process to perform queries on compressed files. Retrieving all variants associated with a given gene XYZ and possibly also the expression of said gene in a sample set cannot be done without decompressing the whole document concerned and parsing all its content. The described process of associating variants to genes today can only be achieved by combining several inefficient operations of data decompression, parsing and processing and by describing the relationships between different features by means of novel ad hoc formats that are not currently available or standardized.
Use case: variant detection in a clinical setting
By way of example (but not by way of limitation), the methods disclosed herein address the shortcomings of current solutions when attempting to determine variants of clinical relevance to a variant detection line and display the results in a manner that allows a clinician to easily review and verify the results. The goal is to use genome resequencing to identify variants that may be associated with a disease of interest or the manifestation of a particular phenotype. Variants are determined by first aligning genome sequencing reads to a reference genome and then detecting genomic variants, such as Single Nucleotide Polymorphisms (SNPs), via a suitable variant detection program using the alignment information at all positions accumulated ("stacked") for all reads. Variant detection is the complex operation of a complex pipeline that requires tools to perform complex processing. False positive or false negative results may result due to several technical issues, such as fluctuations in coverage or the location of variants in the repetitive genomic region. Due to these problems, in clinical settings, variants of potential clinical importance are often manually validated by human operators prior to inclusion in medical reports. However, data processing and validation requires access to and correlation of several information elements (genomic sequence, genomic annotation, read alignment, sequencing coverage, sequencing stacking in regions flanking variants), each typically stored in a separate file and represented using a different file format. In particular, current techniques may not explicitly state relationships such as "this set of sequencing reads aligned to this range of positions (i.e., interval) in the genome supports this variant contained in this genomic feature" because different entities (aligned reads, variants, genomic features) are represented in separate and distinct files. Today, this result can only be achieved by:
1) The various files are decompressed to retrieve the original text representation of the information for the entire sample.
2) The text file is parsed and features of interest (e.g., genomic intervals, gene names, annotation names, etc.) are searched.
3) It is possible to map (slightly) different names used in different files to identify the same features (there are different naming conventions to identify the same genomic features)
4) The retrieved information is aggregated in a single container and exposed to the end user or processing application in a proprietary format.
These various steps may take a very long time depending on the size of the parsed text file, which may be in the range of several gigabytes to several hundred gigabytes.
The present invention aims to address these limitations by providing:
1. a unified compressed representation of annotations of information content capable of representing: browser trajectories, genomic variants, gene expression data, contact matrices, and other metadata associated with genomic sequencing data
2. High compression performance of the unified representation yielding higher compression ratios than state of the art solutions
3. An embedded indexing feature that provides explicit browsing capabilities for annotations and metadata in a compressed domain. The indexing features support performing complex queries, resulting in a hierarchy of related data structures containing biological linkage annotations, browser tracks, genomic variants, gene expression information, contact matrices, and other annotations associated to intervals of aligned genomic sequencing data
4. A mechanism to explicitly join indexed and compressed sequencing raw data and associated metadata with indexed and compressed annotations. Such mechanisms enable selective access to annotations and associated related sequence reads in the compressed domain by interrogating the compressed raw data or the compressed annotation data.
In this example of variant detection in a clinical environment, data processing and visual display is achieved by encoding two distinct compressed data structures (which may or may not be contained in the same file) joined by a bi-directional indexing mechanism. The data structure contains:
1. genome sequencing reads and related alignment information
2. Annotation information (annotation, browser trajectory, genomic variants, gene expression information, contact matrix, and other annotation data) as described in this disclosure.
In particular, the encoded information is contained in a hierarchical structure as described in this disclosure, the following concatenation is made:
1. linking variants to the genes or genomic features it contains (if present), with details about the function and ontology of each gene
2. Linking variants to reads that it supports, i.e., to reads that support variants that are detecting
3. Each variant is coupled to a stacking profile obtained from reads that support the variant.
4. Any other kind of annotation information previously described.
The current state of the art allows to represent separately the different sources of information needed for genomic data annotation and variant detection (aligned reads with SAM/BAM/CRAM files, genomic annotations with GTF/GFF3 files, variants with VCF/BCF files, and various index file formats needed to perform range searches). It does not support explicit representation of bi-directional relationships between different entities. Furthermore, the software analysis workflow (or "pipeline") that performs variant detection needs to operate on different file formats depending on the analysis stage rather than on a single data structure as provided by the present disclosure. It is possible to represent different sources of information as a single genome browser, but this requires manipulation of several different file formats and the genome browser cannot be assigned features belonging to different files as relevant.
Technical advantages of variant detection assays.
In one embodiment, the present invention provides important technical advantages for use cases of variant detection analysis as described below.
The advantages of the current method over state of the art solutions in terms of efficient data retrieval for variant detection analysis are as follows.
1. Applications that provide explicit representations of relationships between sequencing reads and genomic features such as a genome browser must support and manage a single data container and associated bitstream format, rather than a large number of formats that may not be interoperable.
2. Through the use of a genome browser or other similar means, clinicians and scientists can explore the relationships between variants, the reads they support, and the names and functions of one or more genes involved. In particular, the integration between different types of information allows clinicians and scientists to verify the correctness of variant detection (e.g., to exclude misdetections due to the presence of duplicate reads and/or duplicate reference regions or due to the lack of re-alignment whenever multiple insertion deletions are present at different locations; or to examine variants for possible importance by virtue of their gene function or their presence in a database of known variants).
3. Via the possibility of conducting a text search of the meta-information contained in the file, the clinician or scientist can correlate the presence/absence of multiple variants based on gene function (e.g., by retrieving all variants contained in genes with similar function or genes with multiple functional copies, or by retrieving all variants contained in known databases with similar clinical effect).
4. The analysis pipeline can make selective access on a single encoded data structure throughout all stages (from alignment to variant detection), resulting in simpler and economical software development/data access patterns and lower operating costs.
5. Because the relationships are established explicitly when the data is encoded, and all relationships are encoded in the browsable index without the need to decompress and parse the entire file and possibly disconnected files, it is possible to discard irrelevant features (e.g., variants that are present in the database but not in the re-sequenced individual, or variants that are not relevant to the pathology under consideration), thus enabling higher compression.
6. All processing steps 1 to 5 requiring data access can be performed with an indexing mechanism embedded in the compressed data to support retrieval with a single query of both sequencing reads and all associated annotations from a single compressed file structure. The sequencing reads and associated annotations may also be de-coupled and encapsulated in a separate file to enable transmission of only the required portions of the data.
Limitations of state-of-the-art solutions for variant detection
The state of the art supports different pieces of information needed to represent the described use case by using different data structures and formats (aligned reads utilize SAM/BAM/CRAM file format, genome annotations utilize GTF/GFF3 file format, variants utilize VCF/BCF file format, and various types of independent index file formats for conducting range-only searches). These state of the art techniques do not support explicit representation and concatenation of relationships between different pieces of information. The pipeline performing variant detection needs to operate on different file formats depending on the analysis stage rather than on a single compressed data structure that is selectively accessible as proposed in the current approach. With state of the art techniques it is possible to feed different pieces of genomic information to a genome browser, but this requires a complex pre-processing stage consisting of manipulating and parsing several different file formats in an uncompressed form. Furthermore, correlations between annotations, biological features, and sequencing data cannot be specified to the genome browser for proper display.
Use case: establishing and interrogating a population hierarchy library of genomic variant data
By way of example (but not by way of limitation), the methods disclosed herein address the shortcomings of existing solutions when attempting to compile large databases of genomic variants. This scenario is similar to that considered in the previous case, i.e., a setting in which a researcher or clinician attempts to validate and collect genomic variants based on sequencing techniques. However, we now assume that the researcher or clinician is interested in cataloging a large number of variants-ideally, all variants in each genome-for a potentially very large number of individuals (initiatives can be considered that attempt to cover an increased portion of the population, with the ultimate goal being to cover all). In this example, variant detection will be performed first and generally following the analysis steps described in the previous use case; the process will then be repeated for all samples. The researcher will then typically ask for information about the results of the data analysis, such as "how many individuals possess this particular variant? "or" is this variant constantly supported in all individuals considered? "or" how many people in a sample have any of the variants contained in a given dataset of clinically relevant variants? And what are the lists of such variants for each individual? "currently, there are several ways to store lists of variables, usually as VCF/BCF files; however, the size of such population-level files is very large-which makes querying thereof technically challenging-and only very limited querying capabilities (i.e., retrieving variants in a specified genomic interval) are possible.
Technical advantages
The advantages of the current approach over the state of the art solutions are as follows:
1. the possibility to store a large set of variants in a more compact way. This is due to the fact that: the approach disclosed herein explicitly separates and describes information sources about variants, thus making it possible to specify better compression techniques tailored to each information source
2. The possibility of performing more complex queries in the compressed domain. This is also due to the fact that the data is divided into several streams with specified semantics depending on the individual, which makes selective access and filtering possible in addition to the genome coordinate based range access.
3. The possibility to link information about the detection of variants with other kinds of information such as: functional annotations present at the location of the variant; support sequencing reads for each variant; the intensity of a certain signal at that location derived from other sequencing techniques (e.g., from ChIP-seq experiments); and so on.
Limitations of state-of-the-art solutions
Although it is possible to store large databases by means of currently available formats such as VCF/BCF, the process is complex due to the complexity of the format, and due to the use of common compression methods, and because different information sources are mixed together in the same record, the resulting files are relatively large, making compression less efficient. Furthermore, formats such as VCF/BCF are not designed with complex queries-it is only possible to query them on a genomic scale in order to retrieve all variants present in the genomic interval. Further filtering must be performed separately, e.g., selecting variants depending on whether they are present in a given individual. Finally, as described in previous use cases, information about genomic variants cannot be crossed with other sources of information, such as a list of supported sequencing reads or a list of functional genomic features.
Use case: correlating information from complex omics experiments
By way of example (but not by way of limitation), the methods disclosed herein address the shortcomings and inefficiencies of current solutions when attempting to determine the biological mechanisms via which a particular phenotype originates. This is achieved by encoding several pieces of information (e.g., several experiments based on "omics" sequencing) in the same compressed data structure. Identification of complex molecular mechanisms requires the combination of several experimental techniques, each probing a different cell compartment (e.g., chIP-seq experiments investigating chromatin structure, bisulfite-sequencing experiments determining genomic methylation, and RNA-seq experiments determining how to modulate transcription).
The underlying molecular mechanisms of a genotype are determined by analyzing the interactions and correlations between patterns that occur simultaneously in different cell compartments when sequencing the same biological conditions. Chromatin markers were determined as peaks in ChIP-seq tracings obtained by accumulating alignments with reference genomes; methylation patterns were obtained by a special alignment pipeline capable of processing BS-seq data, when bisulfite treatment generated reads with modified bases whose sequence was not present in the original genome; RNA-sequencing data is processed by a specialized alignment pipeline capable of performing a splice alignment, when the cellular machinery derives RNA sequences by linking together one or more blocks of the genomic sequence ("exons") and discarding the sequences that occur between the blocks ("introns"), which results in sequences that were not present in the original genome; etc., depending on the particular omics experiment under consideration.
The data generated by each "omics" experiment typically requires a composite analysis pipeline that is individually tailored to the type of sequence (ChIP-seq, BS-seq, RNA-seq, etc.) generated by the particular biological protocol employed. Each pipeline typically requires multiple types of data (genome sequence, genome annotation, sequencing reads, read alignment, sequencing coverage, sequencing stacking), each type of data typically being stored in a different file and represented using a different file format for consideration and correlation. In particular, the current technology cannot explicitly state that relationships such as "in a given biological condition, this set of sequencing reads aligned to this range of positions in the genome supports this ChIP-seq peak, which correlates with a specific pattern of RNA expression and genome/histone methylation", since different entities (aligned reads, chIP-seq peaks, methylation patterns, genomic features, different biological conditions) are represented separately in different files.
Technical advantages of data processing and visual display
In one embodiment, genomic data processing and visual display is improved by means of the present invention by rendering in the same compressed data structure:
1. genome sequencing reads and related alignment information
2. Annotation information (gene model, stacking profile, methylation pattern, detected ChIP-seq peaks, expression levels derived from RNA sequencing) as described in this disclosure.
In particular, the joint compressed data structure contains a hierarchical organization as described in this disclosure, which joins:
1. linking methylation patterns, chIP-seq peaks, and RNA expression in different biological conditions to the genes or genomic features it contains (if present), with details about the function and ontology of each gene
2. Linking methylation patterns, chIP-seq peaks, and RNA expression in different biological conditions to the reads it supports, i.e., to the reads that support each of the features described
3. Each feature is coupled to a build-up profile obtained from the reads supporting the feature.
The advantages of the current approach over existing solutions in terms of efficient data retrieval for correlating information from several "omics" experiments are listed below.
1. Because current methods provide explicit representations of the relationships between sequencing reads and omics features, as well as the relationships between different omics features, applications such as genome browsers must support and manage a single data container and associated bitstream format rather than a multitude of non-interoperable formats
2. Via a browser or other means, researchers can explore the relationships between different "omics" features, the reads they support, and the names and functions of the genes contained. In particular, the integration between different types of information allows researchers to infer correlation/causal relationships between different "omics" features highlighted by the experiments, thereby marking genomic regions of interest for subsequent experimental validation
3. Through the possibility of conducting a textual search of annotations contained in a document, a researcher can correlate the presence/absence of multiple "omic" features based on gene function (e.g., by retrieving all features contained in genes with similar function or genes with multiple functional copies)
4. The analysis pipeline can run through all stages (from alignment to variant detection) and operate on a single compressed data structure for all types of omics data, resulting in simpler software development/data access patterns
5. Because the relationships are established explicitly when the file is encoded, and all relationships are in the same file rather than using a disconnected file encoding, it is possible to discard irrelevant features (e.g., "omics" features occurring outside the region of interest), thus enabling higher compression.
Limitations of existing solutions for linkage of different genomic features
The prior art allows the user to represent the different sources of information needed for this use case separately (using SAM/BAM/CRAM files via aligned reads, GTF/GFF3 files via genome annotation, chIP-seq peaks, RNA expression levels and other omics features using other file types, and various index file formats needed to conduct a range search). It does not support explicit representation of relationships between different entities. The pipeline that performs the analysis of each kind of omics data needs to operate on different file formats depending on the analysis stage rather than on the single compressed data structure proposed in the current approach. It is possible to present different information sources as a single genome browser, but this requires manipulation of several different file formats and cannot describe to the genome browser that features belonging to different files are relevant.
Concepts and terms
Access unit
Referring to WO 2018/068827A1, WO/2018/068828A1, and WO/2018/068830A1 throughout this disclosure, access Units (AUs) are defined as logical data structures containing encoded representations of genomic information to facilitate bitstream access and manipulation. Which is the smallest data organization that can be decoded by a decoding device implementing the invention described in this disclosure. An access unit is characterized by header information and a payload of compressed data structured into a sequence of blocks, each block possibly compressed using a different compression scheme.
The invention described herein introduces new access unit types containing genome annotation data such as genome features, functional annotations, browser tracks, genome variants, gene expression information, contact matrices, genotype data, and the like.
In the context of the present disclosure, the following definitions apply:
genome annotation record: a data structure consisting of a collection of genome annotation descriptors that describe genomic features such as genome function annotations, browser trajectories, genome variants, gene expression information, contact matrices, genotype data, and other annotations correlated to genome intervals. Each genome annotation record is identified by a unique identifier as shown in Table 1
Genome characterization: a genomic feature is defined herein as any segment of biologically meaningful information that correlates to genomic sequencing data. By way of example (but not by way of limitation), genomic features include: genome annotation, browser trajectory, genome variant, gene expression information, contact matrix.
Access unit start position: an access unit on a reference sequence (e.g., chromosome) is the smallest mapped location for which genomic data or metadata is encoded.
Access unit end position: an access unit on a reference sequence (e.g., chromosome) is the maximum mapped location for which genomic data or metadata is encoded.
Access unit range: a genome range between the access unit start position and the access unit end position.
Access unit size: the number of genome annotation records contained in the access unit.
Access unit footprint: a genome range between the access unit start position and the access unit end position.
In the context of the present disclosure, one or more access units are organized in a structure referred to as a genomic dataset. The genomic dataset is a compressed unit containing a header and an access unit. The set of access units that make up the genomic data set constitutes the genomic data set payload.
A collection of one or more genomic datasets is referred to as a dataset cluster.
Read class: ISO/IEC 23092 and WO 2018/068827A1, WO/2018/068828A1 and WO/2018/068830A1 and WO2018152143A1 specify how genomic sequence reads are classified and encoded according to the results of the alignment of the reads on a reference genome. Each read or read pair is assigned to a different class depending on the type and number of mapping errors.
AU class: each AU contains reads belonging to a single class.
Annotation data type: in the context of the present disclosure, the annotation data type characterizes a collection of genomic annotation information contained in one of these categories: genome features, functional annotations, browser trajectory, genome variants, gene expression information, contact matrix, genotype data, genome sample information.
Genome annotation descriptors
In the context of the present disclosure, a genome annotation descriptor is a syntax element that represents a portion of the information (and elements of the syntax structure of the file format and/or bitstream) necessary to reconstruct (i.e., decode) the encoded reference sequence, sequence reads, associated mapping information, annotations, browser tracks, genome variants, gene expression information, contact matrices, and other annotations associated to the genome sequencing data. The genome annotation descriptors common to all annotation data types disclosed in the present invention are listed in table 1.
Other descriptors specific to each annotation data type are disclosed in syntax and semantic tables specific to each annotation data type.
The text descriptor is a descriptor represented as a character string, and the numerical descriptor is a descriptor represented by a numerical value.
Genome annotation descriptors can be of three types:
value descriptors expressed as values
Text descriptors expressed as strings
Properties are data structures defined in this disclosure (section entitled "Properties")
Figure BDA0003936497030000211
TABLE 1 descriptor common to all annotation data types
According to the method disclosed in the present invention, genome annotations, browser trajectories, genome variants, gene expression information, contact matrices, and other annotation data types associated with genome sequencing data are encoded using a subset of the descriptors listed in table 1, which are then entropy encoded using a number of entropy encoders according to each descriptor specific statistical property. This means that different types of descriptors are grouped together and encoded with different entropy encoders, thereby achieving higher compression. Blocks of compressed descriptors with isomorphic statistical properties are structured in access units that represent the smallest encoded representation of one or more genomic features that can be manipulated by a device implementing the invention described in this disclosure.
The genome annotation descriptors are organized into blocks and streams as defined below.
A block is defined as a unit of data consisting of a header and a payload consisting of parts of the same type of compressed descriptor.
A descriptor stream is defined as a sequence of encoded descriptor blocks used to decode a descriptor of a particular dataclass.
The present disclosure specifies a genomic information representation format in which relevant information is efficiently compressed to be easily accessible, transmitted, stored, and browsed, and any redundant information of the format is reduced in weight.
The main innovative aspects of the disclosed invention are as follows.
1 annotations, browser trajectories, genomic variants, gene expression information, contact matrices, and other metadata associated with genomic sequencing data are compressed in a unified hierarchical data structure. The data structure enables rapid transmission, economical storage, and selective access to encoded data according to criteria such as by genomic interval/location, gene name, variant location and genotype, variant identifier, comment in annotation, annotation type, a pair of genomic intervals (in the case of matrix data connecting genomic locations to other locations).
2 annotations, browser trajectories, genome variants, gene expression information, contact matrices, and other annotation data associated with genome sequencing data are represented by genome annotation descriptors grouped into blocks with homogeneous statistical properties, enabling identification of distinct information sources characterized by low entropy of information.
3 the possibility to model each individual information source with a distinct source model matching the statistical properties of each annotation descriptor and the possibility to change the source model of each description Fu Kuaina within each annotation descriptor of each annotation data type and of each individually accessible data unit (access unit). The appropriate transform, binarization and context adaptive probability models and associated entropy coders are employed according to the statistical properties of each source model of the annotation descriptor.
4 define correspondences and dependencies between descriptor blocks to enable selective access to sequencing data and associated metadata without requiring decoding of all descriptor blocks in the case where only part of the information is required.
5 transmitting configuration parameters governing the process of both encoding and decoding by means of a data structure embedded in the compressed genomic data in the form of header information. Such configuration parameters may be updated during the encoding process in order to improve compression performance. Such updates are transmitted as compressed content in the form of an updated configuration data structure.
Hereinafter, each of the above aspects will be described in further detail.
Genome annotation descriptors per specific annotation data type
Genomic variants
Data on genomic variants were encoded using the common descriptors introduced above and the specific descriptors listed below.
Figure BDA0003936497030000231
Functional annotations
Data on functional annotations describe genes and their content splice transcripts, as well as their biological functions (by which exons are composed); and information about the transcript, for example its breakdown into UTR, start and stop codons, and coding sequences, as applicable. Which is encoded using the common descriptors introduced above and the specific descriptors listed below.
Figure BDA0003936497030000232
sizeof () is a function that returns the number of bits necessary to represent each attribute value according to the type _ ID defined in the attribute type.
Track of
The data of the traces represent the values associated to each position in the genome-a typical example of which is the coverage of sequencing reads at each position as generated by RNA or ChIP sequencing experiments. The data may be provided at different pre-calculated zoom levels that are desirable when the information is being displayed in the genome browser. The data is encoded using the common descriptors introduced above and the specific descriptors listed below.
Figure BDA0003936497030000241
Genotype information
The genotypic information data expresses a collection of genomic variants present at each position of the genome of an individual or population of individuals. Which is encoded using the common descriptors introduced above and the specific descriptors listed below.
Figure BDA0003936497030000242
Sample information
The information about the sample describes meta-information about the particular biological sample on which the sequencing experiment has been performed, such as the date and location of collection, the date of sequencing, and the like. The sample information data is encoded using the specific descriptors listed below.
Descriptor(s) Type (B) Description of the invention
sample_name st(v)
UUID uint Unique identifier for linking with data sets in part 1
bitmask b(n_meta)
values[n_meta] uint n _ meta in parameter set
desc_len uint
description u(desc_len)
n_attributes
attributes[n_attributes] attribute For example, the URL of the DOI to be disclosed
Expressing information
Information about expression correlates a certain genomic range (usually corresponding to a gene, transcript, or another feature in the genome) with one or more numerical values, each of which will correspond to a biological condition that has been tested during a separate experiment.
The enunciated data is encoded using the specific descriptors listed below.
Grammar for grammar Type (B) Description of the invention
ID uint The range is as follows: AU (AU)
feature_position uint Location of features in parameter set list
sample_id_start uint
sample_id_len uint
format_mask[n_format] b(1) Defining n _ format in parameter set
Contact matrix information
The contact information data is encoded using a specific descriptor listed below.
Figure BDA0003936497030000251
Figure BDA0003936497030000261
Bit stream structure
The present invention introduces a compressed representation of annotation data associated with genome sequencing data in the form of a bitstream syntax described below. The syntax is described in terms of a concatenation of data structures consisting of elements characterized as data types.
Grammar notation
In the following description, the following syntax notation is employed.
Figure BDA0003936497030000262
Extension of ISO/IEC 23092-1
The present disclosure extends the data structure specified in ISO/IEC 23092-1 to support the transmission of encoded genome annotations in the bitstream syntax specified in ISO/IEC 23092-1.
Data set group of groups
The data Cluster group syntax is the same as the syntax specified in ISO/IEC 23092-1
Figure BDA0003936497030000263
Figure BDA0003936497030000271
Data set
In ISO/IEC 23092-1, a data set is a data structure that contains a header, a main configuration parameter in the parameter set, an index structure, and a set of access units that encode genomic data. The dataset type is extended to carry different types of genome annotation data specified by different "dataset _ type" values.
Figure BDA0003936497030000272
Figure BDA0003936497030000273
Figure BDA0003936497030000281
reference _ type value Value name Semantics
0 MPEGG_REF Reference sequence
1 MPEGG_ANNOTATION_REF Reference data for annotation
The dataset header this is a block that describes the contents of the dataset.
Figure BDA0003936497030000282
Figure BDA0003936497030000291
Reference to
This data structure extends the reference data structure specified in ISO/IEC 23092 to support the bitstream syntax specified in this disclosure.
Figure BDA0003936497030000301
Annotation index
The present disclosure describes how to encode (i.e., compress) annotation data portions consisting of textual information elements associated with genome sequencing reads, other non-textual genome annotations derived from the genome, and sequences in order to enable searching of textual elements in a compressed domain. Examples include:
information about functional genomic features (e.g., gene name, gene description, gene annotation, gene ontology, variant name, variant description, clinical significance of variant)
Nucleic acid sequences represented as sequences of symbols (usually one for each nucleotide) (e.g., subsequences of a reference genome, sequences of RNA molecules transcribed from a reference genome, or sequencing reads from a genome)
Protein sequences represented as sequences of symbols (usually one for each amino acid) (for example sequences corresponding to the translation of messenger RNA molecules)
Information about sample metadata and methods (name, date/time/location of collection, experimental techniques for performing sequencing, analytical techniques for performing functional annotation and variant detection, etc.).
The information is compressed using a suitable data structure such as, by way of example and not limitation, a compressed string pattern matching data structure. Representative of compressed string pattern matching data structures are, for example (but not limited to), compressed suffix arrays, FM-indexes, and some sort of hash table. Such (compressed) data structures are used to perform string pattern matching and carry, in compressed form, the text portion of the annotation data being added to the compressed bitstream either in a file header or as a payload for an access unit. For clarity, all algorithms belonging to one of these data structure classes will be referred to as "string index algorithms" in this disclosure.
As an example (but not by way of limitation), the present disclosure describes how to encode text portions of different annotation data types and genome reads by using a combination of compressed string indexing algorithms. There are several series of string index algorithms, and each series can be parameterized by several parameters that specify a balance between compression performance and interrogation speed. We use a set of predetermined compression string index algorithms for compression, each specified by selecting a series of compression string index algorithms and by selecting parameters of the series. The set of algorithms is classified by the level of compression achieved and depending on the desired trade-off between compression rate/interrogation speed, one particular algorithm may be selected at the time of encoding. This selection is specified in a parameter set of the compressed bitstream.
By way of example (but not by way of limitation), the chosen compressed string index algorithm is applied to the following concatenation, either individually or jointly:
the name of the gene(s),
the description of the gene(s),
the sequence of the genomic transcript and its protein products (if present),
the name of the variant(s),
a description of the variants,
the name of the sample,
genome sequencing reads of the sequence represented as symbols (one for each nucleotide) and any other textual information associated with the genome interval
Additional information encoding the relationship of the textual information to the genomic interval.
Applying a compressed string index algorithm to the information produces a compressed and indexed representation to which the presence of any substring can be queried. In particular, combinations of exact substring searches may be used to perform inexact substring searches, such as searches for all occurrences of retrieved substrings that have up to a specified number of deviations (mismatches/errors) from a specified pattern. This process enables the query of genomic annotations in a single query for pieces of textual information that are considered or generated during the analysis and re-analysis of sequencing data. This is possible under the following conditions:
1. the genomic information associated with a genomic interval is represented as a data structure called a genome annotation record, which contains information about the nucleotide sequence contained in the interval
2. Genome annotation records associated with genome intervals associated with contiguous locations on a reference genome are compressed in the same access unit
3. All text portions of the annotation information are compressed using a compressed string index algorithm chosen from the available set.
The following text and data structures describe embodiments of this method for indexing and searching of compressed and embedded genome annotation data in access units of MPEG-G compatible (ISO/IEC 23092) bitstreams.
The following table shows the textual information indexed and compressed using the string indexing algorithm per genome annotation type according to the methods described herein. For each access unit, each type of text descriptor is concatenated using string delimiter and record index information as shown in FIG. 5 and compressed using a string index algorithm.
Figure BDA0003936497030000321
Indexing criteria for per-genome annotation access unit types
This table describes the indexing criteria and indexing tools applied to the access units for each genome annotation data type.
Figure BDA0003936497030000322
Figure BDA0003936497030000331
Main Annotation Index (MAI) is an indexing tool that provides annotation data with the indexing capabilities of the sequence reads of MITs defined in ISO/IEC 23092-1 and WO 2018/068827A1, WO/2018/068828A1, and WO/2018/068830A1
Figure BDA0003936497030000332
Figure BDA0003936497030000341
TABLE 2 Primary notes index
Master annotation index header
Figure BDA0003936497030000342
Table 3-main comment index header
Semantics
num _ MAI _ AU _ types is the number of AU types indexed by MAI. Value 0 signals MAI not an index is provided.
MAI _ AU _ type [ i ] is the ith AU type indexed by MAI. The array mai _ AU _ type [ ] will contain unique values, each AU type value can only appear once in the array mai _ dataset _ ID [ ].
num _ MAI _ indexes [ i ] is the number of MAI indices of AU type MAI _ AU _ type [ i ].
String with index
When encoding an access unit of each genome annotation data type, the textual descriptors belonging to the data encoded in the access unit are concatenated and compressed using a compressed string index algorithm as defined in the present disclosure.
The following table lists which strings are encoded in the MAI for each data type. The specified list table determines the value numlistings required in some of the following descriptions of MAIs. numcategories is the number of text fields per genome annotation record indexed using the method described in the present invention.
Figure BDA0003936497030000343
Figure BDA0003936497030000351
String index
The string index chunk is a portion of a primary annotation index that encodes one or more strings of each record for a variable number of access units that each contain a variable number of records.
The main string index also allows string pattern matching queries on the original text to be performed and retrieved.
The list of strings encoded within the string index is referred to hereinafter as the "compressed index".
The list of strings obtained by decoding the compressed index from the string index is hereinafter referred to as "uncompressed index".
The string index provides the following functionality:
1. the occurrence of any sub-string within the list of encoded strings is counted, as specified in the description below.
2. For each of the substrings found at point 1 previously, the position of the substring within the uncompressed index is retrieved, as specified in the description below.
3. Given the start and end positions within the uncompressed index, the respective decoded payload is retrieved, as specified in the description below, where the payload may contain any number of strings, portions of strings, or metadata associated to strings.
4. For each of the substrings found at point 1 previously, the entire string containing that substring is retrieved, as well as the location of the entire string within the uncompressed index, as specified in the description below.
5. For each of the substrings found at point 1 previously, the index of the access unit within which the substring is contained is retrieved, as specified in the description below.
6. For each of the substrings found at previous point 1, a record index of the record within which the substring is contained is retrieved, where the record index is a 0-based index of the record within the access unit containing the record, as specified in the description below.
7. Given an access unit index, a location within an uncompressed index of a first string within an access unit corresponding to the access unit index is retrieved, as specified in the description below.
8. Given an access unit index, a record index within an access unit corresponding to the access unit index, and a string index within a record corresponding to the record index, the location of a string contained in the record of the access unit at the string index is retrieved, as specified in the description below.
The inputs to this process are:
a variable numAUs specifying the number of access units it encodes a string within this string index
The variable codingMode is set to, which specifies the algorithm that has been used to encode the string index.
The number of strings encoded for each record will be the same for all records, and it will correspond to the variable numbranches, as specified in the description below.
Figure BDA0003936497030000361
Table 4-string index chunk.
The uncompressed index encoded within the compressed _ index contains a list of strings and associated optional record indices ordered per access unit (following the same order of access units in table 4) and per record (following the same order of records within an access unit) for each access unit. The total number of strings in the uncompressed index is totNumRecords numrecords, which is the total number of records for all access units identified by au _ id [ ], and numrecords, which is a counter for all strings compressed using the compressed index algorithm.
The uncompressed index specifies:
Figure BDA0003936497030000362
Figure BDA0003936497030000371
table 5-uncompressed index coded in compressed _ index element of string _ index () element.
An example of an uncompressed index specified in this disclosure (numbranches equals 3) is provided in fig. 5.
Semantics
record _ index [ i ] (recidx), signaling its presence by setting the most significant bit on all bytes of record _ index [ i ]. Setting the most significant bit also prevents false positive results from being obtained when searching for a substring because the most significant bit is not set for all bytes in the string [ i ] [ j ] field, as specified in this disclosure for the string [ i ] [ j ] element.
When a record _ index [ i ] exists and is N bytes long, it represents a non-negative integer value, as specified in the following expression:
Figure BDA0003936497030000372
wherein recordlindexvalue [ i ] corresponds to a 0-based index of records within the respective access unit that correspond to string [ i ] [ ] strings.
In the context of the present disclosure, record _ index [ i ] is referred to as "genome annotation record index data".
string [ i ] [ j ] is the j 'th encoded string of the i' th record. The string orders per access unit (following the same order of access units in Table 4) and per record for each access unit (following the same order of records within an access unit)
string _ terminator is a single byte (i.e., '\ n') equal to 0x 0A.
Searching substring positions using string indices
The string index is used to search for a location within the uncompressed index of a given substring, as specified in the following pseudo-code:
Figure BDA0003936497030000373
Figure BDA0003936497030000381
table 6-search substring locations using string indices.
Decoding a subset of string indices
The string index is decoded between given start and end positions (including start and end positions), as specified in the following pseudo code:
Figure BDA0003936497030000382
table 7-decoding a substring at a given position using a string index.
Searching an entire string using a string index
Given a location within the uncompressed index, e.g., one from the list of locations returned by SI _ search _ substrings () as specified in this disclosure, the respective entire string and its starting location within the uncompressed index is decoded with the string index, as specified in the following pseudo-code:
Figure BDA0003936497030000383
Figure BDA0003936497030000391
table 8-search the entire string using the string index.
Searching for access unit ID and record index using string index
Given a position within the uncompressed index of bytes belonging to a string encoded in the compressed index, e.g., one position from the list of positions returned by SI _ search _ substrings () as specified in this disclosure, the access unit ID of the access unit containing the string, the index of the record containing the string, and the index of the string within the record are decoded using the string index, as specified in the following pseudo-code:
Figure BDA0003936497030000401
Figure BDA0003936497030000411
table 9-search access unit and record index with string index.
Searching for the location of a first string of access units using a string index
The location of the first string of the given access unit within the uncompressed index is retrieved using the string index, as specified in the following pseudo code:
Figure BDA0003936497030000412
Figure BDA0003936497030000421
table 10-search the location of the first string of access units using the string index.
Searching for location of recorded string using string index
A location of the string at a given index within a record within the uncompressed index, wherein the record is at the given index within a given access unit, is retrieved with the string index, as specified in the following pseudo code:
Figure BDA0003936497030000422
Figure BDA0003936497030000431
Figure BDA0003936497030000441
table 11-search the location of the first string of records using the string index.
String index construction
In accordance with the principles of the present invention, a string index is constructed from text descriptors using a string transformation method as follows:
for each annotation, separate the non-indexed descriptor from the indexed text descriptor
Concatenating indexed text descriptors separated by terminators and interleaved with information about genome annotation record locations within an access unit
Numeric descriptors are represented as numeric values and text descriptors are represented as strings.
To compress the generated string index, the result of the transformation is then further transformed using compressed full text string index algorithms such as compressed suffix arrays, FM-indices, and some classes of hash tables.
Interleaving information about genome annotations with genome annotation record locations enables compressed genome annotation data to be viewed according to criteria, such as the presence of strings in the genome interval to which a record or genome record is associated. The browsing is performed by specifying a text string or substring and retrieving all genome annotation records containing the text as part of the encoded annotation.
An example of an implementation of this construction method is provided in fig. 5, where each record contains 3 text descriptors.
Configuring encoding parameter selection according to input provided by user according to user's requirements/needs and the methods described in this disclosure each genome annotation type is associated to construct a text descriptor of the string index as described above and in fig. 5. This configuration parameter is encoded in the bitstream and/or transmitted from the encoder to the decoder.
Efficient decoding of genome annotations
By constructing a compressed string index as described above, it is possible to reconstruct genome annotations related to one string descriptor by following the process below.
The goal of this process is to decode all access units that contain annotation data related to a string identifier specified by the user that is searching for, for example, a variant name or description thereof, a genomic feature name or description thereof, or any other textual descriptor associated with the encoded genomic annotation.
The desired name or description is searched by calling the function SI _ search _ substrings () specified above. If the specified string "str" is present in the compressed index, this call returns one or more locations (referred to as "pos" in this example) as specified in the section "search substring locations with string index". Utilizing the string index described above in this disclosure to decode an access unit ID of an access unit containing the string "str", an index of a record containing the string "str", and an index of the string within the record, as described in the following points:
1. the input byte location "pos" identifies the string str that contains the byte at location pos within the uncompressed index.
2. The ID of an access unit containing str is determined by comparing pos with the value of au _ offset [ ] as specified in Table 4, and retrieving the corresponding value of au _ ID [ ] as specified in Table 4:
if pos < au _ offset [1], then the resulting access unit ID is au _ ID [0].
If pos > = au _ offset [ num _ AUs-1], then the access unit ID generated is au _ ID [ num _ AUs-1], where num _ AUs is as specified in Table 4
Otherwise, the generated access unit ID is au _ ID [ i ], for the value of i, such that au _ offset [ i ] < = pos < au _ offset [ i +1].
3. By repeatedly calling the function SI _ decode () described in this disclosure, the compressed index is decoded backwards from the position pos-1 until the entire record index recordIndex (where the record index is as specified in table 5) is decoded or until the beginning of the compressed index is reached. If the beginning of the compressed index is reached, recordIndex is set to 0. When decoding backward, the number of string terminators recordIndex is counted (where the string terminators are as specified in Table 5). However, any non-printable character may be used as a string terminator.
4. Given the number of indexed strings per record numentries as specified in this disclosure and the access unit determined at point 2, the index of the record within the access unit that contains str is equal to recordIndex + stringindidex/numentries.
5. Given the number of indexed strings per record numtrigs as specified in this disclosure and the record determined at point 4, the index of the string str within the record is equal to stringIndex% numtrigs.
Access unit
This clause extends the access unit syntax specified in ISO/IEC 23092-1 with support encoded in the genome annotation data type.
Figure BDA0003936497030000461
AU header
Figure BDA0003936497030000462
Figure BDA0003936497030000471
Dynamic attributes
1. Most genome annotation formats contain poorly specified fields that complement a minimal set of information defined as mandatory. In some cases, such as VCF, GFF, GTF file formats, those fields represent valuable information because they contain information such as the pathogenicity of a given variant or the necessary taxonomy clues for the elements of the functional annotation. Therefore, it cannot simply be discarded or considered as secondary information. Indeed, some of those fields may represent the most valuable filtering criteria for clinical use.
2. For this reason, all those fields across several access units and data set types described later are grouped into a set of dynamic attributes. The presence of a given attribute is signaled in a particular section of the parameter set in an object of the type "attribute" specified in this disclosure.
3. Each attribute corresponds to a new descriptor.
4. The presence of a value for a given record is signaled via the record level bit mask using the location of a given attribute in the parameter set.
5. Attributes are specified in the following respects:
-value type
Array type, e.g. 1 for a GL field in the genotype column of a VCF file, if there is a single scalar value, a fixed size array, a number of arrays depending on allele, ploidy or a combination thereof
Array size required for a fixed size array
This approach provides a unified approach across all different annotation data types, regardless of their nature, and provides space for future indexing/filtering tools based on the presence of specific attributes.
Figure BDA0003936497030000481
Variants
The data structures described in this section encode information about variants, while information about samples (e.g., genotyping) is encoded in separate datasets.
Parameters of variants
This structure in the parameter set contains the main parameters associated with variant coding.
Figure BDA0003936497030000482
Figure BDA0003936497030000491
Genome annotation records of variants
Records will be sorted by increasing value of pos. Position-following differential encoding
NB: ref _ len, ref, alt _ len, alt, q _ int may be encoded as "payload" in the unified record structure; info is encoded as an attribute.
Genome annotation records for variants are encoded using a common genome annotation descriptor and a variant-specific genome annotation descriptor (as described in this disclosure).
Compression of descriptors of variants
Compression of Info values into attributes as described in this disclosure
ref and alt information
Figure BDA0003936497030000501
Figure BDA0003936497030000511
Functional note (GTF, GFF)
Parameters for functional annotations
This structure in the parameter set contains global configuration parameters related to the encoding of the function annotation data type.
Figure BDA0003936497030000512
Genome annotation records for functional annotation
Functionally annotated genome annotation records are encoded using common genome annotation descriptors and genome annotation descriptors specific to functional annotations (as described in this disclosure).
Compression of annotated descriptors
Figure BDA0003936497030000521
Track of
Parameters of the trajectory
This structure in the parameter set contains global parameters related to the browser's track encoding.
Figure BDA0003936497030000522
Figure BDA0003936497030000531
Genome annotation records for trajectories
Functionally annotated genome annotation records are encoded using common genome annotation descriptors and functionally annotated-specific genome annotation descriptors (as described in the present disclosure).
Compression of descriptors of tracks
Figure BDA0003936497030000532
Genotype information
The data set for a type genotype contains encoded information about the genotyping information for an individual or population.
Parameters for genotype information
This structure in the parameter set contains global configuration parameters that are relevant to the coding of genotype information.
Figure BDA0003936497030000533
Figure BDA0003936497030000541
The format _ ID identifies the format field present in the encoded record. The semantics of each identifier are provided in table 12. If the value 0x00 (GT) is present, it will always be the first in the list.
Genotype format field
Figure BDA0003936497030000542
Figure BDA0003936497030000551
Table 12-Format _ ID value used in generic _ parameters ()
A = one value per alternative allele
R = one value for each possible allele containing reference
G = one value per genotype
Genome annotation records for genotype information
Genome annotation records of genotype information are encoded using a common genome annotation descriptor and a genome annotation descriptor specific to the genotype information (as described in this disclosure).
Compression of genotype information
All information is compressed into attributes, as described in this disclosure. Special cases such as GT and LD fields are first split into subsequences identified by subsequenceids as described below.
Figure BDA0003936497030000552
Figure BDA0003936497030000561
Sample information
Parameters for sample information
This structure in the parameter set contains global configuration parameters related to the encoding of information about the sample.
Figure BDA0003936497030000562
Figure BDA0003936497030000571
Genome annotation records for sample information
Genome annotation records for sample information are encoded using genome annotation descriptors specific to the sample information (as described in this disclosure).
Expressing information
This dataset only encodes the actual expression matrix. The features are stored in SAMPLEs in access units of type AU _ announce and access units of type AU _ SAMPLE.
Expressing parameters
This structure in the parameter set contains global configuration parameters related to the encoding of the presentation information.
Figure BDA0003936497030000572
The format _ ID identifies the format field present in the encoded record. The semantics of each identifier are provided in table 12.
(watch 12)
Genome annotation records for expression information
A genome annotation record for expression information is encoded using a genome annotation descriptor (as described in this disclosure) specific to the expression information.
Compression of
The compression strategy is the same as for the genotype dataset: all information is mapped into attributes and compressed as described in the section entitled "compression of attributes". This allows each element of the matrix to have more than one value, thus combining information such as counts, tpm, probabilities, etc. of different types and semantics in a single record.
A special method is used for sparse matrices, where for each record only non-zero values are recorded, along with the array of corresponding locations and the total number of entries.
Contact matrix information
The contact matrix (i.e., contact map) is generated by the Hi-C experiment and represents the spatial organization of DNA molecules in the nucleus. Two dimensions are genomic positions. The contact matrix value at each coordinate represents a counter of how many times two locations in the nucleotide sequence have been measured as having an interaction.
Contact parameter
This structure in the parameter set contains global configuration parameters related to the encoding of information about the contact matrix.
Figure BDA0003936497030000581
The format _ ID identifies the format field present in the encoded record. The semantics of each identifier are provided in Table 12 (Table 12)
Genome annotation records for contact matrix information
Genome annotation records for sample information are encoded using genome annotation descriptors (as described in this disclosure) specific to the sample information.
Compression
The compression strategy is the same as for expressing an information data set.
Properties
Figure BDA0003936497030000591
Compression of attributes
Properties are compressed using as many subsequences as n _ attributes +1 in the parameter set
SubsequenceID Name (R) Description of the invention Examples of the invention
0 attr_mask Bit mask signaling the presence of each attribute
1 attr1 First attribute value
2 attr2 Second attribute value
...
n attrn N attribute value
Data type
This section describes how structured values are represented in this disclosure.
Type of value
This is a structure for representing numerical values, the magnitude of which is in bits.
Figure BDA0003936497030000592
Figure BDA0003936497030000601
Type identifier
Figure BDA0003936497030000602
Table 13-data type array identifier with its identifier and parameters
array_type_ID Corresponding array size
0 Scalar, e.g. only one value
1 Fixed array size
2 Array of length equal to the number of alternative alleles
3 An array of length equal to the total number of alleles plus a reference
4 Genotype probability field: the size of which depends on the combination of the total number of alleles and ploidy
Table 14: array type with its identifier
Data block
A data block is a structure that contains compressed descriptors and is encapsulated in an access unit. Each block containing a single type of descriptor identified by an identifier contained in the block header
Block grammar
Figure BDA0003936497030000611
Block header
Figure BDA0003936497030000612
Block payload
Figure BDA0003936497030000613
Examples of supported queries
Figure BDA0003936497030000614
Figure BDA0003936497030000621
Evidence of the technical advantages of the present invention
The present invention eliminates several problems that exist when using state of the art technology. In particular, it is possible to use, for example,
1. currently, there is no uniform representation of genome annotations. In practice, several diverse formats are used. In general, it is implicitly assumed that features are linked according to their physical proximity on the genome, i.e., for example, variants or isoforms are associated with the genes they contain. The unified representation of the data described in the present invention makes it possible to express complex relationships between different concepts even beyond simple physical inclusion, such as "a promoter is located in this interval and its methylation state (usually outside the gene) is associated with gene a, gene B and gene C, which forms an operon (i.e. a collection of genes each having a different position in the genome)"
2. The invention makes it possible to unambiguously link existing parts 1-5 of the MPEG-G standard, where the alignment to the sequencing reads of the genome is indicated. Many annotation characteristics (e.g., functional gene models, variants, or trace expression, such as methylation status or binding to a protein) are supported by and sometimes derived from the presence of sequencing reads at the relevant location. Currently, it is not possible to express concepts such as "this new transcript, consisting of this list of exons, is supported by this collection of RNA-sequencing reads" or "this new variant at this position is supported by this collection of DNA-sequencing reads". The present invention makes it possible to express these concepts without difficulty (these concepts are very important in clinical practice)
3. Currently, there is no single format that can represent all the different existing sources of genome annotations. Thus, the pipeline and genome browser need to use several different formats to load all the required information. The present technology eliminates the need to implement compound parsers for such domain-specific bioinformatics formats, which are often undefined and lack explicit criteria
4. Since the information is divided into different types of access units, the present invention provides a mechanism to implement efficient compression-each information stream can be modeled as a homogeneous source with lower entropy, thus making compression more efficient. On the other hand, the proposed method still allows integration of different information into a single hierarchical architecture and it is possible to express the relationship between different genome annotation concepts, genome sequences and sequencing reads. Furthermore, compressing different genomic features separately allows for selective decompression of desired features if the user is interested in only a subset of the data
5. Employing a set of compressed string index algorithms, from which one may be picked at encoding time, to compress the textual information allows the user to select a desired balance between compression of the string index and the speed at which it is interrogated. Notably, the use of more than one series of compressed string index algorithms is necessary to achieve the desired optimization and is a necessary feature of the present invention, as employing a single series would not be sufficient for that purpose.
As an example, but not by way of limitation, we illustrate the concept by combining two different series of compressed suffix arrays. Series [1]]Using e.g. Raman, rajeev, venkatesh Raman and s. 2002. "compact indexable dictionary (Succinct indexable dictionary with applications that encode k-trees and multi-sets)" at 13 th ACM-SIAM discrete algorithm seminar (SOD-SIAM discrete algorithm)A2002) Journal, 233-242. Series [2 ]]Using e.g. Juha
Figure BDA0003936497030000632
Dominik Kempa, simon j. Hybrid Compression of bit vectors for FM-indexing (Hybrid Compression of bits for the FM-Index). 2014 data compression society (DCC 2014) journal, IEEE cyber, 2014, 302-311. As shown in FIG. 6, it is possible to vary other parameters of the compressed suffix array series in order to obtain a sequence [1] belonging to](Pink point) and series [2 ]](blue green dots) and shows different compressed suffix array implementations for different values of compression ratio and interrogation speed. However, series [1]]Is inherently better at providing higher compression rates (and slower interrogation speeds), and series [2 ]]Is inherently better at providing faster interrogation speeds (and lower compression rates). By combining two series and selecting the array identified by the black rectangle as the set of possible compressed suffix arrays, we can provide the option of having a better compression rate and the option of having a better interrogation speed, which is not possible using only one series of compressed suffix arrays.
Indexing capability
Figure BDA0003936497030000631
Figure BDA0003936497030000641
Genome annotation encoding device
Fig. 2 shows an encoding apparatus according to the principles of the present invention. The encoding device receives as input genomic annotations 20 such as variants, browser traces, functional annotations, methylation patterns and levels, sequencing coverage and statistics, feature expression matrices, contact matrices, affinity of proteins for nucleic acids, etc. The annotation data is parsed by descriptor encoder unit 22 and the non-index descriptors are separated from the text index descriptors 212. The non-indexed descriptor common to all annotations is fed to the transformation unit 21. The non-indexed descriptors specific to each annotation type are fed to the transformation unit 27. The text index descriptor is fed to the descriptor string transformation unit 26. The outputs of the transform units 21 and 27 are fed to different entropy encoders 24 according to the particular statistical properties of each transformed descriptor. At least one first entropy coder (24) is employed for numeric descriptors and at least one second entropy coder (214) is employed for text descriptors not included in the subset (29) of text descriptors.
The output of each entropy encoder is fed to an annotation data access unit encoder 23 to produce an annotation data access unit 25. The uncompressed primary annotation index 210, the output of the descriptor string index transformation unit 26, is fed to the annotation data index encoder 28 to generate the primary annotation index data 29. An annotation data index is associated with one or more annotation data access units. Fig. 1 shows annotation data access unit (122) jointly encoding (118) with primary annotation index data (123) and access unit (119) containing a first classification of compressed genome sequencing data.
The transformations applied by the descriptor transformation units 21 and 27 used in the encoding apparatus include:
omicron run length encoding: the sequence of digits being represented by successive occurrences of a counter and occurrences of a value
Differential encoding: each digit is represented as a difference relative to a previously encoded value
O byte separation: for numbers represented by a large number of bytes, each byte is processed and compressed separately from other bytes having similar properties in bit configuration
The transformations applied by annotation data index encoder 28 include:
omicron Burrows Wheeler transform
Omicron compressed string pattern matching
O a compressed suffix array,
omicron FM-index
O hash algorithm
An advantage of applying the transformation to the numerical descriptors is to improve compression efficiency without information loss, as is well known to those skilled in the art.
The transformation makes the encoding of string descriptors more efficient, as the transformed representation can more efficiently browse and search the substrings. Once the original text is transformed, the existence of the substring can be verified without having to decompress the entire text.
Genome annotation decoding device
A decoding device implemented according to the principles of the present disclosure extends the functionality of an ISO/IEC 23092-compliant decoding device as depicted in fig. 3.
Fig. 3 shows a decoding device according to the principles of the present disclosure. The genome annotation access unit decoder 31 receives the access unit 30 from the stream demultiplexer 70 and extracts the entropy-encoded payload of the access unit. The entropy decoder 32, 33, 34 receives the extracted entropy encoded payload and decodes different types of genome annotation descriptors into their binary representations 35. The binary representation of the descriptor common to all genome annotations is then fed to the inverse transform unit 36. The binary representation of the descriptor specific to each annotation data type is fed to the inverse transform unit 314. The master annotation index 38 is fed to an indexed access unit information retrieval unit 37 which locates the text fields belonging to each AU in the string index. This position information 313 is then fed to the indexed information decoding unit 39, which decodes the text field from the string index. The decoded text field is then fed to descriptor decoder unit 310 to reconstruct decoded genome annotation 311.
Genome annotation text search device
A text search device implemented according to the principles of the present disclosure extends the functionality of an ISO/IEC 23092-compliant decoding device as depicted in fig. 4.
Fig. 4 shows a decoding device according to the principles of the present disclosure. Genome annotation access unit decoder 41 receives access unit 40 from stream demultiplexer 70 and extracts the entropy encoded payload of the access unit. The entropy decoder 42, 43, 44 receives the extracted entropy encoded payload and decodes the different types of genome annotation descriptors into their binary representations 45. In the configuration of the decoding apparatus, different types or different classifications of access units can be selectively extracted. The binary representation of the descriptor common to all genome annotations is then fed to the inverse transform unit 46. The binary representation of the descriptor specific to the annotation data type is fed to the inverse transform unit 414. The master annotation index 48 feeds an indexed access unit information retrieval unit 47 which locates the text field in the string index that matches the text query 413. This position information 415 is then fed to an indexed information decoding unit 49, which decodes the text field from the string index. The decoded text field is then fed to a descriptor decoder unit 410 to reconstruct the decoded genome annotation 411.
FIG. 8 illustrates how the conceptual organization of data described in the present invention provides for a text query to be performed.
The master index table is associated as follows
Genome segment (sequence ID + start position + end position + data class)
And
access units containing compressed genome sequencing reads and associated alignment information and metadata.
The annotation index is associated as follows
String index containing text information about features in compressed and searchable form
And
an access unit containing
Compressed genome annotation, and
information about the genomic interval to which it belongs.
A single query on the text string "APOBEC" may retrieve all associated annotations containing the text "APOBEC" and associated coded sequence reads.
FIG. 9 illustrates how the conceptual organization of data described in the present invention provides for a search over a genomic interval to be performed.
The master index table is associated as follows
Genome segment (sequence ID + start position + end position + data class)
And
access units containing compressed genome sequencing reads and associated alignment information and metadata.
The annotation index is associated as follows
Genome segment (sequence ID + start position + end position + data class)
And
access units containing compressed genome annotations
And with
A string index containing textual information about features in compressed and searchable form.
A single query over the genomic interval N can retrieve the coding sequence reads and all associated annotations.
The inventive techniques disclosed herein may be implemented in hardware, software, firmware, or any combination thereof. When implemented in software, these may be stored on a computer medium and executed by a hardware processing unit. The hardware processing units may include one or more processors, digital signal processors, general purpose microprocessors, application specific integrated circuits, or other discrete logic circuitry.
The techniques of this disclosure may be implemented in a variety of devices or apparatuses, including mobile phones, desktop computers, servers, tablets, and similar devices.

Claims (16)

1. A computer-implemented method for
Storing or transmitting a representation of genome sequencing data in a genome file format comprising annotation data associated with the genome sequencing data, the genome sequencing data comprising reads of a nucleotide sequence, the method comprising the steps of:
aligning (10) the reads to one or more reference sequences, thereby creating aligned reads,
classifying (14) the aligned reads according to a classification rule based on a mapping of the aligned reads on the one or more reference sequences, thereby creating a class of aligned reads (18),
entropy encoding the sorted aligned reads into a number of descriptor blocks,
structuring the descriptor block with header information, thereby creating an access unit containing a first category of genome sequencing data (119),
the method further comprises encoding annotation data (12) into different access units (122) of a second classification and encoding index data into a master annotation index (MAI, 123, 211), wherein the index data represents an encoded form of annotation string data obtained by employing at least one compressed string indexing algorithm (28) on annotation string data (212), and wherein the MAI associates an encoded annotation string with the access units of the second classification,
the method further includes jointly encoding the access units of the first class, the access units of the second class, and the MAI.
2. The method of claim 1, wherein the access units of the second classification containing genome annotation data further comprise information data identifying a genome interval (80), wherein the genome interval identifies a nucleotide sequence in the one or more reference sequences, such that the annotation data contained in the access units of the second classification are associated with the relevant encoded reads of the genome sequence contained in access units of the first classification containing genome sequencing data.
3. The method according to claim 2, characterized in that said coding of said annotation data and index data comprises the steps of:
encoding (22) genome annotation data (20) into a genome annotation descriptor (29, 212), wherein the genome annotation descriptor comprises a numerical descriptor and a textual descriptor, the encoding comprising the steps of:
-selecting a subset of text descriptors (212) from the text descriptors according to configuration parameters (213) particularly provided by a user;
-transforming (26) the subset of text descriptors (212) by employing a first string transformation method to produce a string index (210);
-transforming and encoding (28) said string index (210) by employing a string index transformation method, thereby generating primary annotation index data (211);
-transforming (21, 27) said numeric descriptors and text descriptors not contained in said subset (29) of text descriptors by employing at least one second transformation method (21, 27) different from said first transformation method;
-encoding (24, 23) the numeric descriptors and the text descriptors not included in the subset (29) of text descriptors into separate access units (25) of the second classification by employing at least one first entropy encoder (24) for the numeric descriptors and at least one second entropy encoder (214) for the text descriptors not included in the subset (29) of text descriptors.
4. A method according to claim 3, characterized in that said first string transformation method (26) comprises the steps of:
-inserting a string terminator (55) character after each text descriptor (51, 52, 53) for signaling the termination of each text descriptor (51, 52, 53);
-concatenating the text descriptors (51, 52, 53);
-interleaving genome annotation record index data (54) to associate the text descriptor (51, 52, 53) with a location of a genome annotation record within the access unit of the second classification.
5. The method of claim 4, wherein the string index transformation method (28) is one of string pattern matching, suffix array, FM-index, hash table.
6. A method according to claim 3, characterized in that said at least one second transformation method (21, 27) is one of the following: differential encoding, run-length encoding, byte separation, and entropy encoders such as CABAC, huffman encoding, arithmetic encoding, range encoding, and the like.
7. The method of any one of the preceding claims, wherein the Master Annotation Index (MAI) contains in its header the number of AU types and the number of indices per AU type.
8. The method of any one of the preceding claims, further comprising encoding of the sorted unaligned reads.
9. A method for decoding and extracting nucleotide sequences and genome annotation data encoded according to the method of claim 1, comprising the steps of:
parsing (70) the genomic data multiplex (710) into a genomic grammar element layer (71);
parsing the compressed annotation data (712);
parsing the Master Annotation Index (MAI) (713);
expanding the genomic layer into sorted reads of nucleotide sequences;
selectively decoding the sorted reads of nucleotide sequences on one or more reference sequences to generate uncompressed reads of nucleotide sequences;
selectively decode the annotation data associated with the classified reads.
10. The method of claim 9, further comprising decoding informational data related to a genomic interval (80), wherein the genomic interval identifies a nucleotide sequence in the one or more reference sequences such that the annotation data is associated with an associated encoded read of the genomic sequence.
11. The method of claim 10, further comprising decoding data encoded according to the method of any one of claims 3-8.
12. A genome encoder (110) for compressing genome sequence data in a genome file format, the genome file format comprising annotation data associated with the genome sequencing data, the genome sequence data comprising reads of a nucleotide sequence, the encoder comprising:
-an alignment unit (10) for aligning the reads to one or more reference sequences, thereby creating aligned reads;
a data classification unit for classifying (14) the aligned reads according to a classification rule based on a mapping of the aligned reads on the one or more reference sequences, thereby creating classes of aligned reads,
an entropy encoding unit (112) for entropy encoding the sorted aligned reads into a number of descriptor blocks (115),
an access unit encoding unit (116) for structuring the descriptor block with header information, thereby creating an access unit (119, 219) containing a first classification of genome sequencing data,
-a genome annotation encoding unit (117) for encoding annotation data (12) into different access units (122) of a second classification and encoding index data into a master annotation index (MAI, 123), wherein the index data represents an encoded form of annotation string data obtained by employing at least one compressed string indexing algorithm (28) on the annotation string data (210), and wherein the MAI associates an encoded annotation string with the access units of the second classification,
-means for jointly encoding said access units of the first class, said access units of the second class and said MAI.
13. The genomic encoder of claim 12, further comprising encoding means for performing the steps of the encoding method of any of claims 1-8.
14. A genome decoder apparatus for decoding nucleotide sequences and genome annotation data encoded by an encoder according to claim 12, the decoder comprising:
-means for parsing (70) the plurality of elements (710) of genomic data into a layer (71) of genomic syntax elements;
-means for parsing the compressed annotation data;
-means for parsing a master annotation index;
-means for expanding the genomic layer into sorted reads of nucleotide sequence;
-means for selectively decoding the sorted reads of nucleotide sequences on one or more reference sequences to generate uncompressed reads of nucleotide sequences;
-means for selectively decoding the annotation data associated to the classified reads.
15. The genome decoder according to claim 14, further comprising decoding means for performing the steps of the decoding method according to claim 10 or 11.
16. A computer-readable medium comprising instructions that, when executed by at least one processor, cause the at least one processor to perform the method of any one of claims 1-11.
CN202180034395.5A 2020-04-15 2021-03-17 Method and system for efficient data compression in MPEG-G Pending CN115552536A (en)

Applications Claiming Priority (3)

Application Number Priority Date Filing Date Title
EP20169717.4 2020-04-15
EP20169717.4A EP3896698A1 (en) 2020-04-15 2020-04-15 Method and system for the efficient data compression in mpeg-g
PCT/EP2021/056766 WO2021209216A1 (en) 2020-04-15 2021-03-17 Method and system for the efficient data compression in mpeg-g

Publications (1)

Publication Number Publication Date
CN115552536A true CN115552536A (en) 2022-12-30

Family

ID=70292776

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202180034395.5A Pending CN115552536A (en) 2020-04-15 2021-03-17 Method and system for efficient data compression in MPEG-G

Country Status (7)

Country Link
US (1) US20230274800A1 (en)
EP (2) EP3896698A1 (en)
JP (1) JP2023521991A (en)
KR (1) KR20230003493A (en)
CN (1) CN115552536A (en)
CA (1) CA3174759A1 (en)
WO (1) WO2021209216A1 (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116504396B (en) * 2023-06-26 2023-09-08 贵阳市第四人民医院 Traditional Chinese and western medicine combined internal medicine inspection data analysis system

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2018068827A1 (en) 2016-10-11 2018-04-19 Genomsys Sa Efficient data structures for bioinformatics information representation
EP3526712B1 (en) 2016-10-11 2021-03-24 Genomsys SA Method and system for the transmission of bioinformatics data
CN110168651A (en) * 2016-10-11 2019-08-23 基因组系统公司 Method and system for selective access storage or transmission biological data
WO2018068829A1 (en) 2016-10-11 2018-04-19 Genomsys Sa Method and apparatus for compact representation of bioinformatics data
CA3039689A1 (en) 2016-10-11 2018-04-19 Genomsys Sa Method and system for storing and accessing bioinformatics data
EP3526706A4 (en) * 2016-10-11 2020-08-12 Genomsys SA Method and apparatus for the access to bioinformatics data structured in access units
KR20190113971A (en) 2017-02-14 2019-10-08 게놈시스 에스에이 Compression representation method and apparatus of bioinformatics data using multiple genome descriptors

Also Published As

Publication number Publication date
US20230274800A1 (en) 2023-08-31
CA3174759A1 (en) 2021-10-21
KR20230003493A (en) 2023-01-06
EP3896698A1 (en) 2021-10-20
EP4136640A1 (en) 2023-02-22
WO2021209216A1 (en) 2021-10-21
JP2023521991A (en) 2023-05-26

Similar Documents

Publication Publication Date Title
CN110121577B (en) Method for encoding/decoding genome sequence data, and genome encoder/decoder
EP3526709B1 (en) Efficient data structures for bioinformatics information representation
CN110168652B (en) Method and system for storing and accessing bioinformatic data
AU2018221458B2 (en) Method and apparatus for the compact representation of bioinformatics data using multiple genomic descriptors
CN110178183B (en) Method and system for transmitting bioinformatic data
US20230274800A1 (en) Method and System for the Efficient Data Compression in MPEG-G
CN110663022B (en) Method and apparatus for compact representation of bioinformatic data using genomic descriptors
NZ753247B2 (en) Efficient data structures for bioinformatics information representation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination