NZ757185B2

NZ757185B2 - Method and apparatus for the compact representation of bioinformatics data using multiple genomic descriptors

Info

Publication number: NZ757185B2
Application number: NZ757185A
Authority: NZ
Inventors: Claudio Alberti; Mohamed Khoso Baluch; Daniele Renzi; Giorgio Zoia
Original assignee: Genomsys Sa
Priority date: 2017-02-14
Filing date: 2018-02-14
Publication date: 2021-08-31

Abstract

Method and apparatus for the compression of genome sequence data produced by genome sequencing machines. Sequence reads are coded by aligning them with respect to pre-existing or constructed reference sequences, the coding process is composed of a classification of the reads into data classes followed by the coding of each class in terms of a multiplicity of descriptors blocks. Specific source models and entropy coders are used for each data class in which the data is partitioned, and each associated descriptor block. ed by the coding of each class in terms of a multiplicity of descriptors blocks. Specific source models and entropy coders are used for each data class in which the data is partitioned, and each associated descriptor block.

Description

W0 2018/152143 PCT/U52018/018092 METHOD AND APPARATUS FOR THE COMPACT REPRESENTATION OF BIOINFORMATICS DATA USING MULTIPLE GENOMIC DESCRIPTORS CROSS-REFERENCE TO RELATED APPLICATIONS This application claims priority to and the benefit of Patent Applications PCT/U82017/017842 filed 0n February 14, 2017, and PCT/U82017/041591 filed on 11, 2017.

July TECHNICAL FIELD This disclosure provides a novel method of representation of genome sequencing data which the utilized and improves performance providing new reduces storage space access that are not available with known prior art methods of representation. functionality BACKGROUND An appropriate representation of genome sequencing data is fundamental to enable efficient genomic analysis applications such as genome variants calling and all other analysis performed with various purposes processing the sequencing data and metadata.

Human become affordable the of high-throughput low genome sequencing has emergence cost sequencing technologies. Such opportunity opens new perspectives in several fields ranging from the diagnosis and treatment of cancer to the identification of genetic illnesses, from pathogen surveillance for the identification of antibodies to the creation of new vaccines, drugs and the customization of personalized treatments.

Hospitals, genomics data analysis providers, bioinformaticians and large biological data storage looking information centers are for affordable, fast, reliable and interconnected genomic processing solutions which would enable scaling genomic medicine to a world-wide scale. Since one of the bottleneck in the sequencing process has become data storage, methods for representing genome sequencing data in a compressed form are increasingly investigated.

The most used genome information representations of sequencing data are based on zipping 3O FASTQ and SAM formats. The objective is to compress the traditionally used file formats (respectively FASTQ and SAM for non-aligned and aligned data). Such files are constituted plain text characters and are compressed, as mentioned above, using general purpose approaches such as LZ (from Lempel and Ziv, the authors who published the first versions) schemes well-known gzip When general compressors such as gzip (the Zip, etc). purpose are PCT/U52018/018092 used, the result of compression is usually a single blob of binary data. The information in such monolithic form results quite difficult to archive, transfer and elaborate particularly when like in the case of high throughput sequencing the volume 0f data are extremely large. The BAM format is characterized poor compression performance due to the focus on compression of the inefficient and redundant SAM format rather than on extracting the actual genomic information conveyed SAM files and due to the adoption of general purpose text compression algorithms such as gzip rather than exploiting the specific nature of each data source (the genomic data itself).

A more sophisticated approach to genomic data compression that is less used, but more efficient than BAM is CRAM. CRAM provides more efficient compression for the adoption of differential encoding with respect to a reference (it partially exploits the data source it still lacks features as incremental support for streaming and redundancy), but such updates, selective access to specific classes of compressed data.

These approaches generate poor compression ratios and data structures that are difficult to navigate and manipulate once compressed. Downstream analysis can be slow due to the very necessity of handling large and rigid data structures even to perform simple operation or to access selected regions of the genomic dataset. CRAM relies on the concept of the CRAM record. Each CRAM record represents single mapped or unmapped coding all the a reads elements to reconstruct it. necessary CRAM presents the following drawbacks and limitations that are solved and overcome the invention described in this document: 1. CRAM does not support data indexing and random access to data subsets sharing specific features. Data indexing is out of the scope 0f the specification section 12 of CRAM (see specification v and it is implemented as a separate file. Conversely the approach of the 3.0) invention described in this document employs a data indexing method that is integrated with the encoding process and indexes are embedded in the encoded (i.e. compressed) bit stream. 2. CRAM is built core data blocks that can contain of mapped reads (perfectly by any type matching reads, reads with substitutions reads with insertions or deletions (also referred to only, "inde|s")). as There is no notion of data classification and grouping of reads in classes according to the result of mapping with respect to a reference sequence. This means that all data need to be inspected even if reads with specific features are searched. Such limitation is solved only by the invention classifying and partitioning data in classes before coding.

PCT/U52018/018092 "CRAM record". 3. CRAM is based on the concept of encapsulating each read into a This "record" implies the need to inspect each complete when reads characterized specific "indels", biological features (e.g. reads with substitutions, but without or perfectly mapped reads) are searched. in the present invention there is the notion of data classes coded in Conversely, separately separated information blocks and there is no notion of record encapsulating each read. This enables more efficient access to set of reads with specific biological characteristics reads (e.g. "indels", with substitutions, but without 0r perfectly mapped reads) without the need of decoding each (block of) read(s) to inspect its features. 4. ln a CRAM record each record field is associated to a specific flag and each ﬂag must always have the same meaning as there is no notion of context since each CRAM record can contain different of data. This coding mechanism introduces redundant information and any type prevents the usage of efficient context based coding. entropy Instead in the present invention, there is no notion of flag denoting data because this is "block" intrinsically defined the information the data belongs to. This implies a largely reduced number of symbols to be used and a consequent reduction of the information source entropy which results into a more efficient compression. Such improvement is possible because the use "blocks" of different the encoder to the with enables reuse same symbol across each block different meanings according to the context. In CRAM each flag must have the same always meaning as there is no notion of contexts and each CRAM record can contain of data. any type . In CRAM substitutions, insertions and deletions are represented using different descriptors, option that increases the size of the information source alphabet and yields a higher source entropy. Conversely, the approach of the disclosed invention uses a single alphabet and encoding for substitutions, insertions and deletions. This makes the encoding and decoding process simpler and produces a lower source model which coding bitstreams entropy yields characterized high compression performance.

The present invention aims at compressing genomic sequences classifying and partitioning sequencing data so that the redundant information to be coded is minimized and features such as selective access and support for incremental updates are directly enabled in the compressed domain.

One of the aspects of the presented approach is the definition of classes of data and metadata structured in different blocks and encoded The more relevant improvements of such separately. approach with respect to existing methods consist in: PCT/U52018/018092 1. the increase of compression performance due to the reduction of the information source entropy constituted providing an efficient source model for each class of data or metadata; 2. the possibility of performing selective accesses to poﬂions of the compressed data and metadata for further processing purpose in the compressed any directly domain; 3. the to incrementally without the need of decoding and re-encoding) possibility (Le. update compressed data and metadata with new sequencing data and/or metadata and/or new analysis results associated to specific sets of sequencing reads.

BRIEF DESCRIPTION OF THE DRAWINGS "pos" Figure 1 shows how the position of the mapped reads pairs are encoded in the block as difference from the position of the first mapped read. absolute Figure 2 shows how two reads in a pair originate from the two DNA strands.

Figure 3 shows how the reverse complement of read 2 is encoded if strand 1 is used as reference.

Figure 4 shows the four possible combinations of reads composing a reads pair and the "rcomp" respective coding in the block.

Figure how to the pairing distance in of constant length for three shows calculate case reads read pairs. "pair" Figure 6 show how the pairing errors encoded in the block enable the decoder to "MPPPD". reconstruct the correct read pairing using the encoded Figure 7 shows the encoding of a pairing distance when a read is mapped on a difference reference than its mate. In this case additional descriptors are added to the pairing distance.

One is a signaling ﬂag, the second is a reference identifier and then the pairing distance. "n "nmis" Figure 8 shows the encoding of type" mismatches in a block.

Figure 9 shows a mapped read pair which presents substitutions with respect to a reference sequence.

Figure 10 shows how to calculate the positions of substitutions either as absolute or differential values.

Figure 11 shows how to calculate the symbols encoding substitutions when no IUPAC types codes are used. The symbols represent the distance in a circular substitution vector between the molecule present in the read and the one present on the reference at that position. "snpt" Figure 12 shows how to encode the substitutions into the block.

Figure 13 shows how to calculate substitution codes when IUPAC ambiguity codes are used.

PCT/U52018/018092 "snpt" Figure 14 shows how the block is encoded when IUPAC codes are used.

Figure 15 shows how for reads of class | the substitution vector used is the same as for class M with the addition of special codes for insertions of the symbols A, C, G, T, N.

Figure 16 shows some examples of encoding of mismatches and indels in case of IUPAC codes. The substitution vector is much longer in this and therefore the possible ambiguity case calculated symbols are more than in the case of five symbols.

Figure 17 shows a different source model for mismatches and indels where each block contains the position of the mismatches or inserts of a single type. In this case no symbols are encoded for the mismatch or indel type.

Figure 18 shows an example of mismatches and indels encoding. When no mismatches or indels of a given are present for a read, a 0 is encoded in the corresponding block. The 0 type acts reads separator and terminator in each block.

Figure 19 shows how a modiﬁcation in the reference sequence can transform M reads in P reads. This operation can reduce the information of the data structure in case entropy especially of high coverage data.

Figure 20 shows a genomic encoder 2010 according to one embodiment of this invention.

Figure 21 shows a genomic decoder 218 according to one embodiment of this invention. "intemal" Figure 22 how an reference constructed clustering and shows can be reads assembling the segments taken from each cluster.

Figure 23 shows how a strategy of constructing a reference consists in storing the most recent reads once a specific sorting lexicographic order) has been applied to the reads. (e.g. "unmapped" Figure 24 shows how a read belonging to the class of reads (Class can be coded using six descriptors stored or carried in the corresponding blocks.

Figure 25 shows how an alternative coding of reads belonging to Class U where a signed pos descriptor is used to code the mapping position of a read on the constructed reference.

Figure 26 shows how reference transformations can be applied to remove mismatches from reads. In some cases reference transformations generate new mismatches or change the of mismatches found when referring to the reference before the transformation has been type applied.

Figure 27 shows how reference transformations can change the class reads belong to when all or a subset of mismatches are removed (i.e. the read belonging to class M before transformation is assigned to Class P after the transformation of the reference has been applied).

PCT/U52018/018092 Figure 28 shows how half mapped read pairs (class can be used to ﬁll unknown regions of a reference sequence assembling longer contigs with unmapped reads.

Figure 29 shows how encoders of data of class N, M and l are configured with vectors of thresholds and generate separate subclasses of M and I data classes.

Figure 30 shows how all classes of data can use the same transformed reference for encoding or a different transformation can be used for each class M and or N, l, any combination thereof.

Figure 31 shows the structure of a Genomic Dataset Header.

Figure 32 shows the generic structure of a Master Index Table where each row contains genomic intervals of the several classes of data HM and further pointers to P, N, M, l, U, Metadata and annotations. The columns refer to specific positions on the reference sequences related to the encoded genomic data.

Figure 33 shows an example of one row of the MIT containing genomic intervals related to reads of Class P. Genomic regions related to different reference sequences are separated a (‘S' special flag in the example).

Figure 34 shows the generic structure of the Local Index Table and how it is used to store (LIT) pointers to the physical location of the encoded genomic information in the stored or transmitted data.

Figure 35 shows an example of LIT used to access Access Units no. 7 and 8 in the block payload.

Figure 36 shows the functional relationship among the several rows of the MIT and the LIT contained in the genomic blocks headers.

Figure 37 shows how an Access Unit is composed several blocks of genomic data carried by by different genomic streams containing data belonging to different classes. Each block is further composed data packets used as data transmission units.

Figure 38 shows how Access Units are composed a header and multiplexed blocks belonging to one or more blocks of homogeneous data. Each block can be composed one or more packets containing the actual descriptors of the genomic information.

Figure 39 shows Multiple alignments without splicing. The left-most read has N alignments. N is the first value of mmap to be decoded and signals the number of alignments of the first read.

The following N values of the mmap descriptor are decoded and are used to calculate P which is the number of alignments of the second read.

Figure 4O shows how the pair and mmap descriptors are used to encode multiple pos, alignments without splices. The left-most read has N alignments.

PCT/U52018/018092 Figure 41 shows multiple alignments with splices.

Figure 42 shows the use of the pos, pair, mmap and msar descriptors to represent multiple alignments with splices.

SUMMARY The features of the claims below solve the problem of existing prior art solutions providing A method for encoding genome sequence data, said genome sequence data comprising reads of sequences of nucleotides, said method comprising the steps of: aligning said reads to one or more reference sequences thereby creating aligned reads, said aligned reads according to specified matching rules with said or more classifying one reference sequences, creating classes of aligned thereby reads, encoding said classified aligned reads as a multiplicity of blocks of descriptors, wherein encoding said classified aligned reads as a multiplicity of blocks of descriptors comprises selecting said descriptors according to said classes of aligned reads, structuring said blocks of descriptors with header information thereby creating successive Units.

Access In another aspect the coding method further comprises further classifying said reads that do not satisfy said specified matching rules into a class of unmapped reads constructing a set of reference sequences using at least some unmapped reads aligning said class of unmapped reads to the set of constructed reference sequences encoding said classified aligned reads as a multiplicity of blocks of descriptors, encoding said set 0f constructed reference sequences structuring said blocks of descriptors and said encoded reference sequences with header information thereby creating successive Access Units.

In another aspect the coding method further comprises identifying genomic reads without "Class mismatch in the reference sequence as first In another aspect the coding method further comprises genomic reads as a second identifying "Class when mismatches are found in the positions where the sequencing machine was only PCT/U52018/018092 "base" not able to call and the number of mismatches in each read does not exceed a given threshold.

In another aspect the coding method further comprises genomic reads as a third identifying "Class when mismatches are found in the positions where the sequencing machine was not "base" "base", type" able to call named mismatches, and/or it called a different than the type" reference sequence, named mismatches, and the number of mismatches does not "n "s type", type" exceed given thresholds for the number of mismatches of of and a threshold "n "s type" type" obtained from a given function calculated on the number of and (f(n,s)) mismatches.

In another aspect the coding method further comprises genomic reads as fourth identifying a "Class "Class M", when can have the same of mismatches of and in they possibly type "insertion" ("i "deletion" ("d ("c addition at least one mismatch of type") type") soft clips type: type"), and wherein the number of mismatches for each does not exceed the type corresponding given threshold and a threshold provided a given function by (w(n,s,i,d,c)) n n u "- "n "d "c type", type" type" calculated on the number of type s type | and mismatches.

In another aspect the coding method further comprises genomic reads as a fifth identifying "Class as comprising all reads that do not find classification in the Classes as any P, N, M, I, previously defined.

In another aspect the coding method further comprises that the reads of the genomic sequence to be encoded are paired.

In another aspect the coding method further comprises that said classifying further comprises "Class identifying genomic reads as a sixth as comprising all reads pairs where one read "Class belong to Class M orl and the other read belong to P, N, In another aspect the coding method further comprises the steps of identifying if the two mate reads are classified in the same class (each of: P, N, M, l, then assigning the pair to the same identified class, PCT/U52018/018092 Identifying if the two mate reads are classified in different classes, and in case none of them "Class belongs to the then assigning the pair of reads to the class with the highest priority defined according to the following expression: < < < P N M l P" l" "Class in which "Class has the lowest and has the highest priority priority; "Class identifying if one of the two mate reads has been classified as belonging to and only classifying "Class the pair of reads as belonging to the sequences.

In another aspect the coding method further comprises that each class of reads | of reads N, M, N, M, l is further partitioned into two or more subclasses (296, 297, 298) according t0 a vector of thresholds defined for class M and the number of (292, 293, 294) respectively each N, l, type" mismatches the function and the function (292), f(n,s) (293) w(n,s,i,d,c) (294). identifying if the two mate reads are classified in the same subclass, then assigning the pair to the same sub-class sub-classes identifying if the two mate reads are classified into of different Classes, then assigning the pair to the belonging to the of higher according to the subclass Class priority following expression: N M l where N has the lowest priority and | has the highest priority; identifying if the two mate reads are classified in the same class, and such class is N or M or sub-class but in different sub-classes, then assigning the pair t0 the with the highest priority according to the following expressions: N1< N2<...

M1< M2<... |1< |2<... |h where the highest index has the highest priority.

In another aspect the information on the mapping position of each read is encoded means of a pos descriptor block.

PCT/U52018/018092 In another aspect the information on the strandedness (i.e. the DNA strand the read was sequences from) of each read is encoded means of a rcomp descriptor block.

In another aspect the pairing information of paired-end reads is encoded means of a pair descriptor block.

In another aspect the additional alignment information such as if the read is mapped in proper pair, it fails platform/vendor quality checks, it is a PCR or optical duplicate or it is a supplementary alignment is encoded means of a flags descriptor block.

In another aspect the information on unknown bases is encoded means of a nmis descriptor block.

In another aspect the information on the position of substitutions is encoded means of a snpp descriptor block.

In another aspect the information on the type of substitutions is encoded means of a specific snpt descriptor block.

In another aspect the information on the position of mismatches of substitutions, insertions type or deletions is encoded means of a indp descriptor block.

In another aspect the information on the of mismatches such as substitutions, insertions or type deletions is encoded means of a indt descriptor block.

In another aspect the information on clipped bases of a mapped read is encoded means of a indc descriptor block.

In another aspect the information on unmapped reads is encoded means of a ureads descriptor block.

In another aspect the information on the of reference sequence used for encoding is type encoded means of a descriptor block. rtype PCT/U52018/018092 In another aspect the information on multiple alignments of the mapped reads is encoded means of a mmap descriptor block.

In another aspect the information on spliced alignments and multiple alignments 0f the same read is encoded means of msar descriptor and mmap descriptor block.

In another aspect the information on read alignment scores is encoded means of a mscore descriptor block.

In another aspect the information on the groups reads belong to is encoded means of a specific rgroup descriptor block.

In another aspect the coding method further comprises that said blocks of descriptors comprise a master index table, containing one section for each Class and sub-class of aligned reads, said section comprising the mapping positions on said one or more reference sequences of the first read of each Access Units of each Class 0r sub-class of data; jointly coding said master index table and said access unit data.

In another aspect the coding method further comprises that said blocks of descriptors further comprise information related to the of reference used (pre-existing or constructed) and the type segments of the read that do not match on the reference sequence.

In another aspect the coding method further comprises that said reference sequences are first transformed into different reference sequences applying substitutions, insertions, deletions and then the encoding 0f said classified aligned reads as a of blocks of clipping, multiplicity descriptors refers to the transformed reference sequences.

In another aspect the coding method further comprises that the same transformation is applied to the reference sequences for all classes of data.

In another aspect the coding method further comprises that different transformations are applied to the reference sequences per each class of data.

PCT/U52018/018092 In another aspect the coding method further comprises that the reference sequences transformations are encoded as blocks of descriptors and structured with header information thereby creating successive Access Units.

In another aspect the coding method further comprises that the encoding of said classified aligned reads and the related reference sequences transformations as multiplicity of blocks of descriptors comprises the step of associating a specific source model and a specific entropy coder to each descriptor block.

In another aspect the coding method fudher comprises that said entropy coder is one of a context adaptive arithmetic coder, a variable length coder or a golomb coder.

The present invention further provides a method for decoding encoded genomic data comprising the steps of: parsing Access Units containing said encoded genomic data to extract multiple blocks of descriptors employing header information decoding said of of descriptors to extract aligned according to specific multiplicity blocks reads matching rules defining their classification with respect to one or more reference sequences.

In another aspect the decoding method further comprises the decoding of unmapped genomic reads.

In another aspect the decoding method further comprises the decoding of classified genomic reads.

In another aspect the decoding method further comprises decoding a master index table containing one section for each class of reads and the associated relevant mapping positions.

In another aspect the decoding method further comprises decoding information related to the type of reference used: pre-existing, transformed or constructed.

In another aspect the decoding method further comprises decoding information related to one or more transformations to be applied to the pre-existing reference sequences.

PCT/U52018/018092 In another aspect the decoding method further comprises genomic reads that are paired.

In another aspect the decoding method further comprises the case wherein said genomic data are decoded. entropy The present invention further provides a genomic encoder for the compression of (2010) genome sequence data 209, said genome sequence data 209 comprising reads of sequences of nucleotides,said genomic encoder comprising: (2010) an aligner unit configured to align said reads to one or more reference sequences (201), creating aligned thereby reads, a constructed-reference generator unit configured to produce constructed reference (202), sequences, a data classification unit configured to classify said aligned reads according to specified (204), matching rules with the one or more pre-existing reference sequences or constructed reference sequences thereby creating classes of aligned reads (208); or more encoding units conﬁgured to said classified aligned one blocks (205-207), encode reads as blocks of descriptors selecting said descriptors according to said classes of aligned by reads; a multiplexer for multiplexing the compressed genomic data and metadata. (2016) In another aspect the genomic encoder further comprises pre-existing a reference sequence transformation unit (2019) conﬁgured to transform the references and data classes into transformed data classes (2018). (208) In another aspect the genomic encoder further comprises a data classification unit contains encoders of data classes M and l configured with (204) N, vectors of thresholds generating sub-classes of data classes M and l.

In another aspect the genomic encoder further comprises the feature that reference transformation unit (2019) applies the same reference transformation for all classes and (300) sub-classes of data.

PCT/U52018/018092 In another aspect the genomic encoder further comprises the feature that the reference transformation decoder applies different reference transformations 302, 303) for (2019) (301, the different classes and sub-classes of data.

In another aspect the genomic encoder further comprises the features suitable for executing all the aspects of the previously mentioned coding methods.

The present invention further provides a genomic decoder for the decompression of a (218) compressed genomic stream said genomic decoder comprising: (211) (218) a demultiplexer for demultiplexing compressed genomic data and metadata (210) parsing means (212-214) configured to parse said compressed genomic stream into genomic of descriptors blocks (215), one or more block decoders configured to decode the genomic blocks into classified (216-217), reads of sequences of nucleotides (2111), genomic data classes decoders configured to selectively decode said classified reads of (219) sequences of nucleotides on one or more reference sequences so as to produce uncompressed reads of sequences of nucleotides.

In another aspect the genomic decoder further comprises a reference transformation decoder configured to decode reference transformation descriptors and produce a (2113) (2112) transformed reference to be used genomic data class decoders (21 14) by (219).

In another aspect the genomic decoder further comprises that the one or more reference sequences are stored in the compressed genome stream (211).

In another aspect the genomic decoder further comprises that the one or more reference sequences are provided to the decoder via an out of band mechanism.

In another aspect the genomic decoder further comprises that the one or more reference sequences are built at the decoder.

In another aspect the genomic decoder further comprises that the one or more reference sequences are transformed at the decoder reference transformation decoder by a (2113).

PCT/U52018/018092 The present invention further provides a computer-readable medium comprising instructions that when executed cause at least one processor to perform all the aspects of the previously mentioned coding methods.

The present invention further provides computer-readable medium comprising instructions that when executed cause at least one processor to perform all the aspects of the previously mentioned decoding methods.

The present invention further provides a support data storing genomic encoded according perform all the aspects of the previously mentioned coding methods.

DETAILED DESCRIPTION The genomic or proteomic sequences referred to in this invention include, for example, and not as a limitation, nucleotide sequences, Deoxyribonucleic acid sequences, Ribonucleic (DNA) acid and amino acid sequences. Although the description herein is in considerable detail (RNA), with to genomic information in the form of nucleotide it will understood respect a sequence, be that the methods and for compression can be implemented for other genomic or systems proteomic sequences as well, albeit with a few variations, as will be understood a person skilled in the art.

Genome sequencing information is generated High Throughput Sequencing machines by (HTS) "bases") in the form of sequences of nucleotides (a. k. a. represented strings of letters from a defined vocabulary. The smallest vocabulary is represented five symbols: C, G, T, by {A, N} representing the 4 of nucleotides present in DNA types namely Adenine, Cytosine, Guanine, and Thymine. In RNA is replaced Uracil N indicates that the sequencing Thymine by (U). machine was not able to call base and so the real nature of the position is undetermined. In case the IUPAC ambiguity codes are adopted the sequencing machine, the alphabet used for the symbols is N or -).

(A, C, G, T, U, W, S, M, K, R, Y, B, D, H, V, "reads".

The nucleotides sequences produced sequencing machines are called Sequence reads can be between a few dozens to several thousand nucleotides long. Some technologies "pairs" produce sequence reads in where one read is from one DNA strand and the second is "coverage" from the other strand. In genome sequencing the term is used to express the level of redundancy of the sequence data with respect to a "reference sequence". For example, to PCT/U52018/018092 reach a coverage of 30x on a human genome billion bases a sequencing machine (3.2 long) shall produce a total of 30 x 3.2 billion bases so that in average each position in the reference is "covered" times.

Throughout this disclosure, reference sequence is sequence on which the nucleotides a any sequences produced sequencing machines are aligned/mapped. One example of sequence "reference could actually be a genome", a sequence assembled scientists as a species‘ representative example of a set of genes. For example GRCh37, the Genome Reference Consortium human genome (build 37) is derived from thirteen anonymous volunteers from Buffalo, New York. However, a reference sequence could also consist of a synthetic sequence conceived and constructed to merely improve the compressibility of the reads in view of their further processing. This is described in more details in section "Descriptors of Class U U" HM’"’ "internal" "Class "Class and construction of an references for unmapped reads of and and depicted in figure 22 and 23.

Sequencing devices can introduce errors in the sequence reads such as: 1. the decision of skipping a base call due to the lack of confidence in calling any specific base.

This is called an unknown and labeled mismatch of base as (denoted as type"); 2. the use of a wrong representing a different nucleic to represent the nucleic symbol (i.e. acid) error" acid actually present in the sequenced this is called "substitution sample; usually (denoted as mismatch of type"); 3. the insertion in one sequence read of additional symbols that do not refer to actually "insertion error" present nucleic acid; this is usually called (denoted as mismatch of type"); 4. the deletion from one sequence read of symbols that represent nucleic acids that are present in the sequenced this is called "deletion error" as actually sample; usually (denoted mismatch of type"); . the recombination of one or more fragments into a single fragment which does not reﬂect the reality of the originating sequence; this usually results in aligners decision to clip bases (denoted as mismatch 0f type"). "coverage" The term is used in literature to quantify the extent to which a reference genome or part thereof can be covered the available sequence reads. Coverage is said to be: partial than when some parts of the reference genome are not mapped (less 1X) by any available sequence read; PCT/U52018/018092 single when all nucleotides of the reference genome are mapped one and (1X) by only one symbol present in the sequence reads; multiple 3X, NX) when each nucleotide of the reference genome is mapped multiple (2X, times.

This invention aims at defining genomic information representation format in which the relevant information is efficiently accessible and transportable and the weight of the redundant information is reduced.

The main innovative aspects of the disclosed invention are the following. 1 Sequence reads are classified and partitioned into data classes according to the results of the alignment with respect to reference sequences. Such classification and partitioning enables the selective access to encoded data according to criteria related to the alignment results and to the matching accuracy. 2 The classified sequence reads and the associated metadata are represented homogeneous blocks of descriptors to obtain distinct information sources characterized a low information entropy. 3 The possibility of modeling each separated information source with distinct source model adapted to the statistical characteristics of each class and the possibility of changing the source model within of and within descriptor for each class reads each block each separately accessible data units The adoption of the appropriate context adaptive (Access Units). probability models and associated coders according to the statistical properties of each entropy source model. 4 The definition of correspondences and dependencies among the descriptors blocks to enable the selective access to the sequencing data and associated metadata without the need to decode all the descriptors blocks if not all information is required. 5 The coding of each sequence data class and associated metadata blocks with respect to "pre-existing" "externa|") "transformed" denoted as reference sequences or with respect to (also "pre-existing" reference sequences obtained applying appropriate transformations to reference sequences so as to reduce the entropy of the descriptors blocks information sources.

Said descriptors represent the reads partitioned into the different data classes. Following "pre-existing" encoding of reads using the corresponding descriptors with reference to a "transformed" "pre-existing" reference or reference sequence, the occurrence of the various mismatches can be used to define the appropriate transformations to the reference sequences in order to ﬁnd final coded representation with low and achieve higher compression a entropy efficiency.

PCT/U52018/018092 "internal" 6 The construction of one or more reference sequences (also referred to as "pre-existing" "external" references to distinguish from the also referred here as reference sequences) used to encode the class of reads that present a degree of matching accuracy with respect to the pre-existing reference sequences not a set of constraints. Such satisfying constraints are set with the objective that the coding costs of representing in compressed form "intemal" the class of reads aligned with respect to the reference sequences and the cost of "internal" representing the reference sequences themselves, is lower than encoding the "external" unaligned class of reads verbatim, or using the reference sequences without or with transformations.

In the following, each of the above aspects will be further described in details.

Classification of the sequence reads according to matching rules The reads generated sequencing machines are classified the disclosed sequence by by "classes" invention into six different according to the matching results of the alignment with "pre-existing" respect to one or more reference sequences.

When aligning a DNA sequence of nucleotides with respect to a reference sequence the following cases can be identified: 1. A region in the reference sequence is found to match the sequence read without any error perfect of nucleotides is referenced to "perfectly matching (i.e. mapping). Such sequence as read" "Class or denoted as 2. A region in the reference sequence is found to match the sequence read with a and a type number of mismatches determined only the number of positions in which the sequencing machine generating the read was not able to call base (or nucleotide). Such of any type "",N mismatches are denoted an the letter used to indicate an undefined nucleotide base.

In this document this of mismatch are referred to as type" mismatch. Such type N" N" "Class "Class sequences belong to reads. Once the read is classified to belong to it is useful to limit the degree of matching inaccuracy to a given upper bound and set a boundary between what is considered a valid matching and what it is not. Therefore, the reads assigned to Class N are also constrained setting a threshold that defines the by (MAXN) "N") maximum number of undeﬁned bases (i.e. bases called as that a read can contain.

Such classification implicitly defines the required minimum matching accuracy (or maximum degree of that all reads belonging to Class N share when referred to the mismatch) corresponding reference which constitutes an useful criterion for sequence, applying selective data searches to the compressed data.

PCT/U52018/018092 3. A region in the reference sequence is found to match the sequence read with types and number of mismatches determined the number of positions in which the sequencing machine generating the read was not able to call nucleotide if present any base, (i.e. type" plus the number of mismatches in which a different than the one mismatches), base, "substitution" present in the reference, has been called. Such of mismatch denoted as type is also called Single Nucleotide Variation or Single Nucleotide Polymorphism (SNV) (SNP). type" In this document this of mismatch is also referred to as mismatch. The type reads" "Class sequence read is then referenced to as mismatching and assigned to "Class "Class Like in the case of also for all reads belonging to it is useful to limit the degree of matching inaccuracy to a given upper bound, and set a boundary between what is considered valid matching and what it is not. the reads assigned to M are a Therefore, Class also constrained defining a set of thresholds, one for the number of mismatches of type" if present, and another for the number of substitutions A third (MAXN) (MAXS). "s’, constraint is a threshold defined function of both numbers and Such by any f(n,s). third constraint enables to generate classes with an upper bound of matching inaccuracy according to any meaningful selective access criterion. For instance, and not as a limitation, or or linear or non-linear expression that to f(n,s) can be (n+s)1/2 (n+5) any sets a boundary "Class the maximum matching level that is admitted for a read belonging to inaccuracy Such boundary constitutes a useful criterion for the desired selective data very applying searches to the compressed data when analyzing sequence reads for various purposes because it makes possible to set a further boundary to possible combination of the "n "s type" type" number of mismatches and mismatches (substitutions) beyond the simple threshold applied to the one type or to the other.

. A fourth class is constituted sequencing reads presenting at least one mismatch of by any "insertion", "deletion" "clipped", among (a.k.a. indels) and plus, if present, type any mismatches belonging to class N or M. Such sequences are referred to as type reads" "Class mismatching and assigned to Insertions are constituted an additional sequence of one or more nucleotides not present in the reference, but present in the read type" sequence. In this document this type of mismatch is referred to as mismatch. In literature when the inserted sequence is at the edges of the sequence it is also referred to "soft clipped" as the nucleotides are not matching the reference but are kept in the (i.e. "hard clipped" aligned reads contrarily to nucleotides which are discarded). In this document PCT/U52018/018092 type" this of mismatch is referred to as mismatch. Keeping or discarding nucleotides type is a decisions taken the aligner stage and not the classifier of reads disclosed in this by by invention which receives and processes the reads as they are determined the "holes" sequencing machine or the following alignment stage. Deletion are (missing in the read with respect to the reference. In this document this of mismatch nucleotides) type "N" "M" type" is referred to as mismatch. Like in the case of classes and it is possible and appropriate to define a limit to the matching inaccuracy. The definition of the set of I" M" "Class "Class constraints for is based on the same principles used for and is reported in Table 1 in the last table lines. Beside a threshold for each type of mismatch admissible for Class | data, a further constraint is defined a threshold determined function of the by by any "",n "",d "i "‘c, number of the mismatches and w(n,s,d,i,c). Such additional constraint make to generate classes with an bound of matching according possible upper inaccuracy to meaningful user defined selective access criterion. For instance, and not as a limitation, can be (n+s+d+i+c)1/5 or (n+s+d+i+c) 0r linear 0r non-linear w(n,s,d,i,c) any expression that sets a boundary to the maximum matching inaccuracy level that is admitted "Class for a read belonging to Such boundary constitutes a very useful criterion for applying the desired selective data searches to the compressed data when analyzing for various it to further to sequence reads purposes because enables set a boundary any "Class possible combination of the number of mismatches admissible in reads the beyond simple threshold applied to each of admissible mismatch. type . A ﬁfth class includes all reads that do not find mapping considered valid (i.e not satisfying the set of matching rules defining an upper bound to the maximum matching inaccuracy as specified in Table for each data class when referring to the reference "Unmapped" sequence. Such sequences are said to be when referring to the reference "Class sequences and are classified as belonging to the Classification of read pairs according to matching rules The classification specified in the previous section concerns single sequence reads. In the case of sequencing technologies that generates read in pairs (i.e. lllumina Inc.) in which two reads are known to be separated an unknown sequence of variable length, it is appropriate to consider the classification of the entire pair to a single data class. A read that is coupled with "mate". another is said to its PCT/U52018/018092 If both paired reads belong to the same class the assignment to a class of the entire pair is obvious: the entire pair is assigned to the same class for class (i.e. P, N, M, l, In the any U).

"Class case the two reads belong to a different class, but none of them belongs to the then the entire pair is assigned to the class with the highest defined according to the following priority expression: P "Class In case only one of the reads belongs to and its mate to any of the Classes P, N, M, l "Class "Half Mapped". sixth is defined which stands for a class as The definition of such specific class of reads is motivated the fact that it is used for attempting to determine gaps or unknown regions existing in reference genomes (a.k.a. little known or unknown regions). Such regions are reconstructed mapping pairs at the edges using the pair read that can be mapped on the known regions. The unmapped mate is then "contigs" used to build the so called of the unknown region as it is shown in Figure 28. Therefore providing selective to of read pairs the a access only such type greatly reduces associated computation burden enabling much efficient processing of such data originated large amounts of data sets that using the state of the art solutions would require to be entirely inspected.

The table below summarizes the matching rules applied to reads in order to define the class of data each read belongs to. The rules are defined in the first five columns of the table in terms of presence or absence of of mismatches i and c The sixth column type s, d, type mismatches). provide rules in terms of maximum threshold for each mismatch and function and type any f(n,s) of the possible mismatch types. w(n,s,d,i,c) of matching Set Assigneme Number and types of mlsmatches found when matchlng a read nt Class accuracy WIth a reference sequence constralnts Number of Number of Number Number Number of _ _ _ unknown substltutlon of of cllpped deletlons bases s Insertlon bases W0 2018/152143 PCT/U52018/018092 O O O 0 0 O gcz'u n>0 O O O 0 nSMAXN n MAXN n20 s>O O O 0 nsMAXNand s MAXS and s MAXM f(n,s) n MAXN or U s MAXS or MAXM f(n,s) d20* i20* 020* n20 $20 nsMAXNand | s MAXS and d MAXD and i S MAXI and least one mismatch of type d, c s MAXC c must be present (i.e. or w(n,s,d,i,c) or c>0) MAXTOT d20 i20 020 n>MAXNor s MAXS or d MAXD or i MAXI or c MAXC w(n,s,d,i,c) MAXTOT Table 1. of mismatches and set of constrains that each sequence reads must Type satisfy to be classified in the data classes defined in this invention disclosure.

Matching rules partition of sequence read data Classes M and l into subclasses with different degrees of matching accuracy The data classes of N, M and I as defined in the previous sections can be further type decomposed into an arbitrary number of distinct sub-classes with different degrees of matching accuracy. Such option is an important technical advantage in providing a finer granularity and as consequence a much more efficient selective access to each data class. As an example and not PCT/U52018/018092 Sub- as a limitation, to partition the Class N into a number k of subclasses (Sub-Class ..., Class it is necessary to define a vector with the corresponding components MAXN1, MAXNz, < < < < ..., MAXNM with the condition that MAXN1 MAXN; MAXl-n MAXN and MAXNM), assign each read t0 the lowest ranked sub-class that the constrains specified in Table 1 satisfy when evaluated for each element of the vector. This is shown in Figure 29 where data classification unit 291 contains Class HM encoder and encoders for annotations P, N, M, l, U, and metadata. Class N encoder is configured with a vector of thresholds, MAXN1 to MAXNk292 which generates k subclasses of N data (296).

In the case of the classes of type M and I the same principle is applied defining a vector with the same properties for MAXM and MAXTOT respectively and use each vector components as threshold for checking if the functions and w(n,s,d,l,c) satisfy the constraint. Like in the f(n,s) case of sub-classes of the assignment is given to the lowest sub-class for which the type N, constraint is satisfied. The number of sub-classes for each class is independent and type any combination of subdivisions is admissible. This is shown in figure 29 where a Class M encoder 293 and a Class | encoder 294 are configured respectively with a vector of thresholds MAXM1 to MAXM} and MAXTOT1 to MAXTOT" . The two encoders generate respectively] subclasses of M data and h subclasses of | data (298). (297) When two in pair classified in the sub-class, then the pair to the reads a are same belongs same sub-class.

When two reads in a pair are classified into sub-classes of different classes, then the pair belongs to the sub-class of the class of higher priority according to the following expression: N When two reads belong to different sub-classes of one of classes N or M or then the pair belongs to the sub-class with the highest priority according to the following expressions: N1< N2<.‘. Nk M1< M2<... |1< |2<... where the highest index has the highest priority.

PCT/U52018/018092 "external" Transformations of the reference sequences The mismatches found for the reads classified in the classes N, M and I can be used to create "transformed" references to be used to compress more the read representation. efficiently "pre-existing" Reads classified belonging to the Classes M or l respect to the as N, (with (i.e."externa|") reference sequence denoted as can be coded with respect to the RSO) "transformed" reference sequence RS1 according to the occurrence of the actual mismatches "transformed" readMin with the reference. For example if belonging to Class M (denoted as the read of class containing mismatches with respect to the reference sequence RS", then after "transformation" readM readPiU-uﬂ) can be obtained with A(Refn)=Refn+1 where A is the transformation from reference sequence RSn to reference sequence RS" +1.

Figure 19 shows an example on how reads containing mismatches to Class with (belonging respect to reference sequence 1 can be transformed into matching reads with (RS1) perfectly respect to the reference sequence 2 obtained from modifying the bases (R82) RS1 by corresponding to the mismatch positions. remain classified and are coded together They they the other reads in the same data class access unit, but the coding is done using only the descriptors and descriptor values needed for a Class P read. This transformation can be denoted R82 A(RS1) When the representation of the transformation A which generates when applied to plus R82 RS1 the representation of the reads versus R32 corresponds to a lower entropy than the representation of the reads of class M versus RS1, it is advantageous to transmit the representation of the transformation A and the corresponding representation of the read versus R82 because an higher compression of the data representation is achieved.

The coding of the transformation A for transmission in the compressed bitstream requires the definition of two additional descriptors as defined in the table below.

Descriptors Semantic Comments rftp Reference transformation position of difference between reference and position contig used for prediction rftt Reference transformation of difference between reference and type type contig used for prediction. Same syntax described for the snpt descriptor defined below in this document.

PCT/U52018/018092 Figure 26 shows an example on how a reference transformation is applied to reduce the number of mismatches to be coded on the mapped reads.

It has to be observed in some cases the transformation applied to the reference: that, introduce mismatches in the representations of the reads that were not present when referring to the reference before applying the transformation. the of a read contain A instead of G while all other May modify types mismatches, may of but in the same position. reads contain C instead mismatches remain Different data classes and subsets of data of each data class refer t0 the same "transformed" reference sequences or to reference sequences obtained applying pre-existing different transformations to the same reference sequence.

Figure further shows an example of how reads can change the type of coding from a data class to another means of the appropriate set of descriptors (e.g. using the descriptors of a Class P to code a read from Class after a reference transformation is applied and the read is "transformed" represented using the reference. This occurs for example when the transformation changes all bases corresponding to the mismatches of a read in the bases actually present in the read, thus virtually transforming a read belonging to Class M (when "transformed" P referring to the original non reference sequence) into a virtual read of Class "transformed" (when referring to the reference). The definition of the set of descriptors used for each class of data is provided in the following sections. "transformed" Figure 3O shows how the different classes of data can use the same reference to re-encode the reads, or different transformations AN AM A‘ A0(R0) (300) (301), (302), (303) can be separately applied to each class of data.

Deﬁnition of the information necessary to represent sequence reads into blocks of descriptors Once the classification reads is completed with the definition of the further of Classes, processing consists in defining a set of distinct descriptors which represent the remaining information enabling the reconstruction of the read sequence when represented as being mapped on a given reference sequence. The data structure of these descriptors requires the storage of global parameters and metadata to be used the decoding engine. These data are structured in a Genomic Dataset Header described in the table below. A dataset is defined as the ensemble of coding elements reconstruct the genomic information related to a needed to W0 2018/152143 PCT/U52018/018092 single genomic sequencing run and all the following analysis. If the same genomic sample is sequenced twice in two distinct runs, the obtained data will be encoded in two distinct datasets.

Element Description Type UniquelD Unique identiﬁer for the Byte array encoded content Major_Brand Byte array Major Minor version of the encoding Minor_Version array algorithm Byte Header Size Integer Size in of the entire bytes encoded content Length Integer Size of in of Reads reads case constant reads length. A special value is (e.g. O) reserved for variable reads length Ref count Integer Number of reference sequences used Access Units counters Byte array Total Number of encoded Access Units per reference (e.g. integers) sequence Ref ids Unique identifiers for Byte array reference sequences for (i=0; i Parameters set Byte array Encoding parameters used to configure the encoding process and sent to the decoder.

Table 1 Genomic Dataset Header structure.

PCT/U52018/018092 A sequence read (i.e. a DNA segment) referred to a given reference sequence can be fully expressed The starting position on the reference sequence (p05) A flag signaling if the read has to be considered as a reverse complement versus the reference (rcomp).

A to the mate pair in case of paired reads distance, (pair).

The value of the read length in case of the sequencing technology produces variable length reads In case of constant reads length the read length associated to each reads can (len). obviously be omitted and can be stored in the main file header.

For each mismatch: o mismatch position for class snpp for class and indp for class (nmis N, M, l) o mismatch present in class snpl in class indt in class type (not N, M, l) Flags indicating specific characteristics of the sequence read such as o template having multiple segments in sequencing o each segment aligned according to the aligner properly o unmapped segment o next segment in the template unmapped o signalization of first or last segment o quality control failure o PCR or optical duplicate o alignment secondary o alignment supplementary Soft clipped nucleotides string when present (indc in class "internal" Flag indicating the reference used for alignment and compression reference (e.g. for class if applicable U) (descriptor rtype).

For class descriptor indc identifies those parts of the reads the edges) that do U, (typically "internal" not match, with a specified set of matching accuracy constraints, with the references.

Descriptor ureads is used to encode verbatim the reads that cannot be mapped on "external" available reference being it a pre-existing i.e. like an actual reference genome) "internal" or an reference sequence.

This classification creates groups of descriptors (descriptors) that can be used to univocally represent genome sequence reads. The table below summarizes the descriptors needed for PCT/U52018/018092 "external" "pre-existing") "internal" "constructed") each class of reads aligned with (i.e. or (i.e. references.

P N M l HM pos X X X X X X pair X X X X rcomp X X X X X X ﬂags X X X X X X rlen X X X X X X nmis X snpp X X snpt X X indp X X indt X X indc X X X ureads X X rtype rgroup X X X X X X mmap X X X X X msar X X X X X mscore X X X X X Table 2 Defined descriptors blocks per class of data.

Reads belonging to class P are characterized and can be perfectly reconstructed only a reverse complement information and an offset between mates in have position, a case they been obtained a sequencing mated some flags and a read length. by technology yielding pairs, The next section further details how these descriptors are defined for classes M and l P, N, while for class U they are described in a following section Class HM is applied to read pairs only and it is a special case for which one read belongs to class P, N, M or l and the other to class U.

PCT/U52018/018092 Position descriptor In the position block only the mapping position of the first encoded read is stored as (pos) absolute value on the reference sequence. All the other position descriptors assume a value expressing the difference with respect to the previous position. Such modeling of the information source deﬁned the sequence of read position descriptors is in general characterized by a reduced entropy particularly for sequencing processes generating high coverage results.

For example, ﬁgure 1 shows how after describing the starting position of the first alignment as "10000" position on the reference sequence, the position of the second read starting at position "180". 10180 is described as With high coverages (> 50x) most of the descriptors of the position vector present very high occurrences of low values such as 0 and 1 and other small integers.

Figure 1 shows how the positions of three read pairs are described in a pos Block.

Reverse complement descriptor Each read of the read pairs produced sequencing technologies can be originated from either genome strands of the sequenced organic sample. However, one of the two strands is only used as reference sequence. Figure 2 shows how in a reads pair one read (read can be originated from one strand and the other (read can be originated from the other strand.

When the strand 1 is used as reference sequence, read 2 can be encoded as reverse complement of the corresponding fragment on strand 1. This is shown in figure 3.

In case of coupled four are the possible combinations of direct and reverse complement reads, mate pairs. This is shown in figure 4. The rcomp block encodes the four possible combinations.

The same encoding is used for the reverse complement information of reads belonging to classes N, P and l. In order to enable selective access to the different data classes, the reverse complement information of reads belonging to the four classes are encoded in different blocks as depicted in Table 2.

Pairing information descriptor The pairing descriptor is stored in the pair block. Such block stores descriptors encoding the information needed to reconstruct the originating reads pairs when the employed sequencing technology produces reads pairs. Although at the date of the disclosure of the invention the vast majority of sequencing data is generated using a technology generating paired reads, it is not the case of all technologies. This is the reason for which the presence of this block is not to reconstruct all sequencing data information if the sequencing of the necessary technology genomic data considered does not generate paired reads information.

W0 2018/152143 PCT/U52018/018092 Definitions: mate pair: associated to another read in Read 2 is read a read pair (e.g. the mate pair of Read 1 in the previous example) pairing distance: number of nucleotide positions on the reference sequence which separate one position in the first read (pairing anchor, e.g. last nucleotide of first read) from position the the first the one of second read (e.g. nucleotide of second read) most probable pairing distance this is the most probable pairing distance (MPPD): expressed in number of nucleotide positions. position pairing distance (PPD): the PPD is a way to express a pairing distance in terms of the number of reads separating one read from its respective mate present in a specific position descriptor block. most probable position pairing distance is the most probable number of reads (MPPPD): separating one read from its mate pair present in a specific position descriptor block. position pairing error is defined as the difference between the MPPD or the (PPE): MPPPD and the actual position of the mate. pairing anchor: position of first read last nucleotide in a pair used as reference to calculate the distance of the mate pair in terms of number of nucleotide positions or number of read positions.

Figure 5 shows how the pairing distance among read pairs is calculated.

The pair descriptor block is the vector of pairing errors calculated as number of reads to be skipped to reach the mate pair of the first read of a pair with respect to the defined decoding pairing distance.

Figure 6 shows an example of how pairing errors are calculated, both as absolute value and as high differential vector (characterized lower entropy for coverages).

The same descriptors are used for the pairing information of reads belonging to classes N, M, P and l. In order to enable the selective access to the different data classes, the pairing information of reads belonging to the four classes are encoded in different block as depicted in figures 8 (class figures 10, 12 and 14 (class and figures 15 and 16 (class N), M) l). 3O Pairing information in case of reads mapped on different reference sequences In it the process of mapping sequence reads on a reference sequence is not uncommon to have the first read in a pair mapped on one reference sequence (e.g. chromosome and the second on a different reference sequence chromosome In this case the pairing information (e.g. 4).

PCT/U52018/018092 described above has to be integrated additional information related t0 the reference sequence used to map one ofthe reads. This is achieved coding: 1. A reserved value indicating that the pair is mapped on two different sequences (ﬂag) values indicate if read1 or read2 are mapped on the sequence that is not (different currently encoded). 2. An unique reference identifier referring to the reference identifiers encoded in the main header structure as described Table 1. 3. The third element contains the mapping information on the reference identified at point 2 and expressed as offset with respect to the last encoded position.

Figure 7 provides an example of this scenario.

In figure since Read 4 is not mapped on the encoded reference sequence, the 7, currently genomic encoder signals this information craﬂing additional descriptors in the pair block. In the example shown below Read 4 of pair 2 is mapped on reference no. 4 while the currently encoded reference is no. 1. This information is encoded using 3 components: Oxfff'fff).

One special reserved value is encoded as pairing distance (in this case A descriptor provides reference ID listed in the main header this second a as (in case 4).

The third element contains the mapping information on the concerned reference 3) (170).

Mismatch descriptors for class N reads type" Class N includes all reads in which only mismatches are present, at the place of an C, G or T base a N is found as called base. All other bases of the read perfectly match the reference sequence.

Figure 8 shows how: the positions of in read 1 are coded as absolute position in read 1 or differential position with the in the read. as respect to previous same the positions of in read 2 are coded as ' absolute position in read 2 read 1 length or differential position with to the previous N respect "separator" In the nmis the coding of each reads pair is terminated a special block, by symbol.

Descriptors coding Substitutions (Mismatches or Insertions and Deletions SNPs), PCT/U52018/018092 A substitution is defined as the presence, in a mapped read, of a different nucleotide base with respect to the one that is present in the reference sequence at the same position.

Figure 9 shows examples of substitutions in a mapped read pair. Each substitution is encoded "position" "type" as and Depending on the statistical occurrence of (snpp block) (snpt block). substitutions, insertion or deletion, different source models of the associated descriptors can be defined and the generated symbols coded in the associated block.

Source model 1: Substitutions as Positions and Types Substitutions Positions Descriptors A substitution position is calculated like the values of the nmis block, i.e.

In read 1 substitutions are encoded position in read 1 or as absolute as differential position with respect to the previous substitution in the same read In read 2 substitutions are encoded as absolute position in read 2 read 1 length or as differential position with respect to the previous substitution Figure 1O shows how substitutions (where, at a given mapping position, a symbol in a read is different from the in the reference symbol sequence) are coded as 1. the position of the mismatch with respect to the beginning of the read or with respect to the previous mismatch (differential encoding) 2. the of mismatch represented as a code calculated as described in Figure 10 type "separator" In the snpp block, the coding of each reads pair is terminated a special symbol.

Substitutions Descriptors Types For class M l as described in the next sections), mismatches are coded an index (and by (moving from right to left) from the actual present in the reference to the corresponding symbol substitution symbol present in the read For example if the aligned read {A, C, G, T, N, Z). presents a C instead of a T which is present at the same position in the reference, the mismatch index will be denoted as The decoding process reads the encoded descriptor, the nucleotide at the given position on the reference and moves from left to right to retrieve the decoded E.g. received for a position where is present in the reference will be symbol. a a G decoded as N. Figure 11 shows all the possible substitutions and the respective encoding PCT/U52018/018092 symbols. Obviously different and context adaptive probability models can be assigned to each substitution index according to the statistical properties of each substitution for each data type class to minimize the entropy of the descriptors.

In case of adoption of the IUPAC codes the substitution mechanism results to be ambiguity the same however the substitution vector is extended as: exactly S {A, C, G, T, N, Z, M, R, W, S, Y, K, V, H, D, B}.

Figure 12 provides an example of encoding of substitutions in the snpt block. types Some examples 0f substitutions encoding when IUPAC ambiguity codes are adopted are provided in Figure 13. A further example of substitution indexes is provided in Figure 14.

Encoding of insenions and deletions For class mismatches and deletions coded an indexes from right to from I, are (moving left) the actual present in the reference to the corresponding substitution present in symbol symbol the read: For example if the aligned read presents a C instead of a T present {A, C, G, T, N, Z}. "".4 at the same position in the reference, the mismatch index will be In case the read presents "".5 a deletion where a is present in the reference, the coded symbol will be The decoding process reads the coded descriptor, the nucleotide at the given position on the reference and from left to right to retrieve the decoded E.g. received for position where moves symbol. a a ‘".Z a G is present in the reference will be decoded as Inserts are coded as respectively for inserted N. 6, 7, 8, 9, 10, A, C, G, T, Figure 15 shows an example of how to encode substitutions, inserts and deletions in a reads pair of class I. In order to support the entire set of IUPAC ambiguity codes, the substitution vector C, G, T, N, shall be replaced S C, G, T, N, Z, M, R, W, S, Y, K, V, H, D, {A, Z} by {A, as described in the previous paragraph for mismatches. In this case the insertion codes need to have different 20 in case the substitution vector has 16 values, namely 16, 17, 18, 19, elements. The mechanism is illustrated in Figure 16.

Source model 2: one block per substitution and indels type For some data statistics a different coding model from the one described in the previous section can be developed for substitutions and indels resulting into a source with lower entropy. Such coding model is an alternative to the techniques described above for mismatches only and for mismatches and indels.

In this one data block is defined for each possible substitution without IUPAC case symbol codes, 16 with IUPAC plus one block for deletions and 4 more blocks for insertions. For codes), PCT/U52018/018092 simplicity of the explanation, but not as a limitation for the application of the model, the following description will focus on the case where no IUPAC codes are supported.

Figure 17 shows how each block contains the position of the mismatches or inserts of a single If no mismatches or inserts for that is present in the encoded read a 0 is type. type pair, encoded in the corresponding block. To enable the decoder to start the decoding process for the blocks described in this section, the header of each Access Units contains a flag signaling the first block to be decoded. In the example of figure 18 the first element to be decoded is position 2 in the C block. When no mismatches or indels of a given are present in a read type pair, a 0 is added to the corresponding blocks. On the decoding side, when the decoding pointer for each block points to a value of the decoding process moves to the next read pair.

Encoding of additional signaling flags Each data class introduced above require the encoding of additional (P, M, N, l) may information on the nature of the encoded reads. This information be related for example to the sequencing experiment indicating a probability of duplication of one read) or can (e.g. express some characteristic of the read mapping (e.g. first or second in pair). In the context of this invention this information is encoded in a separate block for each data class. The main advantage of is the to this information in such approach possibility selectively access only case of need and in the required reference sequence region. Other examples of the use of such only flags are: read paired read mapped in proper pair read or mate unmapped read or mate from reverse strand first/second in pair not primary alignment read fails platform/vendor checks quality read is PCR or optical duplicate alignment supplementary "internal" Descriptors for class and construction of references for unmapped reads of U "Class "Class HM" "Class In the case of the reads belonging to Class U or the unmapped pair of since they "external" cannot be mapped to reference sequence satisfying the specified set of matching "internal" accuracy constraints for belonging to of the classes P, N, M, or l, one or more PCT/U52018/018092 "constructed" reference sequences are and used for the compressed representation of the reads belonging to these data classes. "internal" Several approaches are possible to construct appropriate references such as for instance and not as limitation: the partitioning of the unmapped reads into clusters containing reads that share a common contiguous genomic sequence of at least a minimal size (signature). Each cluster can be uniquely identified its signature as shown in figure 22. the sorting of reads in meaningful order lexicographic and the use of the any (e.g. order) "internal" last N reads as reference for the encoding of the N+1. This method is shown in figure 23. "de-novo assembly" performing a so called on a subset of the reads of class U so as to sub-set be able to align and encode all or a relevant of the reads belonging to said class according to the specified matching accuracy constraints or a new set of constraints. "internal" If the read being coded can be mapped on the reference the specified set of satisfying matching accuracy constraints, the information to reconstruct the read after necessary compression is coded using descriptors that can be of the following types: 1. Start position of the matching portion on the internal reference in terms of read number the internal reference block). This position can be encoded either as absolute or (pos differential value with respect to the previously encoded read.

Offset of the start position from the beginning of the corresponding read in the internal reference block). E.g. in case of constant read length the actual position is pos (pair *length + pair.

Possibly present mismatches coded as mismatch position block) and (snpt (snpp type block) Those parts of the reads the edges identified pair) that do not match with (typically by the internal reference (or do so, but with a number 0f mismatches above a defined encoded in the indc block. A padding operation can be performed the threshold) are to edges of the part of the internal reference used in order to reduce the of the entropy mismatches encoded in the indc block, as shown in figure 24. The most appropriate padding strategy can be chosen the encoder according to the statistical properties of the genomic data being processed. Possible padding strategies include: a. No padding b. Constant padding pattern chosen according to its in the frequency currently encoded data.

PCT/U52018/018092 c. Variable padding pattern according to the statistical properties of the current context defined in terms of the latest N encoded reads The specific type of padding strategy will be signaled special values in the indc block header . A flag that indicates if the read has been encoded using an internal self-generated, external or no-reference block) (rtype 6. Reads which are encoded verbatim (ureads).

Figure 24 provides an example of such coding procedure.

Figure 25 shows an alternative encoding of unmapped reads on the internal reference where pos pair descriptors are replaced a signed pos. In this case pos would express the distance in terms of positions on the reference sequence of the left most nucleotide position of read n with respect of the position of the left most nucleotide of read n-1.

In case reads of class U present variable an additional descriptor rlen is used to store length, each read length.

This coding approach can be extended to support N start positions per read so that reads can be split over two or more reference positions. This can be particularly useful to encode reads generated those sequencing technology (e.g. from Pacific Bioscience) producing very long which present repeated patterns generated in the reads (50K+ bases) usually loops sequencing The same approach can be used as well to encode chimeric methodology. sequence reads defined as reads that align to two distinct portions of the genome with little or no overlap.

The approach described above can be clearly applied beyond the simple class U and could be applied to any block containing descriptors related to reads positions blocks). (pos Alignment score descriptor The mscore descriptor provides a score per alignment. In the context of this invention it is used to represent mapping/alignment score per read generated genomic sequence reads aligners.

The score is expressed using an exponent and fractional part. The number of bits used to represent the exponent and the fractional part are transmitted as configuration parameters. As an example, but not as a limitation, Table 2 shows how this is specified in lEEE RFC 754 for an 1-bits 52-bits 1 exponent and a fractional part.

The score of each alignment can be represented One sign bit 11 bits forthe exponent W0 2018/152143 PCT/U52018/018092 53 bit for the mantissa 1 11 52 +-+ + + Exp Mantissa +-+ + + 63 62 51 0 64-bit Table 2. Alignment scores can be expressed as double precision floating point values The to be used for the calculation of is therefore: base (radix) scores 10, = -1S x10E score x M Reads groups During the sequencing process different of sequenced reads can be produced. As an types example but not as a limitation types can be related to different sequenced samples, different experiments, different configuration of the sequencing machine. After sequencing and alignment this information is according to the of dedicated preserved, disclosed invention, means a descriptor named rgroup. rgroup is a label associated to each encoded read and enables a decoding apparatus to partition the decoded reads in groups after decoding.

Descriptors for multiple alignments The following descriptors are specified for the support of multiple alignments. ln case of presence of spliced reads, this invention defines a global flag spliced_reads_flag to be set to mmap descriptor The mmap descriptor is used to signal on how positions the read or the left-most read of many a pair has been aligned. A Genomic Record containing multiple alignments is associated with one multi-byte mmap descriptor. The first two of a mmap descriptor represent an bytes unsigned integer N which refers to the read as a single segment no splices are present in the encoded dataset) or instead to all the segments into which the read has been spliced for the several possible alignments (if splices are present in the dataset). The value of N says how values of the pos descriptor are coded for the template in this record. N is followed many one or more unsigned integers as described below.

Multiple alignments strandedness The rcomp descriptor described in this invention is used to specify the strandedness of each read alignment using the syntax specified in this invention.

PCT/U52018/018092 Scores of multiple alignments In case of multiple alignments one mscore as specified in this invention is assigned to each alignment.

Multiple alignments without splices If no splices are present in the Access is unset.

Unit, spliced_reads_ﬂag In paired-end sequencing, the mmap descriptor is composed of a 16-bit unsigned integer N followed one or more 8-bit unsigned integers with iassuming values from 1 t0 the number by Mi, of complete first (here, the left-most) read alignments. For each first read alignment, spliced or not, Mi is used to signal how many segments are used to align the second read (in this case, without splices, this is equal to the number of alignments), and then how many values of the pair descriptor are coded for that alignment of the first read.

The values of Mi shall be used to calculate P which indicates the number of alignments of the second read.

A special value of indicates that the alignment of the left-most read is paired with Mi (Mi an alignment of the right-most read which is paired with a alignment of the left-most already read with k i there is no new alignment detected, which is consistent with the equation (then above) As an example, in the simplest cases: left-most 1 If there is a single alignment for the read and two alternative alignments for the right-most, N will 1 will 2. be and M1 be 2 If two alternative alignments are detected for the left-most read but one for the right- only most, N will be will be 1 and will be O. 2, M1 M2 When Mi is the associated value of pair shall link to an existing second read alignment; a syntax error will be raised otherwise and the alignment considered broken.

Example: if the first read has two mapping positions and the second read only one, N is 2, M1 is 1 If and M2 is 0 as said earlier. this is followed another alternative secondary mapping for the entire N will be and will be 1. template, 3, M3 39 illustrates the meaning of P and in case of multiple alignments without splices and N, Mi Error! Reference source not found. shows how the pair and mmap descriptors are used pos, to encode the multiple alignments information.

With respect to 40 the following applies: PCT/U52018/018092 The right-most read has P alignments Eiﬁfi Some values of can be 0 when the alignment of the left-most read is paired with right-most with of left- an alignment of the read that is already paired a alignment the most read with k i One reserved value of the pair descriptor can be present to signal alignments belonging to other AUs ranges. If present it is always the first pair descriptor for the current record Multiple alignments with splices If the dataset is encoded with spliced reads, the msar descriptor enables representation of length and strandedness. splices After having decoded the mmap and the msar descriptors, the decoder knows how reads many or read pairs have been encoded to represent the multiple mappings and how segments many are composing each read or read pair mapping. This is shown in Figure 41 and Figure 42.

With reference to figure 41 the following applies: The left-most read has alignments with N splices S N1 N).

N represents the number of splices present in all alignments of the left-most read and it is encoded as first value of the mmap descriptor.

The right-most read has P $15 where M,- is the number of splices of the splices, right-most read which are associated in a pair with the alignment of the left-most read s S right-most i N1). In other words P represents the number of splices of the read and is calculated using the N values following the first value of the mmap descriptor. and represent the number of alignments of the first and second read and are N1 N2 calculated using the N P values of the msar descriptor.

With reference to Figure 42 the following applies: The left-most has alignments with N splices If N AND P no splices N1 (N15 N). N1 N2 would be present. o right-most sj s The read has P splices 1 P and N2 (NZS alignments.

Eiflﬁi t1 P) The number of pair descriptors can be calculated as NP MaX(N1, Mu where o with M0 is the number of Mi value O o NP has to be incremented 1 in case one special pair descriptor indicates the presence of alignments in other AUs.

PCT/U52018/018092 Alignment score The mscore descriptor allows signaling the mapping score of an alignment. ln single-end paired-end sequencing it will have N1 values per template; in sequencing it will have a value for each alignment of the entire template of different alignments of the first read (number possibly the number of further second read alignments, i.e. when Mi- 1 O).

Number of scores M0 MAX(N1, N2) where represent the total number of 0.

M0 Mi In this invention more than one score value can be associated to each alignment. The number of alignments is signaled a configuration parameter as_depth. 1 0 Descriptors for multiple alignments without splices Semantic mmap mscore Effect Read paired- Single read: Single read: N values Single read: end with N Read pair: values the read has multiple read) MAX(N, {(M0) multiple Read pair: mappings and is encoded as a {(Mi 0) mappings, not N, Mi Introducing a separator would sequence of N consecutive spliced where enable having an arbitrary segments belonging to the 1 S i S N number of scores. Otherwise a 0 class with the highest ID. should be used if not present. N pos descriptors are used These are floating point values as Read pair: specified in this invention. the read pair has multiple mappings and is encoded as a sequence of N segments for the first read P pairings to the =Z(Mi) alignments of the second read N pos descriptors are used N x P pair descriptors are used with = + = 3w‘: Ni NP (no. of W0 2018/152143 PCT/U52018/018092 (optional) The optional pairing descriptor is used when alignments are present on different reference sequences than the one currently encoded N+1 mmap descriptors read O O (pair) uniquely mapped Table 3. Determination of the number of descriptors needed to represent multiple alignments in one Genomic Record in case of multiple alignments without splices.

Descriptors for multiple alignments with splices Table 4 shows the determination of the number of descriptors needed to represent multiple alignments in one Genomic Record in of multiple alignments with splices. case mmap Effect Semantic mscore paired- Read Single read: Single read: N1 values Single read: end read) with N Read pair: Max(N1, Z(Mi== the read has multiple multiple Read pair: values mappings and it is encoded as mappings, with N, Mi N1 and N2 are calculated using a sequence of N consecutive splices the N P msar descriptors belonging to where segments the 1 s i s N P=§ class with the highest ID.

N pos descriptors are used These are floating point values as Read pair: described in this invention. the read pair has multiple mappings and it is encoded as a of sequence N segments for the first read P pairings to the =Z(Mi) alignments of the second read W0 2018/152143 PCT/U52018/018092 N pos descriptors are used read 0 0 (pair) uniquely mapped Table 4. Descriptors used to represent multiple alignments and associated scores.

Multiple alignments on different sequences It may happen that the alignment process finds alternative mappings to another reference than mapping is positioned. sequence the one where the primary For read pairs that are aligned, a pair descriptor shall be used to represent the uniquely absolute read positions when there is for example a chimeric alignment with the mate on another chromosome. The pair descriptor shall be used to signal the reference and the position of the next record containing further alignments for the same template. The last record (e.g. the third if in alternative mappings are coded 3 different AUs) shall contain the reference and position of the first record.

In case one or more alignments for the left-most read in a pair are present on different reference sequence than the one related to the encoded then a reserved value is currently AU, used for the pair descriptor. The reserved value is followed the reference sequence identifier and the position of the left-most alignment among all those contained in the next AU (i.e. the first decoded value of the pos descriptor for that record).

Multiple alignments with insertions, deletions, unmapped portions When an alternative mapping not preserve the of the reference secondary does contiguity region where the sequence is it be impossible to reconstruct the exact mapping aligned, may generated the aligner because the actual sequence then the descriptors related to by (and mismatches such as substitutions or is coded for the alignment. The msar indels) only primary descriptor shall be used to represent how secondary alignments map on the reference sequence in case they contain indels and/0r soft clips. If msar is represented the special symbol for a secondary alignment, the decoder will reconstruct the secondary alignment from the alignment and the alignment mapping position. primary secondary msar descriptor The msar (Multiple Segments Alignment descriptor supports spliced reads and Record) alternative secondary alignments that contain indels or soft clips. msar is intended to convey information on: a mapped segment length W0 2018/152143 PCT/U52018/018092 a different mapping contiguity (i.e. presence of insertions, deletions or clipped bases) for a alignment and/0r read secondaw spliced msar is used the of the extended CIGAR string described below plus the additional syntax symbol described in Table 5.

Table 5. Special symbol used for the msar descriptor in addition to the syntax described in table 6.

Semantics Description Symbol The secondary This is used when the reconstruction of a secondary alignment does not alignment does not require additional information contain indels or soft than the alignment position and the alignment primary clips Extended cigar syntax This section specifies an extended CIGAR (E-CIGAR) syntax for strings to be associated to sequences and related mismatches, indels, clipped bases and information on multiple alignments and spliced reads.

Edit in in operations described this invention are listed Table 6.

E-CIGAR Operation Semantics Equivalent SAM representation CIGAR representation Increment both pointer-to- nmatching bases nM in older pointer-to- reference R and versions (not read r n positions (match) equivalent), recent versions Replace nucleotide in the substitution of character C M in older read with base C from the C is present in the versions, increment pointer- read and not in the X in recent reference, to-reference R and pointer-to- reference) versions (not read r 1 equivalent) Increment pointer-to-read n are insened in n+ nl rby bases PCT/U52018/018092 n positions from the the read present in (insert (not the reference) read) pointer-to- n in nD Increment bases are deleted reference R n positions the read present in (but of sequence in the the (deletion S reference) read) pointer-to- Increment n soft clips nS R n positions reference in the Can (insertion read). occur at beginning or end only of read Hard trim. Can occur at n hard clips nH only beginning or end of read pointer-to- Increment An undirected splice of nN reference R n nbases positions, splice consensus observed (splice in the read) Increment pointer-to- A forward splice of n n/ Not existing reference R n positions, bases splice consensus observed on the forward strand (forward splice in the read) Increment pointer-to- A reverse splice of n n% Not existing reference R n positions, bases splice consensus observed on the reverse strand (reverse splice the read) Table 6. Syntax of the MPEG-G E-CIGAR string.

Source models, entropy coders and coding modes For each data sub-class and associated descriptor the genomic data structure class, block of disclosed in this invention different coding algorithms be adopted according to the specific "coding features of the data or metadata carried each block and its statistical properties. The "source model" algorithm" has to be intended as the association of a specific of the descriptor PCT/U52018/018092 "entropy coder". "source model" block with a specific The specific can be specified and selected to obtain the most efficient coding of the data in terms of minimization of the source entropy.

The selection of the entropy coder can be driven coding efficiency considerations and/or distribution features and associated implementation issues. Each selection of a probability "coding "coding mode" specific algorithm", also referred to as can applied to an entire "descriptor block" associated to a data class or sub-class for the entire data or different set, "coding modes" can be applied for each portion of descriptors partitioned into Access Units. "source model" Each associated to a coding mode is characterized The definition of the descriptors emitted each source (i.e.. the set of descriptors used to represent a class of data such as reads position, reads pairing information, mismatches with respect to a reference sequence as defined in Table 2).

The definition of the associated model. probability The definition of the associated coder. entropy Further advantages The classification 0f sequence data into the defined data classes and sub-classes permits the implementation of efficient coding modes exploiting the lower information source entropy characterizing modelling the of descriptors single data sequences separate sources (e.g. by by distance, position, etc.).

Another advantage of the invention is the possibility to access the subset of of data of only type interest. For example one of the most important application in genomics consists in finding the differences of a genomic sample with respect to a reference or a population (SNV) (SNP).

Today such of analysis requires the processing of the complete sequence reads whereas type adopting the data representation disclosed the invention the mismatches are already by by isolated into one to three data classes on the interest in considering also only (depending type" type" and mismatches).

A further advantage is the possibility of performing efficient transcoding from data and metadata "external" compressed with reference to a specific reference sequence to another different "external" re- reference sequence when new reference sequences are published or when mapping is performed on the already mapped data (e.g. using a different mapping algorithm) obtaining new alignments.

Figure 20 shows an encoding apparatus 207 according to the principles of this invention. The encoding apparatus 207 receives as input a raw sequence data for example produced a 209, by PCT/U52018/018092 genome sequencing apparatus 200. Genome sequencing apparatus 200 are known in the art, like the Illumina HiSeq 2500 or the Thermo-Fisher Ion Torrent devices. The raw sequence data 209 is fed to an aligner unit 201, which prepares the sequences for encoding aligning the reads to a reference sequence 2020. a dedicated module 202 can be used to Alternatively, generate reference sequence from the available reads using different strategies a by as described in this document in section "Construction of internal references for unmapped reads "Class HM". of Class and After having been processed the reference generator 202, reads can be mapped on the obtained longer sequence. The aligned sequences are then classified data classification module 204. A further step of reference transformation is then applied on the reference in order to reduce the entropy of the data generated the data classification unit 204. This implies processing the external reference 2020 into a reference transformation unit 2019 which produces transformed data 2018 and reference transformation descriptors classes 2021. The transformed data classes 2018 are then fed to blocks encoders 205-207 together with the reference transformation descriptors 2021. The genomic blocks 2011 are then fed to arithmetic encoders 2012-2014 which encode the blocks according to the statistical properties of the data or metadata carried the block. The result is a genomic stream 2015.

Figure 21 decoding 218 according to the principles of this disclosure. A shows a apparatus decoding apparatus 218 receives a multiplexed genomic bitstream 2110 from a network or a storage element. The multiplexed genomic bitstream 2110 is fed to a demultiplexer 210, to produce separate streams 211 which are then fed to entropy decoders 212-214, to produce genomic blocks 215 and reference transformation descriptors 2112. The extracted genomic 216-217 blocks are fed to block decoders to further decode the blocks into classes of data and the reference transformation descriptors are fed to a reference transformation unit 2113. Class decoders 219 further process the genomic descriptors 2111 and the transformed reference 2114, and merge the results to produce uncompressed reads of sequences, which can then be further stored in the formats known in the art, for instance a text file or zip compressed file, or FASTQ or SAM/BAM files.

Class decoders 219 are able to reconstruct the original genomic sequences leveraging the information on the original reference sequences carried one or more genomic streams and the reference transformation descriptors 2112 carried in the encoded bitstream. In case the reference sequences are not transponed the genomic streams must be available at the they decoding side and accessible the class decoders.

PCT/U52018/018092 The inventive techniques herewith disclosed be implemented in hardware, software, firmware or combination thereof. When implemented in software, these be stored on a any may computer medium and executed a hardware processing unit. The hardware processing unit comprise one or more digital signal general purpose may processors, processors, microprocessors, application specific integrated circuits or other discrete logic circuitry.

The techniques of this disclosure be implemented in a variety of devices or apparatuses, including mobile phones, desktop computers, sewers, tablets and similar devices.

File Format: Selective Access to Regions of Genomic Data Using the Master Index Table In order to support selective access to specific regions of the aligned data, the data structure described in this document implements an indexing tool called Master Index Table (MIT). This is multi-dimensional containing the loci at which specific reads map on the a array associated reference sequences. The values contained in the MIT are the mapping positions of the first read in each pos block so that non-sequential access to each Access Unit is supported. The MIT contains one section per each class of data U and and per each reference (P, N, M, I, HM) sequence. The MIT is contained in the Genomic Dataset Header of the encoded data. Figure 31 shows the structure of the Genomic Dataset Header, figure 32 shows a generic visual representation of the MIT and figure an example of MIT for the P of encoded 33 shows class reads.

The values contained in the MIT depicted in figure 33 are used to directly access the region of interest the corresponding in the compressed domain. (and AU) For example, with reference to figure 33, if it is required to access the region comprised between position 150,000 and 250,000 on reference 2, a decoding application would skip to the second reference in the MIT and would look for the two values k1 and k2 so that k1 150,000 and k2 Where k1 and k2 are 2 indexes read from the MIT. In the example 0f figure 250,000. 3"j 4m 33 this would result in the and positions of the second vector of the MIT. These returned values will then be used the decoding application to fetch the positions of the appropriate data from the pos block Local Index Table as described in the next section.

Together with pointers to the block containing the data belonging to the four classes of genomic data described above, the MIT can be uses as an index of additional metadata and/or annotations added to the genomic data during its life cycle.

Local Index Table PCT/U52018/018092 Each genomic data block is prefixed with a data structure referred to as local header. The local header contains a unique identifier of the block, a vector of Access Units counters per each reference sequence, a Local Index Table and optionally some block specific metadata.

(LIT) The LIT is a vector of pointers to the position of the data belonging to each Access Unit physical in the block Figure 34 depicts the generic block header and where the LlT is payload. payload used to access specific regions of the encoded data in a non-sequential way.

In the previous example, in order to access region 150,000 to 250,000 of reads aligned on the reference sequence no. 2, the decoding application retrieved positions 3 and 4 from the MIT. 3'd 4‘h These values shall be used the decoding process to access the and elements of the corresponding section of the LIT. In the example shown in figure 35, the Total Access Units counters contained in the block header are used to skip the LIT indexes related to AUs related to reference 1 in the The indexes containing the positions of the example). physical requested AUs in the encoded stream are therefore calculated as: position of the data blocks belonging to the requested AU data blocks belonging to AUs of reference 1 to be skipped position retrieved using the MIT, i.e.

First block position: 5 3 8 Last block position: 5 4 9 The of data retrieved using the indexing mechanism called Index part of blocks Local Table, are the Access Units requested.

Flgure 36 shows how the blocks contained in the MIT table correspond to blocks of the LIT per each class or sub-class of data.

Figure 37 shows how the data blocks retrieved using the MIT and the LIT compose one or more Access Units as defined in the following section.

In an embodiment of this invention, the LIT can be integrated as a substructure of the MIT. The advantage of such approach is the speed of access t0 the indexed data in case of sequential parsing of the compressed file. If the LIT is integrated in the MIT in the file header, a decoding device would need to parse a small portion of data to retrieve the requested compressed only information in case of selective access. Another advantage is evident, to a person skilled in the art, in case of streaming on a network, when the indexing information contained in the MIT and LlT would be delivered among the first data blocks therefore enabling the receiving device to perform operations such as sorting and selective access before the entire data transfer is completed.

Access Units PCT/U52018/018092 The genomic data classified in data classes and structured in compressed 0r uncompressed blocks are organized into different Access Units.

Genomic Access Units are defined as sections of genome data a compressed or (AU) uncompressed that reconstructs nucleotide sequences and/or the relevant metadata, form) and/0r sequence of DNA/RNA the virtual reference) and/or annotation data generated (e.g. by a genome sequencing machine and/0r a genomic processing device or analysis application. An example of Access Unit is provided in figure 37.

An Access Unit is a block of data that can be decoded either independently from other Access Units using only globally available data (e.g. decoder configuration) or using information by by contained in other Access Units.

Units differentiated Access are by: characterizing the nature of the genomic data and data sets they carry and the type, way they can be accessed, providing unique order to Access Units belonging to the same order, a type. "categories".

Access units of can be further classified into different any type Hereafter follows a non-exhaustive list of definition of different of genomic Access Units: types Access units 0f O do not need to refer to information coming from other Access 1) type any Units to be accessed or decoded and accessed. The entire information carried the data or data contain can read and decoding sets they be independently processed a device or processing application.

Access units of 1 contain data that refer to data carried Access Units of O. 2) type by type Reading or decoding and processing the data contained in Access Units of 1 type requires having access to one or more Access Units 0f 0. Access unit of 1 type type "Class encode genomic data related to sequence reads of Access Units of 2 contain data that refer to data carried Access Units of 0. 3) type by type Reading or decoding and processing the data contained in Access Units of 2 type requires having access to one or more Access Units of O. Access unit of 2 type type "Class encode genomic data related to sequence reads of PCT/U52018/018092 Access Units of 3 contain data that refer to data carried Access Units of 0. type by type Reading or decoding and processing the data contained in Access Units of 3 type requires having access to one or more Access Units of type O. Access unit of type 3 encode genomic data related to sequence reads of "Class Access Units of 4 contain data that refer to data carried Access Units of 0. ) type by type Reading or decoding and processing the data contained in Access Units of 4 type requires having access to one or more Access Units of 0. Access unit of 4 type type "Class encode genomic data related to sequence reads of Access Units of type 5 contain reads that cannot be mapped on any available reference ("Class and are encoded an constructed reference sequence used internally sequence. Access Units of 5 contain data that refer to data carried Access Units type by of 0. Reading or decoding and processing the data contained in Access Units of type 5 requires having access to one or more Access Units of 0. type type Access Units of type 6 contain read pairs where one read can belong to any of the four | and the other cannot mapped on available reference classes P, N, M, be any sequence ("Class HM"). Access Units of 6 contain data that refer to data carried Access type by Units of O. Reading or decoding and processing the data contained in Access Units type of 6 requires having access to one or more Access Units of O. type type Access Units of 7 contain metadata (e.g. quality scores) and/or annotation data type associated to the data or data sets contained in the access unit of type 1. Access Units of 7 be classified and labelled in different blocks. type may Access Units of 8 contain data or data sets classified as annotation data. Access type Units of 8 be classified and labelled in blocks. type may Access Units of additional types can extend the structure and mechanisms described here. As an example, but not as a limitation, the results of genomic variant calling, structural and functional can be encoded in Access Units of new The analysis types. data organization in Access Units described herein does not prevent of data to any type PCT/U52018/018092 be encapsulated in Access Units being the mechanism completely transparent with respect to the nature of encoded data.

Access Units of 0 are ordered but do not need to be stored and/0r type (e.g. numbered), they transmitted in an ordered manner advantage: parallel processing/parallel streaming, (technical multiplexing) Access Units of 5 and 6 do not need to be ordered and do not need to be stored type 1, 2, 3, 4, and/0r transmitted in an ordered manner (technical advantage: parallel processing / parallel streaming).

Figure 37 shows how Access Units are composed a header and one or more blocks of homogeneous data. Each block can composed or more blocks. Each block contains be one several packets and the packets are a structured sequence of the descriptors introduced above to represent e.g. reads positions, pairing information, reverse complement information, mismatches positions and etc. types Each Access unit can have a different number of packets in each block, but within an Access Unit all blocks have the same number of packets.

Each data identified the combination of identifiers X Y Z where: packet can be 3 X identifies the access unit it belongs to Y identifies the block it belongs to (i.e. the data type it encapsulates) Z is an identifier expressing the packet order with respect to other packets in the same block Figure 38 shows an example of Access Units and packets labelling where is an access AU_T_N unit of T with identifier N which or not imply a notion of order according to the type may may Access Unit Type. Identifiers are used to uniquely associate Access Units of one with those type of other types required to completely decode the carried genomic data. "categories" Access Units of further classified and labelled in different any type can be according to different sequencing processes. For example, but not as a limitation, classification and labelling can take place when 1. sequencing the same organism at different times (Access Units contain genomic "temporal" information with a connotation), 2. sequencing organic samples of different nature of the same organisms (e.g. skin, blood, "biological" hair for are with connotation. human samples). These Access Units

Claims

1. A method for encoding genome sequence data, said genome sequence data comprising reads of sequences of nucleotides, said method comprising the steps of: 5 aligning said reads to one or more reference sequences thereby creating aligned reads, classifying said aligned reads according to specified matching rules with said one or more reference sequences, thereby creating classes of aligned reads, encoding said classified aligned reads as a multiplicity of blocks of descriptors, wherein encoding said classified aligned reads as a multiplicity of blocks of descriptors 10 comprises selecting said descriptors according to said classes of aligned reads, structuring said blocks of descriptors with header information thereby creating successive Access Units.

2. The encoding method of claim 1 further comprising: 15 further classifying said reads that do not satisfy said specified matching rules into a class of unmapped reads, constructing a set of reference sequences using at least some unmapped reads, aligning said class of unmapped reads to the set of constructed reference sequences, encoding said classified aligned reads as a multiplicity of blocks of descriptors, 20 encoding said set of constructed reference sequences, structuring said blocks of descriptors and said encoded reference sequences with header information thereby creating successive Access Units.

3. The method of claim 2, wherein said classifying comprises identifying genomic reads without 25 any mismatch in the reference sequence as first “Class P” when no mismatches are present in the mapped read with respect to the reference sequence used for mapping, wherein said classifying further comprises identifying genomic reads as a second “Class N” when mismatches are only found in the positions where the sequencing machine was not able to call any “base” and the number of mismatches in each read does not exceed a given 30 threshold, wherein said classifying further comprises identifying genomic reads as a third “Class M” when mismatches are found in the positions where the sequencing machine was not able to call any “base”, named “n type” mismatches, and/or it called a different “base” than the reference sequence, named “s type” mismatches, and the number of mismatches does not exceed given thresholds for the number of mismatches of “n type”, of “s type” and a threshold obtained from a given function (f(n,s)), 5 wherein said classifying further comprises identifying genomic reads as a fourth “Class I” when they can possibly have the same type of mismatches of “Class M”, and in addition at least one mismatch of type: “insertion” (“i type”) “deletion” (“d type”) soft clips (“c type”), and wherein the number of mismatches for each type does not exceed the corresponding given threshold and a threshold provided by a given function (w(n,s,i,d,c)), and 10 wherein said classifying further comprises identifying genomic reads as a fifth “Class U” as comprising all reads that do not find any classification in the Classes P, N, M, I.

4. The encoding method of claim 3 wherein the reads of the genomic sequence to be encoded are paired, and 15 , wherein said classifying further comprises identifying genomic reads as a sixth “Class HM” as comprising all reads pairs where one read belong to Class P, N, M or I and the other read belong to “Class U”.

5. The encoding method of claim 4 further comprising the steps of: 20 Identifying if the two mate reads are classified in the same class (each of: P, N, M, I, U), then assigning the pair to the same identified class, Identifying if the two mate reads are classified in different classes, and in case none of them belongs to the “Class U”, then assigning the pair of reads to the class with the highest priority defined according to the following expression: 25 P < N < M < I in which “Class P” has the lowest priority and “Class I” has the highest priority; identifying if only one of the two mate reads has been classified as belonging to “Class U” and classifying the pair of reads as belonging to the “Class HM” sequences. 30

6. The method of claim 5 where each Class of reads N, M, I is further partitioned into two or more subclasses (296, 297, 298) according to a vector of thresholds (292, 293, 294) defined respectively for each class N, M and I, by the number of “n type” mismatches (292), the function f(n,s) (293) and the function w(n,s,i,d,c) (294).

7. The encoding method of claim 6 further comprising the steps of: identifying if the two mate reads are classified in the same subclass, then assigning the pair to the same sub-class, identifying if the two mate reads are classified into sub-classes of different Classes, then 5 assigning the pair to the subclass belonging to the Class of higher priority according to the following expression: N < M < I where N has the lowest priority and I has the highest priority; identifying if the two mate reads are classified in the same class, and such class is N or M or I, 10 but in different sub-classes, then assigning the pair to the sub-class with the highest priority according to the following expressions: N < N <… < N 1 2 k M < M <… < M 1 2 j I < I <… < I 1 2 h 15 where the highest index has the highest priority.

8. The method of claim 7 wherein information on the mapping position of each read is encoded by means of a “pos” descriptor block, wherein information on the strandedness (i.e. the DNA strand the read was sequences from) of 20 each read is optionally encoded by means of a rcomp descriptor block, wherein pairing information of paired-end reads is optionally encoded by means of a “pair” descriptor block, wherein additional alignment information such as if the read is mapped in proper pair, it fails platform/vendor quality checks, it is a PCR or optical duplicate or it is a supplementary 25 alignment is optionally encoded by means of a “flags” descriptor block, wherein information on unknown bases is optionally encoded by means of a “nmis” descriptor block, wherein information on the position of substitutions is optionally encoded by means of a “snpp” descriptor block, 30 wherein information on the type of substitutions is optionally encoded by means of a specific “snpt” descriptor block, wherein information on the position of mismatches of type substitutions, insertions or deletions is optionally encoded by means of a “indp” descriptor block, wherein information on the type of mismatches such as substitutions, insertions or deletions is optionally encoded by means of a “indt” descriptor block, wherein information on clipped bases of a mapped read is optionally encoded by means of a “indc” descriptor block, 5 wherein information on unmapped reads is optionally encoded by means of a “ureads” descriptor block, wherein information on the type of reference sequence used for encoding is optionally encoded by means of a “rtype” descriptor block, wherein information on multiple alignments of the mapped reads is optionally encoded by 10 means of a “mmap” descriptor block, wherein information on spliced alignments and multiple alignments of the same read is optionally encoded by means of a “msar” descriptor block and a “mmap” descriptor block, wherein information on read alignment scores is optionally encoded by means of a “mscore” descriptor block, 15 wherein information on the groups reads belong to is optionally encoded by means of a “rgroup” descriptor block.

9. The method of claim 8 wherein Access Units of class P are built using blocks of descriptors of type “pos”, “rcomp” and “flags”, 20 wherein said Access Units of class P optionally encode pairing information of paired-end using a block of “pair” descriptors, wherein Access Units of class N are built using the same blocks of descriptors of an Access Unit of class P plus a “nmis” descriptor block for the information on the position of unknown bases, 25 wherein Access Units of class M are built using the same blocks of descriptors of Access Units of class P plus blocks of the “snpp” and “snpt” descriptors for the information on position and type of substitutions, wherein Access Units of class I are built using the same blocks of descriptors of Access Units of class P plus blocks of the “indp”, “indt” and “indc” descriptors for the information on position and 30 type of substitutions, insertions, deletions and clipped bases, and wherein information on multiple alignments is conveyed using blocks of the “mmap” and “msar” descriptor.

10. The method of claim 9 wherein Access Units of class HM are built using the same blocks of descriptors of Access Units of class I for the mapped reads, and using blocks of the “ureads” descriptor for the unmapped reads.

11. The method of claim 9 wherein information on spliced alignments is conveyed using an extended cigar string comprising:  the symbol = to indicated matching bases  the symbol + to indicate insertions 10  the symbol - to indicate deletions  the symbol / to indicate a splice on the forward strand  the symbol % to indicate a splice on the reverse strand  the symbol * to indicate an undirected splice  a textual character from the IUPAC codes for DNA to indicate a substitution 15  the symbol (n) to indicate n soft clipped bases where n is an integer number  the symbol [n] to indicate n hard clipped bases where n is an integer number

12. The method of claim 11 wherein said blocks of descriptors comprise a “master index table”, containing one section for each Class and sub-class of aligned reads, said section comprising 20 the mapping positions on said one or more reference sequences of the first read of each Access Unit of each Class or sub-class of data; jointly coding said “master index table” and said Access Unit data.

13. The method of claim 12, wherein said blocks of descriptors further comprise information on 25 the type of reference used (pre-existing or constructed), and the segments of the read that do not map on the reference sequence.

14. The method of claim 13, wherein said reference sequences are first transformed into different reference sequences by applying substitutions, insertions, deletions and clipping, then 30 the encoding of said classified aligned reads as a multiplicity of blocks of descriptors refers to the transformed reference sequences.

15. The method of claim 14 where the reference sequences transformations are encoded as blocks of descriptors and structured with header information thereby creating successive Access Units. 5

16. The method of claims 15, wherein the encoding of said classified aligned reads and the related reference sequences transformations as multiplicity of blocks of descriptors comprises the step of associating a specific source model and a specific entropy coder to each descriptor block, and wherein said entropy coder preferably is one of a context adaptive arithmetic coder, a variable length coder or a golomb coder.

17. A method for decoding encoded genomic data comprising the steps of: parsing Access Units containing said encoded genomic data to extract multiple blocks of descriptors by employing header information, decoding said multiplicity of blocks of descriptors to extract reads according to specific matching rules defining their classification with respect to one or more reference sequences.

18. The decoding method of claim 17 further comprising decoding a master index table 20 containing one section for each class of reads and the associated relevant mapping positions.

19. The decoding method of claim 18 further comprising decoding information related to the type of reference used: pre-existing, transformed or constructed, and/or further comprising decoding information related to one or more transformations to be applied to the pre-existing 25 reference sequences.

20. The decoding method of claims 19 wherein said block of descriptors are entropy decoded.

21. The decoding method of claim 20 wherein: 30 Class P reads are obtained by decoding blocks of descriptors of type: “pos”, “rcomp”, “flags” and “rlen”, Class N reads are obtained by decoding blocks of descriptors of type: “pos”, “rcomp”, “flags”, “rlen” and “nmis”, Class M reads are obtained by decoding blocks of descriptors of type: “pos”, “rcomp”, “flags”, “rlen”, “snpp” and “snpt”, Class I reads are obtained by decoding blocks of descriptors of type: “pos”, “rcomp”, “flags”, “rlen”, “indp”, “indt” and “indc”, 5 Class U reads are obtained by decoding blocks of descriptors of type: “pos”, “rcomp”, “flags”, “rlen”, “snpp”, “snpt”, “indc”, “ureads” and “rtype”,

22. The decoding method of claim 21 wherein paired reads of: Class P, N, M and I are obtained by also decoding blocks of descriptors of type: “pair”, 10 Class HM are obtained by decoding blocks of descriptors of type: “pos”, “rcomp”, “flags”, “rlen”, “indp”, “indt”, “indc”, and “ureads”.

23. A genomic encoder (2010) for the compression of genome sequence data 209, said genome sequence data 209 comprising reads of sequences of nucleotides, 15 said genomic encoder (2010) comprising: an aligner unit (201), configured to align said reads to one or more reference sequences thereby creating aligned reads, a constructed-reference generator unit (202), configured to produce constructed reference sequences 20 a data classification unit (204), configured to classify said aligned reads according to specified matching rules with the one or more pre-existing reference sequences or constructed reference sequences thereby creating classes of aligned reads (208); one or more blocks encoding units (205-207), configured to encode said classified aligned reads 25 as blocks of descriptors by selecting said descriptors according to said classes of aligned reads, a multiplexer (2016) for multiplexing the compressed genomic data and metadata.

24. The genomic encoder of claim 23 further comprising 30 a reference sequence transformation unit (2019) configured to transform the pre-existing references and data classes (208) into transformed data classes (2018).

25. The genomic encoder of claim 24 where the data classification unit (204) contains encoders of data classes N, M and I configured with vectors of thresholds generating sub-classes of data classes N, M and I. 5

26. A genomic decoder (218) for the decompression of a compressed genomic stream (211) said genomic decoder (218) comprising: a demultiplexer (210) for demultiplexing compressed genomic data and metadata, parsing means (212-214) configured to parse said compressed genomic stream into genomic blocks of descriptors (215), 10 one or more block decoders (216-217), configured to decode the genomic blocks of descriptors into classified reads of sequences of nucleotides (2111), genomic data classes decoders (219) configured to selectively decode said classified reads of sequences, of nucleotides on one or more reference sequences so as to produce uncompressed reads of sequences of nucleotides.

27. The genomic decoder of claim 26 further comprising a reference transformation decoder (2113) configured to decode reference transformation descriptors (2112) and produce a transformed reference (2114) to be used by genomic data class decoders (219). 20

28. The genomic decoder of claim 27, wherein the one or more reference sequences are stored in the compressed genome stream (211), or wherein the one or more reference sequences are provided to the decoder via an out of band mechanism.

29. The genomic decoder of claim 28, wherein the one or more reference sequences are built at 25 the decoder, or wherein one or more reference sequences are transformed at the decoder by a reference transformation decoder (2113).

30. A computer-readable medium comprising instructions that when executed cause at least one processor to perform the encoding method of claim 7. W0