US20200042735A1 - Method and system for selective access of stored or transmitted bioinformatics data - Google Patents

Method and system for selective access of stored or transmitted bioinformatics data Download PDF

Info

Publication number
US20200042735A1
US20200042735A1 US16/341,426 US201716341426A US2020042735A1 US 20200042735 A1 US20200042735 A1 US 20200042735A1 US 201716341426 A US201716341426 A US 201716341426A US 2020042735 A1 US2020042735 A1 US 2020042735A1
Authority
US
United States
Prior art keywords
genomic
data
type
reads
class
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Abandoned
Application number
US16/341,426
Inventor
Mohamed Khoso Baluch
Giorgio Zoia
Daniele Renzi
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Genomsys SA
Original Assignee
Genomsys SA
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from PCT/EP2016/074307 external-priority patent/WO2018068829A1/en
Priority claimed from PCT/EP2016/074297 external-priority patent/WO2018068827A1/en
Priority claimed from PCT/EP2016/074311 external-priority patent/WO2018068830A1/en
Priority claimed from PCT/EP2016/074301 external-priority patent/WO2018068828A1/en
Application filed by Genomsys SA filed Critical Genomsys SA
Assigned to GENOMSYS SA reassignment GENOMSYS SA ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: BALUCH, Mohamed Khoso, RENZI, Daniele, ZOIA, GIORGIO
Publication of US20200042735A1 publication Critical patent/US20200042735A1/en
Abandoned legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2282Tablespace storage structures; Management thereof
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/23Updating
    • G06F16/2365Ensuring data consistency and integrity
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • G06F16/285Clustering or classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/602Providing cryptographic facilities or services
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/62Protecting access to data via a platform, e.g. using keys or access control rules
    • G06F21/6218Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/62Protecting access to data via a platform, e.g. using keys or access control rules
    • G06F21/6218Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
    • G06F21/6245Protecting personal data, e.g. for financial or medical purposes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/01Input arrangements or combined input and output arrangements for interaction between user and computer
    • G06F3/048Interaction techniques based on graphical user interfaces [GUI]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F7/00Methods or arrangements for processing data by operating upon the order or content of the data handled
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/10Ploidy or copy number detection
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/20Allele or variant detection, e.g. single nucleotide polymorphism [SNP] detection
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/10Sequence alignment; Homology search
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/20Sequence assembly
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • G16B40/10Signal processing, e.g. from mass spectrometry [MS] or from PCR
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B45/00ICT specially adapted for bioinformatics-related data visualisation, e.g. displaying of maps or networks
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B50/00ICT programming tools or database systems specially adapted for bioinformatics
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B50/00ICT programming tools or database systems specially adapted for bioinformatics
    • G16B50/10Ontologies; Annotations
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B50/00ICT programming tools or database systems specially adapted for bioinformatics
    • G16B50/30Data warehousing; Computing architectures
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B50/00ICT programming tools or database systems specially adapted for bioinformatics
    • G16B50/40Encryption of genetic data
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B50/00ICT programming tools or database systems specially adapted for bioinformatics
    • G16B50/50Compression of genetic data
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B99/00Subject matter not provided for in other groups of this subclass
    • HELECTRICITY
    • H03ELECTRONIC CIRCUITRY
    • H03MCODING; DECODING; CODE CONVERSION IN GENERAL
    • H03M7/00Conversion of a code where information is represented by a given sequence or number of digits to a code where the same, similar or subset of information is represented by a different sequence or number of digits
    • H03M7/30Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction
    • H03M7/3084Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction using adaptive string matching, e.g. the Lempel-Ziv method
    • H03M7/3086Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction using adaptive string matching, e.g. the Lempel-Ziv method employing a sliding window, e.g. LZ77
    • HELECTRICITY
    • H03ELECTRONIC CIRCUITRY
    • H03MCODING; DECODING; CODE CONVERSION IN GENERAL
    • H03M7/00Conversion of a code where information is represented by a given sequence or number of digits to a code where the same, similar or subset of information is represented by a different sequence or number of digits
    • H03M7/30Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction
    • H03M7/70Type of the data to be coded, other than image and sound

Definitions

  • the present application provides new methods for the efficient storage, transmission and multiplexing of bioinformatics data, and in particular genomic sequencing data, in compressed form that enable efficient selective access and selective protection of the different data categories composing the genomic datasets.
  • genome sequencing data is fundamental to enable efficient processing, storage and transmission of genomic data to make possible and facilitate analysis applications such as genome variants calling and all analysis performed, with various purposes, by processing the sequencing data and metadata.
  • genome sequencing information is generated by High Throughput Sequencing (HTS) machines in the form of sequences of nucleotides (a. k. a. bases) represented by strings of letters from a defined vocabulary.
  • HTS High Throughput Sequencing
  • sequence reads do not read out an entire genomes or genes, but they produce short random fragments of nucleotide sequences known as sequence reads.
  • a quality score is associated to each nucleotide in a sequence read. Such number represents the confidence level given by the machine to the read of a specific nucleotide at a specific location in the nucleotide sequence.
  • This raw sequencing data generated by NGS machines are commonly stored in FASTQ files (see also FIG. 1 ).
  • the smallest vocabulary to represent sequences of nucleotides obtained by a sequencing process is composed by five symbols: ⁇ A, C, G, T, N ⁇ representing the four types of nucleotides present in DNA namely Adenine, Cytosine, Guanine, and Thymine plus the symbol N to indicate that the sequencing machine was not able to call any base with a sufficient level of confidence, so the type of base in such position remains undetermined in the reading process.
  • RNA Thymine is replaced by Uracil (U).
  • the nucleotides sequences produced by sequencing machines are called “reads”. In case of paired reads the term “template” is used to designate the original sequence from which the read pair has been extracted. Sequence reads can be composed by a number of nucleotides in a range from a few dozen up to several thousand. Some technologies produce sequence reads in pairs where each read can be originated from one of the two DNA strands.
  • the term “coverage” is used to express the level of redundancy of the sequence data with respect to a reference genome. For example, to reach a coverage of 30 ⁇ on a human genome (3.2 billion bases long) a sequencing machine shall produce a total of about 30 ⁇ 3.2 billion bases so that in average each position in the reference is “covered” 30 times.
  • the most used genome information representations of sequencing data are based on FASTQ and SAM file formats which are commonly made available in zipped form in the attempt of reducing the original size.
  • the traditional file formats respectively FASTQ and SAM for non-aligned and aligned sequencing data, are constituted by plain text characters and are thus compressed by using general purpose approaches such as LZ (from Lempel and Ziv) schemes (the well-known zip, gzip etc).
  • LZ from Lempel and Ziv
  • general purpose compressors such as gzip
  • the result of the compression is usually a single blob of binary data.
  • the information in such monolithic form results quite difficult to archive, transfer and elaborate particularly in the case of high throughput sequencing when the volumes of data are extremely large.
  • each stage of a genomic information processing pipeline produces data represented by a completely new data structure (file format) despite the fact that in reality only a small fraction of the generated data is new with respect to the previous stage.
  • FIG. 1 shows the main stages of a typical genomic information processing pipeline with the indication of the associated file format representation.
  • genomic data is slow and inefficient because the currently used data formats are organized into monolithic files of up to several hundred Gigabytes of size which need to be entirely transferred at the receiving end in order to be processed. This implies that the analysis of a small segment of the data requires the transfer of the entire file with significant costs in terms of consumed bandwidth and waiting time. Often online transfer is prohibitive for the large volumes of the data to be transferred, and the transport of the data is performed by physically moving storage media such as hard disk drives or storage servers from one location to another.
  • the present invention provides a solution to this need.
  • the invention aims at providing an appropriate genomic sequencing data and metadata representation by organizing and partitioning the data so that the compression of data and metadata is maximized and several functionality such as selective access and support for incremental updates are efficiently enabled.
  • a key aspect of the invention is a specific definition of classes of data and metadata to be represented by an appropriate source model, coded (i.e. compressed) separately by being structured in specific layers.
  • the present application discloses a method and system addressing the problem of efficient manipulation, storage and transmission of very large amounts of genomic sequencing data, by employing a structured access units approach combined with multiplexing techniques.
  • the present application overcomes all the limitations of the prior art approaches related to the functionality of genomic data accessibility, selective data protection, efficient processing of data subsets, transmission and streaming functionality combined with an efficient compression.
  • SAM Sequence Alignment Mapping
  • CRAM CRAM specification: https://samtools.github.io/hts-specs/CRAMv3.pdf.
  • CRAM provides a more efficient compression for the adoption of differential encoding with respect to an existing reference (it partially exploits the data source redundancy), but it still lacks features such as incremental updates, support for streaming and selective access to specific classes of compressed data.
  • CRAM relies on the concept of the CRAM record. Each CRAM record encodes a single mapped or unmapped reads by encoding all the elements necessary to reconstruct it.
  • Beside CRAM also the other approaches to genomic data compression and processing present strong limitations to most of the desired functionality and do not support features that are provided by this invention disclosure as described and specified in the following of the document.
  • the first two categories share the disadvantage of not exploiting the specific characteristics of the data source (genomic sequence reads) and process the genomic data as string of text to be compressed without taking into account the specific properties of such kind of information (e.g. redundancy among reads, reference to an existing sample).
  • Two of the most advanced toolkits for genomic data compression namely CRAM and Goby (“Compression of structured high-throughput sequencing data”, F. Campagne, K. C. Dorff, N. Chambwe, J. T. Robinson, J. P. Mesirov, T. D. Wu), make a poor use of arithmetic coding as they implicitly model data as independent and identically distributed by a Geometric distribution.
  • Goby is slightly more sophisticated since it converts all the fields to a list of integers and each list is encoded independently using arithmetic coding without using any context. In the most efficient mode of operation, Goby is able to perform some inter-list modeling over the integer lists to improve compression. These prior art solutions yield poor compression ratios and data structures that are difficult if not impossible to selectively access and manipulate once compressed. Downstream analysis stages can result to be inefficient and very slow due to the necessity of handling large and rigid data structures even to perform simple operation or to access selected regions of the genomic dataset.
  • FIG. 1 A simplified vision of the relation among the file formats used in genome processing pipelines is depicted in FIG. 1 .
  • file inclusion does not imply the existence of a nested file structure, but it only represents the type and amount of information that can be encoded for each format (i.e. SAM contains all information in FASTQ, but organized in a different file structure).
  • CRAM contains the same genomic information as SAM/BAM, but it has more flexibility in the type of compression that can be used, therefore it is represented as a superset of SAM/BAM.
  • Genomic Information Storage Format Geneomic File Format
  • Transport Mechanism that enable efficient compression, support selective access and protection functionality in the compressed domain, of local and remotely stored data and support the incremental addition of heterogeneous metadata in the compressed domain at all levels of the different stages of the genomic data processing.
  • the present invention provides a solution to the limitations of the state of the art by employing the method, devices and computer programs as claimed in the accompanying set of claims.
  • FIG. 1 shows the main steps of a typical genomic pipeline and the related file formats.
  • FIG. 2 shows the mutual relationship among the most used genomic file formats
  • FIG. 3 shows how genomic sequence reads are assembled in an entire or partial genome via de-novo assembly or reference based alignment.
  • FIG. 4 shows how reads mapping positions on the reference sequence are calculated.
  • FIG. 5 shows how reads pairing distances are calculated.
  • FIG. 6 shows how pairing errors are calculated.
  • FIG. 7 shows how the pairing distance is encoded when a read mate pair is mapped on a different chromosome.
  • FIG. 8 shows how sequence reads can be generated from the first or second DNA strand of a genome.
  • FIG. 9 shows how a read mapped on strand 2 has a corresponding reverse complemented read on strand 1.
  • FIG. 10 shows the four possible combinations of reads composing a reads pair and the respective encoding in the rcomp layer.
  • FIG. 11 shows how “n type” mismatches are encoded in a nmis layer.
  • FIG. 12 shows an example of substitutions in a mapped read pair.
  • FIG. 13 shows how substitutions positions can be calculated either as absolute or differential values.
  • FIG. 14 shows how symbols encoding substitutions without IUPAC codes are calculated.
  • FIG. 15 shows how substitution types are encoded in the snpt layer.
  • FIG. 16 shows how symbols encoding substitutions with IUPAC codes are calculated.
  • FIG. 17 shows an alternative source model for substitution where only positions are encoded, but one layer per substitution type is used.
  • FIG. 18 shows how to encode substitutions, insertions and deletions in a reads pair of class I when IUPAC codes are not used.
  • FIG. 19 shows how to encode substitutions, insertions and deletions in a reads pair of class I when IUPAC codes are used.
  • FIG. 20 shows the structure of the Genomic Dataset Header of the genomic information data structure disclosed by this invention.
  • FIG. 21 shows how the Master Index Table contains the positions on the reference sequences of the first read in each Access Unit.
  • FIG. 22 shows an example of partial MIT showing the mapping positions of the first read in each pos AU of class P.
  • FIG. 23 shows how the Local Index Table in the layer header is a vector of pointers to the AUs in the payload.
  • FIG. 24 shows an example of Local Index Table.
  • FIG. 25 shows the functional relation between Master Index Table and Local Index Tables
  • FIG. 26 shows how Access Units are composed by blocks of data belonging to several layers. Layers are composed by Blocks subdivided in Packets.
  • FIG. 27 shows how a Genomic Access Unit of type 1 (containing positional, pairing, reverse complement and read length information) is packetized and encapsulated in a Genomic Data Multiplex.
  • FIG. 28 shows how Access Units are composed by a header and multiplexed blocks belonging to one or more layers of homogeneous data. Each block can be composed by one or more packets containing the actual descriptors of the genomic information.
  • FIG. 29 shows the structure of Access Units of type 0 which do not need to refer to any information coming from other access units to be accessed or decoded and accessed.
  • FIG. 30 shows the structure of Access Units of type 1.
  • FIG. 31 shows the structure of Access Units of type 2 which contain data that refer to an access unit of type 1. These are the positions of N bases in the encoded reads.
  • FIG. 32 shows the structure of Access Units of type 3 which contain data that refer to an access unit of type 1. These are the positions and types of mismatches in the encoded reads.
  • FIG. 33 shows the structure of Access Units of type 4 which contain data that refer to an access unit of type 1. These are the positions and types of mismatches in the encoded reads.
  • FIG. 34 shows the first five type of Access Units.
  • FIG. 35 shows that Access Units of type 1 refer to Access Units of type 0 to be decoded.
  • FIG. 36 shows that Access Units of type 2 refer to Access Units of type 0 and 1 to be decoded.
  • FIG. 37 shows that Access Units of type 3 refer to Access Units of type 0 and 1 to be decoded.
  • FIG. 38 shows that Access Units of type 4 refer to Access Units of type 0 and 1 to be decoded.
  • FIG. 39 shows the Access Units required to decode sequence reads with mismatches mapped on the second segment of the reference sequence (AU 0-2).
  • FIG. 40 shows how raw genomic sequence data that becomes available can be incrementally added to pre-encoded genomic data.
  • FIG. 41 shows how a data structure based on Access Units enables genomic data analysis to start before the sequencing process is completed.
  • FIG. 42 shows how new analysis performed on existing data can imply that reads are moved from AUs of type 4 to one of type 3.
  • FIG. 43 shows how newly generated analysis data are encapsulated in a new AU of type 8 and a corresponding index is created in the MIT.
  • FIG. 44 shows how to transcode data due to the publication of a new reference sequence (genome).
  • FIG. 45 shows how reads mapped to a new genomic region with better quality (e.g. no indels) are moved from AU of type 4 to AU of type 3
  • FIG. 46 shows how, in case new mapping location is found, (e.g. with less mismatches) the related reads can be moved from one AU to another of the same type.
  • FIG. 47 shows how selective encryption can be applied on Access Units of Type 4 only as they contain the sensible information to be protected.
  • FIG. 48 shows the data encapsulation in a genomic multiplex where one or more genomic datasets 482 - 483 contain Genomic streams 484 and streams of Genomic Datasets Mapping Table Lists 481 , Genomic Dataset Mapping Tables 485 , and Reference Identifiers Mapping Tables 487 .
  • Each genomic stream is composed by a Header 488 and Access Units 486 .
  • Access Units encapsulate Blocks 489 which are composed by Packets 4810 .
  • FIG. 49 shows how raw genomic sequence data ( 499 ) or aligned genomic data (produced by element 491 ) are processed to be encapsulated in a Genomic Multiplex.
  • the alignment ( 491 ) and reference genome construction ( 492 ) stages can be necessary to prepare the data for encoding.
  • Data classes ( 498 ) generated by a data classification unit ( 494 ) can be further classified with respect to one or more transformed reference generated by a reference transformation unit ( 4919 ).
  • the transformed classes ( 4918 ) are then sent to layers encoders ( 495 - 497 ).
  • the generated layers ( 4911 ) are encoded by entropy coders ( 4912 - 4914 ) which generate Genomic Streams of Access Units ( 4915 ) fed to the Genomic Multiplexer ( 4916 ).
  • FIG. 50 shows how a genomic demultiplexer ( 500 ) extracts Genomic Streams ( 501 ) from the Genomic Multiplex ( 5010 ), one decoder per AU type ( 502 - 504 ) extracts the genomic layers which are then decoded ( 506 - 507 ) into various data classes ( 5011 ) which are used by class decoders ( 509 ) to reconstruct genomic formats such as for example FASTQ and SAM/BAM.
  • a genomic stream containing one or more reference transformations is decoded by an entropy decoder ( 504 ) to produce reference transformation descriptors ( 5012 ).
  • Reference transformation descriptors are processed by a reference transformation unit ( 5013 ) to transform one or more “external” references to generate one or more transformed references ( 5014 ) to be used by the class decoders ( 509 ).
  • FIG. 51 shows the process of encoding sequence reads belonging to class U using a self-generated reference sequence using six layers of descriptors. Four layers are the same used for other classes P, N, M, I while two layers are specific to class U reads.
  • FIG. 52 shows how a label is built to aggregate genomic regions belonging to two different references.
  • FIG. 53 shows how an existing label can be updated in case new results of analysis require to add an additional region R4 to the existing ones (R1, R2 and R3).
  • FIG. 54 shows how the labeling mechanism can be used to implement access control and data protection on specific genomic regions or sub regions.
  • the simple case uses one access control rule (AC) and one protection mechanism (e. g. encryption) for all genomic regions identified by one label.
  • AC access control rule
  • protection mechanism e. g. encryption
  • FIG. 55 shows how the different genomic regions identified by the same label can be protected by several different access control rules (AC) and several different encryption keys.
  • AC access control rules
  • FIG. 56 shows how an alternative encoding of reads of class U where a signed POS descriptor is used to encode the mapping position of a read on the computed reference
  • FIG. 57 shows how half mapped read pairs can help in filling unknown regions of the reference sequence by assembling longer contigs with unmapped reads.
  • FIG. 58 shows the hierarchical structure of headers for genomic data stored following the structure described in this invention.
  • FIG. 59 shows how a device implementing the labeling mechanism described by this invention enables concurrent access to data related to several genomic regions when they are stored in different records of a database. This can happen either in presence of controlled access or not.
  • FIG. 60 shows how vectors of thresholds are used in encoders of classes N, M and I to generate separated subclasses of data
  • FIG. 61 provides an example of how reference transformations can change the class reads belong to when all or a subset of mismatches are removed (i.e. the read belonging to class M before transformation is assigned to class P after the transformation of the reference has been applied).
  • FIG. 62 shows how reference transformations can be applied to remove mismatches (MMs) from reads.
  • reference transformations may generate new mismatches or change the type of mismatches found when referring to the reference before the transformation has been applied.
  • FIG. 63 The same reference transformation A0 can be used for all classes of data or different transformations AN, A M , A I are used for each class N, M, I
  • labels comprising: an identifier of a reference genomic sequence ( 521 ), an identifier of said genomic regions ( 522 ), and an identifier of the data class ( 523 ) of said genomic data
  • genomic data are sequences of genomic reads.
  • data classes can be of the following type or a subset of them:
  • genomic data are paired sequences of genomic reads.
  • said data class of paired reads can be of the following types or a subset of them:
  • said identifier of said genomic regions is comprised in a master index table.
  • genomic data and said labels are entropy coded.
  • said master index table ( 4812 ) is comprised in a genomic dataset header ( 4813 ).
  • said regions of genomic data are dispersed among separate Access Units ( 524 , 486 ).
  • the location of said regions of genomic data, in a file is indicated in a local index table ( 525 ).
  • said labels are user specified.
  • said regions are protected and/or encrypted in a separate manner, without encrypting the whole genomic file.
  • said labels are stored in a genomic label list (GLL)
  • the method further comprises encoding genomic data with selective access to regions of genomic data as previously defined.
  • the method further comprises decoding a stream or a file of genomic data with selective access to regions of genomic data as previously defined.
  • the present invention further provides an apparatus for encoding genomic data as previously defined.
  • the present invention further provides an apparatus for decoding genomic data as previously defined.
  • the present invention further provides a storing mean for storing genomic data encoded as previously defined.
  • the present invention further provides a computer-readable medium comprising instructions that when executed cause at least one processor to perform the encoding method previously defined.
  • the present invention further provides a computer-readable medium comprising instructions that when executed cause at least one processor to perform the decoding method previously defined.
  • the present invention describes a labelling mechanism providing selective access and selective access control to genomic regions or sub-regions or aggregations of regions or sub-regions of compressed genomic data stored in a file format and/or the relevant access units to be used to store, transport, access and process genomic or proteomic information in the form of sequences of symbols representing molecules.
  • nucleotides include, for example, nucleotides, amino acids and proteins.
  • amino acids include, for example, nucleotides, amino acids and proteins.
  • sequence of symbols One of the most important pieces of information represented as sequence of symbols are the data generated by high-throughput genome sequencing devices.
  • the genome of any living organism is usually represented as a string of symbols expressing the chain of nucleic acids (bases) characterizing that organism.
  • bases the chain of nucleic acids
  • Current state of the art genome sequencing technology is able to produce only a fragmented representation of the genome in the form of several (up to billions) strings of nucleic acids associated to metadata (identifiers, level of accuracy etc.). Such strings are usually called “sequence reads” or “reads”.
  • the typical steps of the genomic information life cycle comprise Sequence reads extraction, Mapping and Alignment, Variant detection, Variant annotation and Functional and Structural Analysis (see FIG. 1 ).
  • Sequence reads extraction is the process —performed by either a human operator or a machine—of representation of fragments of genetic information in the form of sequences of symbols representing the molecules composing a biological sample.
  • sequences of symbols representing the molecules composing a biological sample.
  • nucleic acids such molecules are called “nucleotides”.
  • sequences of symbols produced by the extraction are commonly referred to as “reads”.
  • This information is usually encoded in prior art as FASTA files including a textual header and a sequence of symbols representing the sequenced molecules.
  • the alphabet is composed by the symbols (A,C,G,T,N).
  • RNA of a living organism the alphabet is composed by the symbols (A,C,G,U,N).
  • the alphabet used for the symbols composing the reads are (A, C, G, T, U, W, S, M, K, R, Y, B, D, H, V, N or ⁇ ).
  • sequence of quality score can be associated to each sequence read.
  • prior art solutions encode the resulting information as a FASTQ file. Sequencing devices can introduce errors in the sequence reads such as:
  • Coverage is used in literature to quantify the extent to which a reference genome or part thereof can be covered by the available sequence reads. Coverage is said to be:
  • Sequence alignment refers to the process of arranging sequence reads by finding regions of similarity that may be a consequence of functional, structural, or evolutionary relationships among the sequences.
  • reference genome a pre-existing nucleotides sequence referred to as “reference genome”
  • Sequence alignment can also be performed without a pre-existing sequence (i.e. reference genome) in such cases the process is known in prior art as “de novo” alignment.
  • Prior art solutions store this information in SAM, BAM or CRAM files.
  • FIG. 3 The concept of aligning sequences to reconstruct a partial or complete genome is depicted in FIG. 3 .
  • Variant detection is the process of translating the aligned output of genome sequencing machines, (sequence reads generated by NGS devices and aligned), to a summary of the unique characteristics of the organism being sequenced that cannot be found in other pre-existing sequence or can be found in a few pre-existing sequences only. These characteristics are called “variants” because they are expressed as differences between the genome of the organism under study and a reference genome. Prior art solutions store this information in a specific file format called VCF file.
  • Variant annotation is the process of assigning functional information to the genomic variants identified by the process of variant calling. This implies the classification of variants according to their relationship to coding sequences in the genome and according to their impact on the coding sequence and the gene product. This is in prior art usually stored in a MAF file.
  • the invention disclosed in this document consists in the definition of a selective and controlled data access applied to a compressed data structure for representing, processing manipulating and transmitting genome sequencing data that differs from prior art solutions for at least the following aspects:
  • the key elements of the invention are:
  • the method described in this document aims at exploiting the available a-priori knowledge on genomic data to define an alphabet of syntax elements with reduced entropy.
  • genomics the available knowledge is represented by an existing genomic sequence usually —but not necessarily —of the same species as the one to be processed.
  • human genomes of different individuals differ only of a fraction of 1%.
  • such small amount of data contain enough information to enable early diagnosis, personalized medicine, customized drugs synthesis etc.
  • This invention aims at defining a genomic information representation format where the relevant information is efficiently accessible, access can be selectively controlled and data protected, the information is efficiently transportable and all such processing is performed handling compressed data structures.
  • the present invention application provides a specific data structure specification that implements appropriate data reordering into accessible units of homogeneous and/or semantically significant data enabling seamless access and processing required by state of the art genome data analysis applications.
  • the present invention adopts a data structure based on the concept of Access Unit, “Labels” and the multiplexing of the relevant data, concepts which are absent from all state of the art genomic data formats.
  • Genomic data are structured and encoded into different Access Units. Hereafter follows a description of the genomic data that are contained into different Access Units and can be identified by “Labels” associating genomic data to specific genomic regions or sub-regions or aggregations of regions or sub-regions versus reference genomes.
  • sequence reads generated by sequencing machines are classified by the disclosed invention into five different “classes” according to the matching results of the alignment with respect to one or more pre-existing reference sequences.
  • the classification specified in the previous section concerns single sequence reads.
  • sequencing technologies that generates read in pairs (i.e. Illumina Inc.) in which two reads are known to be separated by an unknown sequence of variable length, it is appropriate to consider the classification of the entire pair to a single data class.
  • a read that is coupled with another is said to be its “mate”.
  • the entire pair is assigned to the same class for any class (i.e. P, N, M, I, U). In the case the two reads belong to a different class, but none of them belongs to the “Class U”, then the entire pair is assigned to the class with the highest priority defined according to the following expression:
  • the table below summarizes the matching rules applied to reads in order to define the class of data each read belongs to.
  • the rules are defined in the first five columns of the table in terms of presence or absence of type of mismatches (n, s, d, i and c type mismatches).
  • the sixth column provides rules in terms of maximum threshold for each mismatch type and any function f(n,s) and w(n,s,d,i,c) of the possible mismatch types.
  • the data classes of type N, M and I as defined in the previous sections can be further decomposed into an arbitrary number of distinct sub-classes with different degrees of matching accuracy. Such option is an important technical advantage in providing a finer granularity and as consequence a much more efficient selective access to each data class.
  • Sub-Class N k it is necessary to define a vector with the corresponding components MAXN 1 , MAXN 2 , MAXN (k-1) , MAXN (k) , with the condition that MAXN 1 ⁇ MAXN 2 ⁇ . . .
  • ⁇ MAXN (k-1) ⁇ MAXN and assign each read to the lowest ranked sub-class that satisfy the constrains specified in Table 1 when evaluated for each element of the vector.
  • a data classification unit 601 contains Class P, N, M, I U, HM encoder and encoders for annotations and metadata.
  • Class N encoder is configured with a vector of thresholds, MAXN 1 to MAXN k 602 which generates k subclasses of N data ( 606 ).
  • the same principle is applied by defining a vector with the same properties for MAXM and MAXTOT respectively and use each vector components as threshold for checking if the functions f(n,s) and w(n,s,d,i,c) satisfy the constraint.
  • the assignment is given to the lowest sub-class for which the constraint is satisfied.
  • the number of sub-classes for each class type is independent and any combination of subdivisions is admissible. This is shown in FIG.
  • a Class M encoder and a Class I encoder are configured respectively with a vector of thresholds MAXM 1 to MAXM j ( 603 ) and MAXTOT 1 to MAXTOT h ( 604 ).
  • the two encoders generate respectively j subclasses of M data ( 607 ) and h subclasses of I data ( 608 ). When two reads in a pair are classified in the same sub-class, then the pair belongs to the same sub-class.
  • N has the lowest priority and I has the highest priority.
  • the mismatches found for the reads classified in the classes N, M and I can be used to create “transformed references” to be used to compress more efficiently the read representation.
  • Reads classified as belonging to the Classes N, M or I (with respect to the pre-existing (i.e. “external”) reference sequence denoted as RS 0 ) can be coded with respect to the “transformed” reference sequence RS 1 according to the occurrence of the actual mismatches with the transformed reference.
  • FIG. 61 shows an example on how reads containing mismatches (belonging to Class M) with respect to reference sequence 1 (RS 1 ) can be transformed into perfectly matching reads with respect to the reference sequence 2 (RS 2 ) obtained from RS 1 by modifying the bases corresponding to the mismatch positions. They remain classified and they are coded together the other reads in the same data class access unit, but the coding is done using only the descriptors and descriptor values needed for a Class P read. This transformation can be denoted as:
  • FIG. 62 shows an example on how a reference transformation is applied to reduce the number of mismatches to be coded on the mapped reads.
  • FIG. 61 shows an example on how reads can change the type of coding from a data class to another by means of the appropriate set of descriptors (e.g. using the descriptors of a Class P to code a read from Class M) after a reference transformation is applied and the read is represented using the transformed reference.
  • the definition of the set of descriptors used for each class of data is provided in the following sections.
  • genomic data requires the storage of global parameters and metadata to be used by the decoding engine. These data are organized in the following structures: For file based storage:
  • FIG. 58 The hierarchical relationship among these headers is shown in FIG. 58 .
  • a dataset is defined as the ensemble of coding elements needed to reconstruct the genomic information related to a single genomic sequencing run and all the following analysis. If the same genomic sample is sequenced twice in two distinct runs, the obtained data will be encoded in two distinct datasets.
  • Master index table Byte array This is a Alignment positions of first read in each block (Access Unit). multidimensional l.e. smaller position of the first read on the reference genome array supporting per each block of the six classes random access to 1 per pos class (six) per reference Access Units.
  • Label List Byte array This is a list of Sub-part of the Genomic Dataset Header indicating (e.g. Labels, each one number of Labels integers) represented as a for each Label: multidimensional the Label ID array in order to the number of reference sequences concerned support selective by the label access to specific for each reference sequence genomic regions the reference identifier or sub-regions or the number of regions covered by the aggregations of label, regions or sub- for each region: regions.
  • the class ID the start position in the genomic range the end position in the genomic range Start position and end position can be replaced by “block numbers”, composing, together with reference sequence ID and class ID, a three dimensional vector addressing the coordinates of the Master Index Table. Parameters set Byte array Encoding parameters used to configure the encoding process and sent to the decoder.
  • Descriptors (a.k.a. syntax elements) are described in the following sections of this document and are the building blocks of the genomic information representation described by this invention. They are organized in layers (a.k.a. descriptors streams) of homogeneous elements partitioned according to the specific statistical properties of each descriptor. This has the advantage of reducing the entropy of each layer and improving compression efficiency.
  • Each layer is prepended by the Descriptors Layer Header described below.
  • Descriptors_Layer_Header
  • Descriptors_Layer_ID Descriptors layer ID, table specified in this specification Num_Of_Blocks Number of Blocks in the Descriptors Layer Label size Size of the human readable label Label (Human-Readable) Label Flag Flag used to interpret the following metadata Local Index Table
  • the Local Index Table structure as described in this invention Metadata Data structure carrying metadata to be used for application- specific processing such as data analysis and content protection. ⁇
  • Every Descriptors Layer is composed by one or multiple Genomic Data Blocks.
  • One or more Blocks from different Layers compose an Access Unit, depending on the Class of data.
  • An Access Unit is a set of Genomic Blocks that can be decoded either independently from other Access Units by using only globally available data (e.g. decoder configuration) or by using information contained in other Access Units.
  • Semantic Block_Header ⁇ Descriptors_Layer_ID Unambiguously identifies the descriptors stream. Same as Descriptors_Layer_ID in Descriptor Layer Header Block size (BS) Number of bytes composing Block, including this header and payload, and excluding padding (total Block size will be BS + padding size). ⁇
  • BS Layer Header Block size
  • further processing consists in defining a set of distinct syntax elements which represent the remaining information enabling the reconstruction of the DNA read sequence when represented as being mapped on a given reference sequence.
  • a sequence read (e.g. a DNA segment) referred to a given reference sequence can be fully expressed by:
  • This classification creates groups of descriptors (syntax elements) that can be used to univocally represent genome sequence reads.
  • syntax elements syntax elements that can be used to univocally represent genome sequence reads.
  • the table below summarizes the syntax elements needed for each class of reads aligned with “pre-existing” (i.e. “external”) or “constructed” (i.e. “internal”) references.
  • Reads belonging to class P are characterized and can be perfectly reconstructed by only a position, a reverse complement information and an offset between mates in case they have been obtained by a sequencing technology yielding mated pairs, some flags and a read length.
  • Class HM is applied to read pairs only and it is a special case where one read belongs to class P, N, M or I and the other to class U.
  • mapping position of the first encoded read is stored in the AU header as absolute position on the reference genome. All the other positions are expressed as a difference with respect to the previous position and are stored in a specific layer.
  • This modeling of the information source, defined by the sequence of read positions, is in general characterized by a reduced entropy particularly for sequencing processes generating high coverage results.
  • FIG. 4 shows how after encoding the starting position of the first alignment as position “10000” on the reference sequence, the position of the second read starting at position 10180 is coded as “180”. With high coverage data (>50 ⁇ ) most of the descriptors of the position vector will show very high occurrences of low values such as 0 and 1 and other small integers.
  • FIG. 10 shows how the positions of three read pairs are encoded in a pos Layer.
  • the same source model is used for the positions of reads belonging to classes N, M, P and I.
  • the positions of reads belonging to the four classes are encoded in separate layers as depicted in Table I.
  • Each read of the read pairs produced by sequencing technologies can be originated from either genome strands of the sequenced organic sample. However, only one of the two strands is used as reference sequence.
  • FIG. 8 shows how in a reads pair one read (read 1) can be originated from one strand and the other (read 2) can be originated from the other strand.
  • read 2 can be encoded as reverse complement of the corresponding fragment on strand 1. This is shown in FIG. 9 .
  • the reverse complement information of reads belonging to classes P, N, M, I are coded in different layers as depicted in Table 3.
  • the pairing descriptor is stored in the pair layer.
  • Such layer stores descriptors encoding the information needed to reconstruct the originating reads pairs, when the employed sequencing technology produces reads by pairs.
  • the vast majority of sequencing data is generated by using a technology generating paired reads, it is not the case of all technologies. This is the reason for which the presence of this layer is not necessary to reconstruct all sequencing data information if the sequencing technology of the genomic data considered does not generate paired reads information.
  • FIG. 5 shows how the pairing distance among read pairs is calculated.
  • the pair descriptor layer is the vector of pairing errors calculated as number of reads to be skipped to reach the mate pair of the first read of a pair with respect to the defined decoding pairing distance.
  • FIG. 6 shows an example of how pairing errors are calculated, both as absolute value and as differential vector (characterized by lower entropy for high coverages).
  • the same descriptors are used for the pairing information of reads belonging to classes N, M, P and I.
  • the pairing information of reads belonging to the four classes are encoded in different layer as depicted in.
  • mapping sequence reads on a reference sequence it is not uncommon to have the first read in a pair mapped on one reference (e.g. chromosome 1) and the second on a different reference (e.g. chromosome 4).
  • the pairing information described above has to be integrated by additional information related to the reference sequence used to map one of the reads. This is achieved by coding
  • a reserved value indicating that the pair is mapped on two different sequences (different values indicate if read1 or read2 are mapped on the sequence that is not currently encoded)
  • a unique reference identifier referring to the reference identifiers encoded in the Genomic Dataset Header structure as described in Table 2.
  • FIG. 7 provides an example of this scenario.
  • the third element contains the mapping information on the concerned reference ( 170 ).
  • Class N includes all reads in which only “n type” mismatches are present, at the place of an A, C, G or T base a N is found as called base. All other bases of the read perfectly match the reference sequence.
  • FIG. 11 shows how:
  • a substitution is defined as the presence, in a mapped read, of a different nucleotide with respect to the one that is present in the reference sequence at the same position (see FIG. 12 ).
  • a substitution position is calculated as for the values of the nmis layer, i.e.: In read 1 substitutions are encoded
  • mismatches are coded by an index (moving from right to left) from the actual symbol present in the reference to the corresponding substitution symbol present in the read ⁇ A, C, G, T, N, Z ⁇ .
  • the mismatch index will be denoted as “4”.
  • the decoding process reads the encoded syntax element, the nucleotide at the given position on the reference and moves from left to right to retrieve the decoded symbol. E.g. a “2” received for a position where a G is present in the reference will be decoded as “N”.
  • FIG. 14 shows all the possible substitutions and the respective encoding symbols when IUPAC ambiguity codes are not used and
  • FIG. 15 provides an example of encoding of substitutions types in the snpt layer.
  • substitution indexes change as shown in FIG. 16 .
  • an alternative method of substitution encoding consists in storing only the mismatches positions in separate layers, one per nucleotide, as depicted in FIG. 17 .
  • mismatches and deletions are coded by an indexes (moving from right to left) from the actual symbol present in the reference to the corresponding substitution symbol present in the read: ⁇ A, C, G, T, N, Z ⁇ .
  • the mismatch index will be “4”.
  • the coded symbol will be “5”.
  • the decoding process reads the coded syntax element, the nucleotide at the given position on the reference and moves from left to right to retrieve the decoded symbol. E.g. a “3” received for a position where a G is present in the reference will be decoded as “Z” which indicates the presence of a deletion in the sequence read.
  • Inserts are coded as 6, 7, 8, 9, 10 respectively for inserted A, C, G, T, N.
  • FIG. 18 and FIG. 19 show examples of how to encode substitutions, inserts and deletions in a reads pair of class I.
  • syntax elements that can be of the following types:
  • FIG. 51 provides an example of such encoding procedure.
  • FIG. 56 shows an alternative encoding of unmapped reads on the internal reference where pos+pair syntax elements are replaced by a signed pos.
  • pos would express the distance —in terms of positions on the reference sequence —of the left most nucleotide position of read n with respect of the position of the left most nucleotide of read n ⁇ 1.
  • This coding approach can be extended to support N start positions per read so that reads can be split over two or more reference positions. This can be particularly useful to encode reads generated by those sequencing technology (e.g. from Pacific Bioscience) producing very long reads (50K+bases) which usually present repeated patterns generated by loops in the sequencing methodology. The same approach can be used as well to encode chimeric sequence reads defined as reads that align to two distinct portions of the genome with little or no overlap.
  • MIT Master Index Table
  • the MIT contains one section per each class of data (P, N, M, I, U and HM) and per each reference sequence.
  • the MIT is contained in the Genomic Dataset Header of the encoded data.
  • FIG. 20 shows the structure of the Genomic Dataset Header
  • FIG. 21 shows a generic visual representation of the MIT
  • FIG. 22 shows an example of MIT for the class P of encoded reads.
  • the values contained in the MIT depicted in FIG. 22 are used to directly access the region of interest (and the corresponding AU) in the compressed domain.
  • a decoding application would skip to the second reference in the MIT and would look for the two values k1 and k2 so that k1 ⁇ 150,000 and k2>250,000.
  • k1 and k2 are 2 indexes read from the MIT. In the example of FIG. 22 this would result in positions 3 and 4 of the second vector of the MIT.
  • the MIT can be uses as an index of additional metadata and/or annotations added to the genomic data during its life cycle.
  • Each data layer described above is prefixed with a data structure referred to as local header.
  • the local header contains a unique identifier of the layer, a vector of Access Units counters per each reference sequence, a Local Index Table (LIT) and optionally some layer specific metadata.
  • the LIT is a vector of pointers to the physical position of the data belonging to each AU in the layer payload.
  • FIG. 23 depicts the generic layer header and payload where the LIT is used to access specific regions of the encoded data in a non-sequential way.
  • the decoding application in order to access region 150,000 to 250,000 of reads aligned on the reference sequence no. 2, the decoding application retrieved positions 3 and 4 from the MIT. These values shall be used by the decoding process to access the 3 rd and 4 th elements of the corresponding section of the LIT.
  • the Total Access Units counters contained in the layer header are used to skip the LIT indexes related to AUs related to reference 1 (5 in the example).
  • the indexes containing the physical positions of the requested AUs in the encoded stream are therefore calculated as:
  • the blocks of data retrieved using the indexing mechanism called Local Index Table, are part of the Access Units requested.
  • FIG. 26 shows how the data blocks retrieved using the MIT and the LIT compose one or more Access Units.
  • Access Units The genomic data classified in data classes and structured in compressed or uncompressed layers are organized into different Access Units.
  • Genomic Access Units are defined as sections of genome data (in a compressed or uncompressed form) that reconstructs nucleotide sequences and/or the relevant metadata, and/or sequence of DNA/RNA (e.g. the virtual reference) and/or annotation data generated by a genome sequencing machine and/or a genomic processing device or analysis application.
  • An example of Access Unit is provided in FIG. 26 .
  • An Access Unit is a block of data that can be decoded either independently from other Access Units by using only globally available data (e.g. decoder configuration) or by using information contained in other Access Units.
  • Access Units are differentiated by:
  • Access units of any type can be further classified into different “categories”.
  • Access Units of type 0 are ordered (e.g. numbered), but they do not need to be stored and/or transmitted in an ordered manner (technical advantage: parallel processing/parallel streaming, multiplexing)
  • Access Units of type 1, 2, 3, 4, 5 and 6 do not need to be ordered and do not need to be stored and/or transmitted in an ordered manner (technical advantage: parallel processing/parallel streaming).
  • FIG. 26 shows how Access Units are composed by a header and one or more layers of homogeneous data.
  • Each layer can be composed by one or more blocks.
  • Each block contains several packets and the packets are a structured sequence of the descriptors introduced above to represent e.g. reads positions, pairing information, reverse complement information, mismatches positions and types etc.
  • Each Access unit can have a different number of packets in each block, but within an Access Unit all blocks have the same number of packets.
  • Each data packet can be identified by the combination of 3 identifiers X Y Z where:
  • FIG. 28 shows an example of Access Units and packets labelling where AU T N is an access unit of type T with identifier N which may or may not imply a notion of order according to the Access Unit Type. Identifiers are used to uniquely associate Access Units of one type with those of other types required to completely decode the carried genomic data.
  • Access Units of any type can be further classified and labelled in different “categories” according to different sequencing processes. For example, but not as a limitation, classification and labelling can take place when
  • the access units of type 1, 2, 3, 4, 5 and 6 are built according to the result of a matching function applied on genome sequence fragments (a.k.a. reads) with respect to the reference sequence encoded in Access Units of type 0 they refer to.
  • access units (AUs) of type 1 may contain the positions and the reverse complement flags of those reads which result in a perfect match (or maximum possible score corresponding to the selected matching function) when a matching function is applied to specific regions of the reference sequence encoded in AUs of type 0. Together with the data contained in AUs of type 0, such matching function information is sufficient to completely reconstruct all genome sequence reads represented by the data set carried by the access units of type 1.
  • the Access Units of type 1 described above would contain information related to genomic sequence reads of class P (perfect matches).
  • the matching functions applied with respect to access units of type 1 to classify the content of AU for the type 2, 3 and 4 can provide results such as:
  • Access units of type 0 are ordered (e.g. numbered), but they do not need to be stored and/or transmitted in an ordered manner (technical advantage: parallel processing/parallel streaming, multiplexing)
  • Access units of type 1, 2, 3, 4, 5 and 6 do not need to be ordered and do not need to be stored and/or transmitted in an ordered manner (technical advantage: parallel processing/parallel streaming).
  • An additional mechanism is provided by the disclosed invention enabling user-defined selective access to data classes referring to specific genomic regions or sub-regions or aggregations of regions or sub-regions.
  • a “Label” is an identifier which is assigned to a specific genomic region or sub-region or aggregations of regions or sub-regions. Labels identify genomic regions by specifying: the reference sequence id (“Ref ids”), the index of the MIT corresponding to the desired region of the reference sequence, and the data classes. An example is provided in FIG. 52 .
  • a single, a subset, or all data classes can be referenced by a Label, enabling selective access to only a sub-set of the data associated to a specific genomic region or sub-regions or aggregations of regions or sub-regions.
  • a Label list should be created by a Genomic Labels Generator ( 4917 FIG. 49 ), in a storage scenario, and/or in a streaming scenario to make available the available Labels to the analysis applications applying a selective access to the stored or streamed data.
  • one or more Access Units can be identified using a specific “Label” by means of a Block Header field (“Label ID”), which serves as an identifier for the “Label” in the “Label List” which the current block belongs to.
  • Label ID serves as an identifier for the “Label” in the “Label List” which the current block belongs to.
  • Label ID serves as an identifier for the “Label” in the “Label List” which the current block belongs to.
  • start_pos and “end_pos” fields can be replaced by the block numbers referring to all “blocks” belonging to a specific “Label”, as follows:
  • the “Label List” is created by a Genomic Labels Generator ( 4917 ) and sent to the genomic multiplexer (see also FIG. 49 ).
  • the demultiplexer parses the Label List syntax and exposes the available Labels to the data access application, which according to the specific data access required selects the Access Units corresponding to the subset of “Labels”.
  • Generic random access can be achieved by specifying a three dimensional vector determining the MIT and LIT coordinates of interest (reference id, position range and classes) and ignoring the information carried by the Label List.
  • FIG. 51 shows how labels are used to aggregate and uniquely identify several genomic regions by using indexes contained in the MIT.
  • FIG. 59 shows how a device ( 592 ) implementing the labelling mechanism disclosed by this invention can enable concurrent access to several records of data ( 596 ) stored in a database ( 595 ). Selective protection of one or more regions identified by the same label is supported as well by means of a dedicated module ( 591 ) in charge of parsing the query ( 591 ) and dispatching the required metadata to the security module ( 594 ) in charge of enforcing access control.
  • the labels decoder ( 593 ) is in charge of translating the label syntax into object identifiers that can be protected (and therefore access is controlled by the security module 594 ) or not.
  • FIG. 39 shows how the access to the genomic information mapped on the second segment of the reference sequence (AU 0-2) with mismatches only requires the decoding of AUs 0-2, 1-2 and 3-2 only.
  • This is an example of selective access according to both a criteria related to a mapping region (i.e. position on the reference sequence) and a criteria related to the matching function applied to the encoded sequence reads with respect to the reference sequence (e.g. mismatches only in this example).
  • a further technical advantage is that the querying on the data is much more efficient in terms of data accessibility and execution speed because it can be based on accessing and decoding only selected “categories”, specific regions of longer genomic sequences and only specific layers for access units of type 1, 2, 3, 4 that match the criteria of the applied queries and any combination thereof.
  • FIG. 52 shows how the access to the genomic information associated only to specific genomic regions or sub-regions or aggregations of regions or sub-regions associated to user defined “Labels”.
  • the syntax of a label is based on a three coordinates system where each region or sub-region associated to a label can be uniquely identified by:
  • a further technical advantage is that the querying on the data results to be much more efficient in terms of data accessibility and execution speed because it can be based on accessing and decoding only selected “categories”, of the labelled specific regions and only specific layers for access units of type 1, 2, 3, 4 that corresponds to the “Labels” of the applied queries and any combination thereof.
  • Another technical advantage of this labelling mechanism is the possibility of efficiently retrieving encoded genomic information that has been scattered among several Access Units due to its characteristics such as position on the reference genome, type of mismatches with respect to the reference ( 524 ).
  • Filtering genomic data according to the characteristics of the mapped reads (e.g. perfectly matching, substitutions only, etc.) today can take hours when using the traditional formats such as BAM and CRAM. This is due to the fact that the data are sparse within the compressed format and require decompression and filtering using pipelines of commands.
  • the present invention describes a data structure that enables data filtering in a matter of seconds. Memory usage can be as well reduced by a factor that is proportional with the file size (from 10 ⁇ to 100 ⁇ ) since the present invention does not require the decoding (i.e. memory allocation) of the entire file.
  • the Label parameter “Label_lenght_in_blocks” and for each block the parameters: “ref_num”, “class_ID”, “block_num” are determined by the multiplexer based on the position on the reference of the “GeneXY” and “GeneWZ” regions and the class of data for which the selective access is desired. The complete syntax is reported in Table 5.
  • the Label parameters “ref ID”, “class_ID”, “start_pos” and “end_pos” are determined by the multiplexer based on the position on the reference of the “GeneXY” and “GeneWZ” regions and the class of data for which the selective access is desired. The complete syntax is reported in Table 4.
  • the access units of type 7 and 8 allow for easy insertion of annotations without the need of depacketizing/decoding/decompressing the whole file thereby adding to the efficient handling of the file which is a limitation of prior art approaches.
  • Existing compression solutions may have to access and process a large amount of compressed data before the desired genomic data can be accessed. This will cause inefficient RAM bandwidth utilization and more power consumption also in hardware implementations. Power consumption and memory access issues may be alleviated by using the approach based on Access Units described here.
  • the data indexing mechanism described in the Master Index Table together with the utilization of Access Unites and the possibility of identifying Access Units with user-defined “Labels” associated to specific genomic regions or sub-regions or aggregations of regions or sub-regions enables incremental update of the encoded content as described below. This mechanism is shown with an example in FIG. 53 .
  • New genomic information can be periodically added to existing genomic data for several reasons. For example when:
  • This mechanism is illustrated in FIG. 40 where pre-existing data encoded in 3 AUs of type 1 and 4 AUs per each type from 2 to 4 are updated with 3 AUs per type with encoding data coming for example from a new sequence run for the same individual.
  • FIG. 52 and FIG. 53 The mechanism of creating or updating “Labels” and the “Label List” are illustrated in FIG. 52 and FIG. 53 .
  • the incremental update of a pre-existing data set may be useful when analyzing data as soon as they are generated by a sequencing machine and before the actual sequencing is completed.
  • An encoding engine (compressor) can assemble several AUs in parallel by “clustering” sequence reads that map on the same region of the selected reference sequence. Once the first AU contains a number of reads above a pre-configured threshold/parameter, the AU is ready to be sent to the analysis application. Together with the newly encoded Access Unit, the encoding engine (the compressor) shall make sure that all Access Units the new AU depends on have already been sent to the receiving end or is sent together with it. For example an AU of type 3 will require the appropriate AU of type 0 and type 1 to be present at the receiving end in order to be properly decoded.
  • a receiving variant calling application would be able to start calling variants on the AU received before the sequencing process has been completed at the transmitting side.
  • a schematic of this process is depicted in FIG. 41 .
  • Compressed genomic data can require transcoding, for example, in the following situations:
  • prior art compression solutions may have to access and process a large amount of compressed data before the desired genomic data can be accessed. This will cause inefficient RAM bandwidth utilization and more power consumption and in hardware implementations. Power consumption and memory access issues may be alleviated by using the approach based on Access Units described here.
  • a further advantage of the adoption of the genomic access units described here is the facilitation of parallel processing and suitability for hardware implementations.
  • Current solutions such as SAM/BAM and CRAM are conceived for single-threaded software implementation.
  • a person skilled in the art knows that the majority of genomic information related to an organism's genetic profile relies in the differences (variants) with respect to a known sequence (e.g. a reference genome or a population of genomes).
  • An individual genetic profile to be protected from unauthorized access will therefore be encoded in Access Units of type 3 and 4 as described in this document.
  • the implementation of controlled access to the most sensible genomic information produced by a sequencing and analysis process can therefore be realized by encrypting only the payload of AUs of type 3 and 4 (see FIG. 47 for an example). This will generate significant savings in terms of both processing power and bandwidth since the resources consuming encryption process shall be applied on a subset of data only.
  • the labelling mechanism enables different mechanisms of data protection and access control.
  • FIG. 54 shows how one protection mechanism (e.g. encryption) and one access control rule (AC) can be applied to several genomic regions identified by the same label.
  • data protection can be implemented by applying a different access control rule and a different protection mechanism (encryption) to each region identified by a label. This is shown in FIG. 55 .
  • genomic regions or sub-regions or aggregations of regions or sub-regions identified by different “Labels” can be easily implemented by applying encryption only to compressed data corresponding to a “Label” for both file and streamed scenarios.
  • two genomic regions labelled as “GeneXY” and “GeneWZ” like in the example of section “Selective Access to Specific Genomic Regions identified by User Specified “Labels” in “storage” and “streaming” scenarios” can be differentiated by only encrypting data labelled by “GeneXY” and leaving in clear the compressed data labelled as “GeneWZ”.
  • Encryption rules can be carried by the metadata fields (in both storage and streaming scenarios) and associated to each element of the “Label List”
  • Genomic Access Units can be transported over a communication network within a Genomic Data Multiplex.
  • a Genomic Data Multiplex is defined as a sequence of packetized genomic data and metadata represented according to the data classification disclosed as part of this invention, transmitted in network environments where errors, such as packet losses, may occur.
  • Genomic Data Multiplex is conceived to ease and render more efficient the transport of genomic coded data over different environments (typically network environments) and has the following advantages not present in state of the art solutions:
  • FIG. 49 An Example of Genomic Data Multiplexing is Shown in FIG. 49 .
  • Genomic Dataset is defined as a structured set of Genomic Data including, for example, genomic data of a living organism, one or more sequences and metadata generated by several steps of genomic data processing, or the result of the genomic sequencing of a living organism.
  • One Genomic Data Multiplex may include multiple Genomic Datasets (as in a multi-channel scenario) where each dataset refers to a different organism.
  • the multiplexing mechanism of the several datasets into a single Genomic Data Multiplex is governed by information contained in data structures called Genomic Datasets List (GDL), Genomic Dataset Mapping Tables List (GDMTL) and Genomic Dataset Mapping Table (GDMT).
  • GDL Genomic Datasets List
  • GDMTL Genomic Dataset Mapping Tables List
  • GDMT Genomic Dataset Mapping Table
  • Genomic Dataset List is defined as a data structure listing all Genomic Datasets available in a Genomic Data Multiplex. Each of the listed Genomic Datasets is identified by a unique value called Genomic Dataset ID (GID).
  • Each Genomic Dataset listed in the GDL is associated to:
  • the GDL is sent as payload of a single Transport Packet at the beginning of a Genomic Data Stream transmission; it can then be periodically re-transmitted in order to enable random access to the Stream.
  • the syntax of the GDL data structure is provided in the table below with an indication of the data type associated to each syntax element.
  • section_length bitstring field specifying the number of bytes composing the section, starting immediately following the section_length field, and including the CRC.
  • multiplex_id bitstring field which serves as a label to identify this multiplexed stream from any other multiplex within a network.
  • version_number bitstring field indicating the version number of the whole Genomic Dataset List Section. The version number shall be incremented by 1 whenever the definition of the Genomic Dataset Mapping Table changes. When the applicable_section_flag is set to ‘1’, then the version_number shall be that of the currently applicable Genomic Dataset List. When the applicable_section_flag is set to ‘0’, then the version_number shall be that of the next applicable Genomic Dataset List.
  • applicable_section_flag A 1 bit indicator, which when set to ‘1’ indicates that the Genomic Dataset Mapping Table sent is currently applicable. When the bit is set to ‘0’, it indicates that the table sent is not yet applicable and shall be the next table to become valid.
  • list_ID This is a bitstring field identifying the current genomic dataset list.
  • genomic_dataset_ID genomic_dataset_ID is a bitstring field which specifies the genomic dataset to which the genomic_dataset_map_SID is applicable. This field shall not take any single value more than once within one version of the Genomic Dataset Mapping Table.
  • genomic_dataset_map_SID genomic_dataset_map_SID is a bitstring field identifying the Genomic Data Stream carrying the Genomic Dataset Mapping Table (GDMT) associated to this Genomic Dataset. No genomic_dataset_ID shall have more than one genomic_dataset_map_SID associated. The value of the genomic_dataset_map_SID is defined by the user.
  • reference_id_map_SID reference_id_map_SID is a bitstring field identifying the Genomic Data Stream carrying the Reference ID Mapping Table (RIDMT) associated to this Genomic Dataset. No genomic_dataset_ID shall have more than one reference_id_map_SID associated. The value of the reference_id_map_SID is defined by the user.
  • genomic_Label_list_SID genomic_Label_list_SID is a bitstring field identifying the Genomic Data Stream carrying the Genomic Label List (GLL) associated to this Genomic Dataset. No genomic_dataset_ID shall have more than one genomic_Label_list_SID associated. The value of the genomic_Label_list_SID is defined by the user. Chacksum This is a bitstring field that contains an integrity check value for the entire GDL.
  • One typical algorithm used for this purpose function is the CRC32 algorithm producing a 32 bit value other algorithms include the hashing functions MD5 and SHA-256.
  • the Genomic Dataset Mapping Table (GDMT) is produced and transmitted at the beginning of a streaming process (and possibly periodically re-transmitted, updated or identical in order to enable the update of correspondence points and the relevant dependencies in the streamed data).
  • the GDMT is carried by a single Packet following the Genomic Dataset List and lists the SIDs identifying the Genomic Data Streams composing one Genomic Dataset.
  • the GDMT is the complete collection of all identifiers of Genomic Data Streams (e.g., the genomic sequence, reference genome, metadata, etc) composing one Genomic Dataset carried by a Genomic Multiplex.
  • a genomic dataset mapping table is instrumental in enabling random access to genomic sequences by providing the identifier of the stream of genomic data associated to each genomic dataset.
  • the syntax of the GDMT data structure is provided in the table below with an indication of the data type associated to each syntax element.
  • the syntax elements composing the GDMT described above have the following meaning and function.
  • Genomic_dataset_ID bitstring field identifying a Genomic Dataset mapping_table_ID bitstring bit field identifying the current Genomic Dataset Mapping Table genomic_dataset_ef_length bitstring field specifying the number of bytes of the optional extension_field associated with this Genomic Dataset data_type bitstring field specifying the type of genomic data carried by the packets identified by the genomic_data_SID.
  • genomic_data_SID bitstring bit field specifying the Stream ID of the packets carrying the encoded genomic data associated with one component of this Genomic Dataset (e.g. read p positions, read p pairing information etc. as defined in this invention)
  • gd_component_ef_length bitstring field specifying the number of bytes of the optional extension_field associated with the genomic Stream identified by genomic_data_SID.
  • Checksum This is a bitstring field that contains an integrity check value for the entire GDMT.
  • One typical algorithm used for this purpose function is the CRC32 algorithm producing a 32 bit value or hashing functions such as MD5 and SHA-256.
  • extension_fields are optional descriptors that might be used to further describe either a Genomic Dataset or one Genomic Dataset component.
  • the data_type field can have the following values
  • This structure carries information about all the datasets mapping tables related to a Genomic Datasets Multiplex.
  • the Reference ID Mapping Table (RIDMT) is produced and transmitted at the beginning of a streaming process.
  • the RIDMT is carried by a single Packet following the Genomic Dataset List.
  • the RIDMT specifies a mapping between the numeric identifiers of reference sequences (REFID) contained in the Block header of an access unit and the (typically literal) reference identifiers contained in the Genomic Dataset Header specified in Table 2.
  • the RIDMT can be periodically re-transmitted in order to:
  • the syntax of the RIDMT data structure is provided in the table below with an indication of the data type associated to each syntax element.
  • table_length, genomic_dataset_ID These elements have the same meaning as for the version_number, applicable_section_flag GDMT reference_id_mapping_table_ID bitstring field identifying the current Reference ID Mapping Table ref_string_length bitstring field specifying the number of characters (bytes) composing ref_string, excluding the end of string (‘ ⁇ 0’) character.
  • ref_string[i] byte field encoding each character of the string representation of a reference sequence (e.g. “chr1” for chromosome 1).
  • the end of string (‘ ⁇ 0’) character is not necessary, as it is implicitly inferred from the ref_string_length field REFID This is a bitstring field uniquely identifying a reference sequence.
  • Checksum This is a bitstring field that contains an integrity check value for the entire RIDMT.
  • One typical algorithm used for this purpose function is the CRC32 algorithm producing a 32 bit value or any hash function producing longer strings of bits.
  • a label is an identifier which is assigned to a specific genomic regions or sub-regions or aggregations of regions or sub-regions.
  • Labels identify genomic regions by specifying the reference sequence id, the position range with respect to the reference sequence and the data classes that they identify.
  • the Genomic Label List (GLL) is created during the packetization process by the multiplexer and transmitted.
  • the depacketizer of the demultiplexer parses the GLL syntax and exposes the available “Labels” to the data access application, which has the possibility to select and access the desired sub-set of data.
  • the GLL is (optionally) produced and transmittedat the beginning of a stream and typically transmitted periodically in order to enable multiple synchronization points ( 4811 ), and provides the list of “Labels” associated to the Multiplex and Dataset identified by the multiplex_id and dataset_id fields.
  • the syntax of the GLL data structure is provided in the table below with an indication of the data type associated to each syntax element.
  • table_length Bitstring field specifying the number of bytes composing the list, starting after the table_length field, and including the Checksum field multiplex_ID Byte which serves as a label to identify the Genomic Multiplex from any other multiplex within a network dataset_ID Byte which serves as a label to identify the Genomic Dataset from any other dataset within the multiplex identified by multiplex_id num_Labels Bitstring representing the total number of Labels in this GLL Label_id Bitstring identifying the i-th Label num_ref Bitstring identifying the number of references concerned by the current label ref_id Bitstring identifying the j-th reference sequence the i-th Label refers to num_regions Bistring identifying the number of regions conveyed by the i-th Label class_id Bitstring identifying the class of the k-th region in the j-th reference in the i-th Label start_pos Bitstring indicating the position in the j-th reference sequence of the first read of
  • a Genomic Data Multiplex contains one or several Genomic Data Streams where each stream can transport
  • a Genomic Data Stream containing genomic data is essentially a packetized version of a Genomic Data Layer where each packet is prepended with a header describing the packet content and how it is related to other elements of the Multiplex.
  • Genomic Data Stream format described in this document and the File Format described in this document are mutually convertible. Whereas a full file format can be reconstructed in full only after all data have been received, in case of streaming a decoding tool can reconstruct and access, and start processing the partial data at any time.
  • a Genomic Data Stream is composed by several Genomic Data Blocks each containing one or more Genomic Data Packets.
  • Genomic Data Blocks are containers of genomic information composing one genomic AU. GDB can be split into several Genomic Data Packets, according to the communication channel requirements.
  • Genomic access units are composed by one or more Genomic Data Blocks belonging to different Genomic Data Streams.
  • Genomic Data Packets are transmission units composing one GDB. Packet size is typically set according to the communication channel requirements.
  • FIG. 27 shows the relationship among Genomic Multiplex, Streams, Access Units, Blocks and Packets when encoding data belonging to the P class as defined in this invention.
  • three Genomic Streams encapsulate information on position, pairing and reverse complement of sequence reads.
  • Genomic Data Blocks are composed by a header, a payload of compressed data and padding information.
  • the table below provides an example of implementation of a GDB header with a description of each field and a typical data type.
  • AUID Unambiguous ID, linearly increasing (not necessarily by 1, even bitstring though recommended). Needed to implement proper random access, as described in the Master Index Table defined in this invention.
  • Label ID Unambiguous ID, linearly increasing by 1, identifying the bitstring genomic region/classes (Label) this block belongs to. It corresponds to the i-th index in the main for loop in the Genomic Label List described above.
  • Optional Reference Unambiguous ID, identifying the reference sequence the AU bitstring ID (REFID) containing this block refers to. This is needed, along with POS field, to have proper random access, as described in the Master Index Table.
  • POS POS
  • bitstring Position on the reference sequence of the bitstring first read in the block.
  • Additional optional fields presence signaled by BS.
  • bytestring Optional
  • Padding Optional, presence signaled by PDF Fixed bitstring value that bitstring can be inserted in order to meet the channel requirements. If present, the first byte indicates how many bytes compose the padding. It is discarded by the decoder.
  • AUID Master Index Table
  • LIT Local Index Table
  • AUID and BS enable the receiving end to dynamically re-create a LIT locally, without the need to send extra-data.
  • AUID, BS and POS will enable to recreate a MIT locally without the need to send additional data.
  • a Genomic Data Block can be split into one or more Genomic Data Packets, depending on network layer constraints such as maximum packet size, packet loss rate, etc.
  • a Genomic Data Packet is composed by a header and a payload of encoded or encrypted genomic data as described in the table below.
  • Genomic Data Packet syntax elements Data type Description Data size Stream ID (SID) Unambiguously identifies data type carried by this bitstring packet. A Genomic Dataset Mapping Table is needed at the beginning of the stream in order to map Stream IDs to data types. Used also for updating correspondence points and relevant dependencies. Access Unit Marker Bit Set for the last packet of the access unit. Allows to bit (MB) identify the last packet of an AU. Packet Counter Counter associated to each Stream ID linearly increasing bitstring Number (SN) by 1. Needed to identify gaps/packet losses. Wrap around at 255. Packet Size (PS) Number of bytes composing the packet, including bitstring header, optional fields and payload. Extension Flag (EF) Set if extension fields are present. Bit Extension Fields Optional fields, presence signaled by PS. bytestring Payload Block data (entire block or fragment) bytestring
  • the Genomic Multiplex can be properly decoded only when at least one Genomic Dataset List, one Genomic Dataset Mapping Table and one Reference ID Mapping Table have been received, allowing to map every packet to a specific Genomic Dataset component.
  • Every Genomic Data Block may be split in fragments, which may be transmitted in the payload of Genomic Data Packets, depending on channel requirements, such as packet loss rate, protocol maximum packet size, etc.
  • a Genomic Data Packet is defined as follows.
  • Packet_header( ) ⁇ Layer ID (LID) Unambiguously identifies data type carried by this Packet. Unique for each sub-stream/data type. Mapping Table needed at beginning of stream in order to map Layer IDs to data types. Reserved To maintain byte-alignment Access Unit Marker Bit (MB) Set for the last Packet of the Access Unit. Allows identifying the end of an AU as a set of Blocks. Sequence Number (SN) Packet counter, linearly increasing by 1. Needed to identify packet losses as gaps in SNs for each individual sub-stream. Associated to LID, i.e., different SN for every LID. Packet Size (PS) Number of bytes composing Packet, including header, optional fields and payload. Extension Flag (EF) Set if extension field is present. [optional] Extension field Optional field, present if EF is set. ⁇
  • FIG. 49 shows how before being transformed in the data structures presented in this invention, raw genomic sequence data need to be mapped ( 491 ) on one or more reference sequence known a-priori ( 4920 ).
  • a reference sequence is not available a “constructed” reference can be built from the raw sequence data ( 492 ).
  • Already aligned data can be re-aligned in order to reduce the information entropy.
  • a genomic classifier 494 ) creates the data classes according to the matching functions described in Table land separates metadata (e.g. quality values) and annotation data from the genomic sequences.
  • a reference transformation ( 4919 ) can be applied on the external reference ( 4920 ) in order to further reduce the entropy of the generated classes of data ( 498 ).
  • the transformed data classes ( 4918 ) are fed to layers encoders ( 495 - 497 ) to produce genomic layers ( 491 ) which are then encoded by entropy encoders ( 4912 - 4914 ).
  • the genomic streams generated by the entropy encoders are then sent to Genomic Multiplexer ( 4916 ) which generates the Genomic Multiplex.
  • Genomic labels generated by a Genomic Labels Generator ( 4917 ) can be associated to the genomic streams ( 4915 ) by the Multiplexer ( 4916 ).

Abstract

The storage or transmission of genomic data is realized by employing a structured compressed genomic dataset in a file or in a stream of genomic data. Selective access to the data, or subsets of the data, corresponding to specific genomic regions is achieved by employing user-defined labels based on data classification and a specific indexing mechanism.

Description

    TECHNICAL FIELD
  • The present application provides new methods for the efficient storage, transmission and multiplexing of bioinformatics data, and in particular genomic sequencing data, in compressed form that enable efficient selective access and selective protection of the different data categories composing the genomic datasets.
  • BACKGROUND
  • An appropriate representation of genome sequencing data is fundamental to enable efficient processing, storage and transmission of genomic data to make possible and facilitate analysis applications such as genome variants calling and all analysis performed, with various purposes, by processing the sequencing data and metadata. Today, genome sequencing information is generated by High Throughput Sequencing (HTS) machines in the form of sequences of nucleotides (a. k. a. bases) represented by strings of letters from a defined vocabulary.
  • These sequencing machines do not read out an entire genomes or genes, but they produce short random fragments of nucleotide sequences known as sequence reads. A quality score is associated to each nucleotide in a sequence read. Such number represents the confidence level given by the machine to the read of a specific nucleotide at a specific location in the nucleotide sequence. This raw sequencing data generated by NGS machines are commonly stored in FASTQ files (see also FIG. 1).
  • The smallest vocabulary to represent sequences of nucleotides obtained by a sequencing process is composed by five symbols: {A, C, G, T, N} representing the four types of nucleotides present in DNA namely Adenine, Cytosine, Guanine, and Thymine plus the symbol N to indicate that the sequencing machine was not able to call any base with a sufficient level of confidence, so the type of base in such position remains undetermined in the reading process. In RNA Thymine is replaced by Uracil (U). The nucleotides sequences produced by sequencing machines are called “reads”. In case of paired reads the term “template” is used to designate the original sequence from which the read pair has been extracted. Sequence reads can be composed by a number of nucleotides in a range from a few dozen up to several thousand. Some technologies produce sequence reads in pairs where each read can be originated from one of the two DNA strands.
  • In the genome sequencing field the term “coverage” is used to express the level of redundancy of the sequence data with respect to a reference genome. For example, to reach a coverage of 30× on a human genome (3.2 billion bases long) a sequencing machine shall produce a total of about 30×3.2 billion bases so that in average each position in the reference is “covered” 30 times.
  • State of the Art Solutions
  • The most used genome information representations of sequencing data are based on FASTQ and SAM file formats which are commonly made available in zipped form in the attempt of reducing the original size. The traditional file formats, respectively FASTQ and SAM for non-aligned and aligned sequencing data, are constituted by plain text characters and are thus compressed by using general purpose approaches such as LZ (from Lempel and Ziv) schemes (the well-known zip, gzip etc). When general purpose compressors such as gzip are used, the result of the compression is usually a single blob of binary data. The information in such monolithic form results quite difficult to archive, transfer and elaborate particularly in the case of high throughput sequencing when the volumes of data are extremely large.
  • After sequencing, each stage of a genomic information processing pipeline produces data represented by a completely new data structure (file format) despite the fact that in reality only a small fraction of the generated data is new with respect to the previous stage.
  • FIG. 1 shows the main stages of a typical genomic information processing pipeline with the indication of the associated file format representation.
  • Commonly used solutions presents several drawbacks: data archival is inefficient for the fact that a different file format is used at each stage of the genomic information processing pipelines which implies the multiple replication of data, with the consequent rapid increase of the required storage space. This is inefficient and unnecessary and it is also becoming not sustainable for the increase of the data volume generated by HTS machines. This has in fact consequences in terms of available storage space and generated costs, and it is also hindering the benefits of genomic analysis in healthcare from reaching a larger portion of the population. The impact of the IT costs generated by the exponential growth of sequence data to be stored and analysed is currently one of the main challenges the scientific community and that the healthcare industry have to face (see Scott D. Kahn “On the future of genomic data”—Science 331, 728 (2011) and Pavlichin, D. S., Weissman, T., and G. Yona. 2013. “The human genome contracts again” Bioinformatics 29(17): 2199-2202). At the same time several are the initiatives attempting to scale genome sequencing from a few selected individuals to large populations (see Josh P. Roberts “Million Veterans Sequenced”—Nature Biotechnology 31, 470 (2013))
  • The transfer of genomic data is slow and inefficient because the currently used data formats are organized into monolithic files of up to several hundred Gigabytes of size which need to be entirely transferred at the receiving end in order to be processed. This implies that the analysis of a small segment of the data requires the transfer of the entire file with significant costs in terms of consumed bandwidth and waiting time. Often online transfer is prohibitive for the large volumes of the data to be transferred, and the transport of the data is performed by physically moving storage media such as hard disk drives or storage servers from one location to another.
  • These limitations occurring when employing state of the art approaches are overcome by the present invention.
  • Processing the data is slow and inefficient for to the fact that the information is not structured in such a way that the portions of the different classes of data and metadata required by commonly used analysis applications cannot be retrieved without the need of accessing the data in its totality. This fact implies that common analysis pipelines can require to run for days or weeks wasting precious and costly processing resources because of the need, at each stage of accessing, of parsing and filtering large volumes of data even if the portions of data relevant for the specific analysis purpose is much smaller.
  • These limitations are preventing health care professionals from timely obtaining genomic analysis reports and promptly reacting to diseases outbreaks. The present invention provides a solution to this need.
  • There is another technical limitation that is overcome by the present invention.
  • In fact the invention aims at providing an appropriate genomic sequencing data and metadata representation by organizing and partitioning the data so that the compression of data and metadata is maximized and several functionality such as selective access and support for incremental updates are efficiently enabled.
  • A key aspect of the invention is a specific definition of classes of data and metadata to be represented by an appropriate source model, coded (i.e. compressed) separately by being structured in specific layers. The most important achievements of this invention with respect to existing state of the art methods consist in:
      • the increase of compression performance due to the reduction of the information source entropy constituted by providing an efficient model for each class of data or metadata;
      • the possibility of performing selective accesses to portions of the compressed data and metadata for any further processing purpose directly in the compressed domain;
      • the possibility of defining user specified “labels” identifying genomic regions or sub-regions or aggregations of regions or sub-regions to enable efficient selective access to the compressed data by means of parsing a “labels list” contained in the genomic file header;
      • the possibility of implementing access control and protection to the different genomic regions or sub-regions identified by a label;
      • the possibility of incrementally (without the need of re-encoding) updating and adding encoded data and metadata with new sequencing data and/or metadata and/or new analysis results;
      • the possibility of efficiently processing data as soon as they are produced by the sequencing machine or alignment tools without the need of waiting the end of the sequencing or alignment process.
  • The present application discloses a method and system addressing the problem of efficient manipulation, storage and transmission of very large amounts of genomic sequencing data, by employing a structured access units approach combined with multiplexing techniques.
  • The present application overcomes all the limitations of the prior art approaches related to the functionality of genomic data accessibility, selective data protection, efficient processing of data subsets, transmission and streaming functionality combined with an efficient compression.
  • Today the most used representation format for genomic data is the Sequence Alignment Mapping (SAM) textual format and its binary correspondent BAM. SAM files are human readable ASCII text files whereas BAM adopts a block based variant of gzip. BAM files can be indexed to enable a limited modality of random access. This is supported by the creation of a separate index file.
  • The BAM format is characterized by poor compression performance for the following reasons:
    • 1. It focuses on compressing the inefficient and redundant SAM file format rather than on extracting the actual genomic information conveyed by SAM files and using appropriate models for compressing it.
    • 2. It employs a general purpose text compression algorithm such as gzip rather than exploiting the specific nature of each data source (the genomic information itself).
    • 3. It lacks any concept and does not support any functionality related to data classification that would enable the implementation of mechanisms providing selective access to specific classes of genomic data.
  • A more sophisticated approach to genomic data compression that is less commonly used, but more efficient than BAM is CRAM (CRAM specification: https://samtools.github.io/hts-specs/CRAMv3.pdf). CRAM provides a more efficient compression for the adoption of differential encoding with respect to an existing reference (it partially exploits the data source redundancy), but it still lacks features such as incremental updates, support for streaming and selective access to specific classes of compressed data.
  • CRAM relies on the concept of the CRAM record. Each CRAM record encodes a single mapped or unmapped reads by encoding all the elements necessary to reconstruct it.
  • CRAM presents the following drawbacks and limitations that are solved and removed by the invention described in this document:
    • 1. CRAM does not support data indexing and random access to data subsets sharing specific features. Data indexing is out of the scope of the specification (see section 12 of CRAM specification v 3.0) and it is implemented as a separate file. Conversely the approach of the invention described in this document employs a data indexing method that is integrated with the encoding process and indexes are embedded in the encoded (i.e. compressed) bit stream.
    • 2. CRAM does not support the aggregation of the data related to several sequencing runs so that selective access is efficient and segregation of runs (i.e. the process of extracting the genomic information from the actual organic sample) is preserved. CRAM does provide the possibility to label reads as belonging to different groups, but this is provided on a read by read base and reads from different groups are then mixed in the file structure. In the present invention a method is described to structure the data so as to keep segregation among different sequencing runs so that efficient selective access is available.
    • 3. CRAM is built by core data blocks that can contain any type of mapped reads (perfectly matching reads, reads with substitutions only, reads with insertions or deletions (also referred to as “indels”)). There is no notion of data classification and grouping of reads in classes according to the result of mapping with respect to a reference sequence. This means that all data need to be inspected even if only reads with specific features are searched. Such limitation is solved by the invention by classifying and partitioning data in classes before coding.
    • 4. CRAM is based on the concept of encapsulating each read into a “CRAM record”. This implies the need to inspect each complete “record” when reads characterized by specific biological features (e.g. reads with substitutions, but without “indels”, or perfectly mapped reads) are searched. Conversely, in the present invention there is the notion of data classes coded separately in separate information layers and there is no notion of record encapsulating each read. This enables more efficient access to set of reads with specific biological characteristics (e.g. reads with substitutions, but without “indels”, or perfectly mapped reads) without the need of decoding each (block of) read(s) to inspect its features.
    • 5. In a CRAM record each field in a record is associated to a specific flag and each flag must always have the same meaning as there is no notion of context since each CRAM record can contain any different type of data. This coding mechanism introduces redundant information and prevents the usage of efficient context based entropy coding.
      • Conversely in the present invention there is no notion of flag denoting data because this is intrinsically defined by the information “layer” the data belongs to. This implies a largely reduced number of symbols to be used and a consequent reduction of the information source entropy which results into a more efficient compression. Such improvement is possible because the use of different “layers” enables the encoder to reuse the same symbol across each layer with different meanings according to the context. In CRAM each flag must always have the same meaning as there is no notion of contexts and each CRAM record can contain any type of data.
    • 6. In CRAM, substitutions, insertions and deletions are represented by using different syntax elements, option that increases the size of the information source alphabet and yields a higher source entropy. Conversely the approach of the disclosed invention uses a single alphabet and encoding for substitutions, insertions and deletions. This makes the encoding and decoding process simpler and produces a lower entropy source model which coding yields bitstreams characterized by high compression performance.
    • 7. CRAM does not provide any mechanism to uniquely identify specific regions or sub regions of the genomic data or aggregations thereof. Apart from the definition of loci in terms of start and end positions on the reference sequence, according to the CRAM specification there is no way to:
      • label a region and access it using the defined label instead of the genomic start and end position. Start and end positions of the same genomic region may change if a new reference sequence is published, while a defined label would hide such change to any end user. The encoding and decoding system would take care of adapting the actual region identified by the label to the newly published reference sequence
      • aggregate several regions or sub-regions under the same label so that any end user would be able to select the required data via a single query not involving complex nested queries. The entire aggregation mechanism would be embedded in the encoding and decoding system as described in this document.
    • 8. CRAM does not provide or support any mechanism to implement selective protection and access control relative to specific regions or sub regions of the genomic data or aggregations thereof, neither when such regions are pre-defined nor when they are specified by the user inserting appropriate “Labels”.
  • Beside CRAM also the other approaches to genomic data compression and processing present strong limitations to most of the desired functionality and do not support features that are provided by this invention disclosure as described and specified in the following of the document.
  • Genomic compression algorithms used in the state of the art can be classified into these categories:
      • Transform-based
        • LZ-based
        • Read reordering
      • Assembly-based
      • Statistical modeling
  • The first two categories share the disadvantage of not exploiting the specific characteristics of the data source (genomic sequence reads) and process the genomic data as string of text to be compressed without taking into account the specific properties of such kind of information (e.g. redundancy among reads, reference to an existing sample). Two of the most advanced toolkits for genomic data compression, namely CRAM and Goby (“Compression of structured high-throughput sequencing data”, F. Campagne, K. C. Dorff, N. Chambwe, J. T. Robinson, J. P. Mesirov, T. D. Wu), make a poor use of arithmetic coding as they implicitly model data as independent and identically distributed by a Geometric distribution. Goby is slightly more sophisticated since it converts all the fields to a list of integers and each list is encoded independently using arithmetic coding without using any context. In the most efficient mode of operation, Goby is able to perform some inter-list modeling over the integer lists to improve compression. These prior art solutions yield poor compression ratios and data structures that are difficult if not impossible to selectively access and manipulate once compressed. Downstream analysis stages can result to be inefficient and very slow due to the necessity of handling large and rigid data structures even to perform simple operation or to access selected regions of the genomic dataset.
  • A simplified vision of the relation among the file formats used in genome processing pipelines is depicted in FIG. 1. In this diagram file inclusion does not imply the existence of a nested file structure, but it only represents the type and amount of information that can be encoded for each format (i.e. SAM contains all information in FASTQ, but organized in a different file structure). CRAM contains the same genomic information as SAM/BAM, but it has more flexibility in the type of compression that can be used, therefore it is represented as a superset of SAM/BAM.
  • The use of multiple file formats for the storage of genomic information is highly inefficient and costly. Having different file formats at different stages of the genomic information life cycle implies a linear growth of utilized storage space even if the incremental information is minimal. Further disadvantages of prior art solutions are listed below.
    • 1. Accessing, analysing or adding annotations (metadata) to raw data stored in compressed FastQ files or any combination thereof requires the decompression and recompression of the entire file with extensive usage of computational resources and time.
    • 2. Retrieving specific subsets of information such as read mapping position, read variant position and type, indels position and types, or any other metadata and annotation contained in aligned data stored in BAM files requires to access the whole data volume associated to each read. Selective access to a single class of metadata is not possible with prior art solutions.
    • 3. Prior art file formats require that the whole file is received at the end user before processing can start. For example the alignment of reads could start before the sequencing process has been completed relying on an appropriate data representation. Sequencing, alignment and analysis could proceed and run in parallel.
    • 4. Prior art solution do not support structuring and are not able of distinguishing genomic data obtained by different sequencing processes according to their specific generation semantic (e.g. sequencing obtained at different time of the life of the same individual). The same limitation occurs for sequencing obtained by different types of biological samples of the same individual.
    • 5. The protection by means of access control mechanisms (e.g. encryption, watermarking, digital signature, hashing) of entire or selected portions of the data is not supported by prior art solutions. For example the protection of:
      • a. selected DNA regions
      • b. only those sequences containing variants
      • c. chimeric sequences only
      • d. unmapped sequences only
      • e. regions or sub-regions or aggregations of regions or sub-regions identified by user defined Labels
      • f. specific metadata (e.g. origin of the sequenced sample, identity of sequenced individual, type of sample)
      • is not supported in files and data formats of prior art solutions.
    • 6. The transcoding from sequencing data aligned to a given reference (i.e. a SAM/BAM file) to a new reference requires to process the entire data volume even if the new reference differs only by a single nucleotide position from the previous reference.
  • Therefore there is the clear need of an appropriate Genomic Information Storage Format (Genomic File Format) and Transport Mechanism that enable efficient compression, support selective access and protection functionality in the compressed domain, of local and remotely stored data and support the incremental addition of heterogeneous metadata in the compressed domain at all levels of the different stages of the genomic data processing.
  • The present invention provides a solution to the limitations of the state of the art by employing the method, devices and computer programs as claimed in the accompanying set of claims.
  • LIST OF FIGURES
  • FIG. 1 shows the main steps of a typical genomic pipeline and the related file formats.
  • FIG. 2 shows the mutual relationship among the most used genomic file formats
  • FIG. 3 shows how genomic sequence reads are assembled in an entire or partial genome via de-novo assembly or reference based alignment.
  • FIG. 4 shows how reads mapping positions on the reference sequence are calculated.
  • FIG. 5 shows how reads pairing distances are calculated.
  • FIG. 6 shows how pairing errors are calculated.
  • FIG. 7 shows how the pairing distance is encoded when a read mate pair is mapped on a different chromosome.
  • FIG. 8 shows how sequence reads can be generated from the first or second DNA strand of a genome.
  • FIG. 9 shows how a read mapped on strand 2 has a corresponding reverse complemented read on strand 1.
  • FIG. 10 shows the four possible combinations of reads composing a reads pair and the respective encoding in the rcomp layer.
  • FIG. 11 shows how “n type” mismatches are encoded in a nmis layer.
  • FIG. 12 shows an example of substitutions in a mapped read pair.
  • FIG. 13 shows how substitutions positions can be calculated either as absolute or differential values.
  • FIG. 14 shows how symbols encoding substitutions without IUPAC codes are calculated.
  • FIG. 15 shows how substitution types are encoded in the snpt layer.
  • FIG. 16 shows how symbols encoding substitutions with IUPAC codes are calculated.
  • FIG. 17 shows an alternative source model for substitution where only positions are encoded, but one layer per substitution type is used.
  • FIG. 18 shows how to encode substitutions, insertions and deletions in a reads pair of class I when IUPAC codes are not used.
  • FIG. 19 shows how to encode substitutions, insertions and deletions in a reads pair of class I when IUPAC codes are used.
  • FIG. 20 shows the structure of the Genomic Dataset Header of the genomic information data structure disclosed by this invention.
  • FIG. 21 shows how the Master Index Table contains the positions on the reference sequences of the first read in each Access Unit.
  • FIG. 22 shows an example of partial MIT showing the mapping positions of the first read in each pos AU of class P.
  • FIG. 23 shows how the Local Index Table in the layer header is a vector of pointers to the AUs in the payload.
  • FIG. 24 shows an example of Local Index Table.
  • FIG. 25 shows the functional relation between Master Index Table and Local Index Tables
  • FIG. 26 shows how Access Units are composed by blocks of data belonging to several layers. Layers are composed by Blocks subdivided in Packets.
  • FIG. 27 shows how a Genomic Access Unit of type 1 (containing positional, pairing, reverse complement and read length information) is packetized and encapsulated in a Genomic Data Multiplex.
  • FIG. 28 shows how Access Units are composed by a header and multiplexed blocks belonging to one or more layers of homogeneous data. Each block can be composed by one or more packets containing the actual descriptors of the genomic information.
  • FIG. 29 shows the structure of Access Units of type 0 which do not need to refer to any information coming from other access units to be accessed or decoded and accessed.
  • FIG. 30 shows the structure of Access Units of type 1.
  • FIG. 31 shows the structure of Access Units of type 2 which contain data that refer to an access unit of type 1. These are the positions of N bases in the encoded reads.
  • FIG. 32 shows the structure of Access Units of type 3 which contain data that refer to an access unit of type 1. These are the positions and types of mismatches in the encoded reads.
  • FIG. 33 shows the structure of Access Units of type 4 which contain data that refer to an access unit of type 1. These are the positions and types of mismatches in the encoded reads.
  • FIG. 34 shows the first five type of Access Units.
  • FIG. 35 shows that Access Units of type 1 refer to Access Units of type 0 to be decoded.
  • FIG. 36 shows that Access Units of type 2 refer to Access Units of type 0 and 1 to be decoded.
  • FIG. 37 shows that Access Units of type 3 refer to Access Units of type 0 and 1 to be decoded.
  • FIG. 38 shows that Access Units of type 4 refer to Access Units of type 0 and 1 to be decoded.
  • FIG. 39 shows the Access Units required to decode sequence reads with mismatches mapped on the second segment of the reference sequence (AU 0-2).
  • FIG. 40 shows how raw genomic sequence data that becomes available can be incrementally added to pre-encoded genomic data.
  • FIG. 41 shows how a data structure based on Access Units enables genomic data analysis to start before the sequencing process is completed.
  • FIG. 42 shows how new analysis performed on existing data can imply that reads are moved from AUs of type 4 to one of type 3.
  • FIG. 43 shows how newly generated analysis data are encapsulated in a new AU of type 8 and a corresponding index is created in the MIT.
  • FIG. 44 shows how to transcode data due to the publication of a new reference sequence (genome).
  • FIG. 45 shows how reads mapped to a new genomic region with better quality (e.g. no indels) are moved from AU of type 4 to AU of type 3
  • FIG. 46 shows how, in case new mapping location is found, (e.g. with less mismatches) the related reads can be moved from one AU to another of the same type.
  • FIG. 47 shows how selective encryption can be applied on Access Units of Type 4 only as they contain the sensible information to be protected.
  • FIG. 48 shows the data encapsulation in a genomic multiplex where one or more genomic datasets 482-483 contain Genomic streams 484 and streams of Genomic Datasets Mapping Table Lists 481, Genomic Dataset Mapping Tables 485, and Reference Identifiers Mapping Tables 487. Each genomic stream is composed by a Header 488 and Access Units 486. Access Units encapsulate Blocks 489 which are composed by Packets 4810.
  • FIG. 49 shows how raw genomic sequence data (499) or aligned genomic data (produced by element 491) are processed to be encapsulated in a Genomic Multiplex. The alignment (491) and reference genome construction (492) stages can be necessary to prepare the data for encoding. Data classes (498) generated by a data classification unit (494) can be further classified with respect to one or more transformed reference generated by a reference transformation unit (4919). The transformed classes (4918) are then sent to layers encoders (495-497). The generated layers (4911) are encoded by entropy coders (4912-4914) which generate Genomic Streams of Access Units (4915) fed to the Genomic Multiplexer (4916).
  • FIG. 50 shows how a genomic demultiplexer (500) extracts Genomic Streams (501) from the Genomic Multiplex (5010), one decoder per AU type (502-504) extracts the genomic layers which are then decoded (506-507) into various data classes (5011) which are used by class decoders (509) to reconstruct genomic formats such as for example FASTQ and SAM/BAM. When present in the multiplexed bitstream (5010) a genomic stream containing one or more reference transformations is decoded by an entropy decoder (504) to produce reference transformation descriptors (5012). Reference transformation descriptors are processed by a reference transformation unit (5013) to transform one or more “external” references to generate one or more transformed references (5014) to be used by the class decoders (509).
  • FIG. 51 shows the process of encoding sequence reads belonging to class U using a self-generated reference sequence using six layers of descriptors. Four layers are the same used for other classes P, N, M, I while two layers are specific to class U reads.
  • FIG. 52 shows how a label is built to aggregate genomic regions belonging to two different references.
  • FIG. 53 shows how an existing label can be updated in case new results of analysis require to add an additional region R4 to the existing ones (R1, R2 and R3).
  • FIG. 54 shows how the labeling mechanism can be used to implement access control and data protection on specific genomic regions or sub regions. The simple case uses one access control rule (AC) and one protection mechanism (e. g. encryption) for all genomic regions identified by one label.
  • FIG. 55 shows how the different genomic regions identified by the same label can be protected by several different access control rules (AC) and several different encryption keys.
  • FIG. 56 shows how an alternative encoding of reads of class U where a signed POS descriptor is used to encode the mapping position of a read on the computed reference FIG. 57 shows how half mapped read pairs can help in filling unknown regions of the reference sequence by assembling longer contigs with unmapped reads.
  • FIG. 58 shows the hierarchical structure of headers for genomic data stored following the structure described in this invention.
  • FIG. 59 shows how a device implementing the labeling mechanism described by this invention enables concurrent access to data related to several genomic regions when they are stored in different records of a database. This can happen either in presence of controlled access or not.
  • FIG. 60 shows how vectors of thresholds are used in encoders of classes N, M and I to generate separated subclasses of data
  • FIG. 61 provides an example of how reference transformations can change the class reads belong to when all or a subset of mismatches are removed (i.e. the read belonging to class M before transformation is assigned to class P after the transformation of the reference has been applied).
  • FIG. 62 shows how reference transformations can be applied to remove mismatches (MMs) from reads. In some cases reference transformations may generate new mismatches or change the type of mismatches found when referring to the reference before the transformation has been applied.
  • FIG. 63 The same reference transformation A0 can be used for all classes of data or different transformations AN, AM, AI are used for each class N, M, I
  • SUMMARY
  • The features of the claims below solve the problem of existing prior art solutions by providing
  • a method for selective access of regions of genomic data by employing labels, said labels comprising: an identifier of a reference genomic sequence (521), an identifier of said genomic regions (522), and an identifier of the data class (523) of said genomic data
  • In another aspect of the method said genomic data are sequences of genomic reads.
  • In another aspect of the method data classes can be of the following type or a subset of them:
      • “Class P” comprising genomic reads which do not present any mismatch with respect to a reference sequence
      • “Class N” comprising genomic reads including only mismatches in positions where the sequencing machine was not able to call any “base” and the number of said mismatches does not exceed a given threshold
      • “Class M” comprising genomic reads in which mismatches are constituted by positions where the sequencing machine was not able to call any base, named “n type” mismatches, and/or it called a different base than the reference sequence, named “s type” mismatches, and said numbers of mismatches do not exceed given thresholds for the number of mismatches of “n type”, of “s type” and a threshold obtained from a given function (f(n,s))
      • “Class I” when the genomic reads can possibly have the same type of mismatches of “Class M”, and in addition at least one mismatch of type: “insertion” (“i type”), “deletion” (“d type”), soft clips (“c type”), and wherein the numbers of mismatches for each type does not exceed the corresponding given thresholds and a threshold provided by a given function (w(n,s,i,d,c))
      • “Class U” comprising all reads that do not find any classification in the classes P, N, M, I
  • In another aspect of the method said genomic data are paired sequences of genomic reads.
  • In another aspect of the method said data class of paired reads can be of the following types or a subset of them:
      • “Class P” comprising genomic read pairs which do not present any mismatch with respect to a reference sequence
      • “Class N” comprising genomic reads pairs including only mismatches in positions where the sequencing machine was not able to call any “base” and said numbers of mismatches for each read do not exceed a given threshold
      • “Class M” comprising genomic read pairs including only mismatches in positions where the sequencing machine was not able to call any “base” and said numbers of mismatches for each read do not exceed a given threshold, named “n type” mismatches, and/or it called a different base than the reference sequence, named “s type” mismatches, and said numbers of mismatches does not exceed a given thresholds for the number of mismatches of “n type”, of “s type” and a threshold obtained from a given function (f(n,s))
      • “Class I” comprising read pairs which can possibly have the same type of mismatches of “Class M” pairs, and in addition at least one mismatch of type: “insertion” (“i type”) “deletion” (“d type”) soft clips (“c type”), and wherein the number of mismatches for each type does not exceed the corresponding given threshold and a threshold provided by a given function (w(n,s,i,d,c))
      • “Class HM” comprising read pairs for which only one read mate does not satisfy the matching rules for being classified in any of the classes P, N, M, I
      • Class “U” comprising all reads pairs for which both reads do not satisfy the matching rules for being classified in the classes P, N, M, I
  • In another aspect of the method said identifier of said genomic regions is comprised in a master index table.
  • In another aspect of the method said genomic data and said labels are entropy coded.
  • In another aspect of the method said master index table (4812) is comprised in a genomic dataset header (4813).
  • In another aspect of the method said regions of genomic data are dispersed among separate Access Units (524, 486).
  • In another aspect of the method the location of said regions of genomic data, in a file, is indicated in a local index table (525).
  • In another aspect of the method said labels are user specified.
  • In another aspect of the method said regions are protected and/or encrypted in a separate manner, without encrypting the whole genomic file.
  • In another aspect of the method said labels are stored in a genomic label list (GLL)
  • In another aspect the method further comprises encoding genomic data with selective access to regions of genomic data as previously defined.
  • In another aspect of the method said genomic label list is periodically retransmitted or updated in order to enable multiple synchronization points
  • In another aspect the method further comprises decoding a stream or a file of genomic data with selective access to regions of genomic data as previously defined.
  • The present invention further provides an apparatus for encoding genomic data as previously defined.
  • The present invention further provides an apparatus for decoding genomic data as previously defined.
  • The present invention further provides a storing mean for storing genomic data encoded as previously defined.
  • The present invention further provides a computer-readable medium comprising instructions that when executed cause at least one processor to perform the encoding method previously defined.
  • The present invention further provides a computer-readable medium comprising instructions that when executed cause at least one processor to perform the decoding method previously defined.
  • DETAILED DESCRIPTION
  • The present invention describes a labelling mechanism providing selective access and selective access control to genomic regions or sub-regions or aggregations of regions or sub-regions of compressed genomic data stored in a file format and/or the relevant access units to be used to store, transport, access and process genomic or proteomic information in the form of sequences of symbols representing molecules.
  • These molecules include, for example, nucleotides, amino acids and proteins. One of the most important pieces of information represented as sequence of symbols are the data generated by high-throughput genome sequencing devices.
  • The genome of any living organism is usually represented as a string of symbols expressing the chain of nucleic acids (bases) characterizing that organism. Current state of the art genome sequencing technology is able to produce only a fragmented representation of the genome in the form of several (up to billions) strings of nucleic acids associated to metadata (identifiers, level of accuracy etc.). Such strings are usually called “sequence reads” or “reads”.
  • The typical steps of the genomic information life cycle comprise Sequence reads extraction, Mapping and Alignment, Variant detection, Variant annotation and Functional and Structural Analysis (see FIG. 1).
  • Sequence reads extraction is the process —performed by either a human operator or a machine—of representation of fragments of genetic information in the form of sequences of symbols representing the molecules composing a biological sample. In the case of nucleic acids such molecules are called “nucleotides”. The sequences of symbols produced by the extraction are commonly referred to as “reads”. This information is usually encoded in prior art as FASTA files including a textual header and a sequence of symbols representing the sequenced molecules.
  • When the biological sample is sequenced to extract DNA of a living organism the alphabet is composed by the symbols (A,C,G,T,N).
  • When the biological sample is sequenced to extract RNA of a living organism the alphabet is composed by the symbols (A,C,G,U,N).
  • In case the IUPAC extended set of symbols, so called “ambiguity codes” are also generated by the sequencing machine, the alphabet used for the symbols composing the reads are (A, C, G, T, U, W, S, M, K, R, Y, B, D, H, V, N or −).
  • When the IUPAC ambiguity codes are not used, a sequence of quality score can be associated to each sequence read. In such case prior art solutions encode the resulting information as a FASTQ file. Sequencing devices can introduce errors in the sequence reads such as:
    • 1. identification of a wrong symbol (i.e. representing a different nucleic acid) to represent the nucleic acid actually present in the sequenced sample; this is usually called “substitution error” (mismatch);
    • 2. insertion in one sequence read of additional symbols that do not refer to any actually present nucleic acid; this is usually called “insertion error”;
    • 3. deletion from one sequence read of symbols that represent nucleic acids that are actually present in the sequenced sample; this is usually called “deletion error”;
    • 4. recombination of one or more fragments into a single fragment which does not reflect the reality of the originating sequence.
  • The term “coverage” is used in literature to quantify the extent to which a reference genome or part thereof can be covered by the available sequence reads. Coverage is said to be:
      • partial (less than 1×) when some parts of the reference genome are not mapped by any available sequence read
      • single (1×) when all nucleotides of the reference genome are mapped by one and only one symbol present in the sequence reads
      • multiple (2×, 3×, N×) when each nucleotide of the reference genome is mapped multiple times.
  • Sequence alignment refers to the process of arranging sequence reads by finding regions of similarity that may be a consequence of functional, structural, or evolutionary relationships among the sequences. When the alignment is performed with reference to a pre-existing nucleotides sequence referred to as “reference genome”, the process is called “mapping”. Sequence alignment can also be performed without a pre-existing sequence (i.e. reference genome) in such cases the process is known in prior art as “de novo” alignment. Prior art solutions store this information in SAM, BAM or CRAM files. The concept of aligning sequences to reconstruct a partial or complete genome is depicted in FIG. 3.
  • Variant detection (a.k.a. variant calling) is the process of translating the aligned output of genome sequencing machines, (sequence reads generated by NGS devices and aligned), to a summary of the unique characteristics of the organism being sequenced that cannot be found in other pre-existing sequence or can be found in a few pre-existing sequences only. These characteristics are called “variants” because they are expressed as differences between the genome of the organism under study and a reference genome. Prior art solutions store this information in a specific file format called VCF file.
  • Variant annotation is the process of assigning functional information to the genomic variants identified by the process of variant calling. This implies the classification of variants according to their relationship to coding sequences in the genome and according to their impact on the coding sequence and the gene product. This is in prior art usually stored in a MAF file.
  • The process of analysis of DNA (variant, CNV=copy number variation, methylation etc) strands to define their relationship with genes (and proteins) functions and structure is called functional or structural analysis. Several different solutions exist in the prior art for the storage of this data.
  • Genomic File Format
  • The invention disclosed in this document consists in the definition of a selective and controlled data access applied to a compressed data structure for representing, processing manipulating and transmitting genome sequencing data that differs from prior art solutions for at least the following aspects:
      • It does not rely on any prior art representation formats of genomic information (i.e. FASTQ, SAM).
      • It supports efficient handling and selective random access to data produced by multiple sequencing runs structured as multiple genomic datasets. Partitioning data from different sequencing runs into the same data structure enables analysts to simultaneously perform queries on them with great advantage for population genetics studies.
      • It implements a new original classification of the genomic data and metadata according to their specific characteristics. Sequence reads are mapped to a reference sequence and grouped in distinct classes according to the results of the alignment process. This results in data classes with lower information entropy that can be more efficiently encoded applying different specific compression algorithms such as Huffman coding, arithmetic coding (CABAC, CAVLAC), Asymmetric Numerical Systems, Lempel-Ziv and its derivations.
      • It implements a new method to associate data classes or subsets of data classes to specific genomic regions, or sub-regions or aggregations of regions or sub-regions, by means of user defined Labels that enable the selective access and protection of said compressed data classes corresponding to specific genomic regions or sub-regions or aggregations of regions or sub-regions.
      • It defines syntax elements and the related encoding/decoding process conveying the sequence reads and the alignment information into a representation which is more efficient to be processed for downstream analysis applications.
  • Classifying the reads according to the result of mapping and coding them using descriptors to be stored in layers (position layer, mate distance layer, mismatch type layer etc, etc, . . . ) present the following advantages:
      • A reduction of the information entropy when the different syntax elements are modelled by a specific source model which yields higher compression performance.
      • A more efficient access to data that are already organized in groups/layers that have a specific meaning for the downstream analysis stages and that can be accesses separately and independently directly in the compressed domain.
      • The presence of a modular data structure that can be updated incrementally by accessing only the required information without the need of decoding (i.e. decompressing) the whole data content.
      • The genomic information produced by sequencing machines is intrinsically highly redundant due to the nature of the information itself and to the need of mitigating the errors intrinsic in the sequencing process. This implies that the relevant genetic information which needs to be identified and analyzed (the variations with respect to a reference) is only a small fraction of the produced data. Prior art genomic data representation formats are not conceived to “isolate” the meaningful information at a given analysis stage from the rest of the information so as to make it promptly available and understandable by the analysis applications.
      • The solution brought by the disclosed invention is to represent genomic data in such a way that any relevant portion of data is readily available to the analysis applications without the need of accessing and decompressing the entirety of data and the redundancy of the data is efficiently reduced by efficient compression to minimize the required storage space and transmission bandwidth.
  • The key elements of the invention are:
    • 1. The specification of a file format that “contains” structured and user-defined selectively accessible data elements called Access Units (AU) in compressed form. Such approach can be seen as the opposite of prior art approaches, SAM and BAM for instance, in which data are structured in non-compressed form and then the entire file is compressed. A first clear advantage of the approach is to be able to efficiently and naturally provide various forms of user-defined structured selective access to the data elements in the compressed domain which is impossible or extremely awkward in prior art approaches.
    • 2. The structuring of the genomic information into specific “layers” of homogeneous data and metadata presents the considerable advantage of enabling the definition of different models of the information sources characterized by low entropy. Such models not only can differ from layer to layer, but can also differ inside each layer when the compressed data within layers are partitioned into Data Blocks included into Access Units. This structuring enables the use of the most appropriate compression for each class of data or metadata and portion of them with significant gains in coding efficiency versus prior art approaches.
    • 3. The information is structured in Access Units (AU) so that any relevant subset of data used by genomic analysis applications is efficiently and selectively accessible by means of appropriate interfaces. These features enable faster access to data and yield a more efficient processing.
    • 4. The definition of a Master Index Table and Local Index Tables enabling selective access to the information carried by the layers of encoded (i.e. compressed) data without the need to decode the entire volume of compressed data.
    • 5. The possibility of accessing only the AUs that correspond to specific user defined genomic regions or sub-regions or aggregations of regions or sub-regions and data classes of interest by parsing a “Label List” present in the file header.
    • 6. The possibility of providing different types of access control to different AUs and portions of data contained into the AU according to the user defined “Labels” identifying associated genomic regions.
    • 7. The possibility of performing realignment of already aligned and compressed genomic data sets when they need to be re-aligned versus newly published reference genomes by performing an efficient transcoding of selected data portions in the compressed domain. The frequent release of new reference genomes currently requires resource consuming and time for the transcoding processes to re-align already compressed and stored genomic data with respect to the newly published references because all data volume need to be processed.
  • The method described in this document aims at exploiting the available a-priori knowledge on genomic data to define an alphabet of syntax elements with reduced entropy. In genomics the available knowledge is represented by an existing genomic sequence usually —but not necessarily —of the same species as the one to be processed. As an example, human genomes of different individuals differ only of a fraction of 1%. However, such small amount of data contain enough information to enable early diagnosis, personalized medicine, customized drugs synthesis etc. This invention aims at defining a genomic information representation format where the relevant information is efficiently accessible, access can be selectively controlled and data protected, the information is efficiently transportable and all such processing is performed handling compressed data structures.
  • The technical features used in the present invention are:
    • 1. Partitioning genomic information generated by different sequencing runs into different genomic datasets in order to enable efficient data retrieval and processing when querying one or more of the available datasets.
    • 2. Partition of the genome sequence data and metadata in “classes” sharing common features;
    • 3. Definition of the structure of the genomic information carried by each data classes in which the genomic data is partitioned, into a sets of “layers” of descriptors in order to reduce the information entropy as much as possible;
    • 4. Definition of a Master Index Table and Local Index Tables to enable selective access to the data classes and associated information by accessing only the desired layers of coded information (i.e. compressed) without the need to decode the entire coded genomic information;
    • 5. Usage of different source models and entropy coders to code the syntax elements belonging to different layers of the data classes defined as specified in point 2;
    • 6. Definition of specific mechanisms establishing a correspondence among dependent layers to enable selective access to the data without the need to decode all the layers if not necessary or desired;
    • 7. Definition of a mechanism for labelling compressed data corresponding to specific genomic regions or sub-regions or aggregations of regions or sub-regions and corresponding data “classes” or subsets of data classes by “Labels” enabling efficient selective access;
    • 8. Definition of mechanisms for the selective protection of specific genomic regions or sub-regions or aggregations of regions or sub-regions and corresponding data “classes” or subsets of data classes and any combination thereof.
    • 9. Coding of the datasets or data “classes” with respect to one or more pre-existing or constructed reference sequences that can be further transformed to reduce the entropy of the sequence data representation.
  • In order to solve all the mentioned problems of the prior art in terms of efficient selective access and selective access control to specific data “classes”, specific genomic regions or sub-regions or aggregations of regions or sub-regions, while preserving efficient transmission and storing by means of an efficient compressed representation, the present invention application provides a specific data structure specification that implements appropriate data reordering into accessible units of homogeneous and/or semantically significant data enabling seamless access and processing required by state of the art genome data analysis applications.
  • In particular the present invention adopts a data structure based on the concept of Access Unit, “Labels” and the multiplexing of the relevant data, concepts which are absent from all state of the art genomic data formats.
  • Genomic data are structured and encoded into different Access Units. Hereafter follows a description of the genomic data that are contained into different Access Units and can be identified by “Labels” associating genomic data to specific genomic regions or sub-regions or aggregations of regions or sub-regions versus reference genomes.
  • Genomic Data Classification According to Matching Rules
  • The sequence reads generated by sequencing machines are classified by the disclosed invention into five different “classes” according to the matching results of the alignment with respect to one or more pre-existing reference sequences.
  • When aligning a DNA sequence of nucleotides with respect to a reference sequence the following cases can be identified:
    • 1. A region in the reference sequence is found to match the sequence read without any error (i.e. perfect mapping). Such sequence of nucleotides is referenced to as “perfectly matching read” or denoted as “Class P”.
    • 2. A region in the reference sequence is found to match the sequence read with a type and a number of mismatches determined only by the number of positions in which the sequencing machine generating the read was not able to call any base (or nucleotide). Such type of mismatches are denoted by an “N” the letter used to indicate an undefined nucleotide base. In this document this type of mismatch referred to as “n type” mismatch. Such sequences is referenced to as “N mismatching reads” or “Class N”. Once the read is classified to belong to “Class N” it is useful to limit the degree of matching inaccuracy to a given upper bound and set a boundary between what is considered a valid matching and what it is not. Therefore, the reads assigned to Class N are also constrained by setting a threshold (MAXN) that defines the maximum number of undefined bases (i.e. bases called as “N”) that a read can contain. Such classification implicitly defines the required minimum matching accuracy (or maximum degree of mismatch) that all reads belonging to Class N shares when referred to the corresponding reference sequence, which constitute an useful criterion for applying selective data searches to the compressed data.
    • 3. A region in the reference sequence is found to match the sequence read with types and number of mismatches determined by the number of positions in which the sequencing machine generating the read was not able to call any nucleotide base, if present (i.e. “n type” mismatches), plus the number of mismatches in which a different base, than the one present in the reference, has been called. Such type of mismatch denoted as “substitution” is also called Single Nucleotide Variation (SNV) or Single Nucleotide Polymorphism (SNP). In this document this type of mismatch is also referred to as “s type” mismatch. The sequence read is then referenced to as “M mismatching reads” and assigned to “Class M”. Like in the case of “Class N”, also for all reads belonging to “Class M” it is useful to limit the degree of matching inaccuracy to a given upper bound, and set a boundary between what is considered a valid matching and what it is not. Therefore, the reads assigned to Class M are also constrained by defining a set of thresholds, one for the number “n” of mismatches of “n type” (MAXN) if present, and another for the number of substitutions “s” (MAXS). A third constraint is a threshold defined by any function of both numbers “n” and “s”, f(n,s). Such third constraint enable to generate classes with an upper bound of matching inaccuracy according to any meaningful selective access criterion. For instance, and not as a limitation, f(n,s) can be (n+s)1/2 or (n+s) or any linear or non-linear expression that sets a boundary to the maximum matching inaccuracy level that is admitted for a read belonging to “Class M”. Such boundary constitutes a very useful criterion for applying the desired selective data searches to the compressed data when analyzing sequence reads for various purposes because it makes possible to set a further boundary to any possible combination of the numbers “n” of “n type” mismatches and “s” of “s type” mismatches (substitutions) beyond the simple threshold applied to the one type or to the other.
    • 4. A fourth class is constituted by sequencing reads presenting at least one mismatch of any type among “insertion”, “deletion” (a.k.a. indels) and “clipped”, plus, if present, any mismatches type belonging to class N or M. Such sequence is referenced to as “I mismatching reads” and assigned to “Class I”. Insertions are constituted by an additional sequence of one or more nucleotides not present in the reference, but present in the read sequence. In this document this type of mismatch is referred to as “i type” mismatch. In literature when the inserted sequence is at the edges of the sequence it is also referred to as “soft clipped” (i.e. the nucleotides are not matching the reference but are kept in the aligned reads contrarily to “hard clipped” nucleotides which are discarded). In this document this type of mismatch is referred to as “c type” mismatch. Keeping or discarding nucleotides is a decisions taken by the aligner stage and not by the classifier of reads disclosed in this invention which receives and processes the reads as they are determined by the sequencing machine or by the following alignment stage. Deletion are “holes” (missing nucleotides) in the read with respect to the reference. In this document this type of mismatch is referred to as “d type” mismatch. Like in the case of classes “N” and “M” it is possible and appropriate to define a limit to the matching inaccuracy. The definition of the set of constraints for “Class I” is based on the same principles used for “Class M” and is reported in Table 1 in the last table lines. Beside a threshold for each type of mismatch admissible for class I data, a further constraint is defined by a threshold determined by any function of the number of the mismatches “n”, “s”, “d”, “i” and “c”, w(n,s,d,i,c). Such additional constraint make possible to generate classes with an upper bound of matching inaccuracy according to any meaningful user defined selective access criterion. For instance, and not as a limitation, w(n,s,d,i,c) can be (n+s+d+i+c)1/5 or (n+s+d+i+c) or any linear or non-linear expression that sets a boundary to the maximum matching inaccuracy level that is admitted for a read belonging to “Class I”. Such boundary constitutes a very useful criterion for applying the desired selective data searches to the compressed data when analyzing sequence reads for various purposes because it enables to set a further boundary to any possible combination of the number of mismatches admissible in “Class I” reads beyond the simple threshold applied to each type of admissible mismatch.
    • 5. A fifth class includes all reads that do now find any matching considered valid (i.e not satisfying the set of matching rules defining an upper bound to the maximum matching inaccuracy as specified in Table 1) for each data class when referring to the reference sequence. Such sequences are said to be “Unmapped” when referring to the reference sequences and are classified as belonging to the “Class U”.
  • Classification of Read Pairs According to Matching Rules
  • The classification specified in the previous section concerns single sequence reads. In the case of sequencing technologies that generates read in pairs (i.e. Illumina Inc.) in which two reads are known to be separated by an unknown sequence of variable length, it is appropriate to consider the classification of the entire pair to a single data class. A read that is coupled with another is said to be its “mate”.
  • If both paired reads belong to the same class the assignment to a class of the entire pair is obvious, the entire pair is assigned to the same class for any class (i.e. P, N, M, I, U). In the case the two reads belong to a different class, but none of them belongs to the “Class U”, then the entire pair is assigned to the class with the highest priority defined according to the following expression:

  • P<N<M<I
  • in which “Class P” has the lowest priority and “Class I” has the highest priority.
  • In case only one of the reads belongs to “Class U” and its mate to any of the Classes P, N, M, I a sixth class is defined as “Class HM” which stands for “Half Mapped”.
  • The definition of such specific class of reads is motivated by the fact that it is used for attempting to determine gaps or unknown regions existing in reference genomes (a.k.a. little known or unknown regions). Such regions are reconstructed by mapping pairs at the edges using the pair read that can be mapped on the known regions. The unmapped mate is then used to build the so called “contigs” of the unknown region as it is shown in FIG. 57. Therefore providing a selective access to only such type of read pairs greatly reduces the associated computation burden enabling much efficient processing of such data originated by large amounts of data sets that using the state of the art solutions would require to be entirely inspected.
  • The table below summarizes the matching rules applied to reads in order to define the class of data each read belongs to. The rules are defined in the first five columns of the table in terms of presence or absence of type of mismatches (n, s, d, i and c type mismatches). The sixth column provides rules in terms of maximum threshold for each mismatch type and any function f(n,s) and w(n,s,d,i,c) of the possible mismatch types.
  • TABLE 1
    Type of mismatches and set of constrains that each sequence reads must satisfy to be
    classified in the data classes defined in this invention disclosure.
    Number and types of mismatches found when
    matching a read with a reference sequence
    Number of Number of
    unknown Number of Number of Number of clipped Set of matching Assignement
    bases (“N”) substitutions deletions Insertions bases accuracy constraints Class
    0 0 0 0 0 0 P
    n > 0 0 0 0 0 n ≤ MAXN N
    n > MAXN U
    n ≥ 0 s > 0 0 0 0 n ≤ MAXN and M
    s ≤ MAXS and
    f(n, s) ≤ MAXM
    n > MAXN or U
    s > MAXS or
    f(n, s) > MAXM
    n ≥ 0 s ≥ 0 d ≥ 0* i ≥ 0* c ≥ 0* n ≤ MAXN and I
    *At least one mismatch s ≤ MAXS and
    of type d, i, c must be resent d ≤ MAXD and
    (i.e. d > 0 or i > 0 or > 0) i ≤ MAXI and
    c ≤ MAXC
    w(n, s, d, i, c) ≤
    MAXTOT
    d ≥ 0 i ≥ 0 c ≥ 0 n > MAXN or U
    s > MAXS or
    d > MAXD or
    i > MAXI or
    c > MAXC
    w(n, s, d, i, c) >
    MAXTOT
  • Matching Rules Partition of Sequence Read Data Classes N, M and I into Subclasses with Different Degrees of Matching Accuracy
  • The data classes of type N, M and I as defined in the previous sections can be further decomposed into an arbitrary number of distinct sub-classes with different degrees of matching accuracy. Such option is an important technical advantage in providing a finer granularity and as consequence a much more efficient selective access to each data class. As an example and not as a limitation, to partition the Class N into a number k of subclasses (Sub-Class N1, . . . , Sub-Class Nk) it is necessary to define a vector with the corresponding components MAXN1, MAXN2, MAXN(k-1), MAXN(k), with the condition that MAXN1<MAXN2< . . . <MAXN(k-1)<MAXN and assign each read to the lowest ranked sub-class that satisfy the constrains specified in Table 1 when evaluated for each element of the vector. This is shown in FIG. 60 where a data classification unit 601 contains Class P, N, M, I U, HM encoder and encoders for annotations and metadata. Class N encoder is configured with a vector of thresholds, MAXN1 to MAXN k 602 which generates k subclasses of N data (606).
  • In the case of the classes of type M and I the same principle is applied by defining a vector with the same properties for MAXM and MAXTOT respectively and use each vector components as threshold for checking if the functions f(n,s) and w(n,s,d,i,c) satisfy the constraint. Like in the case of sub-classes of type N, the assignment is given to the lowest sub-class for which the constraint is satisfied. The number of sub-classes for each class type is independent and any combination of subdivisions is admissible. This is shown in FIG. 60 where a Class M encoder and a Class I encoder are configured respectively with a vector of thresholds MAXM1 to MAXMj (603) and MAXTOT1 to MAXTOTh (604). The two encoders generate respectively j subclasses of M data (607) and h subclasses of I data (608). When two reads in a pair are classified in the same sub-class, then the pair belongs to the same sub-class.
  • When two reads in a pair are classified into sub-classes of different classes, then the pair belongs to the sub-class of the class of higher priority according to the following expression:

  • N<M<I
  • where N has the lowest priority and I has the highest priority.
  • When two reads belong to different sub-classes of one of classes N or M or I, then the pair belongs to the sub-class with the highest priority according to the following expressions:

  • N1<N2< . . . <Nk

  • M1<M2< . . . <Mj

  • I1<I2< . . . <Ih
  • where the highest index has the highest priority.
  • Transformations of the “External” Reference Sequences
  • The mismatches found for the reads classified in the classes N, M and I can be used to create “transformed references” to be used to compress more efficiently the read representation. Reads classified as belonging to the Classes N, M or I (with respect to the pre-existing (i.e. “external”) reference sequence denoted as RS0) can be coded with respect to the “transformed” reference sequence RS1 according to the occurrence of the actual mismatches with the transformed reference. For example if readM in belonging to Class M (denoted as the ith read of class M) containing mismatches with respect to the reference sequence RSn, then after “transformation” readM in=readP i(n+1) can be obtained with A(Refn)=Refn+1 where A is the transformation from reference sequence RSn to reference sequence RSn+1.
  • FIG. 61 shows an example on how reads containing mismatches (belonging to Class M) with respect to reference sequence 1 (RS1) can be transformed into perfectly matching reads with respect to the reference sequence 2 (RS2) obtained from RS1 by modifying the bases corresponding to the mismatch positions. They remain classified and they are coded together the other reads in the same data class access unit, but the coding is done using only the descriptors and descriptor values needed for a Class P read. This transformation can be denoted as:

  • RS2=A(RS1)
  • When the representation of the transformation A which generates RS2 when applied to RS1 plus the representation of the reads versus RS2 corresponds to a lower entropy than the representation of the reads of class M versus RS1, it is advantageous to transmit the representation of the transformation A and the corresponding representation of the read versus RS2 because an higher compression of the data representation is achieved.
  • The coding of the transformation A for transmission in the compressed bitstream requires the definition of two additional syntax elements as defined in the table below.
  • Syntax
    elements Semantic Comments
    rftp Reference position of difference between
    transformation reference and contig
    position used for prediction
    rftt Reference type of difference between reference and
    transformation contig used for prediction. Same syntax
    type described for the snpt descriptor defined
    below.
  • FIG. 62 shows an example on how a reference transformation is applied to reduce the number of mismatches to be coded on the mapped reads.
  • It has to be observed that, in some cases the transformation applied to the reference:
      • May introduce mismatches in the representations of the reads that were not present when referring to the reference before applying the transformation.
      • May modify the types of mismatches, a read may contain A instead of G while all other reads contain C instead of G), but mismatches remain in the same position.
      • Different data classes and subsets of data of each data class may refer to the same transformed reference sequence or to reference sequences obtained by applying different transformations to the same pre-existing reference sequence.
  • FIG. 61 shows an example on how reads can change the type of coding from a data class to another by means of the appropriate set of descriptors (e.g. using the descriptors of a Class P to code a read from Class M) after a reference transformation is applied and the read is represented using the transformed reference. This occurs for example when the transformation changes all bases corresponding to the mismatches of a read in the bases actually present in the read, thus virtually transforming a read belonging to Class M (when referring to the original non transformed reference sequence) into a virtual read of Class P (when referring to the transformed reference). The definition of the set of descriptors used for each class of data is provided in the following sections.
  • FIG. 63 shows how the different classes of data can use the same “transformed” reference R1=A0(R0) (630) to re-encode the reads or different transformations AN (631), AM (632), AI (633) can be separately applied to each class of data
  • Genomic Data Headers for Global Parameters
  • The data structure of said genomic data requires the storage of global parameters and metadata to be used by the decoding engine. These data are organized in the following structures: For file based storage:
      • Datasets Multiplex Header
      • Dataset Header
      • Descriptors Layer Header
      • Block Header
  • The hierarchical relationship among these headers is shown in FIG. 58.
  • For transport in a streaming scenario:
      • Datasets Mapping Tables List
      • Datasets Mapping Table
      • Transport Block Header
      • Packet Header
  • A dataset is defined as the ensemble of coding elements needed to reconstruct the genomic information related to a single genomic sequencing run and all the following analysis. If the same genomic sample is sequenced twice in two distinct runs, the obtained data will be encoded in two distinct datasets.
  • Datasets Multiplex Header
  • This is the data structure prepended to one or more datasets aggregated in a “Multiplex”.
  • Syntax Description
    Datasets_ Multiplex _Header {
     Multiplex_id Label to identify this Datasets Multiplex
    from any other Datasets Multiplex.
     Version_number Version number of the Dataset
    Multiplex. The version number shall be
    incremented by a unit whenever the
    definition of the Datasets Multiplex
    changes.
     List_number Number of the current datasets list.
     gd_number Number of datasets composing
    the datasets Multiplex.
     for (i=0; i<gd_number;i++) {
      genomic_dataset_ID Field identifying the dataset. This field
    shall not take any single value more
    than once within one version of the
    Dataset List
     }
    Metadata Data structure carrying metadata to
    be used for application-specific
    processing such as data analysis and
    content protection.
    }
  • This is the data structure prepended to an encoded dataset.
  • TABLE 2
    Genomic Dataset Header structure.
    Element Type Description
    Dataset_ID Byte array Unique identifier
    for the encoded
    content
    Major_Brand Byte array Major + Minor
    Minor_Version Byte array version of the
    encoding
    algorithm
    Header Size Integer Size in bytes of the
    entire encoded
    content
    Reads Length Integer Size of reads in
    case of constant
    reads length. A
    special value (e.g.
    0) is reserved for
    variable reads
    length
    Ref count Integer Number of
    reference
    sequences used
    Access Units counters Byte array Total Number of
    (e.g. encoded Access
    integers) Units per
    reference
    sequence
    Ref ids Byte array Unique identifiers
    for reference
    sequences
    Ref_count Integer number of
    references
    for (i=0; i<Ref_count; i++) {
     Reference_genome:Ref_ID string:string Unambiguous ID,
    as a characters
    string, identifying
    the reference
    sequence(s) used
    in this Dataset
    }
    for (i=0; i<Ref_count; i++) {
     Ref blocks Byte array Number of
    encoded blocks
    per each reference
    }
    Dataset label size Integer The size of the
    following element
    Dataset label String A string of
    character used to
    identify the
    dataset
    Dataset type Integer The type of data
    encoded in the
    dataset (e.g.
    aligned, not
    aligned)
    Master index table Byte array This is a
    Alignment positions of first read in each block (Access Unit). multidimensional
    l.e. smaller position of the first read on the reference genome array supporting
    per each block of the six classes random access to
    1 per pos class (six) per reference Access Units.
    Label List Byte array This is a list of
    Sub-part of the Genomic Dataset Header indicating (e.g. Labels, each one
      number of Labels integers) represented as a
      for each Label: multidimensional
        the Label ID array in order to
        the number of reference sequences concerned support selective
        by the label access to specific
        for each reference sequence genomic regions
          the reference identifier or sub-regions or
          the number of regions covered by the aggregations of
          label, regions or sub-
          for each region: regions.
            the class ID
            the start position in the
            genomic range
            the end position in the
            genomic range
    Start position and end position can be replaced by “block
    numbers”, composing, together with reference sequence ID and
    class ID, a three dimensional vector addressing the coordinates
    of the Master Index Table.
    Parameters set Byte array Encoding
    parameters used
    to configure the
    encoding process
    and sent to the
    decoder.
  • Descriptors Layer Header
  • Descriptors (a.k.a. syntax elements) are described in the following sections of this document and are the building blocks of the genomic information representation described by this invention. They are organized in layers (a.k.a. descriptors streams) of homogeneous elements partitioned according to the specific statistical properties of each descriptor. This has the advantage of reducing the entropy of each layer and improving compression efficiency.
  • Each layer is prepended by the Descriptors Layer Header described below.
  • Syntax Description
    Descriptors_Layer_Header {
     Descriptors_Layer_ID Descriptors layer ID, table specified in this specification
     Num_Of_Blocks Number of Blocks in the Descriptors Layer
     Label size Size of the human readable label
     Label (Human-Readable) Label
     Flag Flag used to interpret the following metadata
     Local Index Table The Local Index Table structure as described in this invention
     Metadata Data structure carrying metadata to be used for application-
    specific processing such as data analysis and content protection.
    }
  • Block Header
  • Every Descriptors Layer is composed by one or multiple Genomic Data Blocks. One or more Blocks from different Layers compose an Access Unit, depending on the Class of data.
  • An Access Unit is a set of Genomic Blocks that can be decoded either independently from other Access Units by using only globally available data (e.g. decoder configuration) or by using information contained in other Access Units.
  • Syntax Semantic
    Block_Header {
     Descriptors_Layer_ID Unambiguously identifies the descriptors stream. Same as
    Descriptors_Layer_ID in Descriptor Layer Header
     Block size (BS) Number of bytes composing Block, including this header and
    payload, and excluding padding (total Block size will be BS +
    padding size).
    }
  • Definition of the Information Necessary to Represent Sequence Reads into Layers of Descriptors
  • Once the classification of reads is completed with the definition of the Classes, further processing consists in defining a set of distinct syntax elements which represent the remaining information enabling the reconstruction of the DNA read sequence when represented as being mapped on a given reference sequence.
  • A sequence read (e.g. a DNA segment) referred to a given reference sequence can be fully expressed by:
      • The starting position on the reference sequence pos (292).
      • A flag signaling if the read has to be considered as a reverse complement versus the reference rcomp (293).
      • A distance to the mate pair in case of paired reads pair (294).
      • The value of the read length (295) in case of the sequencing technology produces variable length reads. In case of constant reads length the read length associated to each reads can obviously be omitted and can be stored in the Genomic Dataset Header.
      • For each mismatch:
        • Mismatch position nmis (300) for class N, snpp (311) for class M, and indp (321) for class I)
        • Mismatch type (not present in class N, snpt (312) in class M, indt (322) in class I)
      • Flags (296) indicating specific characteristics of the sequence read such as:
        • template having multiple segments in sequencing
        • each segment properly aligned according to the aligner
        • unmapped segment
        • next segment in the template unmapped
        • signalization of first or last segment
        • quality control failure
        • PCR or optical duplicate
        • secondary alignment
        • supplementary alignment
      • Soft clipped nucleotides string (323) when present for class I
      • Flag indicating the reference used for alignment and compression (e.g. internal reference for class U) if applicable (descriptor rtype).
      • For class U, descriptor indc identifies those parts of the reads (typically the edges) that do not match, with a specified set of matching accuracy constraints, with the “internal” reference sequences.
      • Descriptor ureads is used to encode verbatim the reads that cannot be mapped on any available reference being it “external” (i.e pre-existing like an actual reference genome) or a “internal” reference sequence.
  • This classification creates groups of descriptors (syntax elements) that can be used to univocally represent genome sequence reads. The table below summarizes the syntax elements needed for each class of reads aligned with “pre-existing” (i.e. “external”) or “constructed” (i.e. “internal”) references.
  • TABLE 3
    Defined layers per class of data.
    P N M I U HM
    pos X X X X X X
    pair X X X X X
    rcomp X X X X X
    flags X X X X X
    rlen X X X X X
    nmis X
    snpp X X
    snpt X X
    indp X X
    indt X X
    indc X X
    ureads X X
    rtype X
  • Reads belonging to class P are characterized and can be perfectly reconstructed by only a position, a reverse complement information and an offset between mates in case they have been obtained by a sequencing technology yielding mated pairs, some flags and a read length.
  • The next section details how these descriptors are defined for classes P, N, M and I while for class U they are described in a later section.
  • Class HM is applied to read pairs only and it is a special case where one read belongs to class P, N, M or I and the other to class U.
  • Position Descriptors Layer
  • In each Access Unit, only the mapping position of the first encoded read is stored in the AU header as absolute position on the reference genome. All the other positions are expressed as a difference with respect to the previous position and are stored in a specific layer. This modeling of the information source, defined by the sequence of read positions, is in general characterized by a reduced entropy particularly for sequencing processes generating high coverage results. Once the absolute position of the first alignment has been stored, all positions of other reads are expressed as difference (distance) with respect to the first one.
  • For example FIG. 4 shows how after encoding the starting position of the first alignment as position “10000” on the reference sequence, the position of the second read starting at position 10180 is coded as “180”. With high coverage data (>50×) most of the descriptors of the position vector will show very high occurrences of low values such as 0 and 1 and other small integers. FIG. 10 shows how the positions of three read pairs are encoded in a pos Layer.
  • The same source model is used for the positions of reads belonging to classes N, M, P and I. In order to enable any combination of selective access to the data, the positions of reads belonging to the four classes are encoded in separate layers as depicted in Table I.
  • Reverse Complement Descriptor Layer
  • Each read of the read pairs produced by sequencing technologies can be originated from either genome strands of the sequenced organic sample. However, only one of the two strands is used as reference sequence. FIG. 8 shows how in a reads pair one read (read 1) can be originated from one strand and the other (read 2) can be originated from the other strand.
  • When the strand 1 is used as reference sequence, read 2 can be encoded as reverse complement of the corresponding fragment on strand 1. This is shown in FIG. 9.
  • In case of coupled reads, four are the possible combinations of direct and reverse complement mate pairs. This is shown in FIG. 10. The rcomp layer codes the four possible combinations.
  • The same coding is used for the reverse complement information of reads belonging to classes P, N, M, I. In order to enable enhanced selective access to the data, the reverse complement information of reads belonging to the four classes are coded in different layers as depicted in Table 3.
  • Pairing Descriptors Layer
  • The pairing descriptor is stored in the pair layer. Such layer stores descriptors encoding the information needed to reconstruct the originating reads pairs, when the employed sequencing technology produces reads by pairs. Although at the date of the disclosure of the invention the vast majority of sequencing data is generated by using a technology generating paired reads, it is not the case of all technologies. This is the reason for which the presence of this layer is not necessary to reconstruct all sequencing data information if the sequencing technology of the genomic data considered does not generate paired reads information.
  • Definitions
      • mate pair: read associated to another read in a read pair (e.g. Read 2 is the mate pair of Read 1 in the example of FIG. 4)
      • pairing distance: number of nucleotide positions on the reference sequence which separate one position in the first read (pairing anchor, e.g. last nucleotide of first read) from one position of the second read (e.g. the first nucleotide of the second read)
      • most probable pairing distance (MPPD): this is the most probable pairing distance expressed in number of nucleotide positions.
      • position pairing distance (PPD): the PPD is a way to express a pairing distance in terms of the number of reads separating one read from its respective mate present in a specific position descriptor layer.
      • most probable position pairing distance (MPPPD): is the most probable number of reads separating one read from its mate pair present in a specific position descriptor layer.
      • position pairing error (PPE): is defined as the difference between the MPPD or MPPPD and the actual position of the mate.
      • pairing anchor: position of first read last nucleotide in a pair used as reference to calculate the distance of the mate pair in terms of number of nucleotide positions or number of read positions.
  • FIG. 5 shows how the pairing distance among read pairs is calculated.
  • The pair descriptor layer is the vector of pairing errors calculated as number of reads to be skipped to reach the mate pair of the first read of a pair with respect to the defined decoding pairing distance.
  • FIG. 6 shows an example of how pairing errors are calculated, both as absolute value and as differential vector (characterized by lower entropy for high coverages).
  • The same descriptors are used for the pairing information of reads belonging to classes N, M, P and I. In order to enable the selective access to the different data classes, the pairing information of reads belonging to the four classes are encoded in different layer as depicted in.
  • Pairing Information in Case of Reads Mapped on Different References
  • In the process of mapping sequence reads on a reference sequence it is not uncommon to have the first read in a pair mapped on one reference (e.g. chromosome 1) and the second on a different reference (e.g. chromosome 4). In this case the pairing information described above has to be integrated by additional information related to the reference sequence used to map one of the reads. This is achieved by coding
  • 1. A reserved value (flag) indicating that the pair is mapped on two different sequences (different values indicate if read1 or read2 are mapped on the sequence that is not currently encoded)
    2. a unique reference identifier referring to the reference identifiers encoded in the Genomic Dataset Header structure as described in Table 2.
    3. a third element containing the mapping information on the reference identified at point 2 and expressed as offset with respect to the last encoded position.
  • FIG. 7 provides an example of this scenario.
  • In FIG. 7, since Read 4 is not mapped on the currently encoded reference sequence, the genomic encoder signals this information by crafting additional descriptors in the pair layer. In the example shown in FIG. 7 Read 4 of pair 2 is mapped on reference no. 4 while the currently encoded reference is no. 1. This information is encoded using 3 components:
  • 1) One special reserved value is encoded as pairing distance (in this case 0xffffff)
    2) A second descriptor provides a reference ID as listed in the Genomic Dataset Header (in this case 4)
    3) The third element contains the mapping information on the concerned reference (170).
  • Mismatch Descriptors for Class N Reads
  • Class N includes all reads in which only “n type” mismatches are present, at the place of an A, C, G or T base a N is found as called base. All other bases of the read perfectly match the reference sequence.
  • FIG. 11 shows how:
      • the positions of “N” in read 1 are coded as
      • absolute position in read 1 or
      • as differential position with respect to the previous “N” in the same read. the positions of “N” in read 2 are coded as
      • absolute position in read 2+read 1 length or
      • differential position with respect to the previous N In the nmis layer, the coding of each reads pair is terminated by a special “separator” symbol.
  • Encoding Substitutions (Mismatches or SNPs)
  • A substitution is defined as the presence, in a mapped read, of a different nucleotide with respect to the one that is present in the reference sequence at the same position (see FIG. 12).
  • Each substitution can be encoded as
      • “position” (snpp layer) and “type” (snpt layer). See FIG. 13, FIG. 14, FIG. 16 and FIG. 15.
      • OR
      • “position” only but using one snpp layer per mismatch type. See FIG. 17
  • Substitutions Positions
  • A substitution position is calculated as for the values of the nmis layer, i.e.: In read 1 substitutions are encoded
      • as absolute position in read 1 OR
      • as differential position with respect to the previous substitution in the same read In read 2 substitutions are encoded
  • In read 1 substitutions are encoded
      • as absolute position in read 2+read 1 length OR
      • as differential position with respect to the previous substitution FIG. 13 shows how substitutions positions are encoded in layer snpp. Substitutions positions can be calculated either as absolute or as differential values.
  • In the snpp layer, the encoding of each reads pair is terminated by a special “separator” symbol.
  • Substitutions Types Descriptors
  • For class M (and I as described in the next sections), mismatches are coded by an index (moving from right to left) from the actual symbol present in the reference to the corresponding substitution symbol present in the read {A, C, G, T, N, Z}. For example if the aligned read presents a C instead of a T which is present at the same position in the reference, the mismatch index will be denoted as “4”. The decoding process reads the encoded syntax element, the nucleotide at the given position on the reference and moves from left to right to retrieve the decoded symbol. E.g. a “2” received for a position where a G is present in the reference will be decoded as “N”. FIG. 14 shows all the possible substitutions and the respective encoding symbols when IUPAC ambiguity codes are not used and FIG. 15 provides an example of encoding of substitutions types in the snpt layer.
  • In case of presence of IUPAC ambiguity codes, the substitution indexes change as shown in FIG. 16.
  • In case the encoding of substation types described above presents high information entropy, an alternative method of substitution encoding consists in storing only the mismatches positions in separate layers, one per nucleotide, as depicted in FIG. 17.
  • Encoding of Insertions and Deletions
  • For class I, mismatches and deletions are coded by an indexes (moving from right to left) from the actual symbol present in the reference to the corresponding substitution symbol present in the read: {A, C, G, T, N, Z}. For example if the aligned read presents a C instead of a T present at the same position in the reference, the mismatch index will be “4”. In case the read presents a deletion where a A is present in the reference, the coded symbol will be “5”. The decoding process reads the coded syntax element, the nucleotide at the given position on the reference and moves from left to right to retrieve the decoded symbol. E.g. a “3” received for a position where a G is present in the reference will be decoded as “Z” which indicates the presence of a deletion in the sequence read.
  • Inserts are coded as 6, 7, 8, 9, 10 respectively for inserted A, C, G, T, N.
  • In case of adoption of the IUPAC ambiguity codes the substitution mechanism results to be exactly the same however the substitution vector is extended as: S={A, C, G, T, N, Z, M, R, W, S, Y, K, V, H, D, B} and insertions use different codes: 16, 17, 18, 19, 20.
  • FIG. 18 and FIG. 19 show examples of how to encode substitutions, inserts and deletions in a reads pair of class I.
  • The following structures of file format, access units and multiplexing are described referring to the coding elements disclosed here above. However, the access units, the file format and the multiplexing produce the same technical advantage also with other and different algorithms of source modeling and genomic data compression.
  • Construction of “Internal” References for Unmapped Reads of “Class U” and “Class HM”
  • In the case of the reads belonging to Class U or the unmapped pair of “Class HM” since they cannot be mapped to any “external” reference sequence satisfying the specified set of matching accuracy constraints for belonging to any of the classes P, N, M, or I, one or more “internal” reference sequences are constructed and used for the compressed representation of the reads belonging to these data classes.
  • Several approaches are possible to construct appropriate “internal” references such as for instance and not as limitation:
      • the partitioning of the non-mapped reads into clusters containing reads that share a common contiguous genomic sequence of at least a minimal size (signature). Each cluster can be uniquely identified by its signature.
      • the sorting of reads in any meaningful order (e.g. lexicographic order) and the use of the last N reads as “internal” reference for the encoding of the N+1. This method is shown in FIG. 51.
      • performing a so called “de-novo assembly” on a subset of the reads of class U so as to be able to align and encode all or a relevant sub-set of the reads belonging to said class according to the specified matching accuracy constraints or a new set of constraints.
  • If the read being coded can be mapped on the “internal” reference satisfying the specified set of matching accuracy constraints, the information necessary to reconstruct the read after compression is coded using syntax elements that can be of the following types:
      • 1. Start position of the matching portion on the internal reference in terms of read number in the internal reference (pos layer). This position can be encoded either as absolute or differential value with respect to the previously encoded read.
      • 2. Offset of the start position from the beginning of the corresponding read in the internal reference (pair layer). E.g. in case of constant read length the actual position is pos*length+pair.
      • 3. Possibly present mismatches coded as mismatch position (snpp layer) and type (snpt layer)
      • 4. Those parts of the reads (typically the edges identified by pair) that do not match with the internal reference (or do so, but with a number of mismatches above a defined threshold) are encoded in the indc layer. A padding operation can be performed to the edges of the part of the internal reference used in order to reduce the entropy of the mismatches encoded in the indc layer, as shown in FIG. 51. The most appropriate padding strategy can be chosen by the encoder according to the statistical properties of the genomic data being processed. Possible padding strategies include:
        • a. No padding
        • b. Constant padding pattern chosen according to its frequency in the currently encoded data.
        • c. Variable padding pattern according to the statistical properties of the current context defined in terms of the latest N encoded reads
  • The specific type of padding strategy will be signaled by special values in the indc layer header
      • 5. A flag that indicates if the read has been encoded using an internal self-generated, external or no-reference (rtype layer)
      • 6. Reads which are encoded verbatim (ureads).
  • FIG. 51 provides an example of such encoding procedure.
  • FIG. 56 shows an alternative encoding of unmapped reads on the internal reference where pos+pair syntax elements are replaced by a signed pos. In this case pos would express the distance —in terms of positions on the reference sequence —of the left most nucleotide position of read n with respect of the position of the left most nucleotide of read n−1.
  • This coding approach can be extended to support N start positions per read so that reads can be split over two or more reference positions. This can be particularly useful to encode reads generated by those sequencing technology (e.g. from Pacific Bioscience) producing very long reads (50K+bases) which usually present repeated patterns generated by loops in the sequencing methodology. The same approach can be used as well to encode chimeric sequence reads defined as reads that align to two distinct portions of the genome with little or no overlap.
  • The approach described above can be clearly applied beyond the simple class U and could be applied to any layer containing syntax elements related to reads positions (pos layers).
  • File Format: Selective Access to Regions of Genomic Data by Using the Master Index Table
  • In order to support selective access to specific regions of the aligned data, the data structure described in this document implements an indexing tool called Master Index Table (MIT). This is a multi-dimensional array containing the loci at which specific reads map on the used reference sequences. The values contained in the MIT are the mapping positions of the first read in each pos layer so that non-sequential access to each Access Unit is supported. The MIT contains one section per each class of data (P, N, M, I, U and HM) and per each reference sequence. The MIT is contained in the Genomic Dataset Header of the encoded data. FIG. 20 shows the structure of the Genomic Dataset Header, FIG. 21 shows a generic visual representation of the MIT and FIG. 22 shows an example of MIT for the class P of encoded reads.
  • The values contained in the MIT depicted in FIG. 22 are used to directly access the region of interest (and the corresponding AU) in the compressed domain.
  • For example, with reference to FIG. 22, if it is required to access the region comprised between position 150,000 and 250,000 on reference 2, a decoding application would skip to the second reference in the MIT and would look for the two values k1 and k2 so that k1<150,000 and k2>250,000. Where k1 and k2 are 2 indexes read from the MIT. In the example of FIG. 22 this would result in positions 3 and 4 of the second vector of the MIT. These returned values will then be used by the decoding application to fetch the positions of the appropriate data from the pos layer Local Index Table as described in the next section.
  • Together with pointers to the layer containing the data belonging to the four classes of genomic data described above, the MIT can be uses as an index of additional metadata and/or annotations added to the genomic data during its life cycle.
  • Local Index Table
  • Each data layer described above is prefixed with a data structure referred to as local header. The local header contains a unique identifier of the layer, a vector of Access Units counters per each reference sequence, a Local Index Table (LIT) and optionally some layer specific metadata. The LIT is a vector of pointers to the physical position of the data belonging to each AU in the layer payload. FIG. 23 depicts the generic layer header and payload where the LIT is used to access specific regions of the encoded data in a non-sequential way.
  • In the previous example, in order to access region 150,000 to 250,000 of reads aligned on the reference sequence no. 2, the decoding application retrieved positions 3 and 4 from the MIT. These values shall be used by the decoding process to access the 3rd and 4th elements of the corresponding section of the LIT. In the example shown in FIG. 24, the Total Access Units counters contained in the layer header are used to skip the LIT indexes related to AUs related to reference 1 (5 in the example). The indexes containing the physical positions of the requested AUs in the encoded stream are therefore calculated as:
  • position of the data blocks belonging to the requested AU=data blocks belonging to AUs of reference 1 to be skipped+position retrieved using the MIT, i.e.
    First block position: 5+3=8
    Last block position: 5+4=9
  • The blocks of data retrieved using the indexing mechanism called Local Index Table, are part of the Access Units requested.
  • FIG. 26 shows how the data blocks retrieved using the MIT and the LIT compose one or more Access Units.
  • Access Units The genomic data classified in data classes and structured in compressed or uncompressed layers are organized into different Access Units.
  • Genomic Access Units (AU) are defined as sections of genome data (in a compressed or uncompressed form) that reconstructs nucleotide sequences and/or the relevant metadata, and/or sequence of DNA/RNA (e.g. the virtual reference) and/or annotation data generated by a genome sequencing machine and/or a genomic processing device or analysis application. An example of Access Unit is provided in FIG. 26.
  • An Access Unit is a block of data that can be decoded either independently from other Access Units by using only globally available data (e.g. decoder configuration) or by using information contained in other Access Units.
  • Access Units are differentiated by:
      • type, characterizing the nature of the genomic data and data sets they carry and the way they can be accessed,
      • order, providing a unique order to Access Units belonging to the same type.
  • Access units of any type can be further classified into different “categories”.
  • Hereafter follows a non-exhaustive list of definition of different types of genomic Access Units:
    • 1) Access units of type 0 do not need to refer to any information coming from other Access Units to be accessed or decoded and accessed. The entire information carried by the data or data sets they contain can be independently read and processed by a decoding device or processing application.
    • 2) Access units of type 1 contain data that refer to data carried by Access Units of type 0. Reading or decoding and processing the data contained in Access Units of type 1 requires having access to one or more Access Units of type 0. Access unit of type 1 encode genomic data related to sequence reads of “Class P”
    • 3) Access Units of type 2 contain data that refer to data carried by Access Units of type 0. Reading or decoding and processing the data contained in Access Units of type 2 requires having access to one or more Access Units of type 0. Access unit of type 2 encode genomic data related to sequence reads of “Class N”
    • 4) Access Units of type 3 contain data that refer to data carried by Access Units of type 0. Reading or decoding and processing the data contained in Access Units of type 3 requires having access to one or more Access Units of type 0. Access unit of type 3 encode genomic data related to sequence reads of “Class M”
    • 5) Access Units of type 4 contain data that refer to data carried by Access Units of type 0. Reading or decoding and processing the data contained in Access Units of type 4 requires having access to one or more Access Units of type 0. Access unit of type 4 encode genomic data related to sequence reads of “Class I”
    • 6) Access Units of type 5 contain reads that cannot be mapped on any available reference sequence (“Class U”) and are encoded used an internally constructed reference sequence. Access Units of type 5 contain data that refer to data carried by Access Units of type 0. Reading or decoding and processing the data contained in Access Units of type 5 requires having access to one or more Access Units of type 0.
    • 7) Access Units of type 6 contain read pairs where one read can belong to any of the four classes P, N, M, I and the other cannot be mapped on any available reference sequence (“Class HM”). Access Units of type 6 contain data that refer to data carried by Access Units of type 0. Reading or decoding and processing the data contained in Access Units of type 6 requires having access to one or more Access Units of type 0.
    • 8) Access Units of type 7 contain metadata (e.g. quality scores) and/or annotation data associated to the data or data sets contained in the access unit of type 1. Access Units of type 7 may be classified and labelled in different layers.
    • 9) Access Units of type 8 contain data or data sets classified as annotation data. Access Units of type 8 may be classified and labelled in layers.
    • 10) Access Units of additional types can extend the structure and mechanisms described here. As an example, but not as a limitation, the results of genomic variant calling, structural and functional analysis can be encoded in Access Units of new types. The data organization in Access Units described herein does not prevent any type of data to be encapsulated in Access Units being the mechanism completely transparent with respect to the nature of encoded data.
  • Access Units of type 0 are ordered (e.g. numbered), but they do not need to be stored and/or transmitted in an ordered manner (technical advantage: parallel processing/parallel streaming, multiplexing)
  • Access Units of type 1, 2, 3, 4, 5 and 6 do not need to be ordered and do not need to be stored and/or transmitted in an ordered manner (technical advantage: parallel processing/parallel streaming).
  • FIG. 26 shows how Access Units are composed by a header and one or more layers of homogeneous data. Each layer can be composed by one or more blocks. Each block contains several packets and the packets are a structured sequence of the descriptors introduced above to represent e.g. reads positions, pairing information, reverse complement information, mismatches positions and types etc.
  • Each Access unit can have a different number of packets in each block, but within an Access Unit all blocks have the same number of packets.
  • Each data packet can be identified by the combination of 3 identifiers X Y Z where:
      • X identifies the access unit it belongs to
      • Y identifies the block it belongs to (i.e. the data type it encapsulates)
      • Z is an identifier expressing the packet order with respect to other packets in the same block
  • FIG. 28 shows an example of Access Units and packets labelling where AU T N is an access unit of type T with identifier N which may or may not imply a notion of order according to the Access Unit Type. Identifiers are used to uniquely associate Access Units of one type with those of other types required to completely decode the carried genomic data.
  • Access Units of any type can be further classified and labelled in different “categories” according to different sequencing processes. For example, but not as a limitation, classification and labelling can take place when
      • 1. sequencing the same organism at different times (Access Units contain genomic information with a “temporal” connotation),
      • 2. sequencing organic samples of different nature of the same organisms (e.g. skin, blood, hair for human samples). These are Access Units with “biological” connotation.
  • The access units of type 1, 2, 3, 4, 5 and 6 are built according to the result of a matching function applied on genome sequence fragments (a.k.a. reads) with respect to the reference sequence encoded in Access Units of type 0 they refer to.
  • For example access units (AUs) of type 1 (see FIG. 30) may contain the positions and the reverse complement flags of those reads which result in a perfect match (or maximum possible score corresponding to the selected matching function) when a matching function is applied to specific regions of the reference sequence encoded in AUs of type 0. Together with the data contained in AUs of type 0, such matching function information is sufficient to completely reconstruct all genome sequence reads represented by the data set carried by the access units of type 1.
  • With reference to the genomic data classification previously described in this document, the Access Units of type 1 described above would contain information related to genomic sequence reads of class P (perfect matches).
  • In case of variable reads length and paired reads the data contained in AUs of type 1 mentioned in the previous example, have to be integrated with the data representing the information about reads pairing and reads length in order to be able to completely reconstruct the genomic data including the reads pairs association. With respect to the data classification previously introduced in the present document, pair and rlen layers would be encoded in AU of type 1.
  • The matching functions applied with respect to access units of type 1 to classify the content of AU for the type 2, 3 and 4 can provide results such as:
      • 1. each sequence contained in the AU of type 1 perfectly matches sequences contained in the AU of type 0 in correspondence to the specified position;
      • 2. each sequence contained in the AU of type 2 perfectly matches a sequence contained in the AU of type 0 in correspondence to the specified position, except for the “N” symbols present (base not called by the sequencing device) in the sequence in the AU of type 2;
      • 3. each sequence contained in the AU of type 3 includes variants in the form of substituted symbols (variants) with respect to the sequence contained in the AU of type 0 in correspondence to the specified position;
      • 4. each sequence contained in the AU of type 4 includes variants in the form of substituted symbols (variants), insertions and/or deletions with respect to the sequence contained in the AU of type 0 in correspondence to the specified position.
      • 5. each sequence contained in the AU of type 5 do not map any sequence contained in the AU of type 0.
      • 6. each sequence pair contained in the AU of type 6 presents one sequence that can belong to any class P, N, M and I (points 1 to 4 above) while the other sequence does not map any sequence contained in the AU of type 0.
  • Access units of type 0 are ordered (e.g. numbered), but they do not need to be stored and/or transmitted in an ordered manner (technical advantage: parallel processing/parallel streaming, multiplexing)
  • Access units of type 1, 2, 3, 4, 5 and 6 do not need to be ordered and do not need to be stored and/or transmitted in an ordered manner (technical advantage: parallel processing/parallel streaming).
  • Identifying Access Units by Using “Labels” Associated to Specific Genomic Regions
  • An additional mechanism is provided by the disclosed invention enabling user-defined selective access to data classes referring to specific genomic regions or sub-regions or aggregations of regions or sub-regions.
  • A “Label” is an identifier which is assigned to a specific genomic region or sub-region or aggregations of regions or sub-regions. Labels identify genomic regions by specifying: the reference sequence id (“Ref ids”), the index of the MIT corresponding to the desired region of the reference sequence, and the data classes. An example is provided in FIG. 52.
  • A single, a subset, or all data classes can be referenced by a Label, enabling selective access to only a sub-set of the data associated to a specific genomic region or sub-regions or aggregations of regions or sub-regions.
  • A Label list should be created by a Genomic Labels Generator (4917 FIG. 49), in a storage scenario, and/or in a streaming scenario to make available the available Labels to the analysis applications applying a selective access to the stored or streamed data.
  • A Label List might include the following elements:
      • the number of Labels
      • for each Label in the list:
        • the Label ID
        • the number of reference sequences concerned by the label
        • for each reference sequence
          • the reference identifier
          • the number of regions covered by the label,
          • for each region:
            • the class ID
            • the start position in the genomic range
            • the end position in the genomic range
  • The table below reports a pseudo-syntax for a generic “Label List”.
  • TABLE 4
    Syntax of the generic “Label List” data format.
    Syntax Description
    Label_list( ) {
    num_Labels total number of labels in the list
    for (i=0; i<num_Labels;i++) {
    Label_id label identifier
    num_ref number of references concerned by the
    current label
    for (j = 0; j < num_ref; j++) {
    ref_id current reference
    num_regions number of different regions of this
    reference identified by the label
    for (k = 0; k < num_regions; k++) {
    class_id type of class, start and end position of
    start_pos this region
    end_pos
    }
    }
    }
    }
  • In case Genomic Data are compressed and streamed, one or more Access Units can be identified using a specific “Label” by means of a Block Header field (“Label ID”), which serves as an identifier for the “Label” in the “Label List” which the current block belongs to. Such field enables a dynamic mapping of blocks to “Labels”, typical for streaming scenarios.
  • In the Genomic File Format, the “start_pos” and “end_pos” fields can be replaced by the block numbers referring to all “blocks” belonging to a specific “Label”, as follows:
  • TABLE 5
    Efficient implementation of the “Label List” Syntax
    data format in the case of a compressed file.
    Syntax Data type Description
    num_Labels Bitstring number of labels in
    the genomic dataset
    for (i=0; i<num_Labels;i++) {
    Label_id Bitstring label identifier
    Label_length_in_blocks Bitstring number of data
    blocks identified by
    one label
    for (j = 0; j < Label_length_in_blocks;j++) {
    ref_id Bitstring reference id for this
    block
    class_id Bitstring class id for this block
    block_num Bitstring block number in the
    Master Index Table
    }
    }
  • The use of block numbers instead of “start_pos” and “end_pos” presents a relevant technical advantage because it enables a direct access to the “Master Index Table” (MIT), considering that the three dimensional vector consisting of: “ref_num”, “class_id” and “block_num” can be used as coordinates to directly address the MIT itself.
  • In storage scenarios, the “Label List” is created by a Genomic Labels Generator (4917) and sent to the genomic multiplexer (see also FIG. 49). The demultiplexer parses the Label List syntax and exposes the available Labels to the data access application, which according to the specific data access required selects the Access Units corresponding to the subset of “Labels”.
  • The possibility of using “Labels” to identify Access Units associated to specific genomic regions does not prevent using the indexing tools such as MIT and LIT without “Labels” to achieve random data access functionality. Generic random access can be achieved by specifying a three dimensional vector determining the MIT and LIT coordinates of interest (reference id, position range and classes) and ignoring the information carried by the Label List.
  • FIG. 51 shows how labels are used to aggregate and uniquely identify several genomic regions by using indexes contained in the MIT.
  • FIG. 59 shows how a device (592) implementing the labelling mechanism disclosed by this invention can enable concurrent access to several records of data (596) stored in a database (595). Selective protection of one or more regions identified by the same label is supported as well by means of a dedicated module (591) in charge of parsing the query (591) and dispatching the required metadata to the security module (594) in charge of enforcing access control. The labels decoder (593) is in charge of translating the label syntax into object identifiers that can be protected (and therefore access is controlled by the security module 594) or not.
  • Technical Effects
  • The technical effect of structuring genomic information in Access Units or Access Units identified by Labels as described here is that the genomic data:
  • 1. can be selectively queried in order to access:
      • specific “categories” of data (e.g. with a specific temporal or biological connotation) without having to decompress the entire genomic data or data sets and/or the related metadata.
      • specific regions of the genome for all “categories”, a subset of “categories”, a single “category” (with or without the associated metadata) without the need to decompress other regions of the genome
      • specific genomic regions or sub-regions or aggregations of regions or sub-regions identified by user defined “Labels” by only parsing the “Label List” main header and accessing (i.e. retrieving and decompressing) only the corresponding Access Units
        2. can be incrementally updated with new data that can be available when:
      • new analysis is performed on the genomic data or data sets
      • new genomic data or data sets are generated by sequencing the same organisms (different biological samples, different biological sample of the same type, e.g. blood sample, but acquired at a different time, etc.)
        3. can be efficiently transcoded to a new data format in case of
      • new genomic data or data sets to be used as new reference (e.g. new reference genome carried by AU of type 0)
      • update of the encoding format specification
        4. can be protected with different levels of granularity in terms of both access control (e.g. encryption) and permissions enforcement. For example these scenarios are enabled:
      • the same access control rule and encryption keys can be applied to all the genomic regions or sub-regions identified by one label (see FIG. 54 for an example);
      • different access control rules and different encryption keys can be used to protect each single region or sub-regions aggregated under the same label (see FIG. 55 for an example).
  • With respect to prior art solutions such as SAM/BAM, the described technical features address the issues of requiring data filtering to happen at the application level when the entire data has been retrieved and decompressed from the encoded format.
  • Hereafter follows examples of application scenario where the association of access unit structure, file format and Labelling mechanism becomes instrumental for a technological advantage.
  • Selective Access
  • In particular the disclosed data structure based on Access Units of different types possibly including user defined “Labels” enables to:
      • extract only the read information (data or data sets) of the whole sequencing of all “categories” or a subset (i.e. one or more layers) or a single “category” without having to decompress also the associated metadata information (limitation of current state of the art: SAM/BAM that cannot even support distinction between different categories or layers);
      • extract all the reads aligned on specific regions of the assumed reference sequence for all categories, subsets of the categories, a single category (with or without the associated metadata) without the need of decompressing also other regions of the genome (limitation of current state of the art: SAM/BAM);
      • extract all the reads belonging to a single, a subset or all data “classes” aligned on specific genomic regions or sub-regions or aggregations of regions or sub-regions identified by user specified “Labels” for all categories, subsets of the categories, a single category (with or without the associated metadata) without the need of decompressing also other data associated to other regions of the genome (limitation of current state of the art: SAM/BAM).
  • FIG. 39 shows how the access to the genomic information mapped on the second segment of the reference sequence (AU 0-2) with mismatches only requires the decoding of AUs 0-2, 1-2 and 3-2 only. This is an example of selective access according to both a criteria related to a mapping region (i.e. position on the reference sequence) and a criteria related to the matching function applied to the encoded sequence reads with respect to the reference sequence (e.g. mismatches only in this example).
  • A further technical advantage is that the querying on the data is much more efficient in terms of data accessibility and execution speed because it can be based on accessing and decoding only selected “categories”, specific regions of longer genomic sequences and only specific layers for access units of type 1, 2, 3, 4 that match the criteria of the applied queries and any combination thereof.
  • The organization of access units of type 1, 2, 3, 4 into layers allow for efficient extraction of nucleotides sequences
      • with specific variations (e.g. mismatches, insertions, deletions) with respect to one or more reference genomes;
      • that do not map to any of the considered reference genomes;
      • that perfectly map on one or more reference genomes;
      • that map with one or more accuracy levels.
  • FIG. 52 shows how the access to the genomic information associated only to specific genomic regions or sub-regions or aggregations of regions or sub-regions associated to user defined “Labels”. The syntax of a label is based on a three coordinates system where each region or sub-region associated to a label can be uniquely identified by:
      • 1. reference ID,
      • 2. data type (class)
      • 3. block number in the MIT (corresponding to a genomic region).
  • These three coordinates can be used to identify
      • the MIT location containing the genomic position of the region on the corresponding reference and
      • the LIT location containing the physical location of the data representing the respective genomic region or sub-region
  • Like in the case of accessing data related to a specified genomic region, a further technical advantage is that the querying on the data results to be much more efficient in terms of data accessibility and execution speed because it can be based on accessing and decoding only selected “categories”, of the labelled specific regions and only specific layers for access units of type 1, 2, 3, 4 that corresponds to the “Labels” of the applied queries and any combination thereof.
  • Another technical advantage of this labelling mechanism is the possibility of efficiently retrieving encoded genomic information that has been scattered among several Access Units due to its characteristics such as position on the reference genome, type of mismatches with respect to the reference (524).
  • Filtering genomic data according to the characteristics of the mapped reads (e.g. perfectly matching, substitutions only, etc.) today can take hours when using the traditional formats such as BAM and CRAM. This is due to the fact that the data are sparse within the compressed format and require decompression and filtering using pipelines of commands. The present invention describes a data structure that enables data filtering in a matter of seconds. Memory usage can be as well reduced by a factor that is proportional with the file size (from 10× to 100×) since the present invention does not require the decoding (i.e. memory allocation) of the entire file.
  • Selective Access to Specific Genomic Regions Identified by User Specified “Labels” in “Storage” and “Streaming” Scenarios.
  • For example let's suppose that sequencing data is compressed and selective access to “GeneXY” and “GeneWZ” is required. The two genomic regions corresponding to “GeneXY” and “GeneWZ” in the compressed file format or in the compressed stream must be labelled. Depending if a compressed data file is generated for storage or a compressed data stream is generated for streaming, two methods are used.
  • In the case of a compressed data file, the multiplexer creates a “Label List” which includes two Labels with: “Label_ID”=GeneXY and “Label_ID”=GeneWZ. The Label parameter “Label_lenght_in_blocks” and for each block the parameters: “ref_num”, “class_ID”, “block_num” are determined by the multiplexer based on the position on the reference of the “GeneXY” and “GeneWZ” regions and the class of data for which the selective access is desired. The complete syntax is reported in Table 5.
  • In the case of a compressed stream, the multiplexer creates a “Label List” which includes two Labels with: “Label_ID”=GeneXY and “Label_ID”=GeneWZ. The Label parameters “ref ID”, “class_ID”, “start_pos” and “end_pos” are determined by the multiplexer based on the position on the reference of the “GeneXY” and “GeneWZ” regions and the class of data for which the selective access is desired. The complete syntax is reported in Table 4.
  • The method used in the case of a compressed stream is generic and could be used also in the case of a compressed file for storage, but the corresponding implementation would result less efficient because the use of block numbers, as described in the case of compressed file, enables a direct access to the “Master Index Table” (MIT).
  • In both cases mentioned above (streaming and storage), the mechanism of retrieval of the genomic data identified by the labels follows is the same.
  • When parsing a label a decoding device will:
      • 1. Identify the reference sequence from the first element of the label
      • 2. Identify the class of data from the second element of the label
      • 3. Identify the block of the MIT (corresponding to a genomic region) from the third element of the label
      • 4. The two coordinates parsed in 1 and 2 enable the decoder to identify the required Genomic Streams (484),
      • 5. Each Genomic Stream starts with a header containing a LIT (525) containing pointers to the descriptors encoding data mapped to each genomic region. The third coordinate parsed in 3 is used to access the correct pointer in the LIT of each Genomic Stream.
      • 6. The decoder can efficiently retrieve all the descriptors to decode the genomic data identified by the decoded Genomic Label even if they are scattered among different Access Units (524).
  • Incremental Update
  • The access units of type 7 and 8 allow for easy insertion of annotations without the need of depacketizing/decoding/decompressing the whole file thereby adding to the efficient handling of the file which is a limitation of prior art approaches. Existing compression solutions may have to access and process a large amount of compressed data before the desired genomic data can be accessed. This will cause inefficient RAM bandwidth utilization and more power consumption also in hardware implementations. Power consumption and memory access issues may be alleviated by using the approach based on Access Units described here.
  • The data indexing mechanism described in the Master Index Table (see FIG. 21) together with the utilization of Access Unites and the possibility of identifying Access Units with user-defined “Labels” associated to specific genomic regions or sub-regions or aggregations of regions or sub-regions enables incremental update of the encoded content as described below. This mechanism is shown with an example in FIG. 53.
  • Insertion of Additional Data
  • New genomic information can be periodically added to existing genomic data for several reasons. For example when:
      • An organism is sequenced at different moments in time;
      • Several different samples of the same individual are sequenced at the same time;
      • New data generated by a sequencing process (streaming).
  • In the above mentioned situations, structuring data using the Access Units described here and the data structure described in the file format section enables the incremental integration of the newly generated data without the need to re-encode the existing data. The incremental update process can be implemented as follows:
      • 1. The newly generated AUs can simply be concatenated in the file with the pre-existing AUs and
      • 2. the indexing of the newly generated data or data sets are included in the Master Index Table described in the file format section of this document. One index shall position the newly generated AU on the existing reference sequence, other indexes consist in pointers of the newly generated AUs in the physical file to enable direct and selective access to them.
      • 3. The existing and/or newly generated AU can be identified with user defined “Labels” corresponding to specific genomic regions or sub-regions or aggregations of regions or sub-regions and a “Label List” can be included or updated.
  • This mechanism is illustrated in FIG. 40 where pre-existing data encoded in 3 AUs of type 1 and 4 AUs per each type from 2 to 4 are updated with 3 AUs per type with encoding data coming for example from a new sequence run for the same individual.
  • The mechanism of creating or updating “Labels” and the “Label List” are illustrated in FIG. 52 and FIG. 53.
  • In the specific use case of streaming genomic data and data sets in compressed form, the incremental update of a pre-existing data set may be useful when analyzing data as soon as they are generated by a sequencing machine and before the actual sequencing is completed. An encoding engine (compressor) can assemble several AUs in parallel by “clustering” sequence reads that map on the same region of the selected reference sequence. Once the first AU contains a number of reads above a pre-configured threshold/parameter, the AU is ready to be sent to the analysis application. Together with the newly encoded Access Unit, the encoding engine (the compressor) shall make sure that all Access Units the new AU depends on have already been sent to the receiving end or is sent together with it. For example an AU of type 3 will require the appropriate AU of type 0 and type 1 to be present at the receiving end in order to be properly decoded.
  • By means of the described mechanism, a receiving variant calling application would be able to start calling variants on the AU received before the sequencing process has been completed at the transmitting side. A schematic of this process is depicted in FIG. 41.
  • New Analysis of Results.
  • During the genome processing life cycle several iterations of genome analysis can be applied on the same data (e.g. different variant calling using different processing algorithm). The use of AUs as defined in this document and the data structure described in the file format section of this document enable incremental update of existing compressed data with the results of new analysis. For example, new analysis performed on existing compressed data can produce new data in these cases:
      • 1. A new analysis can modify existing results already associated with the encoded data. This use case is depicted in FIG. 42 and it is implemented by moving entirely or partially the content of one Access Unit from one type to another. In case new AUs need to be created (due to a pre-defined maximum size per AU), the related indexes in the Master Index Table must be created and the related vector is sorted when needed.
      • 2. New data are produced from new analysis and have to be associated to existing encoded data. In this case new AUs of type 7 can be produced and concatenated with the existing vector of AUs of the same type. This and the related update of the Master Index Table are depicted in FIG. 43
  • The use cases described above and depicted in FIG. 42 and FIG. 43 are enabled by:
      • 1. The possibility to have direct access only to data with poor mapping quality (e.g. AUs of type 4);
      • 2. The possibility to remap reads to a new genomic region by simply creating a new Access Unit possibly belonging to a new type (e.g. reads included in a Type 4 AU can be remapped to a new region with less (type 2-3) mismatches and included in a newly created AU);
      • 3. The possibility to create AU of type 8 (433) containing only the newly created analysis results and/or related annotations. In this case the newly created AUs only require to contain “pointers” to the existing AUs to which they refer to.
      • 4. The possibility of performing in a single run new analysis on several genomic regions and sub-regions identified by the same Label without the need to repeat the analysis on each single genomic region or sub-region. Labels as described in this document would enable users to manipulate non-contiguous genomic segments as if they were a single genomic sequence.
      • 5. The possibility of updating with new analysis results several genomic regions or sub regions identified by a single Label. The new results (usually expressed in the form of metadata) would be linked to the label identifying the aggregation of potentially several genomic regions and sub regions without the need of creating several links from the results to each genomic region or sub region.
  • Transcoding
  • Compressed genomic data can require transcoding, for example, in the following situations:
      • Publication of new reference sequences;
      • Use of a different mapping algorithm (re-mapping).
  • When genomic data is mapped on an existing public reference genome, the publication of a new version of said reference sequence or the desire to map the data using a different processing algorithm, today requires a process of re-mapping. When remapping compressed data using prior art file formats such as SAM or CRAM the entire compressed data has to be decompressed into its “raw” form to be mapped again with reference to the newly available reference sequence or using a different mapping algorithm. This is true even if the newly published reference is only slightly different from the previous or the different mapping algorithm used produces a mapping that is very close (or identical) to the previous mapping.
  • The advantage of transcoding genomic data structured using Access Units described here is that:
    • 1. Mapping versus a new reference genome only requires re-encoding (decompressing and compressing) the data of AUs that map on the genome regions that have changes. Additionally the user may select those compressed reads that for any reason might need to be re-mapped even if they originally do not map on the changed region (this may happen if the user believes that the previous mapping is of poor quality). This use case is depicted in FIG. 44.
    • 2. In case the newly published reference genome differs from the previous only in terms of entire regions shifted to different genomic locations (“loci”), the transcoding operation results particularly simple and efficient. In fact in order to move all the reads mapped to the “shifted” region it is sufficient to change only the value of the absolute position contained in the related (set of) AU(s) header. Each AU header contain the absolute position the first read contained in the AU is mapped to on the reference sequence, while all other reads positions are encoded differentially with respect to the first. Therefore, by simply updating the value of the absolute position of the first read, all the reads in the AU are moved accordingly. This mechanism cannot be implemented by state of the art approaches such as CRAM and BAM because genome data positions are encoded in the compressed payload, thus requiring complete decompression and re-compression of all genome data sets.
    • 3. When a different mapping algorithm is used, it is possible to apply it only on a portion of compressed reads that was deemed mapped with poor quality. For example it can be appropriate to apply the new mapping algorithm only on reads which did not perfectly match on the reference genome. With existing formats today it is not possible (or it's only partially possible with some limitations) to extract reads according to their mapping quality (i.e. presence and number of mismatches). If new mapping results are returned by the new mapping tools the related reads can be transcoded from one AU from another of the same type (FIG. 46) or from one AU of one type to an AU of another type (FIG. 45).
  • Moreover, prior art compression solutions may have to access and process a large amount of compressed data before the desired genomic data can be accessed. This will cause inefficient RAM bandwidth utilization and more power consumption and in hardware implementations. Power consumption and memory access issues may be alleviated by using the approach based on Access Units described here.
  • A further advantage of the adoption of the genomic access units described here is the facilitation of parallel processing and suitability for hardware implementations. Current solutions such as SAM/BAM and CRAM are conceived for single-threaded software implementation.
  • Selective Protection
  • The approach based on Access Units organized in several types an layers as described in this document enables the implementation of content protection mechanisms otherwise not possible with state of the art monolithic solutions.
  • A person skilled in the art knows that the majority of genomic information related to an organism's genetic profile relies in the differences (variants) with respect to a known sequence (e.g. a reference genome or a population of genomes). An individual genetic profile to be protected from unauthorized access will therefore be encoded in Access Units of type 3 and 4 as described in this document. The implementation of controlled access to the most sensible genomic information produced by a sequencing and analysis process can therefore be realized by encrypting only the payload of AUs of type 3 and 4 (see FIG. 47 for an example). This will generate significant savings in terms of both processing power and bandwidth since the resources consuming encryption process shall be applied on a subset of data only.
  • Selective Protection of Specific Genomic Regions Identified by “Labels”
  • The labelling mechanism enables different mechanisms of data protection and access control. For example FIG. 54 shows how one protection mechanism (e.g. encryption) and one access control rule (AC) can be applied to several genomic regions identified by the same label. In a more sophisticated scenario, data protection can be implemented by applying a different access control rule and a different protection mechanism (encryption) to each region identified by a label. This is shown in FIG. 55.
  • Additionally, selective encryption of genomic regions or sub-regions or aggregations of regions or sub-regions identified by different “Labels” can be easily implemented by applying encryption only to compressed data corresponding to a “Label” for both file and streamed scenarios. For instance two genomic regions labelled as “GeneXY” and “GeneWZ” like in the example of section “Selective Access to Specific Genomic Regions identified by User Specified “Labels” in “storage” and “streaming” scenarios” can be differentiated by only encrypting data labelled by “GeneXY” and leaving in clear the compressed data labelled as “GeneWZ”. Encryption rules can be carried by the metadata fields (in both storage and streaming scenarios) and associated to each element of the “Label List”
  • Transport of Genomic Access Units
  • Genomic Data Multiplex
  • Genomic Access Units can be transported over a communication network within a Genomic Data Multiplex. A Genomic Data Multiplex is defined as a sequence of packetized genomic data and metadata represented according to the data classification disclosed as part of this invention, transmitted in network environments where errors, such as packet losses, may occur.
  • The Genomic Data Multiplex is conceived to ease and render more efficient the transport of genomic coded data over different environments (typically network environments) and has the following advantages not present in state of the art solutions:
    • 1. it enables encapsulation of either a stream or a sequence of genomic data (described below) or Genomic File Format generated by an encoding tool into one or more Genomic Data Multiplex, in order to carry it over a network environment, and then recover a valid and identical stream or file format in order to render the transmission and access to information more efficient
    • 2. It enables selective retrieval of encoded genomic data from the encapsulated Genomic Data Streams, for decoding and presentation.
    • 3. It enables multiplexing several Genomic Datasets into a single container of information for transport and it enables de-multiplexing a subset of the carried information into a new Genomic Data Multiplex.
    • 4. It enables the multiplexing of data and metadata produced by different sources (with the consequent separate access) and/or sequencing/analysis processes and transmit the resulting Genomic Data Multiplex over a network environment.
    • 5. It supports identification of errors such as packet losses.
    • 6. It supports proper reorder data which may arrive out of order due to network delays, therefore rendering more efficient the transmission of genomic data when compared with the state of the art solutions
  • An Example of Genomic Data Multiplexing is Shown in FIG. 49.
  • Genomic Dataset
  • In the context of the present invention a Genomic Dataset is defined as a structured set of Genomic Data including, for example, genomic data of a living organism, one or more sequences and metadata generated by several steps of genomic data processing, or the result of the genomic sequencing of a living organism. One Genomic Data Multiplex may include multiple Genomic Datasets (as in a multi-channel scenario) where each dataset refers to a different organism. The multiplexing mechanism of the several datasets into a single Genomic Data Multiplex is governed by information contained in data structures called Genomic Datasets List (GDL), Genomic Dataset Mapping Tables List (GDMTL) and Genomic Dataset Mapping Table (GDMT).
  • Genomic Dataset List
  • A Genomic Dataset List (GDL) is defined as a data structure listing all Genomic Datasets available in a Genomic Data Multiplex. Each of the listed Genomic Datasets is identified by a unique value called Genomic Dataset ID (GID).
  • Each Genomic Dataset listed in the GDL is associated to:
      • one Genomic Data Stream carrying one Genomic Dataset Mapping Table (GDMT) and identified by a specific value of Stream ID (genomic_dataset_map_SID);
      • one Genomic Data Stream carrying one Reference ID Mapping Table (RIDMT) and identified by a specific value of Stream ID (reference_id_map_SID).
  • The GDL is sent as payload of a single Transport Packet at the beginning of a Genomic Data Stream transmission; it can then be periodically re-transmitted in order to enable random access to the Stream.
  • The syntax of the GDL data structure is provided in the table below with an indication of the data type associated to each syntax element.
  • Syntax Data type
    genomic_dataset_list( ) {
    list_length bitstring
    multiplex_id bitstring
    version_number bitstring
    applicable_section_flag bit
    list_ID bitstring
    for (i = 0; i < N; i++) { N = number of Genomic
    Datasets in this
    Genomic Multiplex
    genomic_dataset_ID bitstring
    genomic_dataset_map_SID bitstring
    reference_id_map_SID bitstring
    genomic_Label_list_SID bitstring
    }
    Checksum bitstring
    }
  • The syntax elements composing the GDL described above have the following meaning and function.
  • section_length bitstring field, specifying the number of bytes composing the
    section, starting immediately following the section_length field, and
    including the CRC.
    multiplex_id bitstring field which serves as a label to identify this multiplexed
    stream from any other multiplex within a network.
    version_number bitstring field indicating the version number of the whole Genomic
    Dataset List Section. The version number shall be incremented by 1
    whenever the definition of the Genomic Dataset Mapping Table
    changes. When the applicable_section_flag is set to ‘1’, then the
    version_number shall be that of the currently applicable Genomic
    Dataset List. When the applicable_section_flag is set to ‘0’, then the
    version_number shall be that of the next applicable Genomic
    Dataset List.
    applicable_section_flag A 1 bit indicator, which when set to ‘1’ indicates that the Genomic
    Dataset Mapping Table sent is currently applicable. When the bit is
    set to ‘0’, it indicates that the table sent is not yet applicable and
    shall be the next table to become valid.
    list_ID This is a bitstring field identifying the current genomic dataset list.
    genomic_dataset_ID genomic_dataset_ID is a bitstring field which specifies the genomic
    dataset to which the genomic_dataset_map_SID is applicable. This
    field shall not take any single value more than once within one
    version of the Genomic Dataset Mapping Table.
    genomic_dataset_map_SID genomic_dataset_map_SID is a bitstring field identifying the
    Genomic Data Stream carrying the Genomic Dataset Mapping Table
    (GDMT) associated to this Genomic Dataset. No
    genomic_dataset_ID shall have more than one
    genomic_dataset_map_SID associated. The value of the
    genomic_dataset_map_SID is defined by the user.
    reference_id_map_SID reference_id_map_SID is a bitstring field identifying the Genomic
    Data Stream carrying the Reference ID Mapping Table (RIDMT)
    associated to this Genomic Dataset. No genomic_dataset_ID shall
    have more than one reference_id_map_SID associated. The value of
    the reference_id_map_SID is defined by the user.
    genomic_Label_list_SID genomic_Label_list_SID is a bitstring field identifying the Genomic
    Data Stream carrying the Genomic Label List (GLL) associated to this
    Genomic Dataset. No genomic_dataset_ID shall have more than
    one genomic_Label_list_SID associated. The value of the
    genomic_Label_list_SID is defined by the user.
    Chacksum This is a bitstring field that contains an integrity check value for the
    entire GDL. One typical algorithm used for this purpose function is
    the CRC32 algorithm producing a 32 bit value other algorithms
    include the hashing functions MD5 and SHA-256.
  • Genomic Dataset Mapping Table
  • The Genomic Dataset Mapping Table (GDMT) is produced and transmitted at the beginning of a streaming process (and possibly periodically re-transmitted, updated or identical in order to enable the update of correspondence points and the relevant dependencies in the streamed data). The GDMT is carried by a single Packet following the Genomic Dataset List and lists the SIDs identifying the Genomic Data Streams composing one Genomic Dataset. The GDMT is the complete collection of all identifiers of Genomic Data Streams (e.g., the genomic sequence, reference genome, metadata, etc) composing one Genomic Dataset carried by a Genomic Multiplex. A genomic dataset mapping table is instrumental in enabling random access to genomic sequences by providing the identifier of the stream of genomic data associated to each genomic dataset.
  • The syntax of the GDMT data structure is provided in the table below with an indication of the data type associated to each syntax element.
  • genomic_dataset_mapping_table( ) {
    table_length bitstring
    genomic_dataset_ID bitstring
    version_number bitstring
    applicable_section_flag bit
    mapping_table_ID bitstring
    genomic_dataset_ef_length bitstring
    for (i=0; i<N; i++) { N = number of extension fields
    associated to this
    Genomic Dataset
    extension_field( ) data structure
    }
    for (i = 0;i < M ; i++) { M = number of Genomic Data
    Streams associated to
    this specific Dataset
    data_type bitstring
    genomic_data_SID bitstring
    gd_component_ef_length bitstring
    for (I = 0; I < K; i++) { K = number of extension fields
    associated to each
    Genomic Data Stream
    extension_field ( ) data structure
    }
    }
    Chaecksum bitstring
    }
  • The syntax elements composing the GDMT described above have the following meaning and function.
  • version_number, These elements have the same meaning as for the GDL
    applicable_section_flag
    table_length, bitstring field specifying the number of bytes composing the table,
    starting after the table_length field, and including the Checksum field.
    genomic_dataset_ID bitstring field identifying a Genomic Dataset
    mapping_table_ID bitstring bit field identifying the current Genomic Dataset Mapping
    Table
    genomic_dataset_ef_length bitstring field specifying the number of bytes of the optional
    extension_field associated with this Genomic Dataset
    data_type bitstring field specifying the type of genomic data carried by the
    packets identified by the genomic_data_SID.
    genomic_data_SID bitstring bit field specifying the Stream ID of the packets carrying the
    encoded genomic data associated with one component of this
    Genomic Dataset (e.g. read p positions, read p pairing information
    etc. as defined in this invention)
    gd_component_ef_length bitstring field specifying the number of bytes of the optional
    extension_field associated with the genomic Stream identified by
    genomic_data_SID.
    Checksum This is a bitstring field that contains an integrity check value for the
    entire GDMT. One typical algorithm used for this purpose function is
    the CRC32 algorithm producing a 32 bit value or hashing functions
    such as MD5 and SHA-256.
  • extension_fields are optional descriptors that might be used to further describe either a Genomic Dataset or one Genomic Dataset component.
  • The data_type field can have the following values
  • data_type Description
    0 Dataset Header
    1 Layer Header
    2 to 15 User-defined extensions
    16 to N 16 + Descriptors_Layer_ID
  • Genomic Datasets Mapping Tables List
  • This structure carries information about all the datasets mapping tables related to a Genomic Datasets Multiplex.
  • Syntax Description
    Datasets_mapping_tables_list{
    Multiplex_id Datasets Multiplex ID, as in
    Datasets Multiplex Header.
    for (i=0; i<gd_number;i++) { Note: gd_number as in Datasets
    Multiplex Header.
    dataset_mapping_table_SID Stream ID of Dataset Mapping
    Table of i-th Dataset.
    }
    }
  • Reference ID Mapping Table
  • The Reference ID Mapping Table (RIDMT) is produced and transmitted at the beginning of a streaming process. The RIDMT is carried by a single Packet following the Genomic Dataset List. The RIDMT specifies a mapping between the numeric identifiers of reference sequences (REFID) contained in the Block header of an access unit and the (typically literal) reference identifiers contained in the Genomic Dataset Header specified in Table 2.
  • The RIDMT can be periodically re-transmitted in order to:
      • enable the update of correspondence points and the relevant dependencies in the streamed data,
      • support the integration of new reference sequences added to the pre-existing ones (e.g. synthetic references created by de-novo assembly processes)
  • The syntax of the RIDMT data structure is provided in the table below with an indication of the data type associated to each syntax element.
  • Syntax Data type
    reference_id_mapping_table( ) {
    table_length bitstring
    genomic_dataset_ID bitstring
    version_number bitstring
    applicable_section_flag bit
    reference_id_mapping_table_ ID bitstring
    for (i = 0; i < N; i++) { N = number of reference
    sequences associated with
    the Genomic Dataset identified
    by genomic_dataset_ID
    ref_string_length bitstring
    for
    (i=0;i<ref_string_length;i++){
    ref_string[i] byte
    }
     REFID bitstring
    }
    Checksum bitstring (e.g. CRC-32 or
    MD5 hash)
    }
  • The syntax elements composing the RIDMT described above have the following meaning and function.
  • table_length, genomic_dataset_ID, These elements have the same meaning as for the
    version_number, applicable_section_flag GDMT
    reference_id_mapping_table_ID bitstring field identifying the current Reference ID
    Mapping Table
    ref_string_length bitstring field specifying the number of characters
    (bytes) composing ref_string, excluding the end of
    string (‘\0’) character.
    ref_string[i] byte field encoding each character of the string
    representation of a reference sequence (e.g. “chr1”
    for chromosome 1). The end of string (‘\0’) character
    is not necessary, as it is implicitly inferred from the
    ref_string_length field
    REFID This is a bitstring field uniquely identifying a reference
    sequence. This is encoded in the data Block header as
    REFID field.
    Checksum This is a bitstring field that contains an integrity check
    value for the entire RIDMT. One typical algorithm
    used for this purpose function is the CRC32 algorithm
    producing a 32 bit value or any hash function
    producing longer strings of bits.
  • Genomic Label List
  • As described above, a label is an identifier which is assigned to a specific genomic regions or sub-regions or aggregations of regions or sub-regions.
  • Labels identify genomic regions by specifying the reference sequence id, the position range with respect to the reference sequence and the data classes that they identify.
  • For such purpose, the Genomic Label List (GLL) is created during the packetization process by the multiplexer and transmitted.
  • The depacketizer of the demultiplexer parses the GLL syntax and exposes the available “Labels” to the data access application, which has the possibility to select and access the desired sub-set of data.
  • The GLL is (optionally) produced and transmittedat the beginning of a stream and typically transmitted periodically in order to enable multiple synchronization points (4811), and provides the list of “Labels” associated to the Multiplex and Dataset identified by the multiplex_id and dataset_id fields.
  • The syntax of the GLL data structure is provided in the table below with an indication of the data type associated to each syntax element.
  • TABLE 6
    Complete syntax of “Label List” data format
    for the streamed compressed data scenario.
    Syntax Description
    genomic_label_list( ) {
    table_length
    multiplex_id
    dataset_id
    num_labels total number of labels in the list
    for (i=0; i<num_labels;i++) {
    Label_id label identifier
    num_ref number of references concerned by the
    current label
    for (j = 0; j < num_ref; j++) {
    ref_id current reference
    num_regions number of different regions of this
    reference identified by the label
    for (k = 0; k < num_regions; k++) {
    class_id type of class, start and end position of
    start_pos this region
    end_pos
    }
    }
    }
    Checksum e.g. CRC-32 or MD5 hash
    }
  • The syntax elements composing the GLL described above have the following meaning and function.
  • TABLE 7
    Description of syntax elements of Table 6.
    table_length Bitstring field specifying the number of bytes composing the list,
    starting after the table_length field, and including the Checksum field
    multiplex_ID Byte which serves as a label to identify the Genomic Multiplex from
    any other multiplex within a network
    dataset_ID Byte which serves as a label to identify the Genomic Dataset from any
    other dataset within the multiplex identified by multiplex_id
    num_Labels Bitstring representing the total number of Labels in this GLL
    Label_id Bitstring identifying the i-th Label
    num_ref Bitstring identifying the number of references concerned by the
    current label
    ref_id Bitstring identifying the j-th reference sequence the i-th Label refers
    to
    num_regions Bistring identifying the number of regions conveyed by the i-th Label
    class_id Bitstring identifying the class of the k-th region in the j-th reference in
    the i-th Label
    start_pos Bitstring indicating the position in the j-th reference sequence of the
    first read of the k-th region in the i-th Label
    end_pos indicating the position in the j-th reference sequence of the last read
    of the k-th region in the i-th Label
    Checksum Bitstring field that contains an integrity check value for the entire GLL.
    One typical algorithm used for this purpose function is the CRC32
    algorithm producing a 32 bit value or hashing functions producing
    longer strings of bits (e.g. MD5, SHA-256).
  • Genomic Data Stream
  • A Genomic Data Multiplex contains one or several Genomic Data Streams where each stream can transport
      • data structures containing transport information (e.g. Genomic Dataset List, Genomic Dataset Mapping Table etc.)
      • data belonging to one of the Genomic Data Layers described in this invention.
      • Metadata related to the genomic data
      • Any other data
  • A Genomic Data Stream containing genomic data is essentially a packetized version of a Genomic Data Layer where each packet is prepended with a header describing the packet content and how it is related to other elements of the Multiplex.
  • The Genomic Data Stream format described in this document and the File Format described in this document are mutually convertible. Whereas a full file format can be reconstructed in full only after all data have been received, in case of streaming a decoding tool can reconstruct and access, and start processing the partial data at any time.
  • A Genomic Data Stream is composed by several Genomic Data Blocks each containing one or more Genomic Data Packets. Genomic Data Blocks (GDBs) are containers of genomic information composing one genomic AU. GDB can be split into several Genomic Data Packets, according to the communication channel requirements.
  • Genomic access units are composed by one or more Genomic Data Blocks belonging to different Genomic Data Streams.
  • Genomic Data Packets (GDPs) are transmission units composing one GDB. Packet size is typically set according to the communication channel requirements.
  • FIG. 27 shows the relationship among Genomic Multiplex, Streams, Access Units, Blocks and Packets when encoding data belonging to the P class as defined in this invention. In this example three Genomic Streams encapsulate information on position, pairing and reverse complement of sequence reads.
  • Genomic Data Blocks are composed by a header, a payload of compressed data and padding information.
  • The table below provides an example of implementation of a GDB header with a description of each field and a typical data type.
  • TABLE 8
    Description of Genomic Data Block syntax elements.
    Data type Description Data type
    Block Start Code Reserved value used to unambiguously identify the beginning bitstring
    Prefix (BSCP) of a Genomic Data Block.
    Block Header Block Header as defined in this document bitstring
    POS Flag (PSF) If the POS Flag is set, the block contains the 40 bit POS field at bit
    the end of the block header and before the optional fields.
    Padding Flag (PDF) If the Padding Flag is set, the block contains additional padding bit
    bytes after the payload which are not part of the payload.
    Block size (BS) Number of bytes composing the block, including this header bitstring
    and payload, and excluding padding (total block size will be
    BS + padding size).
    Access Unit ID (AUID) Unambiguous ID, linearly increasing (not necessarily by 1, even bitstring
    though recommended). Needed to implement proper random
    access, as described in the Master Index Table defined in this
    invention.
    Label ID (LID) Unambiguous ID, linearly increasing by 1, identifying the bitstring
    genomic region/classes (Label) this block belongs to. It
    corresponds to the i-th index in the main for loop in the
    Genomic Label List described above.
    (Optional) Reference Unambiguous ID, identifying the reference sequence the AU bitstring
    ID (REFID) containing this block refers to. This is needed, along with POS
    field, to have proper random access, as described in the
    Master Index Table.
    (Optional) POS (POS) Present if PSF is 1. Position on the reference sequence of the bitstring
    first read in the block.
    (Extra optional fields) Additional optional fields, presence signaled by BS. bytestring
    (Optional) Padding (Optional, presence signaled by PDF) Fixed bitstring value that bitstring
    can be inserted in order to meet the channel requirements. If
    present, the first byte indicates how many bytes compose the
    padding. It is discarded by the decoder.
  • The use of AUID, POS and BS enables the decoder to reconstruct the data indexing mechanisms referred to as Master Index Table (MIT) and Local Index Table (LIT) in this invention. In a data streaming scenario the use of AUID and BS enables the receiving end to dynamically re-create a LIT locally, without the need to send extra-data. The use of AUID, BS and POS will enable to recreate a MIT locally without the need to send additional data.
  • This has the technical advantage to
      • reduce the encoding overhead which might be large if the entire LIT is transmitted;
      • avoid the need of a complete mapping between genomic positions and Access Units which is not normally available in a streaming scenario
  • A Genomic Data Block can be split into one or more Genomic Data Packets, depending on network layer constraints such as maximum packet size, packet loss rate, etc. A Genomic Data Packet is composed by a header and a payload of encoded or encrypted genomic data as described in the table below.
  • TABLE 9
    Description of Genomic Data Packet syntax elements.
    Data type Description Data size
    Stream ID (SID) Unambiguously identifies data type carried by this bitstring
    packet. A Genomic Dataset Mapping Table is needed at
    the beginning of the stream in order to map Stream IDs
    to data types. Used also for updating correspondence
    points and relevant dependencies.
    Access Unit Marker Bit Set for the last packet of the access unit. Allows to bit
    (MB) identify the last packet of an AU.
    Packet Counter Counter associated to each Stream ID linearly increasing bitstring
    Number (SN) by 1. Needed to identify gaps/packet losses. Wrap
    around at 255.
    Packet Size (PS) Number of bytes composing the packet, including bitstring
    header, optional fields and payload.
    Extension Flag (EF) Set if extension fields are present. bit
    Extension Fields Optional fields, presence signaled by PS. bytestring
    Payload Block data (entire block or fragment) bytestring
  • The Genomic Multiplex can be properly decoded only when at least one Genomic Dataset List, one Genomic Dataset Mapping Table and one Reference ID Mapping Table have been received, allowing to map every packet to a specific Genomic Dataset component.
  • Genomic Packet Header
  • Every Genomic Data Block may be split in fragments, which may be transmitted in the payload of Genomic Data Packets, depending on channel requirements, such as packet loss rate, protocol maximum packet size, etc.
  • A Genomic Data Packet is defined as follows.
  • Syntax Description
    Packet_header( ) {
    Layer ID (LID) Unambiguously identifies data type carried by this Packet.
    Unique for each sub-stream/data type. Mapping Table
    needed at beginning of stream in order to map Layer IDs to
    data types.
    Reserved To maintain byte-alignment
    Access Unit Marker Bit (MB) Set for the last Packet of the Access Unit. Allows identifying
    the end of an AU as a set of Blocks.
    Sequence Number (SN) Packet counter, linearly increasing by 1. Needed to identify
    packet losses as gaps in SNs for each individual sub-stream.
    Associated to LID, i.e., different SN for every LID.
    Packet Size (PS) Number of bytes composing Packet, including header,
    optional fields and payload.
    Extension Flag (EF) Set if extension field is present.
    [optional] Extension field Optional field, present if EF is set.
    }
  • Multiplex Encoding Process
  • FIG. 49 shows how before being transformed in the data structures presented in this invention, raw genomic sequence data need to be mapped (491) on one or more reference sequence known a-priori (4920). In case a reference sequence is not available a “constructed” reference can be built from the raw sequence data (492). Already aligned data can be re-aligned in order to reduce the information entropy. After alignment, a genomic classifier (494) creates the data classes according to the matching functions described in Table land separates metadata (e.g. quality values) and annotation data from the genomic sequences. A reference transformation (4919) can be applied on the external reference (4920) in order to further reduce the entropy of the generated classes of data (498). The transformed data classes (4918) are fed to layers encoders (495-497) to produce genomic layers (491) which are then encoded by entropy encoders (4912-4914). The genomic streams generated by the entropy encoders are then sent to Genomic Multiplexer (4916) which generates the Genomic Multiplex. Genomic labels generated by a Genomic Labels Generator (4917) can be associated to the genomic streams (4915) by the Multiplexer (4916).

Claims (21)

1. A method for selective access of regions of genomic data by employing labels, said labels comprising: an identifier of a reference genomic sequence, an identifier of said genomic regions, and an identifier of the data class of said genomic data,
wherein said genomic data are sequences of genomic reads, and
wherein said data classes can be of the following type or a subset of them:
“Class P” comprising genomic reads which do not present any mismatch with respect to a reference sequence,
“Class N” comprising genomic reads including only mismatches in positions where the sequencing machine was not able to call any “base” and the number of said mismatches does not exceed a given threshold,
“Class M” comprising genomic reads in which mismatches are constituted by positions where the sequencing machine was not able to call any base, named “n type” mismatches, and/or it called a different base than the reference sequence, named “s type” mismatches, and said numbers of mismatches do not exceed given thresholds for the number of mismatches of “n type”, of “s type” and a threshold obtained from a given function (f(n,s)),
“Class I” when the genomic reads can possibly have the same type of mismatches of “Class M”, and in addition at least one mismatch of type: “insertion” (“i type”), “deletion” (“d type”), soft clips (“c type”), and wherein the numbers of mismatches for each type does not exceed the corresponding given thresholds and a threshold provided by a given function (w(n,s,i,d,c)),
“Class U” comprising all reads that do not find any classification in the classes P, N, M, I.
2. (canceled)
3. (canceled)
4. The method of claim 1, further comprising the case of said genomic data being paired sequences of genomic reads.
5. The method of claim 4 wherein said data class of paired reads can be of the following types or a subset of them:
“Class P” comprising genomic read pairs which do not present any mismatch with respect to a reference sequence,
“Class N” comprising genomic reads pairs including only mismatches in positions where the sequencing machine was not able to call any “base” and said numbers of mismatches for each read do not exceed a given threshold,
“Class M” comprising genomic read pairs including only mismatches in positions where the sequencing machine was not able to call any “base” and said numbers of mismatches for each read do not exceed a given threshold, named “n type” mismatches, and/or it called a different base than the reference sequence, named “s type” mismatches, and said numbers of mismatches does not exceed a given thresholds for the number of mismatches of “n type”, of “s type” and a threshold obtained from a given function (f(n,s)),
“Class I” comprising read pairs which can possibly have the same type of mismatches of “Class M” pairs, and in addition at least one mismatch of type: “insertion” (“i type”) “deletion” (“d type”) soft clips (“c type”), and wherein the number of mismatches for each type does not exceed the corresponding given threshold and a threshold provided by a given function (w(n,s,i,d,c)),
“Class HM” comprising read pairs for which only one read mate does not satisfy the matching rules for being classified in any of the classes P, N, M, I,
Class “U” comprising all reads pairs for which both reads do not satisfy the matching rules for being classified in the classes P, N, M, I.
6. The method of claim 1, wherein said identifier of said genomic regions is comprised in a master index table.
7. The method of claim 6 wherein said genomic data and said labels are entropy coded.
8. The method of claim 7 wherein said master index table is comprised in a genomic dataset header.
9. The method of claim 8, wherein said regions of genomic data are dispersed among separate Access Units.
10. The method of claim 9 wherein the location of said regions of genomic data, in a file, is indicated in a local index table.
11. The method of claim 1, wherein said labels are user specified.
12. The method of claim 1, wherein said regions are protected and/or encrypted in a separate manner, without encrypting the whole genomic file.
13. The method of claim 1, wherein said labels are stored in a genomic label list (GLL).
14. A method for encoding genomic data with selective access to regions of genomic data as claimed in claim 1.
15. The method of claim 13, wherein said genomic label list is periodically retransmitted or updated in order to enable multiple synchronization points.
16. A method for decoding a stream or a file of genomic data with selective access to regions of genomic data as claimed in claim 1.
17. An apparatus for encoding genomic data as claimed in claim 14.
18. An apparatus for decoding genomic data as claimed in claim 16.
19. Storing means for storing genomic data encoded according to claim 14.
20. A computer-readable medium comprising instructions that when executed cause at least one processor to perform the encoding method of claim 14.
21. A computer-readable medium comprising instructions that when executed cause at least one processor to perform the decoding method of claim 16.
US16/341,426 2016-10-11 2017-02-14 Method and system for selective access of stored or transmitted bioinformatics data Abandoned US20200042735A1 (en)

Applications Claiming Priority (9)

Application Number Priority Date Filing Date Title
EPPCT/EP2016/074297 2016-10-11
PCT/EP2016/074307 WO2018068829A1 (en) 2016-10-11 2016-10-11 Method and apparatus for compact representation of bioinformatics data
PCT/EP2016/074297 WO2018068827A1 (en) 2016-10-11 2016-10-11 Efficient data structures for bioinformatics information representation
EPPCT/EP2016/074311 2016-10-11
EPPCT/EP2016/074307 2016-10-11
PCT/EP2016/074311 WO2018068830A1 (en) 2016-10-11 2016-10-11 Method and system for the transmission of bioinformatics data
PCT/EP2016/074301 WO2018068828A1 (en) 2016-10-11 2016-10-11 Method and system for storing and accessing bioinformatics data
EPPCT/EP2016/074301 2016-10-11
PCT/US2017/017841 WO2018071054A1 (en) 2016-10-11 2017-02-14 Method and system for selective access of stored or transmitted bioinformatics data

Publications (1)

Publication Number Publication Date
US20200042735A1 true US20200042735A1 (en) 2020-02-06

Family

ID=61905752

Family Applications (6)

Application Number Title Priority Date Filing Date
US16/341,426 Abandoned US20200042735A1 (en) 2016-10-11 2017-02-14 Method and system for selective access of stored or transmitted bioinformatics data
US16/337,639 Abandoned US20190214111A1 (en) 2016-10-11 2017-07-11 Method and systems for the representation and processing of bioinformatics data using reference sequences
US16/337,642 Active 2038-03-31 US11404143B2 (en) 2016-10-11 2017-07-11 Method and systems for the indexing of bioinformatics data
US16/485,623 Pending US20190385702A1 (en) 2016-10-11 2017-12-14 Method and systems for the reconstruction of genomic reference sequences from compressed genomic sequence reads
US16/485,649 Pending US20200051667A1 (en) 2016-10-11 2017-12-15 Method and systems for the efficient compression of genomic sequence reads
US16/485,670 Pending US20200051665A1 (en) 2016-10-11 2018-02-14 Method and apparatus for the compact representation of bioinformatics data using multiple genomic descriptors

Family Applications After (5)

Application Number Title Priority Date Filing Date
US16/337,639 Abandoned US20190214111A1 (en) 2016-10-11 2017-07-11 Method and systems for the representation and processing of bioinformatics data using reference sequences
US16/337,642 Active 2038-03-31 US11404143B2 (en) 2016-10-11 2017-07-11 Method and systems for the indexing of bioinformatics data
US16/485,623 Pending US20190385702A1 (en) 2016-10-11 2017-12-14 Method and systems for the reconstruction of genomic reference sequences from compressed genomic sequence reads
US16/485,649 Pending US20200051667A1 (en) 2016-10-11 2017-12-15 Method and systems for the efficient compression of genomic sequence reads
US16/485,670 Pending US20200051665A1 (en) 2016-10-11 2018-02-14 Method and apparatus for the compact representation of bioinformatics data using multiple genomic descriptors

Country Status (17)

Country Link
US (6) US20200042735A1 (en)
EP (3) EP3526694A4 (en)
JP (4) JP2020505702A (en)
KR (4) KR20190073426A (en)
CN (6) CN110168651A (en)
AU (3) AU2017342688A1 (en)
BR (7) BR112019007359A2 (en)
CA (3) CA3040138A1 (en)
CL (6) CL2019000968A1 (en)
CO (6) CO2019003595A2 (en)
EA (2) EA201990916A1 (en)
IL (3) IL265879B2 (en)
MX (2) MX2019004130A (en)
PE (7) PE20191058A1 (en)
PH (6) PH12019550059A1 (en)
SG (3) SG11201903270RA (en)
WO (4) WO2018071055A1 (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11030324B2 (en) * 2017-11-30 2021-06-08 Koninklijke Philips N.V. Proactive resistance to re-identification of genomic data
WO2022056293A1 (en) * 2020-09-14 2022-03-17 Illumina Software, Inc. Custom data files for personalized medicine
WO2022258866A1 (en) * 2021-06-10 2022-12-15 Veritas Intercontinental, S.L. Method of genomic analysis on a bioinformatics platform

Families Citing this family (24)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
GB2526598B (en) 2014-05-29 2018-11-28 Imagination Tech Ltd Allocation of primitives to primitive blocks
US11574287B2 (en) 2017-10-10 2023-02-07 Text IQ, Inc. Automatic document classification
WO2019191083A1 (en) * 2018-03-26 2019-10-03 Colorado State University Research Foundation Apparatuses, systems and methods for generating and tracking molecular digital signatures to ensure authenticity and integrity of synthetic dna molecules
CN112236824A (en) * 2018-05-31 2021-01-15 皇家飞利浦有限公司 Systems and methods for allelic interpretation using map-based reference genomes
CN108753765B (en) * 2018-06-08 2020-12-08 中国科学院遗传与发育生物学研究所 Genome assembly method for constructing ultra-long continuous DNA sequence
US20200058379A1 (en) * 2018-08-20 2020-02-20 The Board Of Trustees Of The Leland Stanford Junior University Systems and Methods for Compressing Genetic Sequencing Data and Uses Thereof
GB2585816A (en) * 2018-12-12 2021-01-27 Univ York Proof-of-work for blockchain applications
US20210074381A1 (en) * 2019-09-11 2021-03-11 Enancio Method for the compression of genome sequence data
CN110797087B (en) * 2019-10-17 2020-11-03 南京医基云医疗数据研究院有限公司 Sequencing sequence processing method and device, storage medium and electronic equipment
EP4046279A1 (en) 2019-10-18 2022-08-24 Koninklijke Philips N.V. System and method for effective compression, representation and decompression of diverse tabulated data
CN111243663B (en) * 2020-02-26 2022-06-07 西安交通大学 Gene variation detection method based on pattern growth algorithm
CN111370070B (en) * 2020-02-27 2023-10-27 中国科学院计算技术研究所 Compression processing method for big data gene sequencing file
US20210295949A1 (en) * 2020-03-17 2021-09-23 Western Digital Technologies, Inc. Devices and methods for locating a sample read in a reference genome
US11837330B2 (en) 2020-03-18 2023-12-05 Western Digital Technologies, Inc. Reference-guided genome sequencing
EP3896698A1 (en) * 2020-04-15 2021-10-20 Genomsys SA Method and system for the efficient data compression in mpeg-g
CN111459208A (en) * 2020-04-17 2020-07-28 南京铁道职业技术学院 Control system and method for electric energy of subway power supply system
CN112836355B (en) * 2021-01-14 2023-04-18 西安科技大学 Method for predicting coal face roof pressure probability
CN113670643B (en) * 2021-08-30 2023-05-12 四川虹美智能科技有限公司 Intelligent air conditioner testing method and system
CN113643761B (en) * 2021-10-13 2022-01-18 苏州赛美科基因科技有限公司 Extraction method for data required by interpretation of second-generation sequencing result
WO2023114415A2 (en) * 2021-12-15 2023-06-22 Illumina Software, Inc. Systems and methods for iterative and scalable population-scale variant analysis
CN115391284B (en) * 2022-10-31 2023-02-03 四川大学华西医院 Method, system and computer readable storage medium for quickly identifying gene data file
CN116541348B (en) * 2023-03-22 2023-09-26 河北热点科技股份有限公司 Intelligent data storage method and terminal query integrated machine
CN116739646B (en) * 2023-08-15 2023-11-24 南京易联阳光信息技术股份有限公司 Method and system for analyzing big data of network transaction
CN117153270B (en) * 2023-10-30 2024-02-02 吉林华瑞基因科技有限公司 Gene second-generation sequencing data processing method

Family Cites Families (54)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6303297B1 (en) * 1992-07-17 2001-10-16 Incyte Pharmaceuticals, Inc. Database for storage and analysis of full-length sequences
JP3429674B2 (en) 1998-04-28 2003-07-22 沖電気工業株式会社 Multiplex communication system
AU2001255344A1 (en) * 2000-04-12 2001-11-12 King Faisal Specialist Hospital And Research Centre System for identifying and analyzing expression of are-containing genes
FR2820563B1 (en) * 2001-02-02 2003-05-16 Expway COMPRESSION / DECOMPRESSION PROCESS FOR A STRUCTURED DOCUMENT
US20040153255A1 (en) * 2003-02-03 2004-08-05 Ahn Tae-Jin Apparatus and method for encoding DNA sequence, and computer readable medium
DE10320711A1 (en) * 2003-05-08 2004-12-16 Siemens Ag Method and arrangement for setting up and updating a user interface for accessing information pages in a data network
US8280640B2 (en) * 2003-08-11 2012-10-02 Eloret Corporation System and method for pattern recognition in sequential data
WO2005094363A2 (en) * 2004-03-30 2005-10-13 New York University System, method and software arrangement for bi-allele haplotype phasing
WO2006052242A1 (en) * 2004-11-08 2006-05-18 Seirad, Inc. Methods and systems for compressing and comparing genomic data
WO2007132461A2 (en) * 2006-05-11 2007-11-22 Ramot At Tel Aviv University Ltd. Classification of protein sequences and uses of classified proteins
SE531398C2 (en) 2007-02-16 2009-03-24 Scalado Ab Generating a data stream and identifying positions within a data stream
KR101369745B1 (en) * 2007-04-11 2014-03-07 삼성전자주식회사 Method and apparatus for multiplexing and demultiplexing asynchronous bitstreams
US8832112B2 (en) * 2008-06-17 2014-09-09 International Business Machines Corporation Encoded matrix index
WO2010056131A1 (en) * 2008-11-14 2010-05-20 Real Time Genomics, Inc. A method and system for analysing data sequences
US20100217532A1 (en) * 2009-02-25 2010-08-26 University Of Delaware Systems and methods for identifying structurally or functionally significant amino acid sequences
MX2012005069A (en) * 2009-10-30 2012-07-17 Synthetic Genomics Inc Encoding text into nucleic acid sequences.
EP2362657B1 (en) * 2010-02-18 2013-04-24 Research In Motion Limited Parallel entropy coding and decoding methods and devices
WO2011143231A2 (en) * 2010-05-10 2011-11-17 The Broad Institute High throughput paired-end sequencing of large-insert clone libraries
WO2011149534A2 (en) * 2010-05-25 2011-12-01 The Regents Of The University Of California Bambam: parallel comparative analysis of high-throughput sequencing data
EP2666115A1 (en) * 2011-01-19 2013-11-27 Koninklijke Philips N.V. Method for processing genomic data
US20120232874A1 (en) * 2011-03-09 2012-09-13 Annai Systems, Inc. Biological data networks and methods therefor
JP6027608B2 (en) * 2011-06-06 2016-11-16 コーニンクレッカ フィリップス エヌ ヴェKoninklijke Philips N.V. Method for assembly of nucleic acid sequence data
MY170940A (en) * 2011-06-16 2019-09-19 Ge Video Compression Llc Entropy coding of motion vector differences
US8707289B2 (en) * 2011-07-20 2014-04-22 Google Inc. Multiple application versions
WO2013050612A1 (en) * 2011-10-06 2013-04-11 Fraunhofer-Gesellschaft zur Förderung der angewandten Forschung e.V. Entropy coding buffer arrangement
EP2776962A4 (en) * 2011-11-07 2015-12-02 Ingenuity Systems Inc Methods and systems for identification of causal genomic variants
KR101922129B1 (en) * 2011-12-05 2018-11-26 삼성전자주식회사 Method and apparatus for compressing and decompressing genetic information using next generation sequencing(NGS)
US10140683B2 (en) * 2011-12-08 2018-11-27 Five3 Genomics, Llc Distributed system providing dynamic indexing and visualization of genomic data
EP2608096B1 (en) * 2011-12-24 2020-08-05 Tata Consultancy Services Ltd. Compression of genomic data file
US9600625B2 (en) * 2012-04-23 2017-03-21 Bina Technologies, Inc. Systems and methods for processing nucleic acid sequence data
CN103049680B (en) * 2012-12-29 2016-09-07 深圳先进技术研究院 gene sequencing data reading method and system
US9679104B2 (en) * 2013-01-17 2017-06-13 Edico Genome, Corp. Bioinformatics systems, apparatuses, and methods executed on an integrated circuit processing platform
WO2014145503A2 (en) * 2013-03-15 2014-09-18 Lieber Institute For Brain Development Sequence alignment using divide and conquer maximum oligonucleotide mapping (dcmom), apparatus, system and method related thereto
JP6054790B2 (en) * 2013-03-28 2016-12-27 三菱スペース・ソフトウエア株式会社 Gene information storage device, gene information search device, gene information storage program, gene information search program, gene information storage method, gene information search method, and gene information search system
GB2512829B (en) * 2013-04-05 2015-05-27 Canon Kk Method and apparatus for encoding or decoding an image with inter layer motion information prediction according to motion information compression scheme
WO2014186604A1 (en) 2013-05-15 2014-11-20 Edico Genome Corp. Bioinformatics systems, apparatuses, and methods executed on an integrated circuit processing platform
KR101522087B1 (en) * 2013-06-19 2015-05-28 삼성에스디에스 주식회사 System and method for aligning genome sequnce considering mismatch
CN103336916B (en) * 2013-07-05 2016-04-06 中国科学院数学与系统科学研究院 A kind of sequencing sequence mapping method and system
US20150032711A1 (en) * 2013-07-06 2015-01-29 Victor Kunin Methods for identification of organisms, assigning reads to organisms, and identification of genes in metagenomic sequences
KR101493982B1 (en) * 2013-09-26 2015-02-23 대한민국 Coding system for cultivar identification and coding method using thereof
CN104699998A (en) * 2013-12-06 2015-06-10 国际商业机器公司 Method and device for compressing and decompressing genome
US10902937B2 (en) * 2014-02-12 2021-01-26 International Business Machines Corporation Lossless compression of DNA sequences
US9639542B2 (en) * 2014-02-14 2017-05-02 Sap Se Dynamic mapping of extensible datasets to relational database schemas
WO2015127058A1 (en) * 2014-02-19 2015-08-27 Hospodor Andrew Efficient encoding and storage and retrieval of genomic data
US9354922B2 (en) 2014-04-02 2016-05-31 International Business Machines Corporation Metadata-driven workflows and integration with genomic data processing systems and techniques
US20150379195A1 (en) * 2014-06-25 2015-12-31 The Board Of Trustees Of The Leland Stanford Junior University Software haplotying of hla loci
GB2527588B (en) * 2014-06-27 2016-05-18 Gurulogic Microsystems Oy Encoder and decoder
US20160019339A1 (en) * 2014-07-06 2016-01-21 Mercator BioLogic Incorporated Bioinformatics tools, systems and methods for sequence assembly
US10230390B2 (en) * 2014-08-29 2019-03-12 Bonnie Berger Leighton Compressively-accelerated read mapping framework for next-generation sequencing
US10116632B2 (en) * 2014-09-12 2018-10-30 New York University System, method and computer-accessible medium for secure and compressed transmission of genomic data
US20160125130A1 (en) * 2014-11-05 2016-05-05 Agilent Technologies, Inc. Method for assigning target-enriched sequence reads to a genomic location
WO2016202918A1 (en) * 2015-06-16 2016-12-22 Gottfried Wilhelm Leibniz Universität Hannover Method for compressing genomic data
CN105956417A (en) * 2016-05-04 2016-09-21 西安电子科技大学 Similar base sequence query method based on editing distance in cloud environment
CN105975811B (en) * 2016-05-09 2019-03-15 管仁初 A kind of gene sequencing device intelligently compared

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11030324B2 (en) * 2017-11-30 2021-06-08 Koninklijke Philips N.V. Proactive resistance to re-identification of genomic data
WO2022056293A1 (en) * 2020-09-14 2022-03-17 Illumina Software, Inc. Custom data files for personalized medicine
WO2022258866A1 (en) * 2021-06-10 2022-12-15 Veritas Intercontinental, S.L. Method of genomic analysis on a bioinformatics platform
ES2930699A1 (en) * 2021-06-10 2022-12-20 Veritas Intercontinental S L GENOMIC ANALYSIS METHOD IN A BIOINFORMATIC PLATFORM (Machine-translation by Google Translate, not legally binding)

Also Published As

Publication number Publication date
CN110114830B (en) 2023-10-13
CN110506272A (en) 2019-11-26
CL2019000972A1 (en) 2019-08-23
WO2018071054A1 (en) 2018-04-19
IL265928A (en) 2019-05-30
CA3040147A1 (en) 2018-04-19
PE20200323A1 (en) 2020-02-13
WO2018071055A1 (en) 2018-04-19
BR112019007363A2 (en) 2019-07-09
BR112019016232A2 (en) 2020-04-07
CN110506272B (en) 2023-08-01
US20200051665A1 (en) 2020-02-13
CO2019009922A2 (en) 2020-01-17
US11404143B2 (en) 2022-08-02
CL2019002275A1 (en) 2019-11-22
WO2018071080A3 (en) 2018-06-28
CO2019009920A2 (en) 2020-01-17
PE20191056A1 (en) 2019-08-06
CN110603595B (en) 2023-08-08
CL2019002276A1 (en) 2019-11-29
IL265879B2 (en) 2024-01-01
EA201990916A1 (en) 2019-10-31
CA3040145A1 (en) 2018-04-19
KR20190117652A (en) 2019-10-16
EP3526694A1 (en) 2019-08-21
US20190385702A1 (en) 2019-12-19
JP7079786B2 (en) 2022-06-02
PH12019550059A1 (en) 2019-12-16
EA201990917A1 (en) 2019-08-30
CN110121577A (en) 2019-08-13
BR112019007359A2 (en) 2019-07-16
CN110168651A (en) 2019-08-23
JP2020500383A (en) 2020-01-09
CL2019000968A1 (en) 2019-08-23
SG11201903270RA (en) 2019-05-30
EP3526694A4 (en) 2020-08-12
BR112019016236A2 (en) 2020-04-07
BR112019016230A2 (en) 2020-04-07
AU2017342688A1 (en) 2019-05-02
PH12019550058A1 (en) 2019-12-16
WO2018071079A1 (en) 2018-04-19
MX2019004130A (en) 2020-01-30
AU2017341684A1 (en) 2019-05-02
KR20190062541A (en) 2019-06-05
PE20200226A1 (en) 2020-01-29
CN110121577B (en) 2023-09-19
US20200051667A1 (en) 2020-02-13
CL2019002277A1 (en) 2019-11-22
CN110678929B (en) 2024-04-16
CO2019003842A2 (en) 2019-08-30
PH12019501879A1 (en) 2020-06-29
MX2019004128A (en) 2019-08-21
IL265928B (en) 2020-10-29
EP3526707A2 (en) 2019-08-21
PH12019550057A1 (en) 2020-01-20
CL2019000973A1 (en) 2019-08-23
BR112019007360A2 (en) 2019-07-09
US20200035328A1 (en) 2020-01-30
EP3526657A1 (en) 2019-08-21
AU2017341685A1 (en) 2019-05-02
KR20190069469A (en) 2019-06-19
SG11201903271UA (en) 2019-05-30
IL265879A (en) 2019-06-30
PE20191227A1 (en) 2019-09-11
CN110678929A (en) 2020-01-10
EP3526707A4 (en) 2020-06-17
CO2019003638A2 (en) 2019-08-30
US20190214111A1 (en) 2019-07-11
JP2019537172A (en) 2019-12-19
IL265879B1 (en) 2023-09-01
BR112019007357A2 (en) 2019-07-16
SG11201903272XA (en) 2019-05-30
KR20190073426A (en) 2019-06-26
CN110603595A (en) 2019-12-20
PE20191058A1 (en) 2019-08-06
EP3526657A4 (en) 2020-07-01
CA3040138A1 (en) 2018-04-19
PE20200227A1 (en) 2020-01-29
PE20191057A1 (en) 2019-08-06
JP2020505702A (en) 2020-02-20
CO2019003595A2 (en) 2019-08-30
CO2019003639A2 (en) 2020-02-28
WO2018071080A2 (en) 2018-04-19
IL265972A (en) 2019-06-30
PH12019550060A1 (en) 2019-12-16
JP2020500382A (en) 2020-01-09
PH12019501881A1 (en) 2020-06-29
CN110114830A (en) 2019-08-09

Similar Documents

Publication Publication Date Title
US20200042735A1 (en) Method and system for selective access of stored or transmitted bioinformatics data
EP3526709B1 (en) Efficient data structures for bioinformatics information representation
US11386979B2 (en) Method and system for storing and accessing bioinformatics data
US11763918B2 (en) Method and apparatus for the access to bioinformatics data structured in access units
AU2018221458B2 (en) Method and apparatus for the compact representation of bioinformatics data using multiple genomic descriptors
EP3526712B1 (en) Method and system for the transmission of bioinformatics data
JP7362481B2 (en) A method for encoding genome sequence data, a method for decoding encoded genome data, a genome encoder for encoding genome sequence data, a genome decoder for decoding genome data, and a computer-readable recording medium
CN110663022B (en) Method and apparatus for compact representation of bioinformatic data using genomic descriptors
NZ757185B2 (en) Method and apparatus for the compact representation of bioinformatics data using multiple genomic descriptors

Legal Events

Date Code Title Description
AS Assignment

Owner name: GENOMSYS SA, SWITZERLAND

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:BALUCH, MOHAMED KHOSO;ZOIA, GIORGIO;RENZI, DANIELE;SIGNING DATES FROM 20190405 TO 20190409;REEL/FRAME:048865/0629

STPP Information on status: patent application and granting procedure in general

Free format text: DOCKETED NEW CASE - READY FOR EXAMINATION

STPP Information on status: patent application and granting procedure in general

Free format text: NON FINAL ACTION MAILED

STCB Information on status: application discontinuation

Free format text: ABANDONED -- FAILURE TO RESPOND TO AN OFFICE ACTION