WO2022117205A1 - An apparatus and method for creating, reading and decoding a file encoded at a format which is readable to be processed according to multiple file formats - Google Patents

An apparatus and method for creating, reading and decoding a file encoded at a format which is readable to be processed according to multiple file formats Download PDF

Info

Publication number
WO2022117205A1
WO2022117205A1 PCT/EP2020/084595 EP2020084595W WO2022117205A1 WO 2022117205 A1 WO2022117205 A1 WO 2022117205A1 EP 2020084595 W EP2020084595 W EP 2020084595W WO 2022117205 A1 WO2022117205 A1 WO 2022117205A1
Authority
WO
WIPO (PCT)
Prior art keywords
file
data
formats
format
uncommon
Prior art date
Application number
PCT/EP2020/084595
Other languages
French (fr)
Inventor
Yair Toaff
Assaf Natanzon
Original Assignee
Huawei Technologies Co., Ltd.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huawei Technologies Co., Ltd. filed Critical Huawei Technologies Co., Ltd.
Priority to PCT/EP2020/084595 priority Critical patent/WO2022117205A1/en
Priority to CN202080107480.5A priority patent/CN116472526A/en
Publication of WO2022117205A1 publication Critical patent/WO2022117205A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/16File or folder operations, e.g. details of user interfaces specifically adapted to file systems
    • G06F16/164File meta data generation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/17Details of further file system functions
    • G06F16/178Techniques for file synchronisation in file systems
    • G06F16/1794Details of file format conversion

Definitions

  • the present disclosure in some embodiments thereof, relates to data decoding and encoding formats and, more specifically, but not exclusively, to methods and apparatus for creating, reading and decoding a file in a format, which is readable to be processed according to multiple formats.
  • a file format is a standard way that information is encoded for storage in a computer file. It specifies how bits are used to encode information in a digital storage medium.
  • File formats may be either proprietary or free and may be either unpublished or open. Different file formats are designed for different types of data, some are for a very specific type of data and some are for several types of data.
  • books can be encoded in the format of pdf, html, epub, word and the like. All of the file formats when representing the same data must include common data.
  • each format must include 3 types of data: the text, the pictures of the book and Metadata which is instructions on how the text and pictures are ordered. Out of these three types of data, the text and pictures are the same data and therefore are common data, the metadata however is propriety of each format, and changes from format to format.
  • a computerized implemented method for creating a file encoded at a format, which is readable to be processed according to multiple file formats comprises: setting a file storage space for a file in a storage device. Storing data common to the multiple file formats in a common data part of the file at the storage space. Storing uncommon data of each of the multiple file formats in an uncommon data part of the file at the storage space, and storing a plurality of pointers in a metadata data part of the file at the storage space, wherein each of the plurality of pointers associates between metadata which defines how to create the file to be readable in the multiple file formats and a relevant common data or uncommon data.
  • the computerized implemented method further comprising compressing the common data and/or the uncommon data in blocks and storing an offset indicating the beginning of each block.
  • the computerized implemented method further comprising storing line numbers indicating the beginning of each block.
  • the uncommon data is stored separately from the common data in a separate file.
  • different types of data of the common data are stored in different files according to the type of data.
  • the multiple formats are genome file formats.
  • the multiple formats are at least one of FASTQ, FASTA and sequence alignment map (SAM).
  • the multiple formats are text related file formats.
  • a computerized implemented method for reading and decoding a file encoded at a format, which is readable at multiple formats comprises: reading a plurality of pointers in a metadata data part of the file at a storage space where a common data part and uncommon data part are stored, wherein each of the plurality of pointers associates between metadata which defines how to create the file to be readable in the multiple file formats and a relevant common data or uncommon data.
  • the common data and/or the uncommon data are compressed in blocks and an offset is stored indicating the beginning of each block.
  • an apparatus for creating a file encoded at a format, which is readable to be processed according to multiple file formats comprises a processor configured to execute a code for: setting a file storage space for a file in a storage device; storing data common to the multiple file formats in a common data part of the file at the storage space; storing uncommon data of each of the multiple file formats in an uncommon data part of the file at the storage space; and storing a plurality of pointers in a metadata data part of the file at the storage space, wherein each of the plurality of pointers associates between metadata which defines how to create the file to be readable in the multiple file formats and a relevant common data or uncommon data.
  • the processor is further configured to execute a code for compressing the common data and/or the uncommon data in blocks and storing an offset indicating the beginning of each block.
  • a computer program for execution by the processor comprising a call to the code executed by the processor.
  • an apparatus for reading and decoding a file encoded at a format, which is readable to be processed according to multiple file formats comprises a processor configured to execute a code for reading a plurality of pointers in a metadata data part of the file at a storage space where a common data part and uncommon data part are stored, wherein each of the plurality of pointers associates between metadata which defines how to create the file to be readable in the multiple file formats and a relevant common data or uncommon data.
  • a computer program for execution by the processor comprising a call to the code executed by the processor.
  • a computer program for creating a file encoded at a format, which is readable to be processed according to multiple file formats comprising: program instructions for executing, by a processor, a sequence of operations that comprises: setting a file storage space for a file in a storage device; storing data common to the multiple file formats in a common data part of the file at the storage space; storing uncommon data of each of the multiple file formats in an uncommon data part of the file at the storage space; and storing a plurality of pointers in a metadata data part of the file at the storage space, wherein each of the plurality of pointers associates between metadata which defines how to create the file to be readable to be processed according to the multiple file formats and a relevant common data or uncommon data.
  • a computer program for reading and decoding a file encoded at a format, which is readable to be processed according to multiple file formats.
  • the computer program comprising: program instructions for executing, by a processor, a sequence of operations that comprises: reading a plurality of pointers in a metadata data part of a file at a storage space where a common data part and uncommon data part are stored, wherein each of the plurality of pointers associates between metadata which defines how to create the file to be readable to be processed according to multiple file formats and a relevant common data or uncommon data.
  • FIG. 1 schematically shows, flowchart of an apparatus for creating a file encoded at a format, which is readable to be processed according to multiple file formats, and reading and decoding the created file, according to some embodiments of the present disclosure
  • FIG. 2 schematically shows, a flowchart of a method for creating a file encoded at a format, which is readable to be processed according to multiple file formats, according to some embodiments of the present disclosure
  • FIG. 3 A schematically shows an example of a block of four lines formatted to comply with a FASTQ format
  • FIG. 3B schematically shows an example of lines of sequences formatted to comply with a sequence alignment map (SAM) format
  • FIG. 3C is a diagram for creating a genome file encoded in a format, which is readable to be processed according to two genome file formats of FASTQ and SAM, according to some embodiments of the present disclosure
  • FIG. 4A schematically shows, an example of text lines formatted to comply with a txt format
  • FIG. 4B schematically shows, an example of text lines formatted to comply with an html format.
  • the present disclosure in some embodiments thereof, relates to data decoding and encoding formats and, more specifically, but not exclusively, to an apparatus methods and for creating, reading and decoding a file encoded in a format, which is readable to be processed according to multiple formats.
  • the file encoded in a format, which is readable to be processed according to the multiple formats is related herein also as a “combined file”.
  • Multiple file formats may be used for encoding a specific type of data. Each format have advantages and disadvantages.
  • a file format converter is needed to convert the unsupported file format to another file format, which is supported by the user computer. After converting the unsupported file format to the supported file format two files are received eventually, which are stored on a storage space the user set for the files (such as a disk, hard disk and the like). The converted file format is accessible only after the conversion is done.
  • the genome data file formats are very large size and may arrive to tens of Gigabytes for each file format.
  • a genome file encoded at a FASTQ format may arrive to 18GB.
  • the present disclosure discloses a method and apparatus for encoding a file in a format to be processed by a processor according to multiple file formats, by storing the common data of the multiple file formats only once and then adding a mapping of metadata to each corresponding file format of the multiple formats. Thereby, enabling reading the multiple formats from the same file, and saving storage space.
  • the mapping of metadata indicates the relative offset in the file encoded to be processed according to multiple file formats, and the corresponding original files in the different offsets. This allows direct access to random offset in the file. Relating to the example above, when considering a database of thousands (or even millions) of genomes files, the present disclosure brings a significant savings in storage volume and costs.
  • Embodiments may be a system, a method, and/or a computer program product.
  • the computer program product may include a computer readable storage medium (or media) having computer readable program instructions thereon for causing a processor to carry out aspects of the embodiments.
  • the computer readable storage medium can be a tangible device that can retain and store instructions for use by an instruction execution device.
  • the computer readable storage medium may be, for example, but is not limited to, an electronic storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing.
  • a non-exhaustive list of more specific examples of the computer readable storage medium includes the following: a portable computer diskette, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or Flash memory), a static random access memory (SRAM), a portable compact disc read-only memory (CD-ROM), a digital versatile disk (DVD), a memory stick, a floppy disk, and any suitable combination of the foregoing.
  • RAM random access memory
  • ROM read-only memory
  • EPROM or Flash memory erasable programmable read-only memory
  • SRAM static random access memory
  • CD-ROM compact disc read-only memory
  • DVD digital versatile disk
  • memory stick a floppy disk, and any suitable combination of the foregoing.
  • a computer readable storage medium is not to be construed as being transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through a waveguide or other transmission media (e.g., light pulses passing through a fiber-optic cable), or electrical signals transmitted through a wire.
  • Computer readable program instructions described herein can be downloaded to respective computing/processing devices from a computer readable storage medium or to an external computer or external storage device via a network, for example, the Internet, a local area network, a wide area network and/or a wireless network.
  • the network may comprise copper transmission cables, optical transmission fibers, wireless transmission, routers, firewalls, switches, gateway computers and/or edge servers.
  • a network adapter card or network interface in each computing/processing device receives computer readable program instructions from the network and forwards the computer readable program instructions for storage in a computer readable storage medium within the respective computing/processing device.
  • Computer readable program instructions for carrying out operations of embodiments may be assembler instructions, instruction-set-architecture (ISA) instructions, machine instructions, machine dependent instructions, microcode, firmware instructions, state-setting data, or either source code or object code written in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, C++ or the like, and conventional procedural programming languages, such as the "C" programming language or similar programming languages.
  • the computer readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server.
  • the remote computer may be connected to the user's computer through any type of network, including a local area network (LAN) or a wide area network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet Service Provider).
  • electronic circuitry including, for example, programmable logic circuitry, field-programmable gate arrays (FPGA), or programmable logic arrays (PLA) may execute the computer readable program instructions by utilizing state information of the computer readable program instructions to personalize the electronic circuitry, in order to perform aspects of embodiments.
  • These computer readable program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the fimctions/acts specified in the flowchart and/or block diagram block or blocks.
  • These computer readable program instructions may also be stored in a computer readable storage medium that can direct a computer, a programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer readable storage medium having instructions stored therein comprises an article of manufacture including instructions which implement aspects of the fimction/act specified in the flowchart and/or block diagram block or blocks.
  • the computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other device to cause a series of operational steps to be performed on the computer, other programmable apparatus or other device to produce a computer implemented process, such that the instructions which execute on the computer, other programmable apparatus, or other device implement the fimctions/acts specified in the flowchart and/or block diagram block or blocks.
  • each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s).
  • the functions noted in the block may occur out of the order noted in the figures.
  • two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved.
  • FIG. 1 is a schematic flowchart of an apparatus for creating a file encoded at a format, which is readable to be processed according to multiple file formats, and reading and decoding the file encoded at a format, which is readable to be processed according to multiple file formats, according to some embodiments of the present disclosure.
  • Apparatus 100 includes a processor 101, which executes a code 104 for creating a file encoded at a format, which is readable to be processed according to multiple file formats.
  • Processor 101 also executes a code 105 for reading and decoding the file encoded at the format, which is readable to be processed according to multiple file formats, according to some embodiments of the present disclosure.
  • the processor 101 controls file system 102, which manages storage spaces at apparatus 100.
  • File system 102 sets a storage space 103 for files to be used by a user in the creation of the file format, which is readable to be processed according to multiple file formats.
  • FIG. 2 schematically shows a flowchart of a method for creating a file encoded at a format, which is readable to be processed according to multiple file formats, according to some embodiments of the present disclosure.
  • processor 101 executes a code 104, which causes the file system 102 to set a file storage space 103 for a file in a storage device (for the combined file).
  • the executed code 104 accesses given multiple files of the same data encoded each at a different format, and stores data common to the multiple file formats in a common data part of the combined file at the storage space 103.
  • different types of data of the common data may be stored in different files according to the type of data.
  • the two types of data may be stored at two different files one for the sequence type data and one for the quality type of data. Both files are linked to the combined file.
  • the executed code 104 stores uncommon data of each of the multiple file formats in an uncommon data part of the combined file at the storage spacel03.
  • the uncommon data is stored separately from the common data in a separate file, which is linked to the combined file.
  • executed code 104 stores a plurality of pointers in a metadata data part of the combined file at the storage space 103, where each of the plurality of pointers associates between metadata which defines how to create the file to be readable according to the multiple file formats and a relevant common data or uncommon data.
  • the pointers may point to the parts of the combined file, where the common data and uncommon data parts are stored or to the files where the common data and/or the uncommon data are stored, in case the data is stored in different files.
  • the common data and the uncommon data may be compressed in blocks and an offset may be stored for indicating the beginning of each block. Alternatively, instead of an offset, line numbers may be stored for indicating the beginning of each block.
  • the compression in blocks of the common and uncommon data allows a direct access to random offsets (i.e. to individual parts) of the file without needing to decompress the whole combined file. For example, when offset 1GB to 2GB in the combined file needs to be read.
  • the mapping file is accessed by code 105, which finds the start of this part (offset 1GB) and start reading the data until the end of the part (2GB), this way, there is no need to open all of the file and only than extract the needed part.
  • offset 1GB start of this part
  • 2GB start reading the data until the end of the part (2GB)
  • Processor 101 executes code 105 for reading and decoding the file encoded at the format, which is readable to be processed according to multiple file formats (the combined file).
  • Code 105 reads a plurality of pointers in the metadata data part of the combined file at a storage space.
  • the combined file contains a common data part and uncommon data part.
  • Each of the plurality of pointers associates between metadata, which defines how to create the file to be readable to be processed according to the multiple file formats and a relevant common data or uncommon data.
  • the common data and/or uncommon data are stored in different files. In this case, the pointers point to the specific data at the relevant file where the common and/or uncommon data are stored.
  • Genomic data is created by cutting the scanned genome into small pieces and guessing the nucleotide sequence of each piece. For every such sequence, which is represented by a series of letters, there is another series of characters in the same length that represent the quality of the guess for each nucleotide in the sequence. These two series of characters are referred to as sequence and quality.
  • genome data there are multiple file formats for representing the genome data, for example, FASTA, FASTQ and SAM. Each of these formats is needed for specific stages and targets of the genome processing.
  • the size of a single genome file can reach tens of Gigabytes (GB) (for example, a FASTQ file format may get to a size of 18GB and more). Therefore enabling to create a file encoded at a format, which is readable to be processed according to multiple formats and which is only stored once (as one file) to save storage place in this case is a meaningful advantage. This advantage is even more significant when it comes to databases of thousands files (or even millions files). All of the genome files have common data, which is the sequence, most of them also have the quality.
  • this header is unique for each sequence.
  • Most of the genome formats are text based, so they can be manipulated using common text manipulation tools for analyzing the genome. Direct access into the files is important in order for the text manipulation tools to work efficiently.
  • FASTQ format and SAM format both include a common data part.
  • the SAM file contains also an uncommon data part.
  • Both file formats contain the same three fields that are the majority of the data: header, sequence and quality.
  • the SAM file format contains also other unique fields.
  • a FASTQ file format normally uses four lines per genome sequence. Line 1 begins with a '@' character and is followed by a sequence identifier and an optional description (like a FASTA title line). Line 2 is a raw sequence letters. Line 3 begins with a '+' character and is optionally followed by the same sequence identifier (and any description) again.
  • FIG. 3A schematically shows an example of a block of four lines of a FASTQ format.
  • the SAM file format consists of a header and an alignment section.
  • the header section must be prior to the alignment section if it is present. Headings begin with the '@' symbol, which distinguishes them from the alignment section.
  • the header section may contain information about the entire file and additional information for alignments. The alignments then associate themselves with specific header information.
  • the alignment section contains the information for each sequence about where and/or how it aligns to the reference genome. Each alignment line has 11 mandatory fields for essential alignment information such as mapping position.
  • the SAM file format may contain variable number of optional fields for flexible or aligner specific information.
  • FIG. 3B schematically shows an example of sequences line of a SAM format.
  • Both file formats contain header, sequence and quality fields, these fields are about 99% of the FASTQ file format and about 90% of the SAM file format.
  • both formats can be read from one same file.
  • the common data part may be stored at different files according to the different types of data, of headers, sequence, and quality.
  • the uncommon data may be stored at a different file. This may enable to read the file according to one of the multiple formats easier, for example, to read the file according to the FASTQ format, the uncommon data, which is only the MD of the SAM file format, is not necessary.
  • processor 101 executes a code 104, which stores the common fields of both file formats FASTQ and SAM, only once and adds the unique fields from both different formats. This gives a huge space saving since the common fields are the majority of the data. According to some embodiment of the present disclosure, by adding dedicated compression of these fields a compression ratio of up to 1 :20 may be reached.
  • mapping meta data (MD) that shows the offsets in the original files and the offset in the combined data (this MD saves the offsets of about 1MB chunks) a direct read access can be reached to the data in both formats without dumping the needed file to disk.
  • MD mapping meta data
  • the creation of the file which is encoded to be readable to be processed according to a FASTQ file format and according to a SAM file format, is done according to the following: first, all the header section of the SAM file format is copied to a separated file (i.e. to the combined file). Then, for each sequence line in the SAM file format the data in the sequence line is split to header, sequence, quality and additional fields. The different parts of the sequence line are stored to accumulating buffers. After about 3000 sequence lines of the SAM file format, which have been accumulated, the different buffers are compressed, and the compressed buffers are stored at the relevant parts of the combined file. Then, a mapping MD entry is created, indicating offset and line number in the SAM file format and in the FASTQ file format and offset and compressed size in each part of the created file for these blocks.
  • For the reading and decoding method of the combined file for example, reading the file according to a FASTQ format, from offset 1MB with size 1MB.
  • the reading code code 105 searches the MD for the entry that the 1MB offset is in.
  • the FASTQ file is restored by the code 105 from the opened data of this chunk from the correct offset. When needed, the same is done with the next MD entries until the needed size is covered.
  • the code 105 takes into account the headers part as a prefix and copies parts of it if needed. In this way, an 18GB FASTQ file and the related 35GB file of the SAM file (which represent the same genome sample as the FASTQ file) are saved as 3.5GB combined compressed file.
  • the codes 104 and 105 executed by processor 101 may be executed by a dedicated application, where the only file shown to the user is the combined file encoded in a format, which is readable in multiple formats. In this case, only one file is seen in the storage and it is read as the format chosen by the user each time.
  • the first option is to show only the combined file, while this file is read through code 105.
  • the second option is to embed code 105 in a dedicated application (or in the file system implementation) that shows the multiple file format as different files. This way it shows only the multiple file formats without the combined file, and the access is as accessing each file format regularly, while in the background the reading goes through code 105 embedded in the file system and/or in a dedicated application.
  • FIG. 3C schematically shows an example of a diagram for creating a genome file encoded in a format, which is readable to be processed according to two genome file formats of FASTQ and SAM, according to some embodiments of the present disclosure.
  • the creation of the combined file, which is readable according to the two genome file formats, FASTQ and SAM is done by processor 101, which executes the code 104, which accesses the FASTQ file format 310 and SAM file format 320.
  • the header of the FASTQ is a full header and is therefore used as is.
  • the SAM header includes only part of the full header of the FASTQ file format.
  • the code copies all the headers section of the SAM file format to a separated file, which is the common data part in the file denoted as combined data part 301.
  • the SAM file is divided to lines where each line of the SAM file formats corresponds to a block of four lines of the FASTQ file format.
  • the code 104 splits the sequence line to header, sequence, quality and additional fields.
  • the different part of headers, sequence, quality and additional fields are stored to accumulating buffers 302, 303, 304 and 305 respectively in the combined data part 301. Since the data is the same and the parts of the headers, sequence and quality of the FASTQ and SAM file formats are the same, it is stored once in the common data part of the combined file.
  • headerl of the FASTQ file format and headerl of the SAM file format are stored as headerl under the headers part 302 in the combined data part 301.
  • Sequence2 of the FASTQ file format and sequencel of the SAM file formats are stored as sequencel under the sequence part 303, in the combined data part 301.
  • Quality 1 of the FASTQ file format and quality 1 of the SAM file format are stored as quality 1 under the quality part 304 of the combined data part 301.
  • the uncommon data denoted as other fieldsl of the SAM file format is stored in a separated part of uncommon data, as other fieldsl under the other part 305.
  • the FASTQ does not includes uncommon data parts, however, when it does include uncommon data, the uncommon data is stored in a separated part of uncommon data which is the FASTQ data and which is distinguished from the uncommon data of the SAM uncommon data.
  • the combined file which is readable in both FASTQ format and SAM format, includes a part of meta data (MD) 330, which defines how to create the file to be readable according to the multiple file formats.
  • the MD part 330 includes a plurality of pointers, which associate between the MD and the relevant common data or uncommon data.
  • the MD part 330 contains pointers 333, which holds for the first line of the SAM file format and the first block of the FASTQ format, the headers size and offset of headerl (which is the offset of the headers part of the combined file, or the headers file) in the headers part 302 of the common data part 301.
  • the MD part 330 contains pointers 334, which holds the sequence size and offset in the sequence part 303 of the common data part 301 for the first line of the SAM file format and the first block of the FASTQ format.
  • the MD part 330 also contains pointers 335, which hold the quality size and offset of quality 1 in the quality part 304 of the common data part, and pointers 336, which hold the ‘other’ size and offset of the uncommon data of the SAM file format, which is stored as other field 1 in the other part 305.
  • pointers 335 which hold the quality size and offset of quality 1 in the quality part 304 of the common data part
  • pointers 336 which hold the ‘other’ size and offset of the uncommon data of the SAM file format, which is stored as other field 1 in the other part 305.
  • the MD part 330 contains pointers, which hold the sizes and offsets for each part of the common data and for each part of the uncommon data.
  • the MD part 330 includes pointers 331 and 332 to the FASTQ file format offset of each block and to the SAM file format offset of each line to enable direct access to the offset of the file.
  • the diagram of FIG. 3C schematically shows an implementation for a single sequence blocks.
  • another implementation is possible by splitting the SAM file format to groups of about 3000 lines (about 1MB of the SAM file). In this implementation, the groups are small enough (the read overhead) to be considered direct access read while allowing the data to be compressed more efficiently.
  • the combined file format may be encoded to be readable to be processed according to multiple file formats.
  • a genome file format may be encoded to be readable to be processed according to a FASTA file format, FASTQ file format and SAM file format.
  • the method for creating a file encoded at a format readable to be processed according to multiple formats and reading and decoding the file may be applied to all kind of related file formats.
  • - text related formats such as HyperText Markup Language (html), epub, and txt - all contains the text inside with the addition of extra format data, and they may be encoded at a format readable to be processed according to multiple text related formats.
  • creating a file encoded at a format readable to be processed according to html file format and txt file format may be implemented according to some embodiments of the present disclosure.
  • a regular txt file format contains a title line and a text. The text may be very long and may get to 1 MB of text.
  • FIG. 4A schematically shows an example of a txt format.
  • An html file format also contains a title and a text with some additional MD regarding the basic formatting of the text.
  • the text may be very long and may get to 1 MB of text.
  • FIG. 4B schematically shows an example of an html page format.
  • processor 101 executes code 104 for creating a file format encoded as a format readable to be processed according to txt file format and html file format.
  • the processor sets a file storage space for a file in a storage device, and stores the common data, which is the title and the text and of both the txt file format and the html file format, at a common data part of the file in the storage space.
  • the uncommon data which is the MD regarding the basic formatting of the text in the html file format is stored by the processor at the uncommon data part of the file at the storage space.
  • a plurality of pointers are stored by the processor at the MD part of the file at the storage space.
  • Each of the pointers associates between metadata, which defines how to create the file to be readable in the multiple file formats and a relevant common data or uncommon data.
  • the combined file format readable to be processed according to a txt file format and as an html file format it is possible to implement the txt file format as the combined file. This is because the txt file contains all the common data of both files (txt and html).
  • the uncommon data part contains only the MD of the html file format, and the MD part contains pointers with the offset, size common data and uncommon data respectively of the txt and html file formats.
  • composition or method may include additional ingredients and/or steps, but only if the additional ingredients and/or steps do not materially alter the basic and novel characteristics of the claimed composition or method.
  • the singular form “a”, “an” and “the” include plural references unless the context clearly dictates otherwise.
  • the term “a compound” or “at least one compound” may include a plurality of compounds, including mixtures thereof.
  • range format is merely for convenience and brevity and should not be construed as an inflexible limitation on the scope of embodiments. Accordingly, the description of a range should be considered to have specifically disclosed all the possible subranges as well as individual numerical values within that range. For example, description of a range such as from 1 to 6 should be considered to have specifically disclosed subranges such as from 1 to 3, from 1 to 4, from 1 to 5, from 2 to 4, from 2 to 6, from 3 to 6 etc., as well as individual numbers within that range, for example, 1, 2, 3, 4, 5, and 6. This applies regardless of the breadth of the range.
  • a numerical range is indicated herein, it is meant to include any cited numeral (fractional or integral) within the indicated range.
  • the phrases “ranging/ranges between” a first indicate number and a second indicate number and “ranging/ranges from” a first indicate number “to” a second indicate number are used herein interchangeably and are meant to include the first and second indicated numbers and all the fractional and integral numerals there between.

Abstract

An apparatus and method for creating, reading and decoding a file encoded in a format readable to be processed according to multiple file formats are disclosed. The method for creating the file comprises storing data common to the multiple file formats in a common data part of a combined file at the storage space set for the file in a storage device. Storing uncommon data of each of the multiple file formats in an uncommon data part of the combined file. Then, a plurality of pointers are stored in a metadata data part of the combined file, and associate between metadata, which defines how to create the file to be readable to be processed according to the multiple file formats and a relevant common data or uncommon data. The reading and decoding of the combined file is done by reading the plurality of pointers.

Description

AN APPARATUS AND METHOD FOR CREATING, READING AND DECODING A FILE ENCODED AT A FORMAT WHICH IS READABLE TO BE PROCESSED ACCORDING TO MULTIPLE FILE FORMATS
TECHNICAL FIELD
The present disclosure, in some embodiments thereof, relates to data decoding and encoding formats and, more specifically, but not exclusively, to methods and apparatus for creating, reading and decoding a file in a format, which is readable to be processed according to multiple formats.
BACKGROUND
A file format is a standard way that information is encoded for storage in a computer file. It specifies how bits are used to encode information in a digital storage medium. File formats may be either proprietary or free and may be either unpublished or open. Different file formats are designed for different types of data, some are for a very specific type of data and some are for several types of data.
Multiple file formats exist for many types of data. For example: books can be encoded in the format of pdf, html, epub, word and the like. All of the file formats when representing the same data must include common data. In the book example, each format must include 3 types of data: the text, the pictures of the book and Metadata which is instructions on how the text and pictures are ordered. Out of these three types of data, the text and pictures are the same data and therefore are common data, the metadata however is propriety of each format, and changes from format to format.
SUMMARY
It is an object of the present disclosure to provide an apparatus and a method for creating a file encoded at format, which is readable to be processed according to multiple formats, reading and decoding the file encoded at a format, which is readable to be processed according to multiple file formats. Thereby preventing the need to convert any of the multiple file formats o another one of the multiple file format.
By creating, reading and decoding a file encoded at a format, which is readable to be processed according to multiple file formats, storage and costs are saved.
The foregoing and other objects are achieved by the features of the independent claims. Further implementation forms are apparent from the dependent claims, the description and the figures.
According to a first aspect of the present disclosure, a computerized implemented method for creating a file encoded at a format, which is readable to be processed according to multiple file formats, is disclosed. The method comprises: setting a file storage space for a file in a storage device. Storing data common to the multiple file formats in a common data part of the file at the storage space. Storing uncommon data of each of the multiple file formats in an uncommon data part of the file at the storage space, and storing a plurality of pointers in a metadata data part of the file at the storage space, wherein each of the plurality of pointers associates between metadata which defines how to create the file to be readable in the multiple file formats and a relevant common data or uncommon data. By storing the common data only once at one file, storage space and cost are saved, and in some cases where the files are at the sizes of Gigabytes, the storage and cost are meaningful.
In a further implementation of the first aspect the computerized implemented method further comprising compressing the common data and/or the uncommon data in blocks and storing an offset indicating the beginning of each block.
In a further implementation of the first aspect the computerized implemented method further comprising storing line numbers indicating the beginning of each block.
By compressing the common and /or the uncommon data, even more storage space and costs are saved, in addition to the enabling of a direct access to different parts of the created file.
In a further implementation of the first aspect, the uncommon data, is stored separately from the common data in a separate file. In a further implementation of the first aspect, different types of data of the common data are stored in different files according to the type of data. Thereby, enabling an easier compression of the common data part, and allowing easier direct access to the common data files in cases a part of the common data is not needed.
In a further implementation of the first aspect the multiple formats are genome file formats.
In a further implementation of the first aspect the multiple formats are at least one of FASTQ, FASTA and sequence alignment map (SAM).
In a further implementation of the first aspect the multiple formats are text related file formats.
According to a second aspect, a computerized implemented method for reading and decoding a file encoded at a format, which is readable at multiple formats is disclosed. The method comprises: reading a plurality of pointers in a metadata data part of the file at a storage space where a common data part and uncommon data part are stored, wherein each of the plurality of pointers associates between metadata which defines how to create the file to be readable in the multiple file formats and a relevant common data or uncommon data.
In a further implementation of the second aspect the common data and/or the uncommon data are compressed in blocks and an offset is stored indicating the beginning of each block.
According to a third aspect, an apparatus for creating a file encoded at a format, which is readable to be processed according to multiple file formats, is disclosed. The apparatus comprises a processor configured to execute a code for: setting a file storage space for a file in a storage device; storing data common to the multiple file formats in a common data part of the file at the storage space; storing uncommon data of each of the multiple file formats in an uncommon data part of the file at the storage space; and storing a plurality of pointers in a metadata data part of the file at the storage space, wherein each of the plurality of pointers associates between metadata which defines how to create the file to be readable in the multiple file formats and a relevant common data or uncommon data.
In a further implementation of the third aspect the processor is further configured to execute a code for compressing the common data and/or the uncommon data in blocks and storing an offset indicating the beginning of each block.
In a further implementation of the third aspect, a computer program for execution by the processor is disclosed. The computer program comprising a call to the code executed by the processor.
In a fourth aspect, an apparatus for reading and decoding a file encoded at a format, which is readable to be processed according to multiple file formats, is disclosed. The apparatus comprises a processor configured to execute a code for reading a plurality of pointers in a metadata data part of the file at a storage space where a common data part and uncommon data part are stored, wherein each of the plurality of pointers associates between metadata which defines how to create the file to be readable in the multiple file formats and a relevant common data or uncommon data.
In a further implementation of the fourth aspect, a computer program for execution by the processor is disclosed. The computer program comprising a call to the code executed by the processor.
In a fifth aspect, a computer program for creating a file encoded at a format, which is readable to be processed according to multiple file formats is disclosed. The computer program comprising: program instructions for executing, by a processor, a sequence of operations that comprises: setting a file storage space for a file in a storage device; storing data common to the multiple file formats in a common data part of the file at the storage space; storing uncommon data of each of the multiple file formats in an uncommon data part of the file at the storage space; and storing a plurality of pointers in a metadata data part of the file at the storage space, wherein each of the plurality of pointers associates between metadata which defines how to create the file to be readable to be processed according to the multiple file formats and a relevant common data or uncommon data.
In a sixth aspect, a computer program for reading and decoding a file encoded at a format, which is readable to be processed according to multiple file formats is disclosed. The computer program comprising: program instructions for executing, by a processor, a sequence of operations that comprises: reading a plurality of pointers in a metadata data part of a file at a storage space where a common data part and uncommon data part are stored, wherein each of the plurality of pointers associates between metadata which defines how to create the file to be readable to be processed according to multiple file formats and a relevant common data or uncommon data.
Other systems, methods, features, and advantages of the present disclosure will be or become apparent to one with skill in the art upon examination of the following drawings and detailed description. It is intended that all such additional systems, methods, features, and advantages be included within this description, be within the scope of the present disclosure, and be protected by the accompanying claims.
Unless otherwise defined, all technical and/or scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which embodiments. Although methods and materials similar or equivalent to those described herein can be used in the practice or testing of embodiments, exemplary methods and/or materials are described below. In case of conflict, the patent specification, including definitions, will control. In addition, the materials, methods, and examples are illustrative only and are not intended to be necessarily limiting. BRIEF DESCRIPTION OF THE SEVERAL VIEWS OF THE DRAWING(S)
Some embodiments are herein described, by way of example only, with reference to the accompanying drawings. With specific reference now to the drawings in detail, it is stressed that the particulars shown are by way of example and for purposes of illustrative discussion of embodiments. In this regard, the description taken with the drawings makes apparent to those skilled in the art how embodiments may be practiced.
In the drawings:
FIG. 1 schematically shows, flowchart of an apparatus for creating a file encoded at a format, which is readable to be processed according to multiple file formats, and reading and decoding the created file, according to some embodiments of the present disclosure;
FIG. 2 schematically shows, a flowchart of a method for creating a file encoded at a format, which is readable to be processed according to multiple file formats, according to some embodiments of the present disclosure;
FIG. 3 A schematically shows an example of a block of four lines formatted to comply with a FASTQ format;
FIG. 3B schematically shows an example of lines of sequences formatted to comply with a sequence alignment map (SAM) format;
FIG. 3C is a diagram for creating a genome file encoded in a format, which is readable to be processed according to two genome file formats of FASTQ and SAM, according to some embodiments of the present disclosure;
FIG. 4A schematically shows, an example of text lines formatted to comply with a txt format; and
FIG. 4B schematically shows, an example of text lines formatted to comply with an html format. DETAILED DESCRIPTION
The present disclosure, in some embodiments thereof, relates to data decoding and encoding formats and, more specifically, but not exclusively, to an apparatus methods and for creating, reading and decoding a file encoded in a format, which is readable to be processed according to multiple formats.
For simplicity, the file encoded in a format, which is readable to be processed according to the multiple formats is related herein also as a “combined file”.
Multiple file formats may be used for encoding a specific type of data. Each format have advantages and disadvantages. When a user receives data encoded in a specific file format, which is not supported by the computer of the user, a file format converter is needed to convert the unsupported file format to another file format, which is supported by the user computer. After converting the unsupported file format to the supported file format two files are received eventually, which are stored on a storage space the user set for the files (such as a disk, hard disk and the like). The converted file format is accessible only after the conversion is done.
In some cases, for example, in a case of genome data, the genome data file formats are very large size and may arrive to tens of Gigabytes for each file format. For example, a genome file encoded at a FASTQ format may arrive to 18GB.
The present disclosure discloses a method and apparatus for encoding a file in a format to be processed by a processor according to multiple file formats, by storing the common data of the multiple file formats only once and then adding a mapping of metadata to each corresponding file format of the multiple formats. Thereby, enabling reading the multiple formats from the same file, and saving storage space. The mapping of metadata indicates the relative offset in the file encoded to be processed according to multiple file formats, and the corresponding original files in the different offsets. This allows direct access to random offset in the file. Relating to the example above, when considering a database of thousands (or even millions) of genomes files, the present disclosure brings a significant savings in storage volume and costs.
Before explaining at least one embodiment in detail, it is to be understood that embodiments are not necessarily limited in its application to the details of construction and the arrangement of the components and/or methods set forth in the following description and/or illustrated in the drawings and/or the Examples. Implementations described herein are capable of other embodiments or of being practiced or carried out in various ways.
Embodiments may be a system, a method, and/or a computer program product. The computer program product may include a computer readable storage medium (or media) having computer readable program instructions thereon for causing a processor to carry out aspects of the embodiments.
The computer readable storage medium can be a tangible device that can retain and store instructions for use by an instruction execution device. The computer readable storage medium may be, for example, but is not limited to, an electronic storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing. A non-exhaustive list of more specific examples of the computer readable storage medium includes the following: a portable computer diskette, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or Flash memory), a static random access memory (SRAM), a portable compact disc read-only memory (CD-ROM), a digital versatile disk (DVD), a memory stick, a floppy disk, and any suitable combination of the foregoing. A computer readable storage medium, as used herein, is not to be construed as being transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through a waveguide or other transmission media (e.g., light pulses passing through a fiber-optic cable), or electrical signals transmitted through a wire.
Computer readable program instructions described herein can be downloaded to respective computing/processing devices from a computer readable storage medium or to an external computer or external storage device via a network, for example, the Internet, a local area network, a wide area network and/or a wireless network. The network may comprise copper transmission cables, optical transmission fibers, wireless transmission, routers, firewalls, switches, gateway computers and/or edge servers. A network adapter card or network interface in each computing/processing device receives computer readable program instructions from the network and forwards the computer readable program instructions for storage in a computer readable storage medium within the respective computing/processing device. Computer readable program instructions for carrying out operations of embodiments may be assembler instructions, instruction-set-architecture (ISA) instructions, machine instructions, machine dependent instructions, microcode, firmware instructions, state-setting data, or either source code or object code written in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, C++ or the like, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The computer readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the latter scenario, the remote computer may be connected to the user's computer through any type of network, including a local area network (LAN) or a wide area network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet Service Provider). In some embodiments, electronic circuitry including, for example, programmable logic circuitry, field-programmable gate arrays (FPGA), or programmable logic arrays (PLA) may execute the computer readable program instructions by utilizing state information of the computer readable program instructions to personalize the electronic circuitry, in order to perform aspects of embodiments.
Aspects of embodiments are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer readable program instructions.
These computer readable program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the fimctions/acts specified in the flowchart and/or block diagram block or blocks. These computer readable program instructions may also be stored in a computer readable storage medium that can direct a computer, a programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer readable storage medium having instructions stored therein comprises an article of manufacture including instructions which implement aspects of the fimction/act specified in the flowchart and/or block diagram block or blocks.
The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other device to cause a series of operational steps to be performed on the computer, other programmable apparatus or other device to produce a computer implemented process, such that the instructions which execute on the computer, other programmable apparatus, or other device implement the fimctions/acts specified in the flowchart and/or block diagram block or blocks.
The flowchart and block diagrams in the Figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods, and computer program products according to various embodiments. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems that perform the specified functions or acts or carry out combinations of special purpose hardware and computer instructions.
Reference is now made to FIG. 1, which is a schematic flowchart of an apparatus for creating a file encoded at a format, which is readable to be processed according to multiple file formats, and reading and decoding the file encoded at a format, which is readable to be processed according to multiple file formats, according to some embodiments of the present disclosure. Apparatus 100 includes a processor 101, which executes a code 104 for creating a file encoded at a format, which is readable to be processed according to multiple file formats. Processor 101 also executes a code 105 for reading and decoding the file encoded at the format, which is readable to be processed according to multiple file formats, according to some embodiments of the present disclosure. The processor 101 controls file system 102, which manages storage spaces at apparatus 100. File system 102 sets a storage space 103 for files to be used by a user in the creation of the file format, which is readable to be processed according to multiple file formats.
Reference is now made to FIG. 2, which schematically shows a flowchart of a method for creating a file encoded at a format, which is readable to be processed according to multiple file formats, according to some embodiments of the present disclosure. At 201, processor 101 executes a code 104, which causes the file system 102 to set a file storage space 103 for a file in a storage device (for the combined file). At 202, the executed code 104 accesses given multiple files of the same data encoded each at a different format, and stores data common to the multiple file formats in a common data part of the combined file at the storage space 103. In some embodiments of the present disclosure, different types of data of the common data may be stored in different files according to the type of data. For example, when the common data is divided to types of data such as sequence and quality, the two types of data may be stored at two different files one for the sequence type data and one for the quality type of data. Both files are linked to the combined file. At 203, the executed code 104 stores uncommon data of each of the multiple file formats in an uncommon data part of the combined file at the storage spacel03. In some embodiments of the present disclosure, the uncommon data is stored separately from the common data in a separate file, which is linked to the combined file. Then, at 204 executed code 104 stores a plurality of pointers in a metadata data part of the combined file at the storage space 103, where each of the plurality of pointers associates between metadata which defines how to create the file to be readable according to the multiple file formats and a relevant common data or uncommon data. The pointers may point to the parts of the combined file, where the common data and uncommon data parts are stored or to the files where the common data and/or the uncommon data are stored, in case the data is stored in different files. In addition, the common data and the uncommon data may be compressed in blocks and an offset may be stored for indicating the beginning of each block. Alternatively, instead of an offset, line numbers may be stored for indicating the beginning of each block. The compression in blocks of the common and uncommon data allows a direct access to random offsets (i.e. to individual parts) of the file without needing to decompress the whole combined file. For example, when offset 1GB to 2GB in the combined file needs to be read. The mapping file is accessed by code 105, which finds the start of this part (offset 1GB) and start reading the data until the end of the part (2GB), this way, there is no need to open all of the file and only than extract the needed part. In addition, when the common data is stored in different files according to data types, it allows to compress the files easier. According to some embodiments of the present disclosure, a method for reading and decoding a file encoded at a format, which is readable to be processed according to multiple file formats, is disclosed herein. Processor 101 executes code 105 for reading and decoding the file encoded at the format, which is readable to be processed according to multiple file formats (the combined file). Code 105 reads a plurality of pointers in the metadata data part of the combined file at a storage space. The combined file contains a common data part and uncommon data part. Each of the plurality of pointers associates between metadata, which defines how to create the file to be readable to be processed according to the multiple file formats and a relevant common data or uncommon data. In some embodiment of the present disclosure, the common data and/or uncommon data are stored in different files. In this case, the pointers point to the specific data at the relevant file where the common and/or uncommon data are stored.
Reference is now made to an exemplary case for creating a genome file encoded in a format, which is readable to be processed according to two genome file formats of FASTQ and sequence alignment map (SAM), according to some embodiments of the present disclosure. Genomic data is created by cutting the scanned genome into small pieces and guessing the nucleotide sequence of each piece. For every such sequence, which is represented by a series of letters, there is another series of characters in the same length that represent the quality of the guess for each nucleotide in the sequence. These two series of characters are referred to as sequence and quality.
In genome data there are multiple file formats for representing the genome data, for example, FASTA, FASTQ and SAM. Each of these formats is needed for specific stages and targets of the genome processing. The size of a single genome file can reach tens of Gigabytes (GB) (for example, a FASTQ file format may get to a size of 18GB and more). Therefore enabling to create a file encoded at a format, which is readable to be processed according to multiple formats and which is only stored once (as one file) to save storage place in this case is a meaningful advantage. This advantage is even more significant when it comes to databases of thousands files (or even millions files). All of the genome files have common data, which is the sequence, most of them also have the quality. Usually there is also a header for each sequence that is created by a machine that analyzes the given DNA sample into the text file representing the sequence, this header is unique for each sequence. Most of the genome formats are text based, so they can be manipulated using common text manipulation tools for analyzing the genome. Direct access into the files is important in order for the text manipulation tools to work efficiently.
Looking at SAM and FASTQ representations of the same genome scan. In this case, FASTQ format and SAM format both include a common data part. The SAM file contains also an uncommon data part. Both file formats, contain the same three fields that are the majority of the data: header, sequence and quality. The SAM file format contains also other unique fields. A FASTQ file format normally uses four lines per genome sequence. Line 1 begins with a '@' character and is followed by a sequence identifier and an optional description (like a FASTA title line). Line 2 is a raw sequence letters. Line 3 begins with a '+' character and is optionally followed by the same sequence identifier (and any description) again. Eventually, line 4 encodes the quality values for the sequence in Line 2, and must contain the same number of symbols as letters in the sequence. FIG. 3A schematically shows an example of a block of four lines of a FASTQ format. A SAM file format structure is however different. The SAM file format consists of a header and an alignment section. The header section must be prior to the alignment section if it is present. Headings begin with the '@' symbol, which distinguishes them from the alignment section. The header section may contain information about the entire file and additional information for alignments. The alignments then associate themselves with specific header information. The alignment section contains the information for each sequence about where and/or how it aligns to the reference genome. Each alignment line has 11 mandatory fields for essential alignment information such as mapping position. The SAM file format may contain variable number of optional fields for flexible or aligner specific information. FIG. 3B schematically shows an example of sequences line of a SAM format.
Both file formats contain header, sequence and quality fields, these fields are about 99% of the FASTQ file format and about 90% of the SAM file format. According to some embodiments of the present disclosure, by keeping the common data in a common field and keeping in an additional fields the uncommon data of each format, both formats can be read from one same file. In some embodiments of the present disclosure, the common data part may be stored at different files according to the different types of data, of headers, sequence, and quality. In addition, the uncommon data may be stored at a different file. This may enable to read the file according to one of the multiple formats easier, for example, to read the file according to the FASTQ format, the uncommon data, which is only the MD of the SAM file format, is not necessary. According to some embodiments of the present disclosure, processor 101 executes a code 104, which stores the common fields of both file formats FASTQ and SAM, only once and adds the unique fields from both different formats. This gives a huge space saving since the common fields are the majority of the data. According to some embodiment of the present disclosure, by adding dedicated compression of these fields a compression ratio of up to 1 :20 may be reached.
In some embodiments of the present disclosure, by adding mapping meta data (MD) that shows the offsets in the original files and the offset in the combined data (this MD saves the offsets of about 1MB chunks) a direct read access can be reached to the data in both formats without dumping the needed file to disk. By opening the combined file, and reading the MD (very small) a direct read access is gained according to both formats with the cost of a little extra MD and reading in chunks. The value is that at the storage the user sees multiple files in multiple formats while keeping under the hood just a single proprietary highly compressed format.
According to some embodiments of the present disclosure, the creation of the file, which is encoded to be readable to be processed according to a FASTQ file format and according to a SAM file format, is done according to the following: first, all the header section of the SAM file format is copied to a separated file (i.e. to the combined file). Then, for each sequence line in the SAM file format the data in the sequence line is split to header, sequence, quality and additional fields. The different parts of the sequence line are stored to accumulating buffers. After about 3000 sequence lines of the SAM file format, which have been accumulated, the different buffers are compressed, and the compressed buffers are stored at the relevant parts of the combined file. Then, a mapping MD entry is created, indicating offset and line number in the SAM file format and in the FASTQ file format and offset and compressed size in each part of the created file for these blocks.
For the reading and decoding method of the combined file, for example, reading the file according to a FASTQ format, from offset 1MB with size 1MB. The reading code, code 105 searches the MD for the entry that the 1MB offset is in. The FASTQ file is restored by the code 105 from the opened data of this chunk from the correct offset. When needed, the same is done with the next MD entries until the needed size is covered. For reading the combined file according to a SAM format, The code 105 takes into account the headers part as a prefix and copies parts of it if needed. In this way, an 18GB FASTQ file and the related 35GB file of the SAM file (which represent the same genome sample as the FASTQ file) are saved as 3.5GB combined compressed file. When considering a data base of thousands (or even millions) of genomes this can be a huge save in storage volume and costs. By adding mapping meta data the combined file readable to be processed according to multiple formats may be kept and accessed in random access at the multiple formats. This way, at the storage multiple files in multiple formats are seen, while keeping under just a single proprietary format that when done correctly is much smaller than the sum of all multiple files. According to some other embodiments of the present disclosure, the codes 104 and 105 executed by processor 101 may be executed by a dedicated application, where the only file shown to the user is the combined file encoded in a format, which is readable in multiple formats. In this case, only one file is seen in the storage and it is read as the format chosen by the user each time. In general, there are two options to show the different multiple formats. The first option is to show only the combined file, while this file is read through code 105. The second option is to embed code 105 in a dedicated application (or in the file system implementation) that shows the multiple file format as different files. This way it shows only the multiple file formats without the combined file, and the access is as accessing each file format regularly, while in the background the reading goes through code 105 embedded in the file system and/or in a dedicated application.
Reference is now made to FIG. 3C, which schematically shows an example of a diagram for creating a genome file encoded in a format, which is readable to be processed according to two genome file formats of FASTQ and SAM, according to some embodiments of the present disclosure. The creation of the combined file, which is readable according to the two genome file formats, FASTQ and SAM, is done by processor 101, which executes the code 104, which accesses the FASTQ file format 310 and SAM file format 320. In the case of FASTQ and SAM file formats the header of the FASTQ is a full header and is therefore used as is. The SAM header includes only part of the full header of the FASTQ file format. The code copies all the headers section of the SAM file format to a separated file, which is the common data part in the file denoted as combined data part 301. The SAM file is divided to lines where each line of the SAM file formats corresponds to a block of four lines of the FASTQ file format. Then, for each sequence line in the SAM file, the code 104, splits the sequence line to header, sequence, quality and additional fields. The different part of headers, sequence, quality and additional fields are stored to accumulating buffers 302, 303, 304 and 305 respectively in the combined data part 301. Since the data is the same and the parts of the headers, sequence and quality of the FASTQ and SAM file formats are the same, it is stored once in the common data part of the combined file. For example, headerl of the FASTQ file format and headerl of the SAM file format (which contains only part of the full header of the FASTQ file format) are stored as headerl under the headers part 302 in the combined data part 301. Sequence2 of the FASTQ file format and sequencel of the SAM file formats are stored as sequencel under the sequence part 303, in the combined data part 301. Quality 1 of the FASTQ file format and quality 1 of the SAM file format are stored as quality 1 under the quality part 304 of the combined data part 301. The uncommon data denoted as other fieldsl of the SAM file format is stored in a separated part of uncommon data, as other fieldsl under the other part 305. In this case, the FASTQ does not includes uncommon data parts, however, when it does include uncommon data, the uncommon data is stored in a separated part of uncommon data which is the FASTQ data and which is distinguished from the uncommon data of the SAM uncommon data.
The combined file, which is readable in both FASTQ format and SAM format, includes a part of meta data (MD) 330, which defines how to create the file to be readable according to the multiple file formats. The MD part 330, includes a plurality of pointers, which associate between the MD and the relevant common data or uncommon data. The MD part 330, contains pointers 333, which holds for the first line of the SAM file format and the first block of the FASTQ format, the headers size and offset of headerl (which is the offset of the headers part of the combined file, or the headers file) in the headers part 302 of the common data part 301. In addition, the MD part 330 contains pointers 334, which holds the sequence size and offset in the sequence part 303 of the common data part 301 for the first line of the SAM file format and the first block of the FASTQ format. The MD part 330 also contains pointers 335, which hold the quality size and offset of quality 1 in the quality part 304 of the common data part, and pointers 336, which hold the ‘other’ size and offset of the uncommon data of the SAM file format, which is stored as other field 1 in the other part 305. The same is true for the second line of the SAM file format and the second block of the FASTQ file format, where the MD part 330 contains pointers, which hold the sizes and offsets for each part of the common data and for each part of the uncommon data. In addition, the MD part 330 includes pointers 331 and 332 to the FASTQ file format offset of each block and to the SAM file format offset of each line to enable direct access to the offset of the file. For simplicity, the diagram of FIG. 3C schematically shows an implementation for a single sequence blocks. In some other embodiments of the present disclosure, another implementation is possible by splitting the SAM file format to groups of about 3000 lines (about 1MB of the SAM file). In this implementation, the groups are small enough (the read overhead) to be considered direct access read while allowing the data to be compressed more efficiently.
In some embodiments of the present disclosure, the combined file format may be encoded to be readable to be processed according to multiple file formats. For example, a genome file format may be encoded to be readable to be processed according to a FASTA file format, FASTQ file format and SAM file format.
According to some embodiments of the present disclosure, the method for creating a file encoded at a format readable to be processed according to multiple formats and reading and decoding the file, may be applied to all kind of related file formats. For example - text related formats - such as HyperText Markup Language (html), epub, and txt - all contains the text inside with the addition of extra format data, and they may be encoded at a format readable to be processed according to multiple text related formats.
For example, creating a file encoded at a format readable to be processed according to html file format and txt file format may be implemented according to some embodiments of the present disclosure. A regular txt file format contains a title line and a text. The text may be very long and may get to 1 MB of text. FIG. 4A schematically shows an example of a txt format.
An html file format also contains a title and a text with some additional MD regarding the basic formatting of the text. The text may be very long and may get to 1 MB of text. FIG. 4B schematically shows an example of an html page format. According to some embodiment of the present disclosure, processor 101 executes code 104 for creating a file format encoded as a format readable to be processed according to txt file format and html file format. First, the processor sets a file storage space for a file in a storage device, and stores the common data, which is the title and the text and of both the txt file format and the html file format, at a common data part of the file in the storage space. The uncommon data, which is the MD regarding the basic formatting of the text in the html file format is stored by the processor at the uncommon data part of the file at the storage space. Eventually, a plurality of pointers are stored by the processor at the MD part of the file at the storage space. Each of the pointers associates between metadata, which defines how to create the file to be readable in the multiple file formats and a relevant common data or uncommon data.
In some embodiments of the present disclosure, the combined file format readable to be processed according to a txt file format and as an html file format, it is possible to implement the txt file format as the combined file. This is because the txt file contains all the common data of both files (txt and html). The uncommon data part contains only the MD of the html file format, and the MD part contains pointers with the offset, size common data and uncommon data respectively of the txt and html file formats.
The descriptions of the various embodiments have been presented for purposes of illustration, but are not intended to be exhaustive or limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein was chosen to best explain the principles of the embodiments, the practical application or technical improvement over technologies found in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.
It is expected that during the life of a patent maturing from this application many relevant methods and apparatuses for creating and reading a file format, which is readable to be processed according to multiple formats will be developed and the scope of the term method and apparatus for creating and reading a file format, which is readable to be processed according to multiple formats is intended to include all such new technologies a priori.
As used herein the term “about” refers to ± 10 %.
The terms "comprises", "comprising", "includes", "including", “having” and their conjugates mean "including but not limited to". This term encompasses the terms "consisting of' and "consisting essentially of'.
The phrase "consisting essentially of means that the composition or method may include additional ingredients and/or steps, but only if the additional ingredients and/or steps do not materially alter the basic and novel characteristics of the claimed composition or method. As used herein, the singular form "a", "an" and "the" include plural references unless the context clearly dictates otherwise. For example, the term "a compound" or "at least one compound" may include a plurality of compounds, including mixtures thereof.
The word “exemplary” is used herein to mean “serving as an example, instance or illustration”. Any embodiment described as “exemplary” is not necessarily to be construed as preferred or advantageous over other embodiments and/or to exclude the incorporation of features from other embodiments.
The word “optionally” is used herein to mean “is provided in some embodiments and not provided in other embodiments”. Any particular embodiment may include a plurality of “optional” features unless such features conflict.
Throughout this application, various embodiments may be presented in a range format. It should be understood that the description in range format is merely for convenience and brevity and should not be construed as an inflexible limitation on the scope of embodiments. Accordingly, the description of a range should be considered to have specifically disclosed all the possible subranges as well as individual numerical values within that range. For example, description of a range such as from 1 to 6 should be considered to have specifically disclosed subranges such as from 1 to 3, from 1 to 4, from 1 to 5, from 2 to 4, from 2 to 6, from 3 to 6 etc., as well as individual numbers within that range, for example, 1, 2, 3, 4, 5, and 6. This applies regardless of the breadth of the range.
Whenever a numerical range is indicated herein, it is meant to include any cited numeral (fractional or integral) within the indicated range. The phrases “ranging/ranges between” a first indicate number and a second indicate number and “ranging/ranges from” a first indicate number “to” a second indicate number are used herein interchangeably and are meant to include the first and second indicated numbers and all the fractional and integral numerals there between.
It is appreciated that certain features of embodiments, which are, for clarity, described in the context of separate embodiments, may also be provided in combination in a single embodiment. Conversely, various features of embodiments, which are, for brevity, described in the context of a single embodiment, may also be provided separately or in any suitable sub combination or as suitable in any other described embodiment. Certain features described in the context of various embodiments are not to be considered essential features of those embodiments, unless the embodiment is inoperative without those elements.
Although embodiments have been described in conjunction with specific embodiments thereof, it is evident that many alternatives, modifications and variations will be apparent to those skilled in the art. Accordingly, it is intended to embrace all such alternatives, modifications and variations that fall within the spirit and broad scope of the appended claims.
It is the intent of the applicant(s) that all publications, patents and patent applications referred to in this specification are to be incorporated in their entirety by reference into the specification, as if each individual publication, patent or patent application was specifically and individually noted when referenced that it is to be incorporated herein by reference. In addition, citation or identification of any reference in this application shall not be construed as an admission that such reference is available as prior art to embodiments. To the extent that section headings are used, they should not be construed as necessarily limiting.

Claims

1. A computerized implemented method for creating a file encoded at a format, which is readable to be processed according to multiple file formats, comprising: setting a file storage space for a file in a storage device; storing data common to the multiple file formats in a common data part of the file at the storage space; storing uncommon data of each of the multiple file formats in an uncommon data part of the file at the storage space; and storing a plurality of pointers in a metadata data part of the file at the storage space, wherein each of the plurality of pointers associates between metadata which defines how to create the file to be readable to be processed according to the multiple file formats and a relevant common data or uncommon data.
2. The computerized implemented method of claim 1, further comprising compressing the common data and/or the uncommon data in blocks and storing an offset indicating the beginning of each block.
3. The computerized implemented method of claim 2, further comprising storing line numbers indicating the beginning of each block.
4. The computerized implemented method of claim 1, wherein the uncommon data, is stored separately from the common data in a separate file.
5. The computerized implemented method of claim 1, wherein different types of data of the common data are stored in different files according to the type of data.
6. The computerized implemented method of claim 1, wherein the multiple formats are genome file formats.
7. The computerized implemented method of claim 6, wherein the multiple formats are at least one of FASTQ, FASTA and sequence alignment map (SAM).
8. The computerized implemented method of claim 1, wherein the multiple formats are text related file formats.
9. A computerized implemented method for reading and decoding a file encoded at a format, which is readable to be processed according to multiple formats, comprising: reading a plurality of pointers in a metadata data part of the file at a storage space where a common data part and uncommon data part are stored, wherein each of the plurality of pointers associates between metadata which defines how to create the file to be readable to be processed according to the multiple file formats and a relevant common data or uncommon data.
10. The computerized implemented method of claim 9, wherein the common data and/or the uncommon data are compressed in blocks and an offset is stored indicating the beginning of each block.
11. An apparatus for creating a file encoded at a format, which is readable to be processed according to multiple file formats, comprising a processor configured to execute a code for: setting a file storage space for a file in a storage device; storing data common to the multiple file formats in a common data part of the file at the storage space; storing uncommon data of each of the multiple file formats in an uncommon data part of the file at the storage space; and storing a plurality of pointers in a metadata data part of the file at the storage space, wherein each of the plurality of pointers associates between metadata which defines how to create the file to be readable to be processed according to the multiple file formats and a relevant common data or uncommon data.
12. The apparatus of claim 11, wherein the processor is further configured to execute a code for compressing the common data and/or the uncommon data in blocks and storing an offset indicating the beginning of each block.
13. An apparatus for reading and decoding a file encoded at a format, which is readable to be processed according to multiple file formats comprising a processor configured to execute a code for: reading a plurality of pointers in a metadata data part of the file at a storage space where a common data part and uncommon data part are stored, wherein each of the plurality of pointers associates between metadata which defines how to create the file to be readable to be processed according to the multiple file formats and a relevant common data or uncommon data.
14. A computer program for execution by the processor of claim 11, the computer program comprising a call to said code.
15. A computer program for creating a file encoded at a format, which is readable to be processed according to multiple file formats, comprising: program instructions for executing, by a processor, a sequence of operations that comprises: setting a file storage space for a file in a storage device; storing data common to the multiple file formats in a common data part of the file at the storage space; storing uncommon data of each of the multiple file formats in an uncommon data part of the file at the storage space; and storing a plurality of pointers in a metadata data part of the file at the storage space, wherein each of the plurality of pointers associates between metadata which defines how to create the file to be readable to be processed according to the multiple file formats and a relevant common data or uncommon data.
16. A computer program for execution by the processor of claim 13, the computer program comprising a call to said code.
17. A computer program for reading and decoding a file encoded at a format, which is readable to be processed according to multiple file formats, comprising: program instructions for executing, by a processor, a sequence of operations that comprises: reading a plurality of pointers in a metadata data part of a file at a storage space where a common data part and uncommon data part are stored, wherein each of the plurality of pointers associates between metadata which defines how to create the file to be readable to be processed according to multiple file formats and a relevant common data or uncommon data.
PCT/EP2020/084595 2020-12-04 2020-12-04 An apparatus and method for creating, reading and decoding a file encoded at a format which is readable to be processed according to multiple file formats WO2022117205A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
PCT/EP2020/084595 WO2022117205A1 (en) 2020-12-04 2020-12-04 An apparatus and method for creating, reading and decoding a file encoded at a format which is readable to be processed according to multiple file formats
CN202080107480.5A CN116472526A (en) 2020-12-04 2020-12-04 Apparatus and method for creating, reading and decoding files encoded in a format readable thereby to be processed according to a plurality of file formats

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/EP2020/084595 WO2022117205A1 (en) 2020-12-04 2020-12-04 An apparatus and method for creating, reading and decoding a file encoded at a format which is readable to be processed according to multiple file formats

Publications (1)

Publication Number Publication Date
WO2022117205A1 true WO2022117205A1 (en) 2022-06-09

Family

ID=73748057

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/EP2020/084595 WO2022117205A1 (en) 2020-12-04 2020-12-04 An apparatus and method for creating, reading and decoding a file encoded at a format which is readable to be processed according to multiple file formats

Country Status (2)

Country Link
CN (1) CN116472526A (en)
WO (1) WO2022117205A1 (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100281070A1 (en) * 2009-05-01 2010-11-04 Creative Technology Ltd Data file having more than one mode of operation
US20150227686A1 (en) * 2014-02-12 2015-08-13 International Business Machines Corporation Lossless compression of dna sequences
US20170168735A1 (en) * 2015-12-10 2017-06-15 International Business Machines Corporation Reducing time to read many files from tape

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100281070A1 (en) * 2009-05-01 2010-11-04 Creative Technology Ltd Data file having more than one mode of operation
US20150227686A1 (en) * 2014-02-12 2015-08-13 International Business Machines Corporation Lossless compression of dna sequences
US20170168735A1 (en) * 2015-12-10 2017-06-15 International Business Machines Corporation Reducing time to read many files from tape

Also Published As

Publication number Publication date
CN116472526A (en) 2023-07-21

Similar Documents

Publication Publication Date Title
Cox et al. Large-scale compression of genomic sequence databases with the Burrows–Wheeler transform
US10790044B2 (en) Systems and methods for sequence encoding, storage, and compression
Janin et al. BEETL-fastq: a searchable compressed archive for DNA reads
US7185018B2 (en) Method of storing and retrieving miniaturized data
US8838551B2 (en) Multi-level database compression
US7877364B2 (en) Method of storing and retrieving miniaturised data
US9514179B2 (en) Table boundary detection in data blocks for compression
Holt et al. Merging of multi-string BWTs with applications
EP3072076B1 (en) A method of generating a reference index data structure and method for finding a position of a data pattern in a reference data structure
EP2093672A1 (en) Encoding and decoding apparatus, method, and program, and recording medium
Povar et al. Forensic data carving
CN111095421A (en) Context-aware incremental algorithm for gene files
US9665590B2 (en) Bitmap compression for fast searches and updates
Sakib et al. Improving transmission efficiency of large sequence alignment/map (SAM) files
CN112527736A (en) Data storage method and data recovery method based on DNA and terminal equipment
JP2019537810A (en) Efficient data structures for displaying bioinformatics information
CN110168652B (en) Method and system for storing and accessing bioinformatic data
Cánovas et al. Csam: Compressed sam format
US7568156B1 (en) Language rendering
Marco‐Sola et al. Efficient Alignment of Illumina‐Like High‐Throughput Sequencing Reads with the GEnomic Multi‐tool (GEM) Mapper
WO2022117205A1 (en) An apparatus and method for creating, reading and decoding a file encoded at a format which is readable to be processed according to multiple file formats
US20220358290A1 (en) Encoding and storing text using dna sequences
US8463759B2 (en) Method and system for compressing data
Selva et al. SRComp: short read sequence compression using burstsort and Elias omega coding
Deorowicz et al. CoMSA: compression of protein multiple sequence alignment files

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20820826

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 202080107480.5

Country of ref document: CN

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20820826

Country of ref document: EP

Kind code of ref document: A1