CN106991134B - A kind of large data cloud storage method based on object storage - Google Patents

A kind of large data cloud storage method based on object storage Download PDF

Info

Publication number
CN106991134B
CN106991134B CN201710146689.2A CN201710146689A CN106991134B CN 106991134 B CN106991134 B CN 106991134B CN 201710146689 A CN201710146689 A CN 201710146689A CN 106991134 B CN106991134 B CN 106991134B
Authority
CN
China
Prior art keywords
data
container object
block
file
output example
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201710146689.2A
Other languages
Chinese (zh)
Other versions
CN106991134A (en
Inventor
李�根
宋卓
冯博伦
王振国
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Human And Future Biotechnology (changsha) Co Ltd
Original Assignee
Human And Future Biotechnology (changsha) Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Human And Future Biotechnology (changsha) Co Ltd filed Critical Human And Future Biotechnology (changsha) Co Ltd
Priority to CN201710146689.2A priority Critical patent/CN106991134B/en
Publication of CN106991134A publication Critical patent/CN106991134A/en
Application granted granted Critical
Publication of CN106991134B publication Critical patent/CN106991134B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/18File system types
    • G06F16/182Distributed file systems
    • G06F16/1824Distributed file systems implemented using Network-attached Storage [NAS] architecture
    • G06F16/183Provision of network file services by network file servers, e.g. by using NFS, CIFS
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/13File access structures, e.g. distributed indices
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B50/00ICT programming tools or database systems specially adapted for bioinformatics
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/10Protocols in which an application is distributed across nodes in the network
    • H04L67/1097Protocols in which an application is distributed across nodes in the network for distributed storage of data in networks, e.g. transport arrangements for network file system [NFS], storage area networks [SAN] or network attached storage [NAS]
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/50Network services
    • H04L67/56Provisioning of proxy services
    • H04L67/561Adding application-functional data or data for application control, e.g. adding metadata
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/50Network services
    • H04L67/56Provisioning of proxy services
    • H04L67/565Conversion or adaptation of application format or content
    • H04L67/5651Reducing the amount or size of exchanged application data

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Databases & Information Systems (AREA)
  • Signal Processing (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biophysics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioethics (AREA)
  • Library & Information Science (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The invention discloses a kind of large data cloud storage methods based on object storage, implementation steps include: that client reads large data file to be stored and forms at least one substream of data, constantly accumulate the data block for forming fixed size in memory respectively, on one side data block and its description information are compressed to form output example, output example is sent to cloud platform on one side;Cloud platform establishes the root container object comprising block container object, client is received on one side is directed to the output example that large data file to be stored is sent, it is saved in the output example received as object in corresponding container object on one side, and the output example of each substream of data is stored in more than one block container object.The present invention is based on the thoughts that shunting piecemeal concurrently compresses, it supports data compression and transmits the synchronous mode carried out of upper cloud, supports to take great targetedly compression scheme to the data block of different data subflow, can largely save the time cost of data upload and the economic cost of data storage.

Description

A kind of large data cloud storage method based on object storage
Technical field
The present invention relates to the cloud storage technologies of large data, and in particular to it is a kind of based on object storage large data cloud deposit Method for storing.
Background technique
Large data epoch and cloud era come in pairs, and cloud computing platform has become having for large-scale data processing Imitate platform.It is the typical industry of representative with biology, finance, communication etc., all can locally producing hundreds of GB even number TB daily Data.It is limited to the bandwidth limitation of Wide Area Network, the large data of these magnanimity is transmitted to the speed of cloud platform, has become Restrict the bottleneck that these fields carry out data processing using cloud computing resources.In addition, the cloud platform storage of large data is at high cost It is high, also become one of the important limitation reason of enterprise's cloud.
Compression storage is to solve the effective means of data storage and transmission on cloud.For the cloud storage of large data, at present Current way is first to compress source data using compressed software, then entire compressed package is transmitted to storage server progress block and is deposited Storage or file storage.In the large data of magnanimity, compression and the time-consuming for transmitting upper cloud are usually single with hour or day Position;When user needs to read the data in compressed package, then compressed package must be decompressed completely, severely impact large data Read-write efficiency.
In addition, cloud storage platform has generallyd use object storage technology at present.The characteristics of object storage technology is will be in it The data of portion's storage are all considered as object, and each object consists of three parts: 1, (object name can be stratification to object name );2. corresponding object data block;3, the metamessage of description object attribute.Since object storage technology is by object name and object Data as a pair of simple Key-Value mapping storage among system, by simple Get and Put semanteme obtain and on Data are passed, therefore this kind of system is easy to accomplish the extensive extending transversely of access performance.
It is different from local file system, object storage is semantic based on simple Get and Put data manipulation, is difficult efficiently Ground carries out random read-write access to the data in object.Therefore, if simply being carried out locally compressed file as an object Storage, then when needing to extract data from compressed file, it is necessary to after the entire compressed data packets of Get, could unpack and extract it Internal file data.Since compressed package itself scale of large data is not small (number GB or more) yet, for large-scale right in this way The Get and Put of elephant are operated, and cannot effectively play the behavior extension advantage of object storage, and the access effect of compressed data is greatly reduced Rate.
Summary of the invention
The technical problem to be solved in the present invention: in view of the above problems in the prior art, one kind is provided and is based on shunting piecemeal simultaneously The thought of compression is sent out, support data compression and transmits the synchronous mode for carrying out (side flanging biography) of upper cloud, support to different data The data block of stream takes great targetedly compression scheme, can largely save the time cost of data upload and the warp of data storage The large data cloud storage method based on object storage for cost of helping.
In order to solve the above-mentioned technical problem, the technical solution adopted by the present invention are as follows:
On the one hand, the present invention provides a kind of large data cloud storage method based on object storage, and implementation steps include:
1) client reads large data file to be stored, and the file stream of reading is formed at least one substream of data, point The data for not accumulating substream of data constantly in memory form the data block of specified size, on one side by data block and its description information It compresses and forms output example, output example is sent to cloud platform on one side, the description information includes number belonging to data block It is numbered according to sub-stream information, data block size and data block;
2) cloud platform is primarily based on object and establishes the root container object comprising block container object, then receives client on one side The output example of transmission is saved in the output example received as object in corresponding container object on one side, and The output example of each substream of data is stored in respectively in more than one piece of container object.
Preferably, it is solid that the data of constantly accumulation substream of data, which form the data block of specified size, in memory in step 1) Determine the data block of size;Data block and description information are compressed in step 1) and formed output example detailed step include: by Data block is segmented according to data field, calls specified encoder or encoder assembles to be pressed for different data fields Compression result, is finally split according to fixed size, obtains the output example of at least one fixed size by contracting.
Preferably, after the file stream of reading being formed at least one substream of data in step 1), for each substream of data The data of constantly accumulation substream of data form the data block of fixed size, by the data block of each substream of data and retouch in memory Stating Information Compression and forming output example, the output example of each substream of data is sent to cloud platform is concurrently to execute.
Preferably, when output example being sent to cloud platform in step 1), client is carried out using pipeline/Filter Send so that between client and cloud platform the inlet flow of each pipeline with and output stream holding it is synchronous.
Preferably, the large data file to be stored in step 1) specifically refers to FASTQ file, by reading in step 1) File stream forms at least one substream of data and specifically refers to the file stream that will be read formation metadata streams, base sequence stream, quality Three kinds of substream of data of score stream.
Preferably, the detailed step of step 2 includes:
2.1) cloud platform receives client and is directed to the output example that large data file to be stored is sent, and is primarily based on object A root container object is established in storage, under described container object it is nested at least one for supporting individually decompress random read Block container object, nested and substream of data type sub- container object correspondingly under each block container object, each Root container object, block container object, sub- container object are stored with an object respectively in cloud platform, described container object, Block container object, the sub- equal content of container object three are stored in the metadata object of cloud platform for empty, metadata, and root container The title of object includes the file path of compressed file, contains the letter for being subordinate to root container object in the metadata of block container object Breath contains the information for being subordinate to block container object in the metadata of sub- container object so that root container object, block container object, Sub- container object forms the vessel subsystem in tree-shaped institutional framework;
2.2) cloud platform is saved in the output example received as object in corresponding container object, and each The output example of a substream of data is stored in respectively in more than one piece of container object.
It preferably, further include by exporting the metadata storage output example of example wait store large-scale number in step 2.2) Description information according to the line number in file, and the output example further includes the row number information of corresponding data block.
On the other hand, the present invention also provides it is a kind of based on object storage large data cloud storage method, with individually for Method in terms of cloud platform realizes that implementation steps include:
S1) cloud platform establishes the root container object comprising block container object based on object;
S2) cloud platform receives client on one side and is directed to the output example that large data file to be stored is sent, the output Example includes data block and its description information, and the description information includes the affiliated substream of data of data block, data block size and number According to block number;It is saved in the output example received as object in corresponding container object on one side, and each number It is stored in more than one piece of container object respectively according to the output example of subflow.
Preferably, step S1) detailed step include: cloud platform receive client for large data file to be stored hair The output example sent is primarily based on object storage and establishes a root container object, the nesting at least one under described container object It is a to be used to support individually to decompress the block container object read at random, nested and substream of data type one under each block container object One corresponding sub- container object, each root container object, block container object, sub- container object are in cloud platform respectively with one Object storage, described container object, block container object, the sub- equal content of container object three are stored in cloud for empty, metadata and put down In the metadata object of platform, and the title of root container object includes the file path of compressed file, the metadata of block container object In contain the information for being subordinate to root container, the information for being subordinate to block container is contained in the metadata of sub- container object so that root hold Device object, block container object, sub- container object form the vessel subsystem in tree-shaped institutional framework.
Preferably, step S2) in further include by export example metadata storage output example wait store large-scale number Description information according to the line number in file, and the output example further includes the row number information of corresponding data block.
The present invention is based on the large data cloud storage methods of object storage to have an advantage that
1, being based on object by cloud platform the present invention is based on the large data cloud storage method of object storage includes block container The root container object of object receives client on one side and is directed to the output example that large data file to be stored is sent, exports example Comprising data block and its description information, description information includes that the affiliated substream of data of data block, data block size and data block are numbered, It is saved in the output example received as object in corresponding container object on one side, and each substream of data is defeated Example is stored in respectively in more than one piece of container object out, based on the thought that shunting piecemeal concurrently compresses, supports data pressure Contracting and the synchronous mode for carrying out (side flanging biography) of the upper cloud of transmission are supported to take great specific aim to the data block of different data subflow Compression scheme, can largely save data upload time cost and data storage economic cost.
2, the output example received is based respectively on object and is saved in the root container pair comprising block container object by the present invention As in, and the output example of each substream of data is stored in respectively in more than one piece of container object, is based on block container object Concept, each piece of container object can be decompressed individually, to support random read functions, can directly be extracted by compressed file Partial content is downloaded without decompressing entire compressed file, and the smaller retrieval of block container object is faster.
Detailed description of the invention
Fig. 1 is the basic procedure schematic diagram of present invention method.
Fig. 2 is controller and compressor theory structure schematic diagram in present invention method.
Fig. 3 is the basic conception schematic diagram of vessel subsystem in present invention method.
Fig. 4 is the basic structure schematic diagram of vessel subsystem in present invention method.
Specific embodiment
It hereafter will be by taking gene sequencing data file (FASTQ file) as an example, to the present invention is based on the large-scale numbers that object stores It is described in further detail according to cloud storage method.Gene sequencer can generate the short reading data of magnanimity, and base sequence is only The character { ' A ', ' T ', ' G ', ' C ', ' N ' } that energy occurs, and each short reading includes 4 rows, each short reading is started with character "@" and it In the features such as the 3rd behavior "+" number, for the necessary condition for determining gene sequencing data file (FASTQ file).
As shown in Figure 1, the implementation steps for the large data cloud storage method that the present embodiment is stored based on object include:
1) client reads large data file to be stored, and the file stream of reading is formed at least one substream of data, point The data for not accumulating substream of data constantly in memory form the data block of specified size, on one side by data block and its description information It compresses and forms output example, output example is sent to cloud platform on one side, description information includes of data belonging to data block Stream information, data block size and data block number;
2) cloud platform is primarily based on object and establishes the root container object comprising block container object, then receives client on one side The output example of transmission is saved in the output example received as object in corresponding container object on one side, and The output example of each substream of data is stored in respectively in more than one piece of container object.
In conjunction with step 1)~2) it is found that the present embodiment is substantially existed based on the large data cloud storage method that object stores Client on one side accumulate compression, while transmit (that is: side flanging pass), while being also synchronous to the storage for exporting example in server It carries out, and task is uploaded by thread queue synchronization mechanism parallel processing cloud platform in the present embodiment.The present embodiment method is based on Shunt the thought that piecemeal concurrently compresses, support data compression and the synchronous mode for carrying out (side flanging biography) of the upper cloud of transmission, support pair The data block of different data subflow takes great targetedly compression scheme, can largely save the time cost and number of data upload According to the economic cost of storage.In the present embodiment, cloud platform specifically refers to Amazon AWS platform.But the present embodiment method is not Be confined to specific cloud platform, and in order to improve the universality of the present embodiment method, the present embodiment method in the specific implementation, visitor The output example sending function module at family end actually encapsulates different vendor's cloud platform (such as Amazon AWS platform, A Liyun OSS platform etc.) provide official's interface, with support user use different vendor cloud platform.
The present embodiment is had at two based on the fixed size that the large data cloud storage method that object stores is related to: (1), step 1) data block that the data of constantly accumulation substream of data form specified size in memory in is the data block of fixed size, this is When due to by data block and its description information compression, for the encoder that some compressions use, data block is in different size, Obtained compression ratio is achieved with different coding device by test and obtains optimal compression effect (compression ratio height, compression with regard to different Speed is again fast) data block size, therefore this implementations is by forming the data block of fixed size, it will be able to ensure to compress use Encoder obtains a relatively good synthesis compression result, and compression ratio is high and compression speed is fast.(2) in the present embodiment, every number After having been compressed according to block, compression result is split according to fixed size, obtains the output example of at least one fixed size, To become small one by one output example toward uploading, so that managed in cloud platform it is more convenient, and can be square Just breakpoint transmission function and random read functions are supported.
In the present embodiment, the large data file to be stored in step 1) specifically refers to FASTQ file, will read in step 1) File stream out formed at least one substream of data specifically refer to the file stream that will be read formed metadata streams, base sequence stream, Three kinds of substream of data of mass fraction stream.In the present embodiment, after the compression of each sub-stream data block, the fixation indefinite for quantity is exported Big small documents, file are numbered by incremental order, obtain data block number.
In the present embodiment, data block and description information are compressed to and are formed the detailed step packet of output example in step 1) Include: data block segmented according to data field, for different data field call specified encoder or encoder assembles come into Compression result, is finally split according to fixed size, obtains the output example of at least one fixed size by row compression.This reality It applies in example, for metadata streams, repeated data domain, incremental data domain, random data domain etc. is subdivided into according to its feature, and respectively (Burrows-Wheeler_transform, block sequencing pressure are encoded using UTF-8 coding, repeated encoding, incremental encoding, BWT Contracting), one of arithmetic coding or two or more be combined coding compression;For base sequence stream and mass fraction Stream is first pre-processed using BWT encoder, is recalled the arithmetic encoder based on context dynamic probability prediction model and is carried out Compression;For metadata streams, base sequence stream, mass fraction stream, output is indefinite defeated of quantity after the compression of each of which data block Example out, output example name are carried out in a manner of data block number+example number.
In the present embodiment, after the file stream of reading is formed at least one substream of data in step 1), for each data The data that subflow constantly accumulates substream of data in memory form the data block of fixed size, by the data block of each substream of data And it is concurrently to hold that description information, which is compressed and formed output example, the output example of each substream of data is sent to cloud platform, Row.Referring to fig. 2, the present embodiment uses a corresponding controller, the control of each substream of data for each substream of data (metadata streams correspond to metadata controller to device, and base sequence stream corresponds to base sequence controller, mass fraction stream corresponding mass point Number controller) each controller can constantly accumulate the data of each subflow in memory respectively, the data block of fixed size is formed, and The relevant information of each data block is recorded, such as affiliated substream of data, block size, block number etc. are formed corresponding with data block Description information is attached in data block, is sent into subsequent compressor together.Each substream of data possesses independent compression module Data block after receiving accumulation, referring to fig. 2, metadata streams corresponding element data compressor, base sequence stream corresponds to base sequence pressure It the characteristics of contracting device, mass fraction stream corresponding mass score compressor, each compressor is according to data block, if needed can be by one A data block is sub-divided into different data fields, calls different encoder or encoder assembles.In the present embodiment, compressor It is carried out using multi-thread concurrent, in conjunction with the capability configuration parameters of operation host, while compressible N number of data block, each compressor Comprising a calling device, encoder or encoder assembles are selected according to data characteristics.
The encoder that different data subflow compression module uses, compression speed may be inconsistent.In the present embodiment, step 1) when output example being sent to cloud platform in, client is sent using pipeline/Filter, so that client and cloud Between platform the inlet flow of each pipeline with and output stream holding it is synchronous.When the thread of compression module thread pool is fully occupied When, corresponding controller data block push will be blocked, such as when the inlet flow of pipeline is faster than output stream, can block input Stream;When pipeline does not have inlet flow, pipeline can also block.
In the present embodiment, client reads large data file to be stored and specifically refers in step 1): client reading refers to Fixed large data file to be stored or client traversal, which are read, needs to be stored large data text under specified catalogue Part.For the specified of large data file to be stored, it is file that user, which both can specify object, and also can specify object is file Catalogue, so that the large data cloud storage method that is stored based on object of the present embodiment, using more flexible, the scope of application is more Extensively.
In the present embodiment, the detailed step of step 2 includes:
2.1) cloud platform receives client and is directed to the output example that large data file to be stored is sent, and is primarily based on object A root container object is established in storage, under root container object it is nested at least one for supporting individually to decompress the block read at random Container object, sub- container object, each root hold nested and substream of data type correspondingly under each block container object Device object, block container object, sub- container object are stored with an object respectively in cloud platform, root container object, block container pair As, the sub- equal content of container object three be it is empty, metadata is stored in the metadata object of cloud platform, and the name of root container object Claim the file path comprising compressed file, contains the information for being subordinate to root container object, sub- appearance in the metadata of block container object The information for being subordinate to block container object is contained in the metadata of device object, so that root container object, block container object, sub- container pair Pictograph is at the vessel subsystem for being in tree-shaped institutional framework;
2.2) cloud platform is saved in the output example received as object in corresponding container object, and each The output example of a substream of data is stored in respectively in more than one piece of container object.
Based on root container object, block container object, sub- container structure object at vessel subsystem in the present embodiment, referring to figure 3, vessel subsystem proposes following three concepts, the cloud object storage of Lai Shixian compressed data: (1) concept of container, i.e. cloud pair As a kind of tissue or linking form of storage.Because nesting is not supported in object storage, by container come complete tissue and guarantor It deposits by the bibliographic structure of compressed file.Container correspond to cloud object store Bucket(memory space) in Object(object/text Part).(2) concept of example, the i.e. object of storing data block compressed content.Example also corresponds to the object storage of cloud platform Bucket(memory space) in Object(object/file).(3) concept of metadata, the i.e. packed description of storing data block The object of information.Metadata correspond to cloud object store Bucket(memory space) in Object metadata(object meta number According to).Referring to Fig. 3, container and example pass through Object(object/file) it is stored, metadata then passes through metadata pair As storing.
In the vessel subsystem of the present embodiment, the concept based on container, a root container object represents one completely independently Compressed file, container object can be with Multi-nesting: root container object can be embedding with nested block container object, block container object Cover container object.The corresponding Object of root container object, title available directories path representation, behind path plus '/'.Utilize root Container object title may include directory path information, can form certain bibliographic structure in the same Bucket, convenient Management of the user to compressed file, can also support for local directory or multiple source files to be integrally packaged be compressed to it is corresponding In Bucket catalogue.And common object storage, all take flat data organizational structure.The storage that Object is flattened In Bucket.Based on the concept of block container object, each piece of container object can be decompressed individually, to support random reading function Can, it can directly extract by the partial content of compressed file, be downloaded without decompressing entire compressed file, and block container object is smaller It retrieves faster.
Referring to fig. 4, a root container object indicates that a complete FASTQ file, FASTQ file have in the present embodiment Certain format can separately be handled when 4 behavior, one specific content, compression according to the format.Each individually squeeze operation needs One container object, specific compressed data save in instances, therefore a complete compressed file can nested multiple containers Object.Root container object indicates a complete gene sequencing compressed file.Root container object meeting nested block container object, referring to Fig. 4, the number of each piece of container object are the serial number between 0~N, which includes No. 0 block container~N block container Object.Referring to fig. 4, by taking base sequence stream as an example, the corresponding output example of the 0~N-1 data block of base sequence stream is protected There are in No. 0 block container object, the corresponding output example of the N~2N-1 data block of base sequence stream is saved in No. 1 block In container object, the corresponding output example of the 2N~3N-1 data block of base sequence stream is saved in No. 2 block container objects In, and so on.In one block container object, the sub- container object of nested metadata, the sub- container object of base sequence, matter again are understood Measure the sub- container object of score, the sub- container object nesting repeated data domain example of metadata, random data domain example, incremental data domain Example etc., the nested No. 0 block example~N block example of the sub- container object of base sequence, nested No. 0 block of the sub- container object of mass fraction Example~N block example.In the present embodiment, the command forms of root container object are directory/filename(directory Catalogue, filename filename).Such as: directory/filename1 and directory/filename2 are expressed as same In one Bucket, it is stored in two different compressed files of same catalogue.
It further include by exporting the metadata storage output example of example big wait store in the present embodiment, in step 2.2) Line number in type data file, and the description information for exporting example further includes the row number information of corresponding data block, so as to make The retrieval that the present embodiment method supports line number is obtained, can directly extract data file by the part of the specified line number of compressed file Hold, is downloaded without decompressing entire compressed file.
Embodiment two:
The present embodiment and one method of embodiment are essentially identical, and difference is that large data file to be stored is different, this reality The large data file to be stored applied in example is non-FASTQ file, and non-FASTQ file is different from non-FASTQ file, therefore this reality When applying that the file stream of reading is formed at least one substream of data by step 1) in example, the file stream of reading is substantially formed one Kind substream of data (binary file stream) then first takes block sequencing compression to pre-process, then be directed to each piecemeal in compression Interior data are taken based on the arithmetic coding of the dynamic prediction model of bit-level second compression again.
In conjunction with the embodiments one and embodiment two file it is found that the present invention is based on object storage large data cloud storage Method is not limited to specific large data file type, it is controlled according to the characteristic of large data file itself and forms number According to the quantity of subflow, especially formation single substream of data (such that the efficiency that side flanging passes decreases), may make The present invention is based on the large data cloud storage methods of object storage can be suitable for all kinds of large data files, no longer superfluous herein It states.
Embodiment three:
The present embodiment and one method of embodiment are essentially identical, and difference is the base that the present embodiment is a kind of facing cloud platform In the large data cloud storage method of object storage, the present embodiment method requires nothing more than the format of client as requested and provides wait deposit The universal instance of large data file is stored up, does not limit client to the specific implementation form of universal instance.
In the present embodiment, based on the large data cloud storage method of object storage, implementation steps include:
S1) cloud platform establishes the root container object comprising block container object based on object;
S2) cloud platform receives client on one side and is directed to the output example that large data file to be stored is sent, and exports example Comprising data block and its description information, description information includes the affiliated substream of data of data block, data block size and data block number; It is saved in the output example received as object in corresponding container object on one side, and each substream of data is defeated Example is stored in respectively in more than one piece of container object out.
In the present embodiment, step S1) detailed step include: that cloud platform receives client for large data text to be stored The output example that part is sent, is primarily based on object storage and establishes a root container object, the nesting at least one under root container object It is a to be used to support individually to decompress the block container object read at random, nested and substream of data type one under each block container object One corresponding sub- container object, each root container object, block container object, sub- container object are in cloud platform respectively with one Object storage, root container object, block container object, the sub- equal content of container object three are stored in cloud platform for empty, metadata In metadata object, and the title of root container object includes the file path of compressed file, is wrapped in the metadata of block container object The information for being subordinate to root container is contained, the information for being subordinate to block container is contained in the metadata of sub- container object, so that root container pair As, block container object, sub- container object formed be in tree-shaped institutional framework vessel subsystem.
In the present embodiment, step S2) in further include by exporting the metadata storage output example of example big wait store Line number in type data file, and the description information for exporting example further includes the row number information of corresponding data block, so as to make The retrieval that the present embodiment method supports line number is obtained, can directly extract data file by the part of the specified line number of compressed file Hold, is downloaded without decompressing entire compressed file.
In conjunction with the embodiments one and embodiment three file it is found that the present invention is based on object storage large data cloud storage Method be not rely on embodiment one or other record specific substream of data dividing methods or data block accumulation method or Data compression method or data transmission method for uplink, details are not described herein.
The above is only a preferred embodiment of the present invention, protection scope of the present invention is not limited merely to above-mentioned implementation Example, all technical solutions belonged under thinking of the present invention all belong to the scope of protection of the present invention.It should be pointed out that for the art Those of ordinary skill for, several improvements and modifications without departing from the principles of the present invention, these improvements and modifications It should be regarded as protection scope of the present invention.

Claims (8)

1. a kind of large data cloud storage method based on object storage, it is characterised in that implementation steps include:
1) client reads large data file to be stored, and the file stream of reading is formed at least one substream of data, is existed respectively The data that substream of data is constantly accumulated in memory form the data block of specified size, on one side compress data block and its description information And output example is formed, output example is sent to cloud platform on one side, the description information includes of data belonging to data block Stream information, data block size and data block number;
2) cloud platform is primarily based on object and establishes the root container object comprising block container object, then receives client on one side and sends Output example, be saved in the output example received as object in corresponding container object on one side, and each The output example of substream of data is stored in respectively in more than one piece of container object;
The detailed step of step 2 includes:
2.1) cloud platform receives client and is directed to the output example that large data file to be stored is sent, and is primarily based on object storage Establish a root container object, under described container object it is nested at least one for supporting individually to decompress the block read at random Container object, sub- container object, each root hold nested and substream of data type correspondingly under each block container object Device object, block container object, sub- container object are stored with an object respectively in cloud platform, and described container object, block hold Device object, the sub- equal content of container object three are stored in the metadata object of cloud platform for empty, metadata, and root container object Title include compressed file file path, contain the information for being subordinate to root container object in the metadata of block container object, The information for being subordinate to block container object is contained in the metadata of sub- container object, so that root container object, block container object, sub- appearance Device object forms the vessel subsystem in tree-shaped institutional framework;
2.2) cloud platform is saved in the output example received as object in corresponding container object, and each number It is stored in more than one piece of container object respectively according to the output example of subflow.
2. the large data cloud storage method according to claim 1 based on object storage, which is characterized in that in step 1) The data of constantly accumulation substream of data form the data block that the data block of specified size is fixed size in memory;In step 1) Data block and description information are compressed and the detailed step for forming output example includes: to segment data block according to data field, needle Specified encoder or encoder assembles is called to compress different data fields, finally by compression result according to fixation Size is split, and obtains the output example of at least one fixed size.
3. the large data cloud storage method according to claim 1 based on object storage, it is characterised in that: in step 1) After the file stream of reading is formed at least one substream of data, substream of data is constantly accumulated in memory for each substream of data Data formed fixed size data block, the data block of each substream of data and description information compresss and formed output reality Example, the output example of each substream of data is sent to cloud platform is concurrently to execute.
4. the large data cloud storage method according to claim 1 based on object storage, it is characterised in that: in step 1) When output example is sent to cloud platform, client is sent using pipeline/Filter, so that client and cloud platform Between each pipeline inlet flow with and output stream holding it is synchronous.
5. the large data cloud storage method according to claim 1 based on object storage, which is characterized in that in step 1) Large data file to be stored specifically refer to FASTQ file, the file stream of reading is formed at least one data in step 1) Subflow specifically refers to the file stream that will be read and forms metadata streams, three kinds of base sequence stream, mass fraction stream substream of data.
6. the large data cloud storage method according to claim 1 based on object storage, it is characterised in that: step 2.2) In further include line number of the metadata storage output example in large data file to be stored by exporting example, and it is described defeated The description information of example further includes the row number information of corresponding data block out.
7. a kind of large data cloud storage method based on object storage, it is characterised in that implementation steps include:
S1) cloud platform establishes the root container object comprising block container object based on object;
S2) cloud platform receives client on one side and is directed to the output example that large data file to be stored is sent, the output example Comprising data block and its description information, the description information includes the affiliated substream of data of data block, data block size and data block Number;It is saved in the output example received as object in corresponding container object on one side, and each data The output example of stream is stored in respectively in more than one piece of container object;
Step S1) detailed step include: that receive the output that client is sent for large data file to be stored real for cloud platform Example, is primarily based on object storage and establishes a root container object, under described container object it is nested at least one be used to support The block container object read at random is individually decompressed, nested and substream of data type is sub correspondingly under each block container object Container object, each root container object, block container object, sub- container object are stored with an object respectively in cloud platform, Described container object, block container object, the sub- equal content of container object three are metadata empty, that metadata is stored in cloud platform In object, and the title of root container object includes the file path of compressed file, contains person in servitude in the metadata of block container object Belong to the information of root container, contain the information for being subordinate to block container in the metadata of sub- container object, so that root container object, block hold Device object, sub- container object form the vessel subsystem in tree-shaped institutional framework.
8. the large data cloud storage method according to claim 7 based on object storage, it is characterised in that: step S2) In further include line number of the metadata storage output example in large data file to be stored by exporting example, and it is described defeated The description information of example further includes the row number information of corresponding data block out.
CN201710146689.2A 2017-03-13 2017-03-13 A kind of large data cloud storage method based on object storage Active CN106991134B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710146689.2A CN106991134B (en) 2017-03-13 2017-03-13 A kind of large data cloud storage method based on object storage

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710146689.2A CN106991134B (en) 2017-03-13 2017-03-13 A kind of large data cloud storage method based on object storage

Publications (2)

Publication Number Publication Date
CN106991134A CN106991134A (en) 2017-07-28
CN106991134B true CN106991134B (en) 2019-04-05

Family

ID=59412115

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710146689.2A Active CN106991134B (en) 2017-03-13 2017-03-13 A kind of large data cloud storage method based on object storage

Country Status (1)

Country Link
CN (1) CN106991134B (en)

Families Citing this family (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107770273A (en) * 2017-10-23 2018-03-06 上海斐讯数据通信技术有限公司 A kind of big file cloud synchronous method and system
CN108011966B (en) * 2017-12-14 2021-07-06 广东金赋科技股份有限公司 Optimization method for compressing and uploading logs of self-service terminal
CN110196836B (en) * 2019-03-29 2024-05-10 腾讯云计算(北京)有限责任公司 Data storage method and device
CN110349635B (en) * 2019-06-11 2021-06-11 华南理工大学 Parallel compression method for gene sequencing data quality fraction
CN110659252A (en) * 2019-08-12 2020-01-07 安诺优达生命科学研究院 Cloud-based biological information data delivery method and device and electronic equipment
CN110490450A (en) * 2019-08-15 2019-11-22 安诺优达生命科学研究院 Biological information management system based on mixed cloud
CN110740101A (en) * 2019-08-30 2020-01-31 贵州力创科技发展有限公司 big data cloud storage method and system based on object storage
CN111628779B (en) * 2020-05-29 2023-10-20 深圳华大生命科学研究院 Parallel compression and decompression method and system for FASTQ file
WO2022198483A1 (en) * 2021-03-24 2022-09-29 深圳市大疆创新科技有限公司 Data compression method and apparatus, movable platform, and storage medium
CN113259424A (en) * 2021-04-29 2021-08-13 西安点告网络科技有限公司 Cross-regional data transmission method and system

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6629105B1 (en) * 2000-02-19 2003-09-30 Novell, Inc. Facilitating user administration of directory trees
US8028002B2 (en) * 2004-05-27 2011-09-27 Sap Ag Naming service implementation in a clustered environment
CN102882983A (en) * 2012-10-22 2013-01-16 南京云创存储科技有限公司 Rapid data memory method for improving concurrent visiting performance in cloud memory system
CN103034649A (en) * 2011-09-30 2013-04-10 阿里巴巴集团控股有限公司 Method and system for realizing data storage and search
CN106294870A (en) * 2016-08-25 2017-01-04 苏州酷伴软件科技有限公司 Object-based distributed cloud storage method

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6629105B1 (en) * 2000-02-19 2003-09-30 Novell, Inc. Facilitating user administration of directory trees
US8028002B2 (en) * 2004-05-27 2011-09-27 Sap Ag Naming service implementation in a clustered environment
CN103034649A (en) * 2011-09-30 2013-04-10 阿里巴巴集团控股有限公司 Method and system for realizing data storage and search
CN102882983A (en) * 2012-10-22 2013-01-16 南京云创存储科技有限公司 Rapid data memory method for improving concurrent visiting performance in cloud memory system
CN106294870A (en) * 2016-08-25 2017-01-04 苏州酷伴软件科技有限公司 Object-based distributed cloud storage method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
面向海量用户的云存储系统的设计与优化;史新刚;《中国优秀硕士学位论文全文数据库 信息科技辑》;20131215;第2013年卷(第S2期);第I137-65页

Also Published As

Publication number Publication date
CN106991134A (en) 2017-07-28

Similar Documents

Publication Publication Date Title
CN106991134B (en) A kind of large data cloud storage method based on object storage
CN107948334B (en) Data processing method based on distributed memory system
US10268398B2 (en) Storage system, recording medium for storing control program and control method for storage system
CN104090891B (en) Data processing method, Apparatus and system
CN109710614A (en) A kind of method and device of real-time data memory and inquiry
CN107438102A (en) A kind of cloud platform mirror image manufacturing system and its method
CN106294870B (en) Object-based distribution cloud storage method
CN106453572B (en) Method and system based on Cloud Server synchronous images
CN103514205A (en) Mass data processing method and system
CN110855638A (en) Remote sensing satellite data decompression processing system and method based on cloud computing
CN102902724A (en) Mass raster tile map release method
JP2023501054A (en) Partial download of compressed data
CN103944744A (en) Method and system for log acquisition
CN106161074A (en) A kind of cloud terminal log processing method, Apparatus and system
CN108011966A (en) The optimization method that a kind of self-aided terminal log compression uploads
CN105260190A (en) Operation method and device for android application based on android system distribution technology
CN105426472A (en) Distributed computing system and data processing method thereof
CN109451317A (en) A kind of image compression system and method based on FPGA
CN106027615A (en) Object storage method and system
CN102497450A (en) Two-stage-system-based distributed data compression processing method
CN102882960A (en) Method and device for transmitting resource files
CN113360473A (en) Cloud storage computing system for medical inspection image big data
CN110019347A (en) A kind of data processing method, device and the terminal device of block chain
CN109803157A (en) A kind of sequence frame picture transmission method, system and electronic equipment based on video
CN109491807A (en) Data exchange method, device and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant