CN106991134B - A kind of large data cloud storage method based on object storage - Google Patents
A kind of large data cloud storage method based on object storage Download PDFInfo
- Publication number
- CN106991134B CN106991134B CN201710146689.2A CN201710146689A CN106991134B CN 106991134 B CN106991134 B CN 106991134B CN 201710146689 A CN201710146689 A CN 201710146689A CN 106991134 B CN106991134 B CN 106991134B
- Authority
- CN
- China
- Prior art keywords
- data
- container object
- block
- file
- output example
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/10—File systems; File servers
- G06F16/18—File system types
- G06F16/182—Distributed file systems
- G06F16/1824—Distributed file systems implemented using Network-attached Storage [NAS] architecture
- G06F16/183—Provision of network file services by network file servers, e.g. by using NFS, CIFS
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/10—File systems; File servers
- G06F16/13—File access structures, e.g. distributed indices
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B50/00—ICT programming tools or database systems specially adapted for bioinformatics
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L67/00—Network arrangements or protocols for supporting network services or applications
- H04L67/01—Protocols
- H04L67/10—Protocols in which an application is distributed across nodes in the network
- H04L67/1097—Protocols in which an application is distributed across nodes in the network for distributed storage of data in networks, e.g. transport arrangements for network file system [NFS], storage area networks [SAN] or network attached storage [NAS]
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L67/00—Network arrangements or protocols for supporting network services or applications
- H04L67/50—Network services
- H04L67/56—Provisioning of proxy services
- H04L67/561—Adding application-functional data or data for application control, e.g. adding metadata
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L67/00—Network arrangements or protocols for supporting network services or applications
- H04L67/50—Network services
- H04L67/56—Provisioning of proxy services
- H04L67/565—Conversion or adaptation of application format or content
- H04L67/5651—Reducing the amount or size of exchanged application data
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Computer Networks & Wireless Communication (AREA)
- Databases & Information Systems (AREA)
- Signal Processing (AREA)
- Data Mining & Analysis (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biophysics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Biotechnology (AREA)
- Evolutionary Biology (AREA)
- General Health & Medical Sciences (AREA)
- Medical Informatics (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioethics (AREA)
- Library & Information Science (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Information Transfer Between Computers (AREA)
Abstract
The invention discloses a kind of large data cloud storage methods based on object storage, implementation steps include: that client reads large data file to be stored and forms at least one substream of data, constantly accumulate the data block for forming fixed size in memory respectively, on one side data block and its description information are compressed to form output example, output example is sent to cloud platform on one side;Cloud platform establishes the root container object comprising block container object, client is received on one side is directed to the output example that large data file to be stored is sent, it is saved in the output example received as object in corresponding container object on one side, and the output example of each substream of data is stored in more than one block container object.The present invention is based on the thoughts that shunting piecemeal concurrently compresses, it supports data compression and transmits the synchronous mode carried out of upper cloud, supports to take great targetedly compression scheme to the data block of different data subflow, can largely save the time cost of data upload and the economic cost of data storage.
Description
Technical field
The present invention relates to the cloud storage technologies of large data, and in particular to it is a kind of based on object storage large data cloud deposit
Method for storing.
Background technique
Large data epoch and cloud era come in pairs, and cloud computing platform has become having for large-scale data processing
Imitate platform.It is the typical industry of representative with biology, finance, communication etc., all can locally producing hundreds of GB even number TB daily
Data.It is limited to the bandwidth limitation of Wide Area Network, the large data of these magnanimity is transmitted to the speed of cloud platform, has become
Restrict the bottleneck that these fields carry out data processing using cloud computing resources.In addition, the cloud platform storage of large data is at high cost
It is high, also become one of the important limitation reason of enterprise's cloud.
Compression storage is to solve the effective means of data storage and transmission on cloud.For the cloud storage of large data, at present
Current way is first to compress source data using compressed software, then entire compressed package is transmitted to storage server progress block and is deposited
Storage or file storage.In the large data of magnanimity, compression and the time-consuming for transmitting upper cloud are usually single with hour or day
Position;When user needs to read the data in compressed package, then compressed package must be decompressed completely, severely impact large data
Read-write efficiency.
In addition, cloud storage platform has generallyd use object storage technology at present.The characteristics of object storage technology is will be in it
The data of portion's storage are all considered as object, and each object consists of three parts: 1, (object name can be stratification to object name
);2. corresponding object data block;3, the metamessage of description object attribute.Since object storage technology is by object name and object
Data as a pair of simple Key-Value mapping storage among system, by simple Get and Put semanteme obtain and on
Data are passed, therefore this kind of system is easy to accomplish the extensive extending transversely of access performance.
It is different from local file system, object storage is semantic based on simple Get and Put data manipulation, is difficult efficiently
Ground carries out random read-write access to the data in object.Therefore, if simply being carried out locally compressed file as an object
Storage, then when needing to extract data from compressed file, it is necessary to after the entire compressed data packets of Get, could unpack and extract it
Internal file data.Since compressed package itself scale of large data is not small (number GB or more) yet, for large-scale right in this way
The Get and Put of elephant are operated, and cannot effectively play the behavior extension advantage of object storage, and the access effect of compressed data is greatly reduced
Rate.
Summary of the invention
The technical problem to be solved in the present invention: in view of the above problems in the prior art, one kind is provided and is based on shunting piecemeal simultaneously
The thought of compression is sent out, support data compression and transmits the synchronous mode for carrying out (side flanging biography) of upper cloud, support to different data
The data block of stream takes great targetedly compression scheme, can largely save the time cost of data upload and the warp of data storage
The large data cloud storage method based on object storage for cost of helping.
In order to solve the above-mentioned technical problem, the technical solution adopted by the present invention are as follows:
On the one hand, the present invention provides a kind of large data cloud storage method based on object storage, and implementation steps include:
1) client reads large data file to be stored, and the file stream of reading is formed at least one substream of data, point
The data for not accumulating substream of data constantly in memory form the data block of specified size, on one side by data block and its description information
It compresses and forms output example, output example is sent to cloud platform on one side, the description information includes number belonging to data block
It is numbered according to sub-stream information, data block size and data block;
2) cloud platform is primarily based on object and establishes the root container object comprising block container object, then receives client on one side
The output example of transmission is saved in the output example received as object in corresponding container object on one side, and
The output example of each substream of data is stored in respectively in more than one piece of container object.
Preferably, it is solid that the data of constantly accumulation substream of data, which form the data block of specified size, in memory in step 1)
Determine the data block of size;Data block and description information are compressed in step 1) and formed output example detailed step include: by
Data block is segmented according to data field, calls specified encoder or encoder assembles to be pressed for different data fields
Compression result, is finally split according to fixed size, obtains the output example of at least one fixed size by contracting.
Preferably, after the file stream of reading being formed at least one substream of data in step 1), for each substream of data
The data of constantly accumulation substream of data form the data block of fixed size, by the data block of each substream of data and retouch in memory
Stating Information Compression and forming output example, the output example of each substream of data is sent to cloud platform is concurrently to execute.
Preferably, when output example being sent to cloud platform in step 1), client is carried out using pipeline/Filter
Send so that between client and cloud platform the inlet flow of each pipeline with and output stream holding it is synchronous.
Preferably, the large data file to be stored in step 1) specifically refers to FASTQ file, by reading in step 1)
File stream forms at least one substream of data and specifically refers to the file stream that will be read formation metadata streams, base sequence stream, quality
Three kinds of substream of data of score stream.
Preferably, the detailed step of step 2 includes:
2.1) cloud platform receives client and is directed to the output example that large data file to be stored is sent, and is primarily based on object
A root container object is established in storage, under described container object it is nested at least one for supporting individually decompress random read
Block container object, nested and substream of data type sub- container object correspondingly under each block container object, each
Root container object, block container object, sub- container object are stored with an object respectively in cloud platform, described container object,
Block container object, the sub- equal content of container object three are stored in the metadata object of cloud platform for empty, metadata, and root container
The title of object includes the file path of compressed file, contains the letter for being subordinate to root container object in the metadata of block container object
Breath contains the information for being subordinate to block container object in the metadata of sub- container object so that root container object, block container object,
Sub- container object forms the vessel subsystem in tree-shaped institutional framework;
2.2) cloud platform is saved in the output example received as object in corresponding container object, and each
The output example of a substream of data is stored in respectively in more than one piece of container object.
It preferably, further include by exporting the metadata storage output example of example wait store large-scale number in step 2.2)
Description information according to the line number in file, and the output example further includes the row number information of corresponding data block.
On the other hand, the present invention also provides it is a kind of based on object storage large data cloud storage method, with individually for
Method in terms of cloud platform realizes that implementation steps include:
S1) cloud platform establishes the root container object comprising block container object based on object;
S2) cloud platform receives client on one side and is directed to the output example that large data file to be stored is sent, the output
Example includes data block and its description information, and the description information includes the affiliated substream of data of data block, data block size and number
According to block number;It is saved in the output example received as object in corresponding container object on one side, and each number
It is stored in more than one piece of container object respectively according to the output example of subflow.
Preferably, step S1) detailed step include: cloud platform receive client for large data file to be stored hair
The output example sent is primarily based on object storage and establishes a root container object, the nesting at least one under described container object
It is a to be used to support individually to decompress the block container object read at random, nested and substream of data type one under each block container object
One corresponding sub- container object, each root container object, block container object, sub- container object are in cloud platform respectively with one
Object storage, described container object, block container object, the sub- equal content of container object three are stored in cloud for empty, metadata and put down
In the metadata object of platform, and the title of root container object includes the file path of compressed file, the metadata of block container object
In contain the information for being subordinate to root container, the information for being subordinate to block container is contained in the metadata of sub- container object so that root hold
Device object, block container object, sub- container object form the vessel subsystem in tree-shaped institutional framework.
Preferably, step S2) in further include by export example metadata storage output example wait store large-scale number
Description information according to the line number in file, and the output example further includes the row number information of corresponding data block.
The present invention is based on the large data cloud storage methods of object storage to have an advantage that
1, being based on object by cloud platform the present invention is based on the large data cloud storage method of object storage includes block container
The root container object of object receives client on one side and is directed to the output example that large data file to be stored is sent, exports example
Comprising data block and its description information, description information includes that the affiliated substream of data of data block, data block size and data block are numbered,
It is saved in the output example received as object in corresponding container object on one side, and each substream of data is defeated
Example is stored in respectively in more than one piece of container object out, based on the thought that shunting piecemeal concurrently compresses, supports data pressure
Contracting and the synchronous mode for carrying out (side flanging biography) of the upper cloud of transmission are supported to take great specific aim to the data block of different data subflow
Compression scheme, can largely save data upload time cost and data storage economic cost.
2, the output example received is based respectively on object and is saved in the root container pair comprising block container object by the present invention
As in, and the output example of each substream of data is stored in respectively in more than one piece of container object, is based on block container object
Concept, each piece of container object can be decompressed individually, to support random read functions, can directly be extracted by compressed file
Partial content is downloaded without decompressing entire compressed file, and the smaller retrieval of block container object is faster.
Detailed description of the invention
Fig. 1 is the basic procedure schematic diagram of present invention method.
Fig. 2 is controller and compressor theory structure schematic diagram in present invention method.
Fig. 3 is the basic conception schematic diagram of vessel subsystem in present invention method.
Fig. 4 is the basic structure schematic diagram of vessel subsystem in present invention method.
Specific embodiment
It hereafter will be by taking gene sequencing data file (FASTQ file) as an example, to the present invention is based on the large-scale numbers that object stores
It is described in further detail according to cloud storage method.Gene sequencer can generate the short reading data of magnanimity, and base sequence is only
The character { ' A ', ' T ', ' G ', ' C ', ' N ' } that energy occurs, and each short reading includes 4 rows, each short reading is started with character "@" and it
In the features such as the 3rd behavior "+" number, for the necessary condition for determining gene sequencing data file (FASTQ file).
As shown in Figure 1, the implementation steps for the large data cloud storage method that the present embodiment is stored based on object include:
1) client reads large data file to be stored, and the file stream of reading is formed at least one substream of data, point
The data for not accumulating substream of data constantly in memory form the data block of specified size, on one side by data block and its description information
It compresses and forms output example, output example is sent to cloud platform on one side, description information includes of data belonging to data block
Stream information, data block size and data block number;
2) cloud platform is primarily based on object and establishes the root container object comprising block container object, then receives client on one side
The output example of transmission is saved in the output example received as object in corresponding container object on one side, and
The output example of each substream of data is stored in respectively in more than one piece of container object.
In conjunction with step 1)~2) it is found that the present embodiment is substantially existed based on the large data cloud storage method that object stores
Client on one side accumulate compression, while transmit (that is: side flanging pass), while being also synchronous to the storage for exporting example in server
It carries out, and task is uploaded by thread queue synchronization mechanism parallel processing cloud platform in the present embodiment.The present embodiment method is based on
Shunt the thought that piecemeal concurrently compresses, support data compression and the synchronous mode for carrying out (side flanging biography) of the upper cloud of transmission, support pair
The data block of different data subflow takes great targetedly compression scheme, can largely save the time cost and number of data upload
According to the economic cost of storage.In the present embodiment, cloud platform specifically refers to Amazon AWS platform.But the present embodiment method is not
Be confined to specific cloud platform, and in order to improve the universality of the present embodiment method, the present embodiment method in the specific implementation, visitor
The output example sending function module at family end actually encapsulates different vendor's cloud platform (such as Amazon AWS platform, A Liyun
OSS platform etc.) provide official's interface, with support user use different vendor cloud platform.
The present embodiment is had at two based on the fixed size that the large data cloud storage method that object stores is related to: (1), step
1) data block that the data of constantly accumulation substream of data form specified size in memory in is the data block of fixed size, this is
When due to by data block and its description information compression, for the encoder that some compressions use, data block is in different size,
Obtained compression ratio is achieved with different coding device by test and obtains optimal compression effect (compression ratio height, compression with regard to different
Speed is again fast) data block size, therefore this implementations is by forming the data block of fixed size, it will be able to ensure to compress use
Encoder obtains a relatively good synthesis compression result, and compression ratio is high and compression speed is fast.(2) in the present embodiment, every number
After having been compressed according to block, compression result is split according to fixed size, obtains the output example of at least one fixed size,
To become small one by one output example toward uploading, so that managed in cloud platform it is more convenient, and can be square
Just breakpoint transmission function and random read functions are supported.
In the present embodiment, the large data file to be stored in step 1) specifically refers to FASTQ file, will read in step 1)
File stream out formed at least one substream of data specifically refer to the file stream that will be read formed metadata streams, base sequence stream,
Three kinds of substream of data of mass fraction stream.In the present embodiment, after the compression of each sub-stream data block, the fixation indefinite for quantity is exported
Big small documents, file are numbered by incremental order, obtain data block number.
In the present embodiment, data block and description information are compressed to and are formed the detailed step packet of output example in step 1)
Include: data block segmented according to data field, for different data field call specified encoder or encoder assembles come into
Compression result, is finally split according to fixed size, obtains the output example of at least one fixed size by row compression.This reality
It applies in example, for metadata streams, repeated data domain, incremental data domain, random data domain etc. is subdivided into according to its feature, and respectively
(Burrows-Wheeler_transform, block sequencing pressure are encoded using UTF-8 coding, repeated encoding, incremental encoding, BWT
Contracting), one of arithmetic coding or two or more be combined coding compression;For base sequence stream and mass fraction
Stream is first pre-processed using BWT encoder, is recalled the arithmetic encoder based on context dynamic probability prediction model and is carried out
Compression;For metadata streams, base sequence stream, mass fraction stream, output is indefinite defeated of quantity after the compression of each of which data block
Example out, output example name are carried out in a manner of data block number+example number.
In the present embodiment, after the file stream of reading is formed at least one substream of data in step 1), for each data
The data that subflow constantly accumulates substream of data in memory form the data block of fixed size, by the data block of each substream of data
And it is concurrently to hold that description information, which is compressed and formed output example, the output example of each substream of data is sent to cloud platform,
Row.Referring to fig. 2, the present embodiment uses a corresponding controller, the control of each substream of data for each substream of data
(metadata streams correspond to metadata controller to device, and base sequence stream corresponds to base sequence controller, mass fraction stream corresponding mass point
Number controller) each controller can constantly accumulate the data of each subflow in memory respectively, the data block of fixed size is formed, and
The relevant information of each data block is recorded, such as affiliated substream of data, block size, block number etc. are formed corresponding with data block
Description information is attached in data block, is sent into subsequent compressor together.Each substream of data possesses independent compression module
Data block after receiving accumulation, referring to fig. 2, metadata streams corresponding element data compressor, base sequence stream corresponds to base sequence pressure
It the characteristics of contracting device, mass fraction stream corresponding mass score compressor, each compressor is according to data block, if needed can be by one
A data block is sub-divided into different data fields, calls different encoder or encoder assembles.In the present embodiment, compressor
It is carried out using multi-thread concurrent, in conjunction with the capability configuration parameters of operation host, while compressible N number of data block, each compressor
Comprising a calling device, encoder or encoder assembles are selected according to data characteristics.
The encoder that different data subflow compression module uses, compression speed may be inconsistent.In the present embodiment, step
1) when output example being sent to cloud platform in, client is sent using pipeline/Filter, so that client and cloud
Between platform the inlet flow of each pipeline with and output stream holding it is synchronous.When the thread of compression module thread pool is fully occupied
When, corresponding controller data block push will be blocked, such as when the inlet flow of pipeline is faster than output stream, can block input
Stream;When pipeline does not have inlet flow, pipeline can also block.
In the present embodiment, client reads large data file to be stored and specifically refers in step 1): client reading refers to
Fixed large data file to be stored or client traversal, which are read, needs to be stored large data text under specified catalogue
Part.For the specified of large data file to be stored, it is file that user, which both can specify object, and also can specify object is file
Catalogue, so that the large data cloud storage method that is stored based on object of the present embodiment, using more flexible, the scope of application is more
Extensively.
In the present embodiment, the detailed step of step 2 includes:
2.1) cloud platform receives client and is directed to the output example that large data file to be stored is sent, and is primarily based on object
A root container object is established in storage, under root container object it is nested at least one for supporting individually to decompress the block read at random
Container object, sub- container object, each root hold nested and substream of data type correspondingly under each block container object
Device object, block container object, sub- container object are stored with an object respectively in cloud platform, root container object, block container pair
As, the sub- equal content of container object three be it is empty, metadata is stored in the metadata object of cloud platform, and the name of root container object
Claim the file path comprising compressed file, contains the information for being subordinate to root container object, sub- appearance in the metadata of block container object
The information for being subordinate to block container object is contained in the metadata of device object, so that root container object, block container object, sub- container pair
Pictograph is at the vessel subsystem for being in tree-shaped institutional framework;
2.2) cloud platform is saved in the output example received as object in corresponding container object, and each
The output example of a substream of data is stored in respectively in more than one piece of container object.
Based on root container object, block container object, sub- container structure object at vessel subsystem in the present embodiment, referring to figure
3, vessel subsystem proposes following three concepts, the cloud object storage of Lai Shixian compressed data: (1) concept of container, i.e. cloud pair
As a kind of tissue or linking form of storage.Because nesting is not supported in object storage, by container come complete tissue and guarantor
It deposits by the bibliographic structure of compressed file.Container correspond to cloud object store Bucket(memory space) in Object(object/text
Part).(2) concept of example, the i.e. object of storing data block compressed content.Example also corresponds to the object storage of cloud platform
Bucket(memory space) in Object(object/file).(3) concept of metadata, the i.e. packed description of storing data block
The object of information.Metadata correspond to cloud object store Bucket(memory space) in Object metadata(object meta number
According to).Referring to Fig. 3, container and example pass through Object(object/file) it is stored, metadata then passes through metadata pair
As storing.
In the vessel subsystem of the present embodiment, the concept based on container, a root container object represents one completely independently
Compressed file, container object can be with Multi-nesting: root container object can be embedding with nested block container object, block container object
Cover container object.The corresponding Object of root container object, title available directories path representation, behind path plus '/'.Utilize root
Container object title may include directory path information, can form certain bibliographic structure in the same Bucket, convenient
Management of the user to compressed file, can also support for local directory or multiple source files to be integrally packaged be compressed to it is corresponding
In Bucket catalogue.And common object storage, all take flat data organizational structure.The storage that Object is flattened
In Bucket.Based on the concept of block container object, each piece of container object can be decompressed individually, to support random reading function
Can, it can directly extract by the partial content of compressed file, be downloaded without decompressing entire compressed file, and block container object is smaller
It retrieves faster.
Referring to fig. 4, a root container object indicates that a complete FASTQ file, FASTQ file have in the present embodiment
Certain format can separately be handled when 4 behavior, one specific content, compression according to the format.Each individually squeeze operation needs
One container object, specific compressed data save in instances, therefore a complete compressed file can nested multiple containers
Object.Root container object indicates a complete gene sequencing compressed file.Root container object meeting nested block container object, referring to
Fig. 4, the number of each piece of container object are the serial number between 0~N, which includes No. 0 block container~N block container
Object.Referring to fig. 4, by taking base sequence stream as an example, the corresponding output example of the 0~N-1 data block of base sequence stream is protected
There are in No. 0 block container object, the corresponding output example of the N~2N-1 data block of base sequence stream is saved in No. 1 block
In container object, the corresponding output example of the 2N~3N-1 data block of base sequence stream is saved in No. 2 block container objects
In, and so on.In one block container object, the sub- container object of nested metadata, the sub- container object of base sequence, matter again are understood
Measure the sub- container object of score, the sub- container object nesting repeated data domain example of metadata, random data domain example, incremental data domain
Example etc., the nested No. 0 block example~N block example of the sub- container object of base sequence, nested No. 0 block of the sub- container object of mass fraction
Example~N block example.In the present embodiment, the command forms of root container object are directory/filename(directory
Catalogue, filename filename).Such as: directory/filename1 and directory/filename2 are expressed as same
In one Bucket, it is stored in two different compressed files of same catalogue.
It further include by exporting the metadata storage output example of example big wait store in the present embodiment, in step 2.2)
Line number in type data file, and the description information for exporting example further includes the row number information of corresponding data block, so as to make
The retrieval that the present embodiment method supports line number is obtained, can directly extract data file by the part of the specified line number of compressed file
Hold, is downloaded without decompressing entire compressed file.
Embodiment two:
The present embodiment and one method of embodiment are essentially identical, and difference is that large data file to be stored is different, this reality
The large data file to be stored applied in example is non-FASTQ file, and non-FASTQ file is different from non-FASTQ file, therefore this reality
When applying that the file stream of reading is formed at least one substream of data by step 1) in example, the file stream of reading is substantially formed one
Kind substream of data (binary file stream) then first takes block sequencing compression to pre-process, then be directed to each piecemeal in compression
Interior data are taken based on the arithmetic coding of the dynamic prediction model of bit-level second compression again.
In conjunction with the embodiments one and embodiment two file it is found that the present invention is based on object storage large data cloud storage
Method is not limited to specific large data file type, it is controlled according to the characteristic of large data file itself and forms number
According to the quantity of subflow, especially formation single substream of data (such that the efficiency that side flanging passes decreases), may make
The present invention is based on the large data cloud storage methods of object storage can be suitable for all kinds of large data files, no longer superfluous herein
It states.
Embodiment three:
The present embodiment and one method of embodiment are essentially identical, and difference is the base that the present embodiment is a kind of facing cloud platform
In the large data cloud storage method of object storage, the present embodiment method requires nothing more than the format of client as requested and provides wait deposit
The universal instance of large data file is stored up, does not limit client to the specific implementation form of universal instance.
In the present embodiment, based on the large data cloud storage method of object storage, implementation steps include:
S1) cloud platform establishes the root container object comprising block container object based on object;
S2) cloud platform receives client on one side and is directed to the output example that large data file to be stored is sent, and exports example
Comprising data block and its description information, description information includes the affiliated substream of data of data block, data block size and data block number;
It is saved in the output example received as object in corresponding container object on one side, and each substream of data is defeated
Example is stored in respectively in more than one piece of container object out.
In the present embodiment, step S1) detailed step include: that cloud platform receives client for large data text to be stored
The output example that part is sent, is primarily based on object storage and establishes a root container object, the nesting at least one under root container object
It is a to be used to support individually to decompress the block container object read at random, nested and substream of data type one under each block container object
One corresponding sub- container object, each root container object, block container object, sub- container object are in cloud platform respectively with one
Object storage, root container object, block container object, the sub- equal content of container object three are stored in cloud platform for empty, metadata
In metadata object, and the title of root container object includes the file path of compressed file, is wrapped in the metadata of block container object
The information for being subordinate to root container is contained, the information for being subordinate to block container is contained in the metadata of sub- container object, so that root container pair
As, block container object, sub- container object formed be in tree-shaped institutional framework vessel subsystem.
In the present embodiment, step S2) in further include by exporting the metadata storage output example of example big wait store
Line number in type data file, and the description information for exporting example further includes the row number information of corresponding data block, so as to make
The retrieval that the present embodiment method supports line number is obtained, can directly extract data file by the part of the specified line number of compressed file
Hold, is downloaded without decompressing entire compressed file.
In conjunction with the embodiments one and embodiment three file it is found that the present invention is based on object storage large data cloud storage
Method be not rely on embodiment one or other record specific substream of data dividing methods or data block accumulation method or
Data compression method or data transmission method for uplink, details are not described herein.
The above is only a preferred embodiment of the present invention, protection scope of the present invention is not limited merely to above-mentioned implementation
Example, all technical solutions belonged under thinking of the present invention all belong to the scope of protection of the present invention.It should be pointed out that for the art
Those of ordinary skill for, several improvements and modifications without departing from the principles of the present invention, these improvements and modifications
It should be regarded as protection scope of the present invention.
Claims (8)
1. a kind of large data cloud storage method based on object storage, it is characterised in that implementation steps include:
1) client reads large data file to be stored, and the file stream of reading is formed at least one substream of data, is existed respectively
The data that substream of data is constantly accumulated in memory form the data block of specified size, on one side compress data block and its description information
And output example is formed, output example is sent to cloud platform on one side, the description information includes of data belonging to data block
Stream information, data block size and data block number;
2) cloud platform is primarily based on object and establishes the root container object comprising block container object, then receives client on one side and sends
Output example, be saved in the output example received as object in corresponding container object on one side, and each
The output example of substream of data is stored in respectively in more than one piece of container object;
The detailed step of step 2 includes:
2.1) cloud platform receives client and is directed to the output example that large data file to be stored is sent, and is primarily based on object storage
Establish a root container object, under described container object it is nested at least one for supporting individually to decompress the block read at random
Container object, sub- container object, each root hold nested and substream of data type correspondingly under each block container object
Device object, block container object, sub- container object are stored with an object respectively in cloud platform, and described container object, block hold
Device object, the sub- equal content of container object three are stored in the metadata object of cloud platform for empty, metadata, and root container object
Title include compressed file file path, contain the information for being subordinate to root container object in the metadata of block container object,
The information for being subordinate to block container object is contained in the metadata of sub- container object, so that root container object, block container object, sub- appearance
Device object forms the vessel subsystem in tree-shaped institutional framework;
2.2) cloud platform is saved in the output example received as object in corresponding container object, and each number
It is stored in more than one piece of container object respectively according to the output example of subflow.
2. the large data cloud storage method according to claim 1 based on object storage, which is characterized in that in step 1)
The data of constantly accumulation substream of data form the data block that the data block of specified size is fixed size in memory;In step 1)
Data block and description information are compressed and the detailed step for forming output example includes: to segment data block according to data field, needle
Specified encoder or encoder assembles is called to compress different data fields, finally by compression result according to fixation
Size is split, and obtains the output example of at least one fixed size.
3. the large data cloud storage method according to claim 1 based on object storage, it is characterised in that: in step 1)
After the file stream of reading is formed at least one substream of data, substream of data is constantly accumulated in memory for each substream of data
Data formed fixed size data block, the data block of each substream of data and description information compresss and formed output reality
Example, the output example of each substream of data is sent to cloud platform is concurrently to execute.
4. the large data cloud storage method according to claim 1 based on object storage, it is characterised in that: in step 1)
When output example is sent to cloud platform, client is sent using pipeline/Filter, so that client and cloud platform
Between each pipeline inlet flow with and output stream holding it is synchronous.
5. the large data cloud storage method according to claim 1 based on object storage, which is characterized in that in step 1)
Large data file to be stored specifically refer to FASTQ file, the file stream of reading is formed at least one data in step 1)
Subflow specifically refers to the file stream that will be read and forms metadata streams, three kinds of base sequence stream, mass fraction stream substream of data.
6. the large data cloud storage method according to claim 1 based on object storage, it is characterised in that: step 2.2)
In further include line number of the metadata storage output example in large data file to be stored by exporting example, and it is described defeated
The description information of example further includes the row number information of corresponding data block out.
7. a kind of large data cloud storage method based on object storage, it is characterised in that implementation steps include:
S1) cloud platform establishes the root container object comprising block container object based on object;
S2) cloud platform receives client on one side and is directed to the output example that large data file to be stored is sent, the output example
Comprising data block and its description information, the description information includes the affiliated substream of data of data block, data block size and data block
Number;It is saved in the output example received as object in corresponding container object on one side, and each data
The output example of stream is stored in respectively in more than one piece of container object;
Step S1) detailed step include: that receive the output that client is sent for large data file to be stored real for cloud platform
Example, is primarily based on object storage and establishes a root container object, under described container object it is nested at least one be used to support
The block container object read at random is individually decompressed, nested and substream of data type is sub correspondingly under each block container object
Container object, each root container object, block container object, sub- container object are stored with an object respectively in cloud platform,
Described container object, block container object, the sub- equal content of container object three are metadata empty, that metadata is stored in cloud platform
In object, and the title of root container object includes the file path of compressed file, contains person in servitude in the metadata of block container object
Belong to the information of root container, contain the information for being subordinate to block container in the metadata of sub- container object, so that root container object, block hold
Device object, sub- container object form the vessel subsystem in tree-shaped institutional framework.
8. the large data cloud storage method according to claim 7 based on object storage, it is characterised in that: step S2)
In further include line number of the metadata storage output example in large data file to be stored by exporting example, and it is described defeated
The description information of example further includes the row number information of corresponding data block out.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710146689.2A CN106991134B (en) | 2017-03-13 | 2017-03-13 | A kind of large data cloud storage method based on object storage |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710146689.2A CN106991134B (en) | 2017-03-13 | 2017-03-13 | A kind of large data cloud storage method based on object storage |
Publications (2)
Publication Number | Publication Date |
---|---|
CN106991134A CN106991134A (en) | 2017-07-28 |
CN106991134B true CN106991134B (en) | 2019-04-05 |
Family
ID=59412115
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710146689.2A Active CN106991134B (en) | 2017-03-13 | 2017-03-13 | A kind of large data cloud storage method based on object storage |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106991134B (en) |
Families Citing this family (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107770273A (en) * | 2017-10-23 | 2018-03-06 | 上海斐讯数据通信技术有限公司 | A kind of big file cloud synchronous method and system |
CN108011966B (en) * | 2017-12-14 | 2021-07-06 | 广东金赋科技股份有限公司 | Optimization method for compressing and uploading logs of self-service terminal |
CN110196836B (en) * | 2019-03-29 | 2024-05-10 | 腾讯云计算(北京)有限责任公司 | Data storage method and device |
CN110349635B (en) * | 2019-06-11 | 2021-06-11 | 华南理工大学 | Parallel compression method for gene sequencing data quality fraction |
CN110659252A (en) * | 2019-08-12 | 2020-01-07 | 安诺优达生命科学研究院 | Cloud-based biological information data delivery method and device and electronic equipment |
CN110490450A (en) * | 2019-08-15 | 2019-11-22 | 安诺优达生命科学研究院 | Biological information management system based on mixed cloud |
CN110740101A (en) * | 2019-08-30 | 2020-01-31 | 贵州力创科技发展有限公司 | big data cloud storage method and system based on object storage |
CN111628779B (en) * | 2020-05-29 | 2023-10-20 | 深圳华大生命科学研究院 | Parallel compression and decompression method and system for FASTQ file |
WO2022198483A1 (en) * | 2021-03-24 | 2022-09-29 | 深圳市大疆创新科技有限公司 | Data compression method and apparatus, movable platform, and storage medium |
CN113259424A (en) * | 2021-04-29 | 2021-08-13 | 西安点告网络科技有限公司 | Cross-regional data transmission method and system |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6629105B1 (en) * | 2000-02-19 | 2003-09-30 | Novell, Inc. | Facilitating user administration of directory trees |
US8028002B2 (en) * | 2004-05-27 | 2011-09-27 | Sap Ag | Naming service implementation in a clustered environment |
CN102882983A (en) * | 2012-10-22 | 2013-01-16 | 南京云创存储科技有限公司 | Rapid data memory method for improving concurrent visiting performance in cloud memory system |
CN103034649A (en) * | 2011-09-30 | 2013-04-10 | 阿里巴巴集团控股有限公司 | Method and system for realizing data storage and search |
CN106294870A (en) * | 2016-08-25 | 2017-01-04 | 苏州酷伴软件科技有限公司 | Object-based distributed cloud storage method |
-
2017
- 2017-03-13 CN CN201710146689.2A patent/CN106991134B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6629105B1 (en) * | 2000-02-19 | 2003-09-30 | Novell, Inc. | Facilitating user administration of directory trees |
US8028002B2 (en) * | 2004-05-27 | 2011-09-27 | Sap Ag | Naming service implementation in a clustered environment |
CN103034649A (en) * | 2011-09-30 | 2013-04-10 | 阿里巴巴集团控股有限公司 | Method and system for realizing data storage and search |
CN102882983A (en) * | 2012-10-22 | 2013-01-16 | 南京云创存储科技有限公司 | Rapid data memory method for improving concurrent visiting performance in cloud memory system |
CN106294870A (en) * | 2016-08-25 | 2017-01-04 | 苏州酷伴软件科技有限公司 | Object-based distributed cloud storage method |
Non-Patent Citations (1)
Title |
---|
面向海量用户的云存储系统的设计与优化;史新刚;《中国优秀硕士学位论文全文数据库 信息科技辑》;20131215;第2013年卷(第S2期);第I137-65页 |
Also Published As
Publication number | Publication date |
---|---|
CN106991134A (en) | 2017-07-28 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106991134B (en) | A kind of large data cloud storage method based on object storage | |
CN107948334B (en) | Data processing method based on distributed memory system | |
US10268398B2 (en) | Storage system, recording medium for storing control program and control method for storage system | |
CN104090891B (en) | Data processing method, Apparatus and system | |
CN109710614A (en) | A kind of method and device of real-time data memory and inquiry | |
CN107438102A (en) | A kind of cloud platform mirror image manufacturing system and its method | |
CN106294870B (en) | Object-based distribution cloud storage method | |
CN106453572B (en) | Method and system based on Cloud Server synchronous images | |
CN103514205A (en) | Mass data processing method and system | |
CN110855638A (en) | Remote sensing satellite data decompression processing system and method based on cloud computing | |
CN102902724A (en) | Mass raster tile map release method | |
JP2023501054A (en) | Partial download of compressed data | |
CN103944744A (en) | Method and system for log acquisition | |
CN106161074A (en) | A kind of cloud terminal log processing method, Apparatus and system | |
CN108011966A (en) | The optimization method that a kind of self-aided terminal log compression uploads | |
CN105260190A (en) | Operation method and device for android application based on android system distribution technology | |
CN105426472A (en) | Distributed computing system and data processing method thereof | |
CN109451317A (en) | A kind of image compression system and method based on FPGA | |
CN106027615A (en) | Object storage method and system | |
CN102497450A (en) | Two-stage-system-based distributed data compression processing method | |
CN102882960A (en) | Method and device for transmitting resource files | |
CN113360473A (en) | Cloud storage computing system for medical inspection image big data | |
CN110019347A (en) | A kind of data processing method, device and the terminal device of block chain | |
CN109803157A (en) | A kind of sequence frame picture transmission method, system and electronic equipment based on video | |
CN109491807A (en) | Data exchange method, device and system |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |