CN107423321B

CN107423321B - Method and device suitable for cloud storage of large-batch small files

Info

Publication number: CN107423321B
Application number: CN201710206089.0A
Authority: CN
Inventors: 郑晟
Original assignee: Beijing Yizhiyun Technology Co Ltd
Current assignee: Beijing yizhiyun Technology Co., Ltd
Priority date: 2017-03-31
Filing date: 2017-03-31
Publication date: 2020-12-01
Anticipated expiration: 2037-03-31
Also published as: CN107423321A

Abstract

The invention relates to a method and a device suitable for cloud storage of large-batch small files, wherein the method comprises the following steps of S1: uploading a file: selectively splicing the files according to a file splicing mode to obtain processed files and uploading the processed files; s2: establishing or searching a storage grid: dividing the data block into a plurality of storage grids with equal size by using a data block with the size larger than a set threshold value, or searching the remaining storage grids enough to store the data block of the file processed in the step S1; s3: the processing files in the step S1 are analyzed and stored in one or more storage grids of the data blocks in the step S2, the small files are uploaded by a combination method, the connection establishment times are reduced, and the uploading speed is improved; each file is stored by using a proper storage space, so that the files are more convenient to update and access.

Description

Method and device suitable for cloud storage of large-batch small files

Technical Field

The invention belongs to the technical field of computer application, and particularly relates to a method and a device suitable for cloud storage of large-batch small files.

Background

The cloud storage of files, uploading uses a RESTful API interface, essentially uses an HTTP protocol, if each file is uploaded in sequence, a large amount of time is consumed in the process of establishing HTTP connection, actual data transmission takes little time, and therefore, the number of times of establishing HTTP connection is required to be reduced. In addition, when storing, a large number of small files occupy a large number of storage resources, which is not economical, but also affects the access performance, and in order to solve the above technical problems, the following schemes are proposed:

the optimization strategy for the massive small files generally comprises the step of merging and storing the small files, for example, Facebook open source haystack and Taobao TFS adopt the optimization strategy, the number of inodes is reduced through merging the files, namely, the number of metadata is reduced, the purpose that the files can be basically and completely loaded into a memory is achieved, the files can be read through one-time disk IO, the performance of reading the files is greatly improved, and the problem of the reading performance of the massive small files is solved. The whole idea is to use a large data block to load a plurality of small files by establishing a secondary index, convert the uploading of the small files into the uploading of medium files, and convert the addressing of the small files into the addressing of data blocks and the addressing of offsets in the data blocks.

Second, chinese patent discloses a read-write solution for ten million-level small file data [ application number: CN201410560613.0], when storing small files, this scheme stores a large number of small files by opening up large continuous disk space, that is, logically continuous data is stored on the continuous space of the disk array as much as possible; the disk space is divided into a plurality of blocks, the size of each block is 64KB, and the basic idea is as follows: each small file can only be stored in a single block and cannot be stored across 2 blocks, each folder has one or more blocks, the blocks only store the data of the folder, and each file data is stored in a continuous disk space; the scheme stores logically continuous data in a continuous space of a physical disk as much as possible, uses a cache technology as a metadata server, improves the cache utilization rate through simplified file information nodes, and improves the access performance of small files.

In the above scheme, the former adopts a file merging method, which can reduce the number of files, but because the file merging is continuous in byte sequence, if a certain file needs to be changed, the cost is high, so that the method is only suitable for occasions with few updates; the latter method, which uses continuous disk space, also has certain disadvantages: in the cloud storage, due to the virtualization of the storage, the realization is difficult in the using process.

Disclosure of Invention

The invention aims to solve the problems and provides a method which can combine and upload small files and improve the uploading speed and is suitable for large-batch cloud storage of the small files;

another object of the present invention is to solve the above problems, and to provide an apparatus suitable for cloud storage of large-batch small files, which can reduce the number of connections;

in order to achieve the purpose, the invention adopts the following technical scheme:

the invention discloses a method suitable for cloud storage of large-batch small files, which comprises the following steps:

s1: uploading a file: selectively splicing the files according to a file splicing mode to obtain processed files and uploading the processed files;

s2: establishing or searching a storage grid: dividing the data block into a plurality of storage grids with equal size by using a data block with the size larger than a set threshold value, or searching the remaining storage grids enough to store the data block of the file processed in the step S1;

s3: the processed file in step S1 is parsed and stored in one or more storage grids of the data blocks in step S2.

By the technical scheme, the small files are merged and uploaded, the merged files are stored by using the storage grids of the data blocks, the specific grids of the processed files in the data blocks can be accurately and quickly known for each processed file, and updating is very convenient.

In the above method suitable for cloud storage of large-batch small files, in step S1, the file splicing method is as follows:

judging the size of the file, and when the size of the file is larger than a preset size, judging the file to be a non-small file and independently uploading the file; when the size of the file is smaller than or equal to the preset size, the file is regarded as a small file and needs to be spliced;

wherein, the mode of concatenation does: and combining the small files into a medium-sized processing file with the maximum size not exceeding a set threshold value.

In the above method suitable for cloud storage of large-batch small files, in step S1, the following steps are further performed before splicing:

pre-judging whether the size of the synthesized medium-sized processing file exceeds a set threshold value after the current small file is merged, and if so, not splicing; otherwise, splicing.

In the method suitable for cloud storage of large-batch small files, if it is pre-determined that the size of the medium-sized processing file synthesized after the current small file is merged exceeds a set threshold, the current small file is marked as a file to be spliced while the current small file is determined not to be spliced so as to wait for being spliced with other small files to generate another medium-sized processing file.

In the method suitable for cloud storage of large-batch small files, in step S2, each processing file is stored in only one data block, and when one processing file occupies multiple storage grids of one data block, the storage grid occupied by the processing file may be a continuous storage grid or a discontinuous storage grid.

In the method suitable for cloud storage of large-batch small files, when a storage grid is established, a data block specification recording table is established for corresponding data blocks according to the size of the storage grid, and when the storage grid is searched for a processed file, the data blocks with full storage space are removed.

In the above method suitable for cloud storage of large-batch small files, in step S2, when a processing file to be stored exists, a data block with a suitable storage grid is preferentially searched or established according to the current processing file, and the current processing file is stored.

In the above method suitable for cloud storage of large-batch small files, in step S2, the method further includes the following steps:

s2-1: setting a reference value Y, if the file to be processed is an integral multiple of the reference value Y,

then, the storage grid size is Y × (processing file size/Y);

otherwise, the storage grid size is Y × (int (processed file size/Y) + 1);

where the int function represents rounding.

S2-2: inquiring a processing file data block relation table according to a processing file needing to be stored currently, checking whether the processing file is stored or not, if the record does not exist, indicating that the processing file is a new file, and executing a step S2-4; if the record exists, the file is indicated as a file needing updating, and step S2-3 is executed;

s2-3, obtaining the data block originally stored in the processing file, comparing the storage grid size calculated according to the step S2-1 with the storage grid size of the original file of the processing file, and directly updating the corresponding storage grid in the data block of the original file when the same storage grid size is needed to be used by the processing file and the original file; when the storage file and the original file use different storage grid sizes, firstly releasing the storage space of the corresponding storage grid in the original file data block, and then executing the step S2-4;

s2-4: searching all data blocks with the storage grid size equal to the storage grid size calculated in the step S2-1, traversing the data blocks until a data block with an empty storage grid is found, then storing the processing file into a certain empty storage grid of the data block, and then updating a related data table; if a suitable data block does not exist, one data block is used and divided into a plurality of storage grids having a size equal to the storage grid size calculated in step S2-1.

In the method for cloud storage of large-batch small files, in step S2-2, a query is performed to check whether a file exists by using a relative path in combination with a file name as a query object.

A device suitable for cloud storage of large-batch small files by adopting a method suitable for cloud storage of large-batch small files.

Compared with the prior art, the method and the device suitable for cloud storage of the large-batch small files have the following advantages: 1. small files can be uploaded in a file merging mode, so that the connection times are reduced, and the uploading speed is improved; 2. the small files are stored by introducing the data blocks with different storage grid specifications, so that the storage space can be utilized to the maximum extent, and meanwhile, the files are more convenient to update and access.

Drawings

FIG. 1 is a flowchart illustrating a first embodiment of the present invention;

FIG. 2 is a partial flowchart of a first embodiment of the present invention;

fig. 3 is a partial flowchart of the second embodiment of the present invention.

Detailed Description

The invention can be used for storing large-batch small files more efficiently, and can reduce the establishing times and facilitate searching and updating the files. The following are preferred embodiments of the present invention and are further described with reference to the accompanying drawings, but the present invention is not limited to these embodiments.

Example one

As shown in fig. 1, the embodiment discloses a method suitable for cloud storage of large-batch small files, which includes the following steps:

specifically, as shown in fig. 2, the file splicing method is as follows:

judging the size of the file, and when the size of the file is larger than a preset size, judging the file to be a non-small file and independently uploading the file; when the size of the file is smaller than or equal to the preset size, the file is regarded as a small file and needs to be spliced; since a file having a general file size larger than 512k is not generally considered to be a small file, the preset size of the present embodiment is 512 k.

The splicing method comprises the following steps: combining the small files into a medium-sized processing file with the maximum size not exceeding a set threshold;

the specific splicing method comprises the following steps:

file path: 512 bytes;

file name: 512 bytes;

file length 4 bytes

The file content is as follows: 0 to 512K.

Splicing is performed according to the following format:

of course, the size of the file needs to be pre-judged before splicing: judging whether the size of the synthesized medium-sized processing file exceeds a set threshold (for example, 1M) or not after the current small file is merged, and if so, not splicing; otherwise, splicing. Therefore, a processing file with the size not more than 1M is obtained each time, the data block is uploaded, small files can be spliced, and meanwhile the spliced files are not too large.

Further, if it is determined in advance that the size of the medium-sized processing file synthesized after the current small file is merged exceeds a set threshold, the current small file is marked as a file to be spliced to wait for being spliced with other small files to generate another medium-sized processing file while the current small file is determined not to be spliced.

S2: establishing or searching a storage grid according to the size of the processed file: in order to increase the processing speed, when a storage grid is established, a data block specification recording table is established for a corresponding data block according to the size of the storage grid, and when the storage grid is searched for a processing file, a data block with a full storage space is removed.

When the data block with the storage grid capable of being stored by the processing file is not found, dividing the data block into a plurality of storage grids with the same size by using a data block with the size larger than a set threshold value, and similarly, enabling N times of the specification of the storage grid into which the data block is divided to be equal to the size of the processing file, wherein N is an integer.

Further, for the convenience of query and update, each processing file is stored in only one data block, and when one processing file occupies a plurality of storage grids of one data block, the storage grid occupied by the processing file may be a continuous storage grid or a discontinuous storage grid, and since each storage grid has its own number, even if the storage grid occupied by the processing file is discontinuous, only the number of the storage grid occupied by the processing file is known.

S3: file storage: the processed file in step S1 is parsed and stored in one or more storage grids of the data blocks in step S2.

This embodiment is specifically described with a data block of size 8M:

here, it is understood that the number of storage grids is 1024, and the size of the storage grid into which the data block is divided is dynamically variable in order to accommodate processing files of various sizes. The following data tables associated with the storage grid and data blocks are established for the system:

table 1: data block information table: blockinfo

The table is used for recording the use information of the data block, the number of the storage grids used at present indicates that the data block is completely used if the usednum is 8M/gridsize; if not, indicating that there is an empty storage grid, it can be used continuously.

Table 2: processing a file data block relation table: fileblock

The table is used to record the data block where the processing file is located.

Table 3: processing the file storage grid information table: filegrid

The table records the information of the storage grids occupied by the file and the number sequence of the storage grids composing the file.

The following examples are given:

the file with the file id of 1 is represented, the data block id of the file is 1, the number of the occupied storage grid is 12, the size of the storage space of the file is determined according to the gridsize corresponding to the block id of 1, and if the gridsize is 8K, the file occupies the storage space of 8K.

In this embodiment, the storage grid size gridsize of a data block is n times of 8K, 1< ═ n < ═ 64, n is an integer, and 64 specifications are total.

In the embodiment, the small files are merged and uploaded, the merged files are stored by using the storage grids of the data blocks, the specific grids of the processed files in the data blocks can be accurately and quickly known for each processed file, and the updating processing is very convenient.

Further, the embodiment also comprises a device which is suitable for the cloud storage of the large-batch small files and adopts the method suitable for the cloud storage of the large-batch small files.

Example two

As shown in fig. 3, the present embodiment is similar to the embodiment, except that in step S2, when there is a processing file to be stored, the present embodiment preferentially searches or establishes a data block with an appropriate storage grid according to the current processing file, so as to store the current processing file.

A suitable storage grid is one that occupies just or nearly just the capacity provided by one storage grid for the currently processed file, i.e. the currently processed file can just be stored in one storage grid, although when the size of the processed file is not an integer multiple of 8k, the size of the created or searched grid will be slightly larger than the size of the processed file.

That is, the specification of the storage grid found or established by the present embodiment is equal to or nearly equal to the size of the processed file.

A suitable storage grid is obtained in the manner of step S2-1:

then, the storage grid size is (Y) x (processing file size/Y);

otherwise, the storage grid size is (Y) x (int (processed file size/Y) + 1);

where the int function represents rounding.

The reference value Y is the minimum unit of the storage grid into which the data block is divided, the storage grid of the full data block in this embodiment may be different in size, but the remaining storage grid size of each data block different from the reference value is an integer multiple of the reference value, for example, 1 time, 2 times, 3 times, etc., so Y satisfies the following formula: y × N is the block size, N is an integer, for example, for a block size of 8M, Y may be 8K, so that for a processed file with an integer multiple of 8K, assuming that the integer multiple is 5 times, i.e. a processed file with a size of 40K, a storage grid consisting of 5 minimum units with a size of 5 × 8K being 40K can be divided; for a processed file with a size not an integer multiple of 8K, for example, a processed file with a size of 42K, a storage grid consisting of 6 minimum units can be divided, and the size of the storage grid is 6 × 8K to 48K, so that when a large number of data blocks exist and the storage grid is vacant, the processed files with different sizes always have storage grids suitable for storage, and the storage space of each data block is fully utilized.

S2-2: inquiring a processing file data block relation table according to a processing file needing to be stored currently, checking whether the processing file is stored or not, if the record does not exist, indicating that the processing file is a new file, and executing a step S2-4; if the record exists, the file is indicated as a file needing updating, and step S2-3 is executed; here, step S2-1 and step S2-2 may be performed simultaneously or sequentially.

s2-4: searching all data blocks with the storage grid size equal to the storage grid size calculated in the step S2-1, traversing the data blocks until a data block with a spare storage grid is found, then storing the processing file into a certain spare storage grid of the data block, and then updating a related data table, wherein the related data table comprises a processing file storage grid information table, a processing file data block relation table, a data block information table and other related tables; if a suitable data block does not exist, one data block is used and divided into a plurality of storage grids having a size equal to the storage grid size calculated in step S2-1.

Further, in step S2-2, a query is made with the relative path in combination with the file name as a query object to check whether the file exists.

The embodiment stores the processing file in one of the data blocks as much as possible, so that the file is more convenient to update and access.

The specific embodiments described herein are merely illustrative of the spirit of the invention. Various modifications or additions may be made to the described embodiments or alternatives may be employed by those skilled in the art without departing from the spirit or ambit of the invention as defined in the appended claims.

Although the terms storage grid, data block, small file, process file, medium process file, etc. are used more often herein, the possibility of using other terms is not excluded. These terms are used merely to more conveniently describe and explain the nature of the present invention; they are to be construed as being without limitation to any additional limitations that may be imposed by the spirit of the present invention.

Claims

1. A method suitable for cloud storage of large-batch small files is characterized by comprising the following steps:

s3: file storage: analyzing the processing file in the step S1 and storing the processing file in one or more storage grids of the data blocks in the step S2;

when a storage grid is established, a data block specification recording table is established for the corresponding data block according to the size of the storage grid, and the data block with full storage space is removed when the storage grid is searched for a processing file;

in step S2, when there is a processing file to be stored, preferentially searching or creating a data block with a suitable storage grid according to the current processing file to store the current processing file;

in step S2, the method further includes:

then, the storage grid size is (Y) x (processing file size/Y);

otherwise, the storage grid size is (Y) x (int (processed file size/Y) + 1);

wherein int function represents rounding;

2. The method for cloud storage of large batches of small files according to claim 1, wherein: in step S1, the file splicing method is as follows:

judging the size of the file, and when the size of the file is larger than a preset size, judging the file to be a non-small file and independently uploading the file;

when the size of the file is smaller than or equal to the preset size, the file is regarded as a small file and needs to be spliced;

3. The method for cloud storage of large batches of small files according to claim 2, wherein: in step S1, the following steps are also performed before splicing:

4. The method for cloud storage of large batches of small files according to claim 3, wherein: and if the size of the synthesized medium-sized processing file exceeds a set threshold after the current small file is merged is judged in advance, the current small file is marked as a file to be spliced to wait for being spliced with other small files to generate another medium-sized processing file while the splicing is not determined.

5. The method for cloud storage of large batches of small files according to claim 1, wherein: in step S2, each processing file is stored in only one data block, and when one processing file occupies a plurality of storage grids of one data block, the storage grid occupied by the processing file is a continuous storage grid or a discontinuous storage grid.

6. The method for cloud storage of large batches of small files according to claim 1, wherein: in step S2-2, a query is made with the relative path in combination with the file name as a query object to check whether the file exists.

7. An apparatus for cloud storage of large batch of small files using the method for cloud storage of large batch of small files according to any one of claims 1 to 6.