CN111309702A

CN111309702A - Method and system for aggregation between files

Info

Publication number: CN111309702A
Application number: CN202010130050.7A
Authority: CN
Inventors: 王帅阳; 李文鹏; 张端
Original assignee: Suzhou Inspur Intelligent Technology Co Ltd
Current assignee: Suzhou Inspur Intelligent Technology Co Ltd
Priority date: 2020-02-28
Filing date: 2020-02-28
Publication date: 2020-06-19

Abstract

The application discloses a method and a system for aggregation between files, which are applied to a user terminal in a distributed file system and comprise the following steps: receiving and storing the written data to a terminal cache for caching until the writing is finished, aggregating all the written data in the terminal cache, and obtaining a target file in the terminal cache; down-brushing the target file to a storage server; according to the method and the device, the I/O (input/output) brushing behavior cannot be triggered during the writing of the target file, the terminal cache is utilized, the data written into the target file is cached in the terminal cache, before the data of the whole target file is written into the storage server, the behavior that the target file is brushed to the storage server is not triggered, the situation that the file is not written into the storage server but is brushed down, the writing mode is changed into the modification mode, the writing efficiency is reduced, the internal aggregation of the file is realized, all data of the whole file are aggregated into one file to be brushed down, and the file writing efficiency is improved.

Description

Method and system for aggregation between files

Technical Field

The invention relates to the field of distributed file storage, in particular to a method and a system for aggregation between files.

Background

For a distributed file system, at present, cache aggregation is performed on small files according to the size of a fixed object (generally 4M), but in a concurrent scene, concurrent writing of a client causes that a plurality of files are not written completely, and then the files are flushed, so that after the files are flushed, data mapping between the small files and the large files is already formed, data is subsequently written into the small files, the writing form is changed into a modified writing form, the modified writing form is not suitable for the aggregated scene, the processing logic is complex, the writing speed is slow, and the overall writing efficiency is reduced.

For this reason, a more efficient file aggregation method is required.

Disclosure of Invention

In view of the above, the present invention provides a method and system for intra-file and inter-file aggregation.

The specific scheme is as follows:

a method for aggregation in files is applied to a user terminal in a distributed file system and comprises the following steps:

receiving and storing written data to a terminal cache for caching until the writing is finished, aggregating all the written data in the terminal cache, and obtaining a target file in the terminal cache;

and brushing the target file to a storage server.

Optionally, the receiving and storing the written data in a terminal cache for caching until the writing is completed, aggregating all the written data in the terminal cache, and obtaining the target file in the terminal cache includes:

and receiving and storing the written data to an AggMgr aggregation processing module for caching until a writing completion instruction is received, and aggregating all the written data in the AggMgr aggregation processing module to obtain the target file.

Optionally, the receiving and storing the written data to the AggMgr aggregation processing module for caching until a write completion instruction is received, and aggregating all the written data in the AggMgr aggregation processing module to obtain the target file includes:

receiving and storing written data corresponding to a file identifier of an initial target file to the initial target file cached in the AggMgr polymerization processing module in advance;

and until a writing completion instruction is received, according to the initial target file comprising all written data, carrying out polymerization in the AggMgr polymerization processing module to obtain the target file.

The invention also discloses an aggregation method among files, which is applied to a storage server in a distributed file system and comprises the following steps:

receiving a target file sent by a user terminal from a terminal cache;

judging whether the target file is a small file or not;

if so, the target file is flushed to a preset aggregation subfile in a cache layer in a large file additional writing mode;

the aggregation subfiles are flushed to the aggregation large file preset in the disk in a large file additional writing mode;

the aggregation large file is a preset file comprising 1 or more aggregation subfiles which are sequentially aggregated; the aggregation subfile is a file comprising 1 or more small files which are sequentially and closely aggregated; and the target file is a file obtained by aggregating all written data in the terminal cache.

Optionally, the process of determining whether the target file is a small file includes:

and judging whether the size of the target file is smaller than a preset threshold value.

Optionally, the process of flushing the target file to a preset aggregation subfile in a cache layer in a form of large file appending writing includes:

and brushing the target file into a preset Object in an Object cache layer in a large file appending writing mode.

Optionally, the process of flushing the aggregate subfile to a preset aggregate large file in a disk in a form of large file additional writing includes:

when the cache of the Object cache layer is full or no new writing is performed within a preset time threshold, the Object is flushed down to a preset aggregation large file in a disk.

Optionally, before the flushing the aggregate subfile to the preset aggregate large file in the disk in the form of large file additional writing, the method further includes:

storing the data mapping relation of the position of the target file in the aggregation large file in the target file;

and sending the data mapping relation to a metadata server.

The invention also discloses an in-file polymerization system, which is applied to a user terminal in a distributed file system and comprises the following components:

the internal aggregation module is used for receiving and storing the written data to a terminal cache for caching until the writing is finished, aggregating all the written data in the terminal cache, and obtaining a target file in the terminal cache;

and the file sending module is used for downloading and brushing the target file to the storage server.

The invention also discloses an aggregation system among files, which is applied to a storage server in a distributed file system and comprises the following components:

the file receiving module is used for receiving a target file sent by the user terminal from the terminal cache;

the file screening module is used for judging whether the target file is a small file;

the first aggregation module is used for flushing the target file to an aggregation subfile preset in a cache layer in a large file additional writing mode if the file screening module judges that the target file is a small file;

the second aggregation module is used for brushing the aggregate subfiles to the aggregate large file preset in the disk in a large file additional writing mode;

In the invention, the intra-file aggregation method is applied to a user terminal in a distributed file system and comprises the following steps: receiving and storing the written data to a terminal cache for caching until the writing is finished, aggregating all the written data in the terminal cache, and obtaining a target file in the terminal cache; and downloading the target file to a storage server.

According to the invention, the I/O (input/output) brushing behavior is not triggered during the writing of the target file, the terminal cache is utilized to cache the data written into the target file in the terminal cache, and before the data of the whole target file is written into the storage server, the behavior that the target file is brushed down to the storage server is not triggered, so that the problem that the writing mode is changed into the modification and writing efficiency is reduced due to the fact that the target file is not completely brushed down is avoided, the aggregation of the interior of the file is realized, all the data of the whole file is aggregated into one file to be brushed down, and the file writing efficiency is improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.

FIG. 1 is a schematic flow chart of a method for in-document aggregation according to an embodiment of the present invention;

FIG. 2 is a schematic flow chart of another in-document aggregation method disclosed in the embodiments of the present invention;

FIG. 3 is a flowchart illustrating an inter-document aggregation method according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of a large file structure according to an embodiment of the present invention;

FIG. 5 is a schematic structural diagram of an in-document polymerization system according to an embodiment of the present invention;

fig. 6 is a schematic structural diagram of an inter-file aggregation system according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The embodiment of the invention discloses an in-file aggregation method, which is applied to a user terminal in a distributed file system and is shown in figure 1, and comprises the following steps:

s11: and receiving and storing the written data to a terminal cache for caching until the writing is finished, aggregating all the written data in the terminal cache, and obtaining the target file in the terminal cache.

Specifically, when receiving data written into an object file by a user in a writing mode, caching the whole object file and the written data in a terminal cache until the user finishes writing into the whole object file, and aggregating all the written data in the terminal cache, so that a complete object file can be obtained in the terminal cache at one time.

S12: and downloading the target file to a storage server.

It is understood that after the complete target file is obtained, the user terminal may download the target file to a storage server in the distributed storage system, so as to complete the storage of the target file.

Therefore, in the embodiment of the invention, the I/O (input/output) brushing behavior is not triggered during the writing of the target file, but the terminal cache is utilized to cache the data written into the target file in the terminal cache, and before the data of the whole target file is written into the storage server, the behavior that the target file is brushed to the storage server is not triggered, so that the problem that the writing mode is changed into the modification due to the fact that the target file is not completely brushed and the writing efficiency is reduced is avoided, the aggregation of the internal parts of the files is realized, all the data of the whole file are aggregated into one file to be brushed, and the file writing efficiency is improved.

Specifically, file internal aggregation is performed in the terminal cache before writing is completed, so that internal I/O of the files can be integrated, an I/O path is shortened, and the speed of writing the files into the block is increased.

The embodiment of the invention discloses a specific in-file polymerization method, and compared with the previous embodiment, the embodiment further explains and optimizes the technical scheme. Referring to fig. 2, specifically:

s21: and receiving and storing the written data to an AggMgr aggregation processing module for caching until a writing completion instruction is received, and aggregating all the written data in the AggMgr aggregation processing module to obtain a target file.

Specifically, the AggMgr aggregation processing module may be used to cache the data written in the target file in the terminal cache.

Further, because the data is usually written concurrently, the data written in a plurality of different target files may be received simultaneously, and in order to ensure that each data is classified more accurately, a file identifier is set to ensure that the data can be written into the corresponding target file accurately, which is specifically shown in S211 and S212; wherein the content of the first and second substances,

s211: and receiving and storing written data corresponding to the file identifier of the initial target file to the initial target file cached in the AggMgr aggregation processing module in advance.

Specifically, each target file has a unique file identifier for distinguishing the files, an initial target file needs to be newly created when a user inputs data before the target file is obtained, the initial target file may not include data, which is equivalent to a file structure and a frame of the target file, and the final target file is obtained by filling the initial target file with the data.

Specifically, according to the file identifier of the initial target file, it is ensured that corresponding data can be accurately written into the initial target file, and data writing errors due to concurrence are avoided.

S212: and until a writing completion instruction is received, according to the initial target file comprising all written data, polymerizing in an AggMgr polymerization processing module to obtain a target file.

Specifically, the write completion instruction may correspond to a close instruction when the user closes the file, and when the user closes the initial target file, the initial target file is completely written, and at this time, according to the initial target file including all written data, aggregation may be performed in the AggMgr aggregation processing module to obtain the target file.

S22: and downloading the target file to a storage server.

The embodiment of the invention also discloses an inter-file aggregation method, which is applied to a storage server in a distributed file system and is shown in figure 3, and the method comprises the following steps:

s31: receiving a target file sent by a user terminal from a terminal cache;

s32: and judging whether the target file is a small file.

Specifically, the storage server receives a target file subjected to file internal aggregation in a terminal cache of the user terminal, and judges whether the target file is a small file, if so, further aggregation can be performed, and if not, further aggregation is not required, and the target file can be separately stored in a disk.

Specifically, the determination of the small file may be determined based on the size of the file, for example, whether the size of the target file is smaller than a preset threshold may be determined, for example, the threshold may be set to 512KB, when the size of the target file is larger than 512KB, the target file is not considered as the small file and is not to be aggregated further, and when the size of the target file is smaller than 512KB, the target file is considered as the small file and is to be aggregated further, so as to save the storage space.

It is understood that the threshold may be determined according to the actual application requirement, and is not limited herein.

The user terminal is the user terminal of the foregoing embodiment, and the target file is the target file of the foregoing embodiment.

S33: if so, the target file is flushed to a preset aggregation subfile in the cache layer in a large file additional writing mode.

Specifically, the target file is aggregated into a preset aggregation subfile, as shown in fig. 4, the aggregation subfile is a file including 1 or more sequentially and tightly aggregated small files (obj), where each small file is a target file meeting the requirement.

Specifically, when the target file is aggregated, no aggregate subfile or no aggregate subfile with a residual space exists before, a new aggregate subfile is created, the target file is stored in the aggregate subfile according to a preset storage sequence, and when the target file is aggregated, a small file which is already aggregated in the aggregate subfile before, the target file is stored next to the small file which is stored in the aggregate subfile before.

It should be noted that each aggregate subfile has a preset file size, and the preset file size of each aggregate subfile may be the same, for example, 4MB, and as each aggregate subfile has a preset size and different target files have different sizes, the remaining space of the aggregate subfile after storing to a certain target file may be smaller than the size of a new target file to be stored, at this time, the new target file needs to store to a new aggregate subfile next to the previous aggregate subfile, and although the space occupied by the small file stored in the previous aggregate subfile is less than 4MB, the previous aggregate subfile still occupies 4MB of space, so that a certain gap may exist between two aggregate subfiles.

As shown in fig. 4, the target File may be divided into two segments in the aggregate subfile, one segment is a metadata segment (File _ head), the other segment is a data field (File), the metadata segment may include a File identifier (ino) and a data mapping relationship (data _ len (64 bits +32 bits)), and the length of the total 12 bytes, actual data of the small File is stored in the data field, and if the small File needs to be searched subsequently, the data mapping relationship of the small File may be found in the metadata server, and then the small File is found in the storage server from the large File according to the record of the data mapping relationship.

S34: the aggregation subfiles are flushed to the aggregation large file preset in the disk in a large file additional writing mode;

specifically, the aggregation subfiles are flushed to the preset aggregation large file in the disk in a large file additional writing mode, so that the aggregation subfiles stored in the aggregation large file in the disk are tightly aggregated, and meanwhile, a file organization mode of large file additional writing is adopted, so that random I/O can be avoided, and the aggregation efficiency is improved.

Here, referring to fig. 4, the aggregate large file (big _ agg _ file) is a preset file including 1 or more sequentially aggregated aggregate subfiles (obj).

It is understood that the size of the aggregate large file and the number of the included aggregate subfiles may be set according to the actual application requirements, and are not limited herein.

It should be noted that different aggregation large files may use aggregation modes of different versions during aggregation, and therefore, as shown in fig. 4, a starting position of a first aggregation subfile of each aggregation large file stores an aggregation version number char type (gather _ version) for identifying an organization format of the aggregation large file, so that when the aggregation large file is subsequently analyzed by scanning, corresponding software can be used for identification, and an analysis error is avoided.

Specifically, the cache layer may be specifically an objectcatcher cache layer, the aggregate subfile may be an Object, and the target file is flushed to a preset Object in the objectcatcher cache layer in a form of large file appending writing.

Further, in the above S34, the process of flushing the aggregate sub-file to the preset aggregate large file in the disk in the form of large file additional writing may specifically be that when the Object cache layer is full of cache or no new write is made within a preset time threshold, the Object is flushed to the preset aggregate large file in the disk.

Specifically, multiple objects can be simultaneously cached in the Object cache layer, in the continuous writing process, data is flushed down after the Object cache layer cache is full, the number of times of I/O is reduced, meanwhile, in order to ensure that the data can be flushed down in time after the writing is finished, no new written data exists within a preset time threshold, the objects are flushed down to a preset aggregation large file in a disk, and aggregation is finished.

Therefore, the embodiment of the invention judges whether the internally aggregated target file is a small file, and if so, further aggregation is carried out, so that the file aggregation effect and the file management efficiency are improved.

It can be understood that before the aggregate subfile is flushed to the preset aggregate large file in the disk in the form of large file additional writing, S35 and S36 may also be included; wherein the content of the first and second substances,

s35: storing the data mapping relation of the position of the target file in the aggregated large file in the target file;

s36: and sending the data mapping relation to a metadata server.

It can be understood that, in order to facilitate subsequent finding of a new aggregated target file, a data mapping relationship between a small file and an aggregated large file is established, the data mapping relationship is stored in the target file, and in the data mapping relationship, the small file is positioned in the large file by using the file identifier of the small file and the storage position of the small file in the aggregated large file, and the data mapping relationship is sent to the metadata server, so that when the small file is subsequently found, the position of the small file can be determined by using the data mapping relationship in the metadata server.

Correspondingly, the embodiment of the present invention further discloses an intra-file aggregation system, which is shown in fig. 5 and is applied to a user terminal in a distributed file system, and the system includes:

the internal aggregation module 11 is configured to receive and store the written data to a terminal cache for caching, aggregate all the written data in the terminal cache until the writing is completed, and obtain a target file in the terminal cache;

and the file sending module 12 is used for downloading the target file to the storage server.

Specifically, the internal aggregation module 11 is specifically configured to receive and store the written data to the AggMgr aggregation processing module for caching, until a write completion instruction is received, and aggregate all the written data in the AggMgr aggregation processing module to obtain the target file.

Specifically, the internal aggregation module 11 may include a data aggregation unit and a file generation unit; wherein the content of the first and second substances,

the data aggregation unit is used for receiving and storing written data corresponding to the file identifier of the initial target file to the initial target file cached in the AggMgr aggregation processing module in advance;

and the file generation unit is used for aggregating in the AggMgr aggregation processing module to obtain the target file according to the initial target file comprising all written data until the writing completion instruction is received.

Correspondingly, the embodiment of the present invention further discloses an inter-file aggregation system, which is shown in fig. 6 and is applied to a storage server in a distributed file system, and the system includes:

the file receiving module 21 is configured to receive a target file sent by a user terminal from a terminal cache;

the file screening module 22 is used for judging whether the target file is a small file;

the first aggregation module 23 is configured to, if the file screening module 22 determines that the target file is a small file, flush the target file into an aggregation subfile preset in the cache layer in a large file appending writing manner;

the second aggregation module 24 is configured to flush the aggregate subfiles to a preset aggregate large file in the disk in a large file additional writing mode;

the aggregation large file is a preset file comprising 1 or more aggregation subfiles which are sequentially aggregated; the aggregation subfile is a file comprising 1 or more small files which are sequentially and closely aggregated; the target file is a file obtained by aggregating all written data in the terminal cache.

Specifically, the file screening module 22 is specifically configured to determine whether the size of the target file is smaller than a preset threshold.

Specifically, the first aggregation module 23 is specifically configured to brush the target file into a preset Object in an Object cache layer in the Object cache layer in a form of large file appending writing.

Specifically, the second aggregation module 24 is specifically configured to, when the Object cache layer is full or no new write is performed within a preset time threshold, flush the Object down to a preset aggregation large file in the disk.

Specifically, the system can further comprise a mapping relationship generation module and a mapping relationship sending module; wherein the content of the first and second substances,

the mapping relation generating module is used for storing the data mapping relation of the position of the target file in the aggregated large file in the target file;

and the mapping relation sending module is used for sending the data mapping relation to the metadata server.

Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative components and steps have been described above generally in terms of their functionality in order to clearly illustrate this interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

The technical content provided by the present invention is described in detail above, and the principle and the implementation of the present invention are explained in this document by applying specific examples, and the above description of the examples is only used to help understanding the method of the present invention and the core idea thereof; meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

Claims

1. A method for aggregation in files is applied to a user terminal in a distributed file system, and comprises the following steps:

and brushing the target file to a storage server.

2. The intra-file aggregation method according to claim 1, wherein the process of receiving and storing the written data to a terminal cache for caching until the writing is completed, aggregating all the written data in the terminal cache, and obtaining the target file in the terminal cache comprises:

3. The in-file aggregation method according to claim 2, wherein the process of receiving and storing the written data to an AggMgr aggregation processing module for caching until a write completion instruction is received, and aggregating all the written data in the AggMgr aggregation processing module to obtain the target file includes:

4. An inter-file aggregation method is applied to a storage server in a distributed file system, and comprises the following steps:

receiving a target file sent by a user terminal from a terminal cache;

judging whether the target file is a small file or not;

5. The method according to claim 4, wherein the step of determining whether the target file is a small file comprises:

6. The inter-file aggregation method according to claim 4, wherein the process of flushing the target file into the aggregation subfile preset in the cache layer in the form of large file append writing includes:

7. The inter-file aggregation method according to claim 6, wherein the process of flushing the aggregate subfile to the aggregate large file preset in the disk in the form of large file append writing comprises:

8. The inter-file aggregation method according to claim 4, wherein before the flushing of the aggregate subfile in the form of large file append to the aggregate large file preset in the disk, the method further comprises:

and sending the data mapping relation to a metadata server.

9. An intra-file aggregation system, applied to a user terminal in a distributed file system, includes:

10. An inter-file aggregation system, applied to a storage server in a distributed file system, includes: