CN117539389A - Cloud edge end longitudinal fusion deduplication storage system, method, equipment and medium - Google Patents

Cloud edge end longitudinal fusion deduplication storage system, method, equipment and medium Download PDF

Info

Publication number
CN117539389A
CN117539389A CN202311496327.8A CN202311496327A CN117539389A CN 117539389 A CN117539389 A CN 117539389A CN 202311496327 A CN202311496327 A CN 202311496327A CN 117539389 A CN117539389 A CN 117539389A
Authority
CN
China
Prior art keywords
cloud
edge
layer
spectrum sequence
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311496327.8A
Other languages
Chinese (zh)
Inventor
任棒棒
程葛瑶
谢兴睿
夏俊旭
罗来龙
郭得科
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
National University of Defense Technology
Original Assignee
National University of Defense Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by National University of Defense Technology filed Critical National University of Defense Technology
Priority to CN202311496327.8A priority Critical patent/CN117539389A/en
Publication of CN117539389A publication Critical patent/CN117539389A/en
Pending legal-status Critical Current

Links

Abstract

The application relates to a cloud edge end longitudinal fusion deduplication storage system, a cloud edge end longitudinal fusion deduplication storage method, equipment and a medium. The edge layer reduces bandwidth resource overhead of the backbone network by storing hotter data blocks. The uploaded metadata can also be pre-de-duplicated at the edge layer to further reduce the data transmission quantity; the cloud layer maintains a global fingerprint index table for global data deduplication and supports distributed parallel indexing across cloud storage servers by partitioning the global fingerprint index table and incoming metadata to different servers of the cloud data center, thereby globally improving data deduplication and storage performance. The end-side-cloud longitudinal fusion deduplication storage architecture remarkably reduces the bidirectional data transmission quantity between layers while ensuring the optimal data deduplication performance, and achieves the effect of greatly reducing the resource overhead of a backbone network of the cloud side end architecture.

Description

Cloud edge end longitudinal fusion deduplication storage system, method, equipment and medium
Technical Field
The invention belongs to the technical field of data processing, and relates to a cloud edge end longitudinal fusion deduplication storage system, a cloud edge end longitudinal fusion deduplication storage method, cloud edge end longitudinal fusion equipment and a cloud edge end longitudinal fusion medium.
Background
With the increasing volume and value of digital information, it has attracted widespread attention to data protection in the industry. The cloud backup service can provide economical, efficient, on-demand and always available options for data protection by saving continuous backup files of customer important data. According to the existing report, the organization number protected by cloud data is rapidly rising, but transmitting the original files from the terminal to the remote cloud server can bring about a great data transmission burden to the backbone network, and also can generate considerable transmission cost.
Under this trend, the application of data deduplication technology in backup storage systems is a new paradigm. Because of the internal derivative relationships between these consecutive backup files, there may be a large amount of non-negligible redundant data between the files. A large research report surface, the data redundancy of file system content on desktop Windows machines can be as high as 87% of its original storage space. A common practice in data deduplication techniques is to split a backup file into data blocks and calculate one fingerprint for each data block, with two data blocks having the same fingerprint being considered duplicate without requiring byte-by-byte comparisons of the data blocks. The repeated data blocks only reserve one copy, thereby achieving the purpose of saving storage space in the storage system.
There have been emerging data protection strategies that employ data deduplication techniques, such as performing deduplication operations at a cloud provider, attempting to distribute similar files to the same storage server to increase data deduplication rates, and so forth. However, in the foregoing conventional data deduplication technology, during the deduplication storage process of the cloud-edge architecture, there is still a technical problem that the resource overhead of the backbone network is relatively high.
Disclosure of Invention
Aiming at the problems in the traditional method, the invention provides a cloud edge end longitudinal fusion deduplication storage system, a cloud edge end longitudinal fusion deduplication storage method, computer equipment and a computer readable storage medium, which can greatly reduce the resource overhead of a backbone network of a cloud edge end framework.
In order to achieve the above object, the embodiment of the present invention adopts the following technical scheme:
on the one hand, a cloud edge end longitudinal fusion deduplication storage system is provided, and comprises a terminal layer, an edge layer and a cloud layer, wherein after an original backup file is divided into data blocks by terminal equipment of the terminal layer, an unprocessed file spectrum sequence corresponding to each data block is generated and uploaded to the edge layer;
the edge server in the edge layer hashes fingerprint information contained in the unprocessed file spectrum sequence to a compact sketch data structure with a fixed size to estimate the heat of the data block, allocates an edge storage position for the estimated new heat data block, records the storage address of the edge storage position into the unprocessed file spectrum sequence, and attaches an edge uploading tag;
The edge server in the edge layer matches the entry in the unprocessed file spectrum sequence with the entry in the set part index table, copies the storage position corresponding to the data block with the matched fingerprint from the set part index table to the unprocessed file spectrum sequence, and then uploads the part of unprocessed file spectrum sequence corresponding to the data block without the matched fingerprint to the cloud layer; setting a partial index table as a partial index table adjacent to the version perceived by the user;
the cloud server in the cloud layer carries out redundancy detection on the uploaded part of unprocessed file spectrum sequences and a global fingerprint index table maintained by the cloud, attaches a cloud uploading tag to an entry in the unprocessed file spectrum sequences corresponding to unrecognized unprocessed data blocks, allocates a new cloud storage position corresponding to the unrecognized data blocks, and returns the processed file spectrum sequences to the edge layer;
the edge server of the edge layer assembles the processed file spectrum sequence and a part of unprocessed file spectrum sequence corresponding to the data block with the matched fingerprint into a complete processed file spectrum sequence, and returns the processed file spectrum sequence to the terminal layer;
and uploading the new thermal data blocks to the edge layer for storage according to the edge storage position and the edge uploading tag in the processed file spectrum sequence by the terminal equipment of the terminal layer, and uploading the non-stored data blocks to the cloud layer for storage according to the new cloud storage position and the cloud uploading tag in the processed file spectrum sequence.
On the other hand, a method for deduplication storage by longitudinal fusion of cloud edge ends is also provided, which comprises the following steps:
hashing fingerprint information contained in an unprocessed file spectrum sequence uploaded by a terminal layer into a compact sketch data structure with a fixed size to perform data block heat estimation, distributing an edge storage position for the estimated new heat data block, recording a storage address of the edge storage position into the unprocessed file spectrum sequence, and attaching an edge uploading tag;
matching an entry in the unprocessed file spectrum sequence with an entry in the set part index table, copying a storage position corresponding to a data block with a matched fingerprint from the set part index table to the unprocessed file spectrum sequence, and uploading a part of unprocessed file spectrum sequence corresponding to the data block without the matched fingerprint to a cloud layer; setting a partial index table as a partial index table adjacent to the version perceived by the user;
assembling the processed file spectrum sequence and a part of unprocessed file spectrum sequences corresponding to the data blocks with the matched fingerprints into a complete processed file spectrum sequence, and returning the processed file spectrum sequence to the terminal layer; the method comprises the steps that redundancy detection is conducted on a part of an unprocessed file spectrum sequence uploaded through a cloud server in a cloud layer and a global fingerprint index table maintained by a cloud, a cloud uploading tag is attached to an entry in the unprocessed file spectrum sequence corresponding to an unrecognized unprocessed data block, and a new cloud storage position corresponding to the unrecognized data block is allocated and then obtained and returned;
The terminal equipment of the receiving terminal layer uploads a new thermal data block according to the edge storage position and the edge uploading label in the processed file spectrum sequence and stores the new thermal data block; the new cloud storage position and the cloud uploading tag in the processed file spectrum sequence are also used for indicating terminal equipment of the terminal layer to upload the non-stored data blocks to the cloud layer for storage.
In still another aspect, a computer device is provided, including a memory and a processor, where the memory stores a computer program, and the processor implements the steps of the deduplication storage method for cloud edge longitudinal fusion described above when executing the computer program.
In still another aspect, a computer readable storage medium is provided, on which a computer program is stored, which when executed by a processor implements the steps of the deduplication storage method for cloud-edge vertical fusion described above.
One of the above technical solutions has the following advantages and beneficial effects:
according to the cloud side end longitudinal fusion deduplication storage system, method, equipment and medium, the terminal layer is responsible for blocking the file, and the generated metadata information is uploaded to the edge layer and the cloud layer for redundancy detection. The edge layer reduces bandwidth resource overhead of the backbone network by storing hotter data blocks. In addition, the uploaded metadata can also be pre-de-duplicated at the edge layer to further reduce the data transmission quantity; the cloud layer maintains a global fingerprint index table for global data deduplication and supports distributed parallel indexing across cloud storage servers by partitioning the global fingerprint index table and incoming metadata to different servers of the cloud data center, thereby globally improving data deduplication and storage performance. Compared with the traditional technology, the end-side-cloud longitudinal fusion deduplication storage architecture effectively integrates different levels of technology and storage resources of files in the transmission, storage and retrieval processes, ensures the optimal data deduplication performance, obviously reduces the bidirectional data transmission quantity between layers, and achieves the effect of greatly reducing the backbone network resource overhead of the cloud side architecture.
Drawings
In order to more clearly illustrate the technical solutions of embodiments or conventional techniques of the present application, the drawings required for the descriptions of the embodiments or conventional techniques will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present application, and other drawings may be obtained according to these drawings without inventive effort for a person of ordinary skill in the art.
FIG. 1 is a schematic diagram of a CoopDedup architecture of a cloud-side longitudinally fused deduplication storage system in one embodiment;
FIG. 2 is a schematic diagram of data sharing dependencies between backup files in one embodiment;
FIG. 3 is a schematic diagram of a data deduplication rate of a backup file in one embodiment;
FIG. 4 is a schematic diagram of the overall structure of a deduplication storage system with cloud-edge longitudinal fusion in one embodiment;
FIG. 5 is a data flow diagram of a CoopDedup architecture in one embodiment;
FIG. 6 is a schematic diagram of evaluation performance during backup file upload in one embodiment;
FIG. 7 is a schematic diagram of a data block transfer size of a backup file in one embodiment;
FIG. 8 is a schematic diagram of data transmission performance during backup file retrieval in one embodiment;
FIG. 9 is a bandwidth saving performance schematic in one embodiment;
fig. 10 is a flow chart of a deduplication storage method of cloud edge longitudinal fusion in an embodiment.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application will be further described in detail with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the present application.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used in the description of the present application is for the purpose of describing particular embodiments only and is not intended to be limiting of the application.
It is noted that reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment may be included in at least one embodiment of the invention. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments.
Those skilled in the art will appreciate that the embodiments described herein may be combined with other embodiments. The term "and/or" as used in the present specification and the appended claims refers to any and all possible combinations of one or more of the associated listed items, and includes such combinations.
Continuous files in a personal computing environment require a cloud backup service for data protection. But transmitting these original files from the terminal to the remote cloud server places a large transmission burden on the backbone network. Even though the backup service may use data deduplication techniques to delete redundant data blocks, the limited computing and memory resources of the terminal make it difficult to actually deploy and apply the deduplication techniques on the end side.
In the prior art in this field, there are also deduplication performed on the end side and only new data blocks (not stored) are uploaded to the cloud, which has the disadvantage that the scarce resources of the end layer limit the application of the data deduplication technology. The current state-of-the-art deduplication storage architecture is to upload metadata (such as fingerprints) of a data block to a cloud server for redundancy detection and upload only a new data block, however, it still does not fully utilize edge side storage resources, resulting in frequent data exchange between a terminal and a remote cloud.
Therefore, the application provides a end-side-cloud longitudinal fusion deduplication storage architecture CoopDedup, and the innovative architecture enables end layers, cloud layers and middle edge layers to cooperatively complete data transmission, storage and index access. An example of a coopdodup architecture may be as shown in fig. 1. The terminal layer divides the file into data blocks and calculates metadata thereof for uploading, and the cloud layer maintains a global index table and stores all the data blocks after the duplication removal. The innovation of the architecture is mainly focused on an edge layer between the end and the cloud, which pre-deduplicates the uploaded metadata (UFR) and identifies that the hottest data block is stored at the edge. Based on this, these hotter data blocks can be obtained directly from the nearby edge layers when accessing the backup file, thereby optimizing the data retrieval process and reducing the data block transmission distance. In addition, the edge layer performs pre-deduplication on the uploaded UFR and only transmits the processed partial metadata (P-UFR) to the cloud, so that metadata transmission quantity is greatly reduced, and bandwidth resources of a backbone network are saved. The CoopDedup architecture fully integrates the storage and calculation resources of the end-side-cloud three layers, remarkably reduces the bidirectional data exchange between layers, and can still ensure the optimal data deduplication performance.
Embodiments of the present invention will be described in detail below with reference to the attached drawings in the drawings of the embodiments of the present invention.
It should be noted that, the data deduplication technology is a widely used data reduction technology, which can delete redundant data blocks and avoid repeated writing of the same data. Thus, not only is the occupation of the system to the storage space reduced, but also the network bandwidth consumption in the data transmission and retrieval process is reduced. A common practice for data deduplication is to segment a file into multiple fixed-size or variable-size data blocks. Each chunk is uniquely identified by a fingerprint, which is essentially an cryptographically secure hash signature, and two chunks of data having the same fingerprint are considered to be duplicate without requiring byte-by-byte comparisons of the chunks. All fingerprints of the stored data blocks are recorded in a fingerprint index table. By looking up the fingerprint index table, duplicate data blocks will be found and deleted directly, while only data blocks with unique fingerprints will be stored in the storage system.
Fingerprint comparison is an efficient redundant detection method, but storing the fingerprint index table also requires a lot of additional space. For example, when storing a 800TB file data set, assuming an average cut data block size of 4KB, a fingerprint of at least 4TB would be generated (using SHA-1 encoding, each fingerprint of 20B). The transmission and storage of these high-volume fingerprints is a significant challenge for the backup file system. These effects are further exacerbated when stored data is retrieved frequently, as these generated fingerprints need to be transmitted back and forth between the user and the cloud layer for repeated data block comparisons, placing a significant communication burden on the backbone network. But it is already a relatively good choice compared to the way in which the data block is directly transmitted.
It should be noted that, assuming that the cloud storage cluster is composed of a large number of storage servers, the resources reserved for the cloud backup service are relatively sufficient. The cloud layer may store a global fingerprint index table (containing fingerprints of all stored data blocks in the cloud storage cluster) for thorough redundancy detection. These stored data blocks should also employ some fault tolerant mechanisms, such as duplication and erasure codes, to ensure the reliability of the data stored in the cloud.
Reducing the storage space can significantly reduce the cost of data protection. Therefore, the data deduplication technology is widely applied in backup storage systems to reduce the occupation of storage space. In addition to storage costs, bandwidth consumption from the end side of data generation to the cloud layer is also a concern. Because the network throughput between layers is limited, a large amount of bandwidth resources may be occupied by a large amount of data transmission, which may have a delay impact for other applications. Furthermore, the cost of transmitting data once may exceed the cost of storing data monthly.
To this end, placing partial data blocks at the edge of the network is an innovative and effective realistic attempt. Because the edge resources are mostly located close to the user, when accessing the backup file, the relevant data blocks can be retrieved from the edge layer, while only the data blocks that are not present in the edge layer are downloaded from the remote cloud data center. In addition, the method and the device can use the edge resources to pre-deduplicate the transmitted data, and only the data after pre-deduplication is transmitted to the cloud layer. In this way, these edge-assisted backup service modes can effectively save valuable bandwidth resources of the backbone network.
But the storage space and computing resources of the edge clusters are limited and it is difficult to cope with the huge storage and computing demands caused by explosive growth of files. For example, in the case of deduplication data storage, as the number of data blocks stored increases with the arrival of a file, the volume of the corresponding fingerprint index table increases. Therefore, how to effectively use precious resources on the edge cluster to place partial data blocks and perform data pre-deduplication is a technical problem to be solved.
Backup file characteristics observation: important digital information will typically have a series of backup versions. For example, a user may periodically take snapshots of his virtual machine, where each snapshot corresponds to a backup file. In these contiguous files, most of the data blocks remain unchanged. Also provided herein are systematic observations based on backup files that facilitate efficient data deduplication operations with limited storage and computing resources.
Observation 1: because of the diversity of data sources, the number of duplicate data blocks between backup files for different users is negligible.
In the big data age, different users may back up files of different content and formats. Data sharing dependencies between these files are observed herein, and the test dataset is a college student's master catalog snapshot commonly used in deduplication systems. To find the amount of data redundancy inside and between different users, these data sets are divided into variable length data blocks of average size 4KB and the respective amounts of data block repetition are recorded.
The duplicate removal rates in and among users are recorded, and experimental results show that: the test comprises four users and five continuous backup files, wherein the user internal de-duplication rate is always kept above 41%, and even the user internal de-duplication rate of individual users is up to 55.02%. This indicates that there are a large number of data blocks maintained in a user's contiguous plurality of backup files. In contrast, the inter-user deduplication rate is typically around 2%, which is negligible compared to the intra-user deduplication rate. This also confirms the observation herein that due to the diversity of data sources, data redundancy between different users is negligible.
Empirical observations indicate that: the redundancy rate of data in one user's continuous backup file is high, and the repetition amount of data blocks between backup files from different users is relatively small, or even negligible. This phenomenon motivates the study of the variability between different users herein. The global fingerprint index table can be divided into a plurality of independent sub-index tables according to user sources, and the backup files are indexed based on the specific sub-index tables associated with the corresponding users. The division of the index table is beneficial to carrying out independent index management on each user; in addition, the index table accelerates the speed of fingerprint indexing, avoids the bottleneck of index searching, and simultaneously improves the index throughput by the parallelism of indexes. Most importantly, this user-perceived indexing is done while maintaining the deduplication effect, with little additional index memory space being consumed, since the number of duplicate data blocks between different user-backed files is negligible.
Observation 2: most of the duplicate data blocks of the backup file come from its previous adjacent backup version, while the two more distant backup versions contain only a small number of duplicate data blocks.
In order to observe internal data sharing dependencies between successive backup files and to detect their data block composition, data slicing is performed on successive 30 backup versions of one user. When fingerprints of these data blocks are detected from a previous version of the backup file, these data blocks are identified as duplicates. Will backup File B j The four blocks in (a) are represented as follows: one is an internal duplicate data block. And the second is that the adjacent repeated data blocks: b (B) j Is also adjacent to version B j-1 Reference. Third, skip repeated data block: b (B) j Is used at B j-1 Previous backup version references, e.g. B j-2 But not adjacent B j-1 Reference. Fourth, the unique data block: a data block that is not repeated.
Fig. 2 and 3 illustrate data sharing dependencies between consecutive 30 backup files from one user. First, fig. 2 shows the distribution of data blocks of these backup files, and it can be observed that most of the duplicate data blocks of the backup files come from their previous versions (i.e., adjacent duplicate data blocks), which account for about 55% of the total backup file data size. While the skip repeated data blocks contain only a small part, even less than 0.3% of all data blocks. Fig. 3 illustrates the data deduplication rates for the current 30 th backup file, each based on the previous 29 versions of the fingerprint index table. Observations indicate that the data deduplication rate gradually increases from about 42% (based on the initial backup version 1) to over 65% (based on the neighboring version 29 of the current backup).
The results of these observations verify the data sharing dependencies and version derivations between one user's consecutive backup files. If the backup versions are closer, more duplicate data blocks can be detected with each other, while fingerprint indexing with a farther version would impair the effectiveness of redundancy detection. This gives more focus to adjacent versions of the backed-up file when there is insufficient memory space to hold fingerprints for all previous backed-up versions.
Referring to fig. 4, in one embodiment, a cloud-edge-based longitudinal fusion deduplication storage system is provided, comprising a terminal layer 12, an edge layer 14, and a cloud layer 16. After dividing the original backup file into data blocks, the terminal device of the terminal layer 12 generates an unprocessed file spectrum sequence corresponding to each data block and uploads the unprocessed file spectrum sequence to the edge layer 14. The edge server in the edge layer 14 hashes fingerprint information contained in the unprocessed file spectrum sequence to a compact sketch data structure with a fixed size to perform data block heat estimation, allocates an edge storage position for the estimated new hot data block, records the storage address of the edge storage position into the unprocessed file spectrum sequence, and appends an edge uploading tag. The edge server in the edge layer 14 matches the entry in the unprocessed file spectrum sequence with the entry in the set part index table, copies the storage position corresponding to the data block with the matched fingerprint from the set part index table to the unprocessed file spectrum sequence, and then uploads the part of the unprocessed file spectrum sequence corresponding to the data block without the matched fingerprint to the cloud layer 16; the partial index table is set as a partial index table adjacent to the version perceived by the user. The cloud server in the cloud layer 16 performs redundancy detection on the uploaded part of unprocessed file spectrum sequence and the global fingerprint index table maintained by the cloud, attaches a cloud uploading tag to an entry in the unprocessed file spectrum sequence corresponding to the unrecognized unprocessed data block, allocates a new cloud storage position corresponding to the unrecognized data block, and returns the processed file spectrum sequence to the edge layer 14. The edge server of the edge layer 14 assembles the processed file spectrum sequence and the part of the unprocessed file spectrum sequence corresponding to the data block with the matched fingerprint into a complete processed file spectrum sequence, and returns the processed file spectrum sequence to the terminal layer 12. The terminal device of the terminal layer 12 uploads new thermal data blocks to the edge layer 14 for storage according to the edge storage position and the edge uploading tag in the processed file spectrum sequence, and uploads non-stored data blocks to the cloud layer 16 for storage according to the new cloud storage position and the cloud uploading tag in the processed file spectrum sequence.
The above-mentioned cloud-edge-end longitudinal fusion deduplication storage system 100 is responsible for blocking files through the terminal layer 12, and uploading the generated metadata information to the edge layer 14 and the cloud layer 16 for redundancy detection. The edge layer 14 reduces the bandwidth resource overhead of the backbone network by storing hotter data blocks. In addition, the uploaded metadata may also be pre-deduplicated at the edge layer 14 to further reduce the amount of data transmission; the cloud layer 16 maintains a global fingerprint index table for global data deduplication and supports distributed parallel indexing across cloud storage servers by partitioning the global fingerprint index table and incoming metadata to different servers of the cloud data center, thereby globally improving data deduplication and storage performance. Compared with the traditional technology, the end-side-cloud longitudinal fusion deduplication storage architecture effectively integrates different levels of technology and storage resources of files in the transmission, storage and retrieval processes, ensures the optimal data deduplication performance, obviously reduces the bidirectional data transmission quantity between layers, and achieves the effect of greatly reducing the backbone network resource overhead of the cloud side architecture.
It will be appreciated that the coopdodup architecture has several challenges to be addressed in the deduplication storage service. The first key question is how should the data block access frequency be reasonably estimated and the most frequently accessed blocks (hot blocks) stored at the edge layer 14? Due to limited storage and computing resources of the edge layer 14, it is impractical to record the frequency of access for each data block. A second key issue is how should edge-limited memory space be effectively utilized to detect more redundant metadata? Maintaining a global fingerprint index table at the edge for redundancy detection can pose a serious challenge for memory-scarce edge servers. Even if an edge server inserts a large storage disk to assist in maintaining this large-capacity index table, such slow disk indexing still becomes a major performance bottleneck for the deduplication system.
For the first challenge, a spatially friendly Count-Min Sketch (an algorithm that can be used to Count, where the data size is very large, by sacrificing efficiency of accuracy improvement) is used to estimate the access heat of the data block, and this fixed-size data structure can identify hotter data blocks with higher accuracy. For the second problem, an efficient lightweight index table is designed to pre-deduplicate uploaded metadata using backup file derivatives, where most metadata redundancy can be detected and deleted. These edge-assisted methods effectively save valuable bandwidth resources from the edge to the backbone of the remote cloud.
Some concepts that need to be explained first: metadata structure, fingerprint is the unique identifier of a data block. By comparing the fingerprints, it can be discriminated whether or not the two data blocks are duplicates. During file storage or retrieval, these fingerprints are typically combined in order into a file sequence. The file spectrum order is essentially a sequential list of metadata for each data block in the file, reflecting the order in which the data blocks appear in the file. Even if a data block appears multiple times in a file, its corresponding metadata will still be listed multiple times in the corresponding file sequence. There are two types of spectral orders in the CoopDedup architecture herein: untreated file spectrum (Unprocessed File recipes, UFR) and treated file spectrum (Processed File recipes, PFR).
Where the unprocessed file spectrum sequence (UFR) is generated when the terminal layer 12 divides the backup file, where the metadata only contains the fingerprints of the data blocks. The UFR records the data block composition of the backup file and serves as input to the subsequent data deduplication process. Once the storage locations of any data blocks are determined at either the edge layer 14 or the cloud layer 16, the UFR will be converted to a processed file spectrum sequence (PFR) with each entry further adding the storage address of the corresponding data block.
The PFR has two roles in backup protection. The first role is that during the uploading of the backup file, the data blocks can be uploaded according to the storage addresses recorded on the PFR. To avoid the duplicate transmission of the data block already stored, each entry is additionally appended to the unique block that needs to be uploaded with an upload tag. Only the data block with the marked entry will be uploaded, the other data blocks are considered duplicate data blocks, and no upload is required. In the second role, during the backup file access process, the data blocks contained in the backup file are downloaded from the edge layer 14 or the cloud layer 16 according to the addresses recorded in the PFR, and the downloaded blocks are assembled into a complete backup file according to the sequence of the data blocks in the PFR, and the complete backup file is returned to the user side requesting access.
Another important metadata structure is a fingerprint index table (which may be simply referred to as FingIdx) that records a mapping from data block fingerprints to data block storage addresses. FingIdx generally has a smaller capacity than the PFR because it contains only one unique entry for any data block. A global fingerprint index table records metadata information for all stored data blocks. During the uploading of the data backup file, a data block is identified as a duplicate block when it detects its fingerprint in FingIdx. In this case, their storage locations will be passed directly to the UFR without the need for reassignment of addresses.
Termination layer 12 of the CoopDedup architecture: the terminal layer 12 is the first layer of the coopdodup architecture, and terminal devices of different users generate a large number of backup files. Uploading the original file of the terminal device to the remote cloud layer 16 for data deduplication is not economical because it results in a large amount of redundant data being transmitted multiple times over the backbone network. Thus, the data deduplication process is advanced herein to the front end termination layer 12 and edge layer 14. The main task of the terminal device is to divide the generated backup file into data blocks of unequal size and to compute fingerprints for these segmented data blocks, as the terminal layer 12 shown in fig. 5, also called end layer. The fingerprints computed in one backup file will be assembled into UFRs in sequence. This UFR will be uploaded to the subsequent edge layer 14 for further processing. Transmitting UFR instead of transmitting all the sliced data blocks can effectively reduce the network transmission burden.
Thus, the innovation on the termination layer 12 is that: instead of deleting duplicate data blocks on the terminal device, uploading UFR is selected for redundancy identification, because it is impractical to store a global fingerprint index table to record metadata information of all data blocks split by the terminal device, first, due to limited storage and computational resources on the terminal device. Scarce computing resources also prevent redundant data detection based on fingerprint information. Secondly, performing isolated data deduplication on each terminal device may ignore data redundancy that may exist between multiple terminal devices, thereby affecting the effect of data deduplication.
Edge layer 14 of the CoopDedup architecture: an edge server serves a plurality of terminal devices within an area, all of which constitute an edge layer 14 in the coopdodup architecture. As an intermediate bridge between the termination layer 12 and the cloud layer 16, it is herein innovatively proposed to undertake some of the tasks performed by the termination layer 12 and the cloud layer 16 in the traditional deduplication backup method through the edge layer 14. Pushing the task of the terminal layer 12 towards the edge layer 14 can relieve the computation and storage pressure of the terminal device. By pulling the cloud layer 16 to the edge layer 14, frequent data transmission between the terminal and the remote cloud can be reduced, so that network transmission overhead is reduced.
It should be noted that the storage space of the edge layer 14 is not comparable to that of a cloud data center. Therefore, how to effectively utilize the limited resources of the edge layer 14, to maximize the data deduplication effect and reduce the transmission cost is an important point of the edge layer 14. This problem is addressed herein in view of the following two aspects: the first aspect is to store part of the high access frequency hot data blocks at the edge of the network. Under the same space resource, the hot data block can be stored to serve more end-side data requests than the cold data block, so that the transmission cost in the data retrieval process is reduced to the greatest extent. The second aspect is that a partial index table may be maintained at the edge layer 14 to pre-deduplicate the uploaded metadata information (UFR), and the detected redundant metadata information will not be further transmitted to the remote cloud layer 16 to reduce the data traffic of the backbone network. The two modes assist in the duplicate removal processing of the backup file by means of the storage and calculation resources of the edge layer 14, so that the bandwidth resources of a backbone network can be effectively saved, and the possible network transmission congestion is relieved.
Sketch-based block selection based on count: it is not economical to record the access hotness of data blocks by directly tracking the count of references, because each data block needs to record its fingerprint and its count of references frequency, which can lead to non-negligible memory overhead. Thus, the present disclosure chooses to utilize a compact Sketch data structure of fixed size (i.e., count-Min Sketch) to estimate the reference Count (data block heat) of a data block. Count-Min Sketch is a two-dimensional array with width denoted r and depth denoted w (r and w are both configurable parameters). For each arriving data block, its fingerprint will be mapped by w independent hash functions to w counters for each r rows, as shown in the edge layer 14 of fig. 5. Then, the counter in each hash position will be incremented by 1. The reference count of a data block, i.e. the data block warmth, is estimated by the minimum value Min of all counters to which its fingerprint hashes. The estimation error has been shown to be bounded within n e/r with a probability of at least 1-1/w e Where n is the total number of data blocks and e is the euler number.
This sketch-based block heat estimation may prove herein to be a significant memory saving by a simple analysis. For example, if it is necessary to record reference counts of all divided data blocks (n=2 12 ) Then need 2 12 * (20B+4B) memory space. If the Count-Min Sketch is used, the parameter is set to r=2 2 And w=2 10 Then coexist at 2 (10+2) And a counter which occupies only 1/6 of the memory required by the direct trace reference counter method. The effect of this sketch structure on space saving increases exponentially as more data blocks are generated, without adding additional space overhead。
User perception and version adjacent partial fingerprint index: pre-deduplicating metadata may reduce the transmission burden on the network. For example, when the data block fingerprint is calculated using SHA-1 encoding herein and assuming an average data block size of 4KB, storing a 800TB file requires uploading at least 4TB of the fingerprint. However, because of limited marginal memory space, it is impractical to maintain a global fingerprint index table (containing all the data block fingerprints of the processed file). Therefore, a key consideration is how to select a subset of the global fingerprint index table for redundancy detection, while minimizing the index table volume while maximizing redundancy detection rate. Based on system observations, content variability between backup files from different users is found to be large, but a large number of data blocks can be shared between adjacent backup files of the same user. Based on this, a User-perceived and version-adjacent partial index table (User-aware and Version-adjacent Partial Index Table, UVPIDx) is presented herein innovatively. The UVPIDx index table divides the fingerprint index table into a plurality of independent sub-index tables based on the membership user information, and each sub-index table records the metadata information of one adjacent backup version of the user.
The fingerprint index perceived by the user accelerates the fingerprint comparison process, and facilitates the updating operation of the sub-index table corresponding to each user. It should be noted that each UVPIdx index table should be updated to the latest version of the user's backup file all the time to avoid affecting the duplicate data detection effect due to excessive differences in backup versions. The fingerprint index tables with adjacent versions can detect most repeated items under the condition of ensuring smaller table volume, and the redundancy detection precision is higher. As shown in fig. 3, about 66.538%/66.568% = 99.95% duplicate entries may be detected based on an adjacent version of the index table. When the UFR reaches the edge layer 14, the matching entry directly adds the storage address of the data block by means of the UVPIdx-based fingerprint index. Only the unmatched portion of UFR will be further uploaded to the remote cloud, which can effectively save bandwidth resources between the edge layer 14 and the cloud layer 16.
Cloud layer 16 of coopdodup architecture: the cloud layer 16 is effectively a large storage cluster comprising a large number of homogeneous storage servers. Sufficient storage and computing resources support the cloud layer 16 to possess its own data deduplication structure. The cloud layer 16 maintains a global fingerprint index table that records all the data block fingerprints sliced from all the processed backup files. By comparing the uploaded UFR with the global fingerprint index table, the data block corresponding to the matching entry is deemed to have been stored at the cloud end, whereas the unmatched data block is deemed to have not been stored, such data block should then be uploaded from the terminal to the cloud end for storage. The global fingerprint index table of the cloud layer 16 compares the data from all terminals, and can detect all repeated data blocks, thereby realizing complete redundancy deletion.
In one embodiment, the edge server of the edge layer 14 is further configured to upload copies of the hot data blocks stored at the edge to the cloud layer 16 for storage.
It will be appreciated that further, all data blocks, including hot data blocks stored at the edges, also require at least one copy to be maintained in the cloud. Thus, the availability of the data block can be still ensured when the edge resource is unreliable. In addition, cloud storage may also employ fault tolerant mechanisms, such as implementing copies or erasing code, to further ensure reliability of data storage.
In one embodiment, each cloud server in the cloud layer 16 performs redundancy detection on the uploaded part of the unprocessed file spectrum sequence and the global fingerprint index table maintained by the cloud in a distributed index manner.
It will be appreciated that further, fingerprinting on a storage server is time consuming and resource intensive. To support distributed fingerprint indexing across cloud storage servers, the present embodiment may choose to fingerprint the global fingerprint index table and incoming UFRs into different buckets based on data blocks, as in the distributed index shown in fig. 5. Entries mapped into the same bucket will be assigned to one cloud server. Since the fingerprint is a unique identification of the data block, the fingerprint-based bucket mapping ensures that all matching entries (UFR and global fingerprint index table) will be allocated to the same bucket, accelerating the fingerprint index process without affecting the data deduplication effect. After the distributed indexing, new data blocks that are not stored are determined and assigned new cloud storage locations.
In general, the CoopDedup architecture reduces bi-directional data exchange, saves bandwidth resources, and reduces transmission costs through the cooperation of the three layers of end-edge-cloud. Two types of data are transferred between the three layers of end-side-cloud, the first being metadata for data deduplication detection, namely the UFR and PFR described above. The second is the new data block that needs to be uploaded, not stored. When the file sequence is processed back to the terminal, these new data blocks will be uploaded from the terminal layer 12 to the edge layer 14 or cloud layer 16 for data storage based on address information in the PFR and thus the data communication model across the three layers in the CoopDedup architecture can be summarized as described below.
At the terminal layer 12, the original backup file is first divided into data blocks, generating an unprocessed file spectrum order (UFR), i.e. a list of fingerprints of all data blocks in the file. The UFR is uploaded and further processed as input to the edge layer 14.
At the edge layer 14, the fingerprint information contained by the ufr will be hashed into Count-Min Sketch. Once the estimated data block heat just exceeds the set threshold, the corresponding data block will be identified as a new hot data block. It is believed in this embodiment that these hot data blocks should store a copy at the edge layer 14 and allocate an edge storage location for it. This memory address will be recorded in the corresponding entry of the UFR and an edge upload tag will be appended. Further, the entries in the UFR will also be compared to the entries in UAPIdx. The data blocks with matching fingerprints are considered to be duplicate, i.e. already stored in the cloud layer 16, in which case their locations will be copied directly from UAPIdx to UFR. These edge-processed partial file sequences (P-PFR) will wait at the edge for further assembly. While the unmatched portion of the UFR will be further uploaded to the resource-efficient cloud layer 16.
At cloud layer 16, the uploaded partial UFR and global index table perform redundancy detection in a distributed index manner. The UFR entry corresponding to the identified non-stored data block will be attached with a cloud upload tag, and a new cloud storage location is allocated to the data block. When the locations (storage containers and offsets) of all the non-stored data blocks are determined, the cloud-processed file spectrum sequence is returned to the edge layer 14 and assembled into a complete PFR with the edge layer 14-processed file spectrum sequence. It should be noted that entries in the complete PFR will update the UVPIdx in the edge layer 14, since the UVPIdx in the edge layer 14 should always record metadata of the latest backup version of each user to avoid degradation of the data deduplication effect.
The complete PFR is finally sent back to the terminal layer 12 with the location information of the data block and the upload tag attached. The terminal is responsible for uploading corresponding data blocks according to the storage position of the PFR and the uploading tag. The new data blocks are uploaded to the cloud layer 16 designated locations according to cloud upload tags, and the hot data blocks are uploaded to the edge storage locations designated by the nearby edge layer 14 according to edge upload tags. The retrieval of the backup file can only be marked as ready after all data blocks have been successfully uploaded.
In one embodiment, further, in the file retrieval process, the terminal device of the terminal layer 12 sends a block request to the edge server and/or the cloud server according to the position information of the data blocks in the processed file spectrum sequence, and assembles the retrieved data blocks into the original backup file in sequence.
It can be understood that in the file retrieval process, the terminal sends a block request to the edge server or the cloud server according to the position information of the data blocks in the PFR, and the retrieved data blocks are assembled into the original backup file in sequence. The frequently accessed hot data blocks can be directly downloaded from a nearby edge server, thereby reducing the bandwidth occupation of the backbone network. Even if the edge server is damaged or other unreliable storage problems occur, the data block request can still find the copy of the data block in the cloud according to the global fingerprint index table of the cloud, so that the availability of the data block and the reliability of data storage are realized.
In summary, the cooperation between the three longitudinal interconnection layers of the end-side-cloud obviously reduces the bidirectional data exchange between the layers, and simultaneously ensures the optimal data deduplication performance.
In some embodiments, in order to more clearly and intuitively demonstrate the effect of the system described above, some experimental examples thereof are provided herein, which are only for auxiliary explanation and not for the sole limitation of the technical solutions described above. The present example uses a real world dataset to empirically evaluate the performance of the CoopDedup architecture herein. Experiment setting: a desktop computer is used. Data set: this example uses a real world FSL dataset containing consecutive snapshots of the file master catalog of 13 students to evaluate the versatility of the CoopDedup architecture herein. Each snapshot corresponds to a backup file. These snapshots have a variety of typical workloads, such as file system snapshots and virtual machine images. The aggregate size of the data is 67.0GB and the average size of the snapshot file is approximately 154.2MB. The global deduplication rate is about 48.62%, i.e., about half of the data blocks are duplicated.
To more fully illustrate the data deduplication performance, four comparison methods are considered: coopdudup, end-edge-cloud longitudinal fusion deduplication storage architecture as set forth above. InftyDedup, the current latest deduplication storage architecture, sends metadata information across cloud layers and terminals for redundancy detection, but it ignores efficient utilization of edge layer resources. PostDedup is a widely used post-processing data deduplication strategy. The backup file is directly transmitted to a remote cloud layer for storage and data deduplication operation. CoopDedup_FIFO is a variant of the CoopDedup architecture that maintains both the partial index table at the edge layer and the stored data blocks on a first-in-first-out basis. The index table and the volume of the stored data blocks may refer to the CoopDdup architecture. When space overflows, the data in the CoopDedup_FIFO is updated by using a first-in first-out principle.
Comparing the indexes: in the process of uploading the backup file, the main comparison index of the experiment is the data transmission quantity of the backbone network, and the main comparison index consists of the metadata transmission quantity and the data block transmission quantity. The smaller the value of the amount of data uploaded, the more redundancy that the data has eliminated before being transmitted to the remote cloud. This saves valuable network resources on the backbone network and is critical to improving network performance.
In the backup file retrieval process, a bandwidth saving rate (Bandwidth Saving Ratio, BSR) is of great concern, which represents the percentage of reduction in the amount of data transferred over the backbone network resulting from the acquisition of data blocks from nearby edges. The present example also records the amount of data download from the remote cloud layer. Of course, the auxiliary advantage of edge layer de-duplication of data comes at the cost of additional memory space being consumed at the edge, and therefore, this example also records edge layer additional space usage (Extra Space Occupation, ESO), which contains the volume of edge layer stored data blocks and the size of CM-skitch used.
Parameter setting: the present example first decompresses and divides the files in the dataset into data blocks. The fingerprint of each data block is represented using a SHA-1 code. For the access hotness of the backup file, the present example employs widely used Zipf distribution to generate the access hotness of the backup file, where the concentration of data access is set to 1. The CM-Sketch parameters used in this example are set by default to r=1000000 and w=10. In the coopdodup architecture, the amount of data chunks stored at the edge layer is set by default to 30% of the amount of all deduplicated data chunks. The edge storage of the coopdedup_fifo compare method is always consistent with the CoopDedup architecture.
Numerical results: the example performs large-scale experiments, and tests the performance of the CoopDedup architecture and the comparison method thereof in the process of uploading and searching the backup files. The evaluation performance during the uploading of the backup file is shown in fig. 6. In fig. 6, the metadata transfer amount of the InftyDedup is the largest: the metadata transfer amount of uploading 450 files is up to 371.27MB. The metadata transmission amount of the CoopDedup architecture is about half of that of the InftyDedup, because the InftyDedup performs redundancy detection by transmitting the whole metadata information between the terminal and the remote cloud, and the edge pre-deduplication function in the CoopDedup architecture can effectively reduce the transmission amount of the metadata from edge to cloud. The coopdodup_fifo maintains an index table at the edge to perform metadata pre-deduplication, but because it ignores the key role of backup file derivative relationship, the volume of metadata transfer is still large, that is, about 297.27MB of metadata transfer is required when uploading 450 files.
Note that the metadata transfer amount of the conventional postdup method remains zero. This is because it directly transfers the original file to the cloud without the need to calculate metadata information for redundancy detection. However, as shown in fig. 7, the data block transfer amount of postdup is highest among all the comparison methods. This is because other methods only need to upload the data block after the deduplication after the redundant detection of the metadata information, and the transmission amount of the data block is only about half of that of the PostDedup.
The data transfer performance during the backup file retrieval process is shown in fig. 8. The present example randomly generates 1000 file requests according to the hotness of the file. In fig. 8, the coopdodup architecture has an absolute advantage in terms of data download amount. Specifically, when 60% of the data blocks can be stored in the edge layer, the data downloading amount of the CoopDedup architecture is only about 31.79 GB. In contrast, since PostDedup and InftyDedup do not fully utilize edge layer resources, all involved data blocks are always downloaded from the remote cloud layer, resulting in a higher data download amount (1000 file requests with data download amount of about 150 GB).
The bandwidth saving rate performance is shown in fig. 9. The CoopDedup architecture may save 32.77% of the bandwidth resources when only 10% of the data blocks may be stored at the edge layer. This benefits from the accurate selection of hotter data blocks by CM-skip. This allows more data blocks to be obtained marginally without accessing the remote cloud layer during file retrieval. This trend of increasing bandwidth savings with increasing amounts of stored data tends to be gradual, with the last 10% of data blocks (90% -100%) corresponding to a bandwidth savings of only 4.33%. This is because the last selected data blocks are typically cooler data blocks, which may only serve a small number of file access requests. Furthermore, the CoopDeup_FIFO method can only provide a linearly increasing bandwidth saving rate due to the use of first-in first-out rules to maintain stored data blocks.
Finally, the edge layer memory data block ratio is fixed to 30% in the CoopDedup architecture performance results for different Sketch (CM-Sketch) widths. As the sketch width increases from 100 to 10 8 The extra space occupation of the edge layer is also increasing gradually, because the CM-Sketch used requires more space to store the hash counter in the Sketch.However, this increasing trend of extra space occupation is not obvious, since most of the extra space occupation consists of edge-stored data blocks. As CM-Sketch widths increase, so does bandwidth savings. This demonstrates that large sketch structures can reduce the bias in the estimation of the access frequency to data blocks. When the sketch width is set to be more than 10 7 When the bandwidth saving rate remains unchanged. This shows that the sketch structure at this width setting can already accurately estimate the heat of the data block.
In summary, the CoopDedup architecture fully utilizes the three-layer resources of the end-edge-cloud, and reduces the volume of uploaded metadata by half compared with the most advanced InftyDedup at present. In addition, when only 10% of the data blocks are stored at the edge, the bandwidth saving rate can reach around 33%.
In one embodiment, as shown in fig. 10, a deduplication storage method 100 for longitudinal fusion of cloud edge ends is provided, which may include the following processing steps S12 to S18:
S12, hashing fingerprint information contained in the unprocessed file spectrum sequence uploaded by the terminal layer into a compact sketch data structure with a fixed size to estimate the heat of the data block, distributing an edge storage position for the estimated new heat data block, recording a storage address of the edge storage position into the unprocessed file spectrum sequence, and attaching an edge uploading tag.
S14, matching an entry in the unprocessed file spectrum sequence with an entry in the set part index table, copying a storage position corresponding to a data block with a matched fingerprint from the set part index table to the unprocessed file spectrum sequence, and uploading a part of the unprocessed file spectrum sequence corresponding to the data block without the matched fingerprint to the cloud layer; the partial index table is set as a partial index table adjacent to the version perceived by the user.
S16, assembling the processed file spectrum sequence and the part of unprocessed file spectrum sequence corresponding to the data block with the matched fingerprint into a complete processed file spectrum sequence, and returning the processed file spectrum sequence to the terminal layer. The method comprises the steps of performing redundancy detection on a part of an unprocessed file spectrum sequence uploaded through a cloud server in a cloud layer and a global fingerprint index table maintained by a cloud, attaching a cloud uploading tag to an entry in the unprocessed file spectrum sequence corresponding to an unrecognized unprocessed data block, and distributing a new cloud storage position corresponding to the unrecognized data block to obtain and return.
S18, the terminal equipment of the receiving terminal layer uploads and stores a new thermal data block according to the edge storage position and the edge uploading tag in the processed file spectrum sequence; the new cloud storage position and the cloud uploading tag in the processed file spectrum sequence are also used for indicating terminal equipment of the terminal layer to upload the non-stored data blocks to the cloud layer for storage.
It will be appreciated that, for specific limitation of the deduplication storage method of the cloud-edge longitudinal fusion, reference may be made to the corresponding limitation of the deduplication storage system 100 of the cloud-edge longitudinal fusion, which is not described herein. The deduplication storage method of the cloud edge end longitudinal fusion is described in the view angle of the edge layer, so that the technical scheme of the application can be more intuitively understood.
According to the cloud edge end longitudinal fusion deduplication storage method, the terminal layer is responsible for blocking the file, and the generated metadata information is uploaded to the edge layer and the cloud layer to perform redundancy detection. The edge layer reduces bandwidth resource overhead of the backbone network by storing hotter data blocks. In addition, the uploaded metadata can also be pre-de-duplicated at the edge layer to further reduce the data transmission quantity; the cloud layer maintains a global fingerprint index table for global data deduplication and supports distributed parallel indexing across cloud storage servers by partitioning the global fingerprint index table and incoming metadata to different servers of the cloud data center, thereby globally improving data deduplication and storage performance. Compared with the traditional technology, the end-side-cloud longitudinal fusion deduplication storage architecture effectively integrates different levels of technology and storage resources of files in the transmission, storage and retrieval processes, ensures the optimal data deduplication performance, obviously reduces the bidirectional data transmission quantity between layers, and achieves the effect of greatly reducing the backbone network resource overhead of the cloud side architecture.
In one embodiment, each cloud server in the cloud layer performs redundancy detection on the uploaded part of unprocessed file spectrum sequences and a global fingerprint index table maintained by the cloud in a distributed index mode.
In one embodiment, the above method for deduplication storage by longitudinal fusion of cloud edge ends may further include the following processing steps:
and uploading the copy corresponding to the hot data block stored at the edge to the cloud layer by the edge server for storage.
In one embodiment, in the file retrieval process, the terminal device of the terminal layer sends a block request to the edge server and/or the cloud server according to the position information of the data blocks in the processed file spectrum sequence, and the retrieved data blocks are assembled into the original backup file in sequence.
It should be understood that, although the steps in the flowchart 10 are shown in sequence as indicated by the arrows, the steps are not necessarily performed in sequence as indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least a portion of the steps of flowchart 10 described above may include a plurality of sub-steps or stages that are not necessarily performed at the same time, but may be performed at different times, the order in which the sub-steps or stages are performed is not necessarily sequential, and may be performed in turn or alternately with at least a portion of the sub-steps or stages of other steps or steps.
In one embodiment, there is also provided a computer device including a memory and a processor, the memory storing a computer program, the processor implementing the following processing steps when executing the computer program: hashing fingerprint information contained in an unprocessed file spectrum sequence uploaded by a terminal layer into a compact sketch data structure with a fixed size to perform data block heat estimation, distributing an edge storage position for the estimated new heat data block, recording a storage address of the edge storage position into the unprocessed file spectrum sequence, and attaching an edge uploading tag; matching an entry in the unprocessed file spectrum sequence with an entry in the set part index table, copying a storage position corresponding to a data block with a matched fingerprint from the set part index table to the unprocessed file spectrum sequence, and uploading a part of unprocessed file spectrum sequence corresponding to the data block without the matched fingerprint to a cloud layer; setting a partial index table as a partial index table adjacent to the version perceived by the user; assembling the processed file spectrum sequence and a part of unprocessed file spectrum sequences corresponding to the data blocks with the matched fingerprints into a complete processed file spectrum sequence, and returning the processed file spectrum sequence to the terminal layer; the method comprises the steps that redundancy detection is conducted on a part of an unprocessed file spectrum sequence uploaded through a cloud server in a cloud layer and a global fingerprint index table maintained by a cloud, a cloud uploading tag is attached to an entry in the unprocessed file spectrum sequence corresponding to an unrecognized unprocessed data block, and a new cloud storage position corresponding to the unrecognized data block is allocated and then obtained and returned; the terminal equipment of the receiving terminal layer uploads a new thermal data block according to the edge storage position and the edge uploading label in the processed file spectrum sequence and stores the new thermal data block; the new cloud storage position and the cloud uploading tag in the processed file spectrum sequence are also used for indicating terminal equipment of the terminal layer to upload the non-stored data blocks to the cloud layer for storage.
It will be appreciated that the above-mentioned computer device includes, in addition to the above-mentioned memory and processor, other software and hardware components not listed in this specification, and may be specifically determined according to the model of the specific edge server device in different application scenarios, which will not be listed in detail in this specification.
In one embodiment, the processor may further implement the steps or sub-steps added in the embodiments of the above-described deduplication storage method for cloud-edge longitudinal fusion when executing the computer program.
In one embodiment, there is also provided a computer readable storage medium having stored thereon a computer program which when executed by a processor performs the following processing steps: hashing fingerprint information contained in an unprocessed file spectrum sequence uploaded by a terminal layer into a compact sketch data structure with a fixed size to perform data block heat estimation, distributing an edge storage position for the estimated new heat data block, recording a storage address of the edge storage position into the unprocessed file spectrum sequence, and attaching an edge uploading tag; matching an entry in the unprocessed file spectrum sequence with an entry in the set part index table, copying a storage position corresponding to a data block with a matched fingerprint from the set part index table to the unprocessed file spectrum sequence, and uploading a part of unprocessed file spectrum sequence corresponding to the data block without the matched fingerprint to a cloud layer; setting a partial index table as a partial index table adjacent to the version perceived by the user; assembling the processed file spectrum sequence and a part of unprocessed file spectrum sequences corresponding to the data blocks with the matched fingerprints into a complete processed file spectrum sequence, and returning the processed file spectrum sequence to the terminal layer; the method comprises the steps that redundancy detection is conducted on a part of an unprocessed file spectrum sequence uploaded through a cloud server in a cloud layer and a global fingerprint index table maintained by a cloud, a cloud uploading tag is attached to an entry in the unprocessed file spectrum sequence corresponding to an unrecognized unprocessed data block, and a new cloud storage position corresponding to the unrecognized data block is allocated and then obtained and returned; the terminal equipment of the receiving terminal layer uploads a new thermal data block according to the edge storage position and the edge uploading label in the processed file spectrum sequence and stores the new thermal data block; the new cloud storage position and the cloud uploading tag in the processed file spectrum sequence are also used for indicating terminal equipment of the terminal layer to upload the non-stored data blocks to the cloud layer for storage.
In one embodiment, when the computer program is executed by the processor, the steps or sub-steps added in the embodiments of the above-mentioned deduplication storage method for cloud-edge longitudinal fusion may also be implemented.
Those skilled in the art will appreciate that implementing all or part of the above-described methods may be accomplished by way of a computer program, which may be stored on a non-transitory computer readable storage medium and which, when executed, may comprise the steps of the above-described embodiments of the methods. Any reference to memory, storage, database, or other medium used in the various embodiments provided herein may include non-volatile and/or volatile memory. The nonvolatile memory can include Read Only Memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double Data Rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous Link DRAM (SLDRAM), memory bus dynamic random access memory (Rambus DRAM, RDRAM for short), and interface dynamic random access memory (DRDRAM), among others.
The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.
The foregoing examples represent only a few embodiments of the present application, which are described in more detail and are not to be construed as limiting the scope of the invention. It should be noted that it would be apparent to those skilled in the art that various modifications and improvements could be made without departing from the spirit of the present application, and are intended to be within the scope of the present application. The scope of the patent is therefore intended to be covered by the appended claims.

Claims (10)

1. The cloud edge end longitudinal fusion deduplication storage system is characterized by comprising a terminal layer, an edge layer and a cloud layer, wherein after an original backup file is divided into data blocks by terminal equipment of the terminal layer, unprocessed file spectrum sequences corresponding to the data blocks are generated and uploaded to the edge layer;
the edge server in the edge layer hashes fingerprint information contained in the unprocessed file spectrum sequence to a compact sketch data structure with a fixed size to estimate the heat of a data block, allocates an edge storage position for an estimated new heat data block, records a storage address of the edge storage position into the unprocessed file spectrum sequence, and attaches an edge uploading tag;
The edge server in the edge layer matches the entry in the unprocessed file spectrum sequence with the entry in the set part index table, copies the storage position corresponding to the data block with the matched fingerprint from the set part index table to the unprocessed file spectrum sequence, and then uploads the part of the unprocessed file spectrum sequence corresponding to the data block without the matched fingerprint to the cloud layer; the set part index table is a part index table adjacent to the version perceived by the user;
the cloud server in the cloud layer carries out redundancy detection on the uploaded part of the unprocessed file spectrum sequence and a global fingerprint index table maintained by a cloud, attaches a cloud uploading tag to an entry in the unprocessed file spectrum sequence corresponding to an unrecognized unprocessed data block, allocates a new cloud storage position corresponding to the unrecognized data block, and returns the processed file spectrum sequence to the edge layer;
the edge server of the edge layer assembles the processed file spectrum sequence and a part of the unprocessed file spectrum sequence corresponding to the data block with the matched fingerprint into a complete processed file spectrum sequence, and returns the processed file spectrum sequence to the terminal layer;
And uploading new thermal data blocks to the edge layer for storage according to the edge storage position and the edge uploading tag in the processed file spectrum sequence by the terminal equipment of the terminal layer, and uploading non-stored data blocks to the cloud layer for storage according to the new cloud storage position and the cloud uploading tag in the processed file spectrum sequence.
2. The cloud-edge-end longitudinal fusion deduplication storage system of claim 1, wherein each cloud server in the cloud layer performs redundancy detection on the uploaded part of the unprocessed file spectrum sequence and a global fingerprint index table maintained by the cloud in a distributed index manner.
3. The cloud-edge-based longitudinal fusion deduplication storage system of claim 2, wherein the edge server of the edge layer is further configured to upload copies corresponding to hot data blocks stored at an edge to the cloud layer for storage.
4. A cloud-edge-based longitudinal fusion deduplication storage system according to any one of claims 1 to 3, wherein in a file retrieval process, terminal equipment of the terminal layer sends a block request to the edge server and/or the cloud server according to position information of data blocks in the processed file spectrum sequence, and the retrieved data blocks are assembled into an original backup file in sequence.
5. A method for removing and storing the duplicate of the longitudinal fusion of cloud edge ends is characterized by comprising the following steps:
hashing fingerprint information contained in an unprocessed file spectrum sequence uploaded by a terminal layer into a compact sketch data structure with a fixed size to perform data block heat estimation, distributing an edge storage position for a new estimated hot data block, recording a storage address of the edge storage position into the unprocessed file spectrum sequence, and attaching an edge uploading tag;
matching an entry in the unprocessed file spectrum sequence with an entry in a set part index table, copying a storage position corresponding to a data block with a matched fingerprint from the set part index table to the unprocessed file spectrum sequence, and uploading a part of the unprocessed file spectrum sequence corresponding to the data block without the matched fingerprint to the cloud layer; the set part index table is a part index table adjacent to the version perceived by the user;
assembling the processed file spectrum sequence and a part of the unprocessed file spectrum sequence corresponding to the data block with the matched fingerprint into a complete processed file spectrum sequence, and returning the processed file spectrum sequence to the terminal layer; the method comprises the steps that the processed file spectrum sequence carries out redundancy detection on part of the unprocessed file spectrum sequence which is uploaded and a global fingerprint index table maintained by a cloud end through a cloud server in the cloud layer, a cloud uploading tag is attached to an entry in the unprocessed file spectrum sequence corresponding to an unrecognized unrecorded data block, and a new cloud storage position corresponding to the unrecorded data block is allocated to obtain and return;
The terminal equipment of the terminal layer uploads a new thermal data block according to the edge storage position and the edge uploading label in the processed file spectrum sequence and stores the new thermal data block; and the new cloud storage position and the cloud uploading tag in the processed file spectrum sequence are also used for indicating terminal equipment of the terminal layer to upload non-stored data blocks to the cloud layer for storage.
6. The cloud-edge-end longitudinal fusion deduplication storage method of claim 5, wherein each cloud server in the cloud layer performs redundancy detection on the uploaded part of the unprocessed file spectrum sequence and a global fingerprint index table maintained by the cloud in a distributed index mode.
7. The method for deduplication storage by longitudinal fusion of cloud end as described in claim 6, further comprising the steps of:
and uploading the copy corresponding to the hot data block stored at the edge to the cloud layer for storage.
8. The method for deduplication storage by longitudinal fusion of cloud edge according to any one of claims 5 to 7, wherein in the process of file retrieval, the terminal device of the terminal layer sends a block request to the edge server and/or the cloud server according to the position information of the data blocks in the processed file spectrum sequence, and sequentially assembles the retrieved data blocks into an original backup file.
9. A computer device comprising a memory and a processor, said memory storing a computer program, characterized in that the processor, when executing said computer program, carries out the steps of the deduplication storage method of the cloud-edge longitudinal fusion of claim 5 or 7.
10. A computer readable storage medium having stored thereon a computer program, characterized in that the computer program when executed by a processor performs the steps of the deduplication storage method of cloud-edge vertical fusion according to claim 5 or 7.
CN202311496327.8A 2023-11-10 2023-11-10 Cloud edge end longitudinal fusion deduplication storage system, method, equipment and medium Pending CN117539389A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311496327.8A CN117539389A (en) 2023-11-10 2023-11-10 Cloud edge end longitudinal fusion deduplication storage system, method, equipment and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311496327.8A CN117539389A (en) 2023-11-10 2023-11-10 Cloud edge end longitudinal fusion deduplication storage system, method, equipment and medium

Publications (1)

Publication Number Publication Date
CN117539389A true CN117539389A (en) 2024-02-09

Family

ID=89787398

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311496327.8A Pending CN117539389A (en) 2023-11-10 2023-11-10 Cloud edge end longitudinal fusion deduplication storage system, method, equipment and medium

Country Status (1)

Country Link
CN (1) CN117539389A (en)

Similar Documents

Publication Publication Date Title
US11169972B2 (en) Handling data extent size asymmetry during logical replication in a storage system
US10031675B1 (en) Method and system for tiering data
US9965483B2 (en) File system
US10380073B2 (en) Use of solid state storage devices and the like in data deduplication
US20230013281A1 (en) Storage space optimization in a system with varying data redundancy schemes
US10228851B2 (en) Cluster storage using subsegmenting for efficient storage
US9047301B2 (en) Method for optimizing the memory usage and performance of data deduplication storage systems
US8799238B2 (en) Data deduplication
US7827146B1 (en) Storage system
CN108733761B (en) Data processing method, device and system
US8751763B1 (en) Low-overhead deduplication within a block-based data storage
US10346045B2 (en) System and method for granular deduplication
US9785646B2 (en) Data file handling in a network environment and independent file server
CN110018983A (en) A kind of metadata query method and device
US20150142756A1 (en) Deduplication in distributed file systems
WO2013086969A1 (en) Method, device and system for finding duplicate data
CN110908589B (en) Data file processing method, device, system and storage medium
US20230394010A1 (en) File system metadata deduplication
CN110618790B (en) Mist storage data redundancy elimination method based on repeated data deletion
US20200019539A1 (en) Efficient and light-weight indexing for massive blob/objects
CN116578746A (en) Object de-duplication method and device
CN117539389A (en) Cloud edge end longitudinal fusion deduplication storage system, method, equipment and medium
CN106775452A (en) A kind of data monitoring and managing method and system
CN117075823B (en) Object searching method, system, electronic device and storage medium
CN115543688B (en) Backup method, backup device, proxy terminal and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination