CN107249035B

CN107249035B - Shared repeated data storage and reading method with dynamically variable levels

Info

Publication number: CN107249035B
Application number: CN201710506611.7A
Authority: CN
Inventors: 谭玉娟; 赵亚军; 晏志超
Original assignee: Chongqing University
Current assignee: Chongqing University
Priority date: 2017-06-28
Filing date: 2017-06-28
Publication date: 2020-05-26
Anticipated expiration: 2037-06-28
Also published as: CN107249035A

Abstract

The invention provides a hierarchical dynamically variable shared repeated data storage and reading mechanism for a cloud backup system, and is suitable for hierarchical and fair service quality requirements of the cloud backup system. The method is organically matched with the repeated data deleting technology, different service quality strategies are provided for tenants of different grades, the repeated deleting rate of the system is improved, and the optimal service effect is achieved. The invention ensures that the tenant enjoys fair and graded service quality in the cloud backup system environment.

Description

Shared repeated data storage and reading method with dynamically variable levels

Technical Field

The invention belongs to the technical field of computer information storage, and provides a shared repeated data storage and reading method with dynamically variable grades for meeting the service quality requirement of a cloud backup system for deleting repeated data of multiple tenants, so that fair grade-guaranteed service quality is provided for the tenants.

Background

In the cloud backup system, different tenants can purchase corresponding backup resources to a cloud backup provider according to respective service requirements, and the tenants need the cloud backup system to provide fair and level-guaranteed service quality. With the application of the data de-duplication technology, the difficulty of maintaining the service quality of the data de-duplication technology is aggravated by sharing the data de-duplication, and the requirement of the tenant on the hierarchical service cannot be met only by the traditional method.

Aiming at the problems of the cloud backup system, the invention provides a shared repeated data storage and reading method with dynamically variable grades, so that the fairness of tenants is guaranteed and the graded service quality is achieved. Different from the existing method, the method is a hierarchical service quality control method based on data de-duplication, the resource allocation and the throughput monitoring adopt fine-grained control based on data blocks, the hierarchical service quality which is more fair than other methods can be obtained, and the re-deleting rate, the throughput rate and the data recovery speed of the system can be improved.

Disclosure of Invention

The invention provides a method for storing and reading shared repeated data with dynamically variable levels. According to the invention, the resource allocation and throughput conditions of each data block in the backup and recovery processing stages are fully considered, and the hierarchical service quality of the cloud backup system is realized from three aspects of tenant hierarchy, resource fair allocation and shared repeated data processing on the premise of stable system performance.

One of the core ideas of the present invention is a fair allocation of resources. The resource fair distribution method can ensure that the tenant obtains fair service in each processing stage of data backup and data recovery. The resource allocation method is used for allocating fair memory space for the tenants according to the service levels of the tenants. The method comprises the following steps: (1) firstly, applying for a memory space from a cloud backup system; (2) quantifying the memory space, and dividing the applied memory space capacity by the size of the metadata corresponding to 1 data block to obtain the total memory space number of the cloud backup system; (3) and (3) allocating memory space for each tenant by using a formula (1) according to the weight corresponding to the tenant grade and the total number of the tenants in each grade.

Wherein the Memory_LMemory space size Memory representing L-level tenant allocation_totalIs the total amount of memory space applied by the system, N represents the number of grades of the current system, P_n、A_nRespectively representing the memory space weight and the tenant population corresponding to the nth level of the current system,

represents the sum of the weights of all tenants, P_LiRepresenting the memory space weight corresponding to the tenant of the L level,

is the ratio of the total amount of the memory space occupied by the L-level tenants,

and the memory space size obtained by the tenants representing the L level from the total memory space.

The second core idea of the invention is throughput monitoring, and the purpose of the throughput monitoring method is to dynamically adjust the memory space and throughput threshold of the tenant, and ensure the hierarchical service quality of the tenant. The method comprises the following steps: (1) periodically monitoring the throughput of each tenant in data backup and data recovery in real time; (2) and (3) summing the tenant throughputs obtained through monitoring, and calculating the average throughput of each grade by using a formula (2) according to the weight corresponding to the tenant grade and the number of the tenants in each grade in the system.

Wherein Throughput_L,aAverage Throughput size, Throughput, of tenants representing L class_totalIs the total throughput of the system, N represents the number of levels of the current system, P_n、A_nRespectively representing the throughput weight and the tenant population corresponding to the nth level of the current system,

represents the sum of the weights of all tenants, P_LtIs the throughput weight corresponding to the L-level tenant.

Is the ratio of the throughput size of the L-class tenant to the total throughput size,

representing the average throughput size of the L-class tenants.

(3) And initializing a tenant throughput threshold by using the throughput size of each tenant monitored in the first period. After each monitoring period is finished, if the real-time throughput of the tenant is not equal to the average throughput of the corresponding level, the memory space size and the throughput threshold value of the tenant are dynamically adjusted by using a formula (3), wherein the formula (3) is shown as follows.

Wherein Throughput_L,aRepresents the average Throughput size, Throughput, of the L class_iIs the current real-time Throughput of the tenant i, and Δ Memory, Δ Throughput are the increases of the Memory space and the Throughput threshold when the current Throughput of the tenant is not equal to the average Throughput of the corresponding level, and Δ Throughput is the Throughput of the tenant_L,a-Throughput_iRepresenting the amount of throughput lost by tenant i, Memory_iOn behalf of tenant i the current memory size,

is the size of the memory space compensated for tenant i.

And (4) according to the increment delta Memory of the Memory space and the increment delta Throughput of the Throughput threshold calculated by the formula (3), respectively increasing the Memory space size and the Throughput threshold of the tenant by delta Memory and delta Throughput.

The third core idea of the invention is to share the repeated data processing. The shared repeated data processing method comprises a shared repeated data storage method and a shared data reading method. The method specifically comprises the following steps:

(a) the method for storing the shared repeated data comprises the following steps: and when the new data blocks backed up by the high-priority tenants and the low-priority tenants are repeated in a unit time period, checking whether the throughput ratio of the high-priority tenants and the low-priority tenants is greater than or equal to the average throughput ratio of the corresponding levels according to the throughput of the tenants and the average throughput of each level obtained by the throughput monitoring module. If the throughput ratio values of the high-priority tenants and the low-priority tenants are larger than or equal to the average throughput ratio value of the corresponding levels, the fact that the high-priority tenants complete the data block backup and have no influence on the performance of the high-priority tenants is indicated, the backup tasks of the data block are given to the high-priority tenants to complete, and the data block of the low-priority tenants is marked as being repeated and points to the data block; otherwise, the low-priority tenant completes the backup of the data block, and the data block mark of the high-priority tenant is repeated and points to the data block.

(b) The shared data reading method comprises the following steps: and checking whether the throughput ratio values of the high-priority tenants and the low-priority tenants are larger than or equal to the average throughput ratio value of the corresponding grade or not according to the throughput of the tenants and the average throughput of each grade obtained by the throughput monitoring module when the data blocks recovered by the high-priority tenants and the low-priority tenants are repeated and are not in the cache in a unit time period. If the throughput ratio values of the high-priority tenants and the low-priority tenants are larger than or equal to the average throughput ratio value of the corresponding levels, it is indicated that the high-priority tenants finish the data block caching and have no influence on the performance of the data block caching, the caching tasks of the data block are finished and recovered by the high-priority tenants, the memory space of the high-priority tenants is increased by 1 data block corresponding metadata, and the memory space of the low-priority tenants is reduced by 1 data block corresponding metadata; otherwise, caching the data block by the low tenant and completing the recovery of the data block.

The invention provides a method for storing and reading shared repeated data with dynamically variable grades, which mainly comprises two parts of data backup and data recovery.

The data backup method comprises the following specific steps:

(10) the client performs data blocking on a data stream needing to be backed up by a tenant, then calculates the blocked data blocks by using a Hash algorithm to obtain corresponding fingerprints, and sends the data fingerprints and tenant grade information to a server.

(11) After receiving the data information sent by the client, the server performs the following steps:

(11.1) establishing corresponding priority for backup services of the tenants according to the service levels of the tenants; and the resource fair distribution module is used for distributing the memory space and the throughput threshold value for the tenant by using a formula (1) according to the weight corresponding to the service level.

And (11.2) carrying out periodic real-time monitoring on the throughput of each tenant in the data backup. And (3) summing the tenant throughputs obtained through monitoring, and calculating the average throughputs of different levels by using a formula (2) according to the weight corresponding to the tenant level and the number of the tenants in each level in the system. And (4) after each monitoring period is finished, if the throughput of the tenant is not equal to the average throughput of the corresponding grade, adjusting the memory space and the throughput threshold of the step (11.11) by using a formula (3).

(11.3) after the service priority is determined in the step (11.1), traversing the fingerprint sequence sent in the step (1) from high priority to low priority in sequence according to the tenant backup service priority, inquiring in a fingerprint index table, and if the fingerprint index table does not exist, marking the corresponding data block as a new data block; otherwise, the corresponding data block is stored, the data block is marked as a repeated data block, and the storage address of the data block is recorded.

(11.4) storing the new data block, and specifically comprising the following steps:

(a) and if the new data block is data commonly backed up by the high-priority tenants and the low-priority tenants in a unit time period, adopting a shared repeated data storage strategy, and updating the fingerprint index table according to the storage address of the new data block. The strategy for storing the shared repeated data is specifically as follows:

and (4) checking whether the throughput ratio values of the high-priority tenants and the low-priority tenants are larger than or equal to the average throughput ratio value of the corresponding grade or not according to the tenant throughput and the average throughput of each grade obtained in the step (11.2). And if the throughput ratio values of the high-priority tenants and the low-priority tenants are larger than or equal to the average throughput ratio value of the corresponding grade, finishing the storage of the new data block by the high-priority tenants, and otherwise finishing the storage of the new data block by the low-priority tenants.

(b) And if the new data block is not the data commonly backed up by the high-priority tenant and the low-priority tenant in the unit time period, the tenant to which the data block belongs finishes data block storage, and the fingerprint index table is updated according to the storage address of the new data block.

The data recovery method comprises the following specific steps:

(20) the client reads the address of the data needing to be recovered by the tenant, and sends the address of the data needing to be recovered and the tenant level information to the server.

(21) The server receives the recovery data address and the tenant grade information sent by the client, and the following steps are carried out:

and (21.1) establishing corresponding priority for the recovery service of the tenant according to the service level of the tenant.

And (21.2) searching the metadata information for storing the data on the disk by restoring the address of the data.

And (21.3) according to the weight corresponding to the service level, allocating memory space and a throughput threshold value for the tenant by using the formula (1).

And (21.4) carrying out periodic real-time monitoring on the throughput of each tenant in data recovery. And summing the tenant throughputs obtained through monitoring, and calculating the average throughputs of different levels according to the formula (2) according to the weight corresponding to the tenant level and the number of the tenants in each level in the system. And (4) after each monitoring period is finished, if the throughput of the tenant is not equal to the average throughput of the corresponding grade, adjusting the memory space and the throughput threshold of the step (21.3) by using a formula (3).

(21.5) after the service priority is determined in the step (21.1), scanning metadata information of the data recovered in the step (21.2) from high priority to low priority according to the service priority recovered by the tenant, searching in a server cache, and directly recovering if a data block corresponding to the metadata information exists in the cache; if the data block corresponding to the metadata information is not in the cache, executing the following steps:

(a) and if the data block is the data which is recovered by the high-priority tenant and the low-priority tenant together in the unit time period, adopting a shared data reading strategy for processing. The strategy is specifically as follows:

and (5) according to the tenant throughput and the average throughput of each grade obtained in the step (21.4), checking whether the throughput ratio values of the high-priority tenant and the low-priority tenant are larger than or equal to the average throughput ratio value of the corresponding grade, and judging the performance influence of the high-priority tenant on the data block cache. If the throughput ratio of the high-priority tenant and the low-priority tenant is larger than or equal to the average throughput ratio of the corresponding grade, the high-priority tenant caches the data blocks, the memory space of the high-priority tenant is increased by the size of metadata corresponding to 1 data block, and the memory space of the low-priority tenant is decreased by the size of metadata corresponding to 1 data block; otherwise, the low-priority tenants finish caching and recovering the data blocks;

(b) if the data block is not the data commonly backed up by the high-priority tenant and the low-priority tenant in the unit time period, the tenant to which the data block belongs caches and recovers the data block.

The invention has the characteristics that: the invention relates to a hierarchical service quality control method based on repeated data deletion, which is characterized in that the resource allocation and the throughput monitoring adopt fine-grained control based on data blocks, the data blocks at each stage of data backup and data recovery are accurately obtained, the problem of unfairness of service quality caused by repeated data deletion is solved, and the repeated data deletion technology is better fit with a cloud backup system.

Drawings

FIG. 1 is a schematic block diagram;

FIG. 2 is a flowchart of a data de-duplication method and a shared data reading method

Detailed Description

Fig. 1 is a schematic diagram of a module structure according to the present invention. The present invention relates to a client 100 and a server 200. The client comprises a fingerprint processing module 110, which mainly performs data block chunking on the backup data set and calculates a fingerprint of each data block by using a hash function. The server comprises a tenant hierarchical management module 210, a resource fair distribution module 220, a throughput monitoring module 240 and a shared repeated data processing module 230. The tenant-level management module 210 establishes a corresponding priority according to the service level of each tenant, and schedules the tenant data from a high priority to a low priority in sequence according to the service priority of data backup or data recovery of the tenant. The resource fair allocation module 220 and the throughput monitoring module 240 are used for guaranteeing the hierarchical service quality of the tenant, wherein the resource fair allocation module 220 allocates a fair memory space for the tenant by using a formula (1); the throughput monitoring module 240 monitors the throughput of tenant data processing in real time, and dynamically adjusts the memory space and throughput threshold of the tenant by using the formulas (2) and (3). When a high-priority tenant and a low-priority tenant have data blocks stored or cached together in a unit time period, the shared duplicated data processing module 230 checks whether the throughput ratio values of the high-priority tenant and the low-priority tenant are greater than or equal to the average throughput ratio value of the corresponding level, if the throughput ratio values of the high-priority tenant and the low-priority tenant are greater than or equal to the average throughput ratio value of the corresponding level, the high-priority tenant completes data block storage or caching, otherwise, the low-priority tenant completes data block storage or caching.

FIG. 2 is a processing flow diagram of a data de-duplication method and a shared data reading method according to the present invention, which includes two parts, namely data backup and data recovery.

The data backup method comprises the following specific steps:

(10) the fingerprint processing module 110 of the client 100 performs data blocking on a data stream that a tenant needs to backup, then calculates a corresponding fingerprint by using a hash algorithm on the blocked data block, and sends the data fingerprint and tenant level information to the server.

(11) After receiving the data information sent by the client, the server 200 performs the following steps:

(11.1) the tenant level management module 210 establishes a corresponding priority for the backup service of the tenant according to the service level of the tenant; the resource fair allocation module 220 allocates the memory space and the throughput threshold to the tenant using the formula (1) according to the weight corresponding to the service level.

(11.2) the throughput monitoring module 240 periodically monitors the throughput of each tenant in the data backup in real time. And summing the monitored tenant throughputs, and calculating the average throughputs of different levels according to the corresponding weight of the tenant level and the number of the tenants in each level in the system. And (3) after each monitoring period is finished, if the throughput of the tenant is not equal to the average throughput of the corresponding grade, adjusting the memory space and the throughput threshold of the step (11.1) by using a formula (3).

(11.3) after the service priority is determined in the step (11.1), sequentially traversing the fingerprint sequence sent in the step (10) from the high priority to the low priority by the tenant level management module 210 according to the tenant backup service priority, inquiring in the fingerprint index table, and if the fingerprint index table does not exist, marking the corresponding data block as a new data block; otherwise, the corresponding data block is stored, the data block is marked as a repeated data block, and the storage address of the data block is recorded.

(a) if the new data block is data that is backed up by a high-priority tenant and a low-priority tenant together in a unit time period, the shared duplicate data processing module 230 adopts a shared duplicate data storage strategy and updates the fingerprint index table according to the storage address of the new data block. The strategy for storing the shared repeated data is specifically as follows: and (4) checking whether the throughput ratio values of the high-priority tenants and the low-priority tenants are larger than or equal to the average throughput ratio value of the corresponding grade or not according to the tenant throughput and the average throughput of each grade obtained in the step (11.2). And if the throughput ratio values of the high-priority tenants and the low-priority tenants are larger than or equal to the average throughput ratio value of the corresponding grade, finishing the storage of the new data block by the high-priority tenants, and otherwise finishing the storage of the new data block by the low-priority tenants.

The data recovery method comprises the following specific steps:

(20) the client 100 reads the address of the tenant needing to recover the data, and sends the address needing to recover the data and the tenant level information to the server.

(21) The server 200 receives the recovery data address and the tenant level information sent by the client, and performs the following steps:

(21.1) the tenant level management module 210 establishes a corresponding priority for the recovery service of the tenant according to the service level of the tenant.

And (21.3) the resource fair allocation module 220 allocates the memory space and the throughput threshold to the tenant by using the formula (1) according to the weight corresponding to the service level.

(21.4) the throughput monitoring module 240 periodically monitors the throughput of each tenant in the data recovery in real time. And summing the monitored tenant throughputs, and calculating the average throughputs of different levels according to the corresponding weight of the tenant level and the number of the tenants in each level in the system. And (4) after each monitoring period is finished, if the throughput of the tenant is not equal to the average throughput of the corresponding grade, adjusting the memory space and the throughput threshold of the step (21.3) by using a formula (3).

(21.5) after the service priority is determined in the step (21.1), the tenant level management module 210 scans the metadata information of the data recovered in the step (21.2) from high priority to low priority according to the tenant recovery service priority, searches the metadata information in the server cache, and directly recovers the metadata information if the metadata information corresponding to the data block exists in the cache; if the data block corresponding to the metadata information is not in the cache, executing the following steps:

(a) if the data block is data that is recovered by a high-priority tenant and a low-priority tenant together in a unit time period, the shared duplicate data processing module 230 employs a shared data reading policy for processing. The strategy is specifically as follows: and (5) according to the tenant throughput and the average throughput of each grade obtained in the step (21.4), checking whether the throughput ratio values of the high-priority tenant and the low-priority tenant are larger than or equal to the average throughput ratio value of the corresponding grade, and judging the performance influence of the high-priority tenant on the data block cache. If the throughput ratio of the high-priority tenant and the low-priority tenant is larger than or equal to the average throughput ratio of the corresponding grade, the high-priority tenant caches the data blocks, the memory space of the high-priority tenant is increased by the size of metadata corresponding to 1 data block, and the memory space of the low-priority tenant is decreased by the size of metadata corresponding to 1 data block; otherwise, the low-priority tenants finish caching and recovering the data blocks;

Claims

1. A shared repeated data storage and reading method with dynamically variable grades mainly comprises two parts of data backup and data recovery;

the data backup method comprises the following specific steps:

(10) the client performs data blocking on a data stream needing to be backed up by a tenant, then calculates the blocked data by using a Hash algorithm to obtain a corresponding fingerprint, and sends the data block fingerprint and tenant grade information to a server;

(11.1) establishing corresponding priority for backup services of the tenants according to the service levels of the tenants, and distributing memory spaces in corresponding proportion for the tenants according to the weights corresponding to the service levels:

wherein the Memory_LMemory space size, Memory, representing L-level tenant allocation_totalIs the total amount of memory space applied by the system, N represents the number of grades of the current system, P_n、A_nRespectively representing the memory space weight and the tenant population corresponding to the nth level of the current system,

represents the sum of the weights of all tenants,

representing the memory space weight corresponding to the tenant of the L level,

representing the size of the memory space obtained by the tenants of the L level from the total memory space;

(11.2) carrying out periodic real-time monitoring on the throughput of each tenant in the data backup, summing the monitored throughputs of the tenants, and calculating the average throughputs of different levels according to the corresponding weight of the tenant level and the number of the tenants in each level in the system:

represents the sum of the weights of all tenants, P_LtIs the throughput weight for the L level tenant,

represents an average throughput size of tenants of the L-class;

initializing a tenant throughput threshold value by using the throughput of each tenant monitored in the first period, and increasing a memory space and the throughput threshold value if the throughput of the tenant is lower than the average throughput of the corresponding grade after each monitoring period is finished; if the throughput size of the tenant is higher than the average throughput size of the corresponding grade, reducing the memory space and the throughput threshold:

wherein Throughput_L,aAverage throughput representing L levelMagnitude of volume, Throughput_iIs the current real-time Throughput of the tenant i, and Δ Memory, Δ Throughput are the increases of the Memory space and the Throughput threshold when the current Throughput of the tenant is not equal to the average Throughput of the corresponding level, and Δ Throughput is the Throughput of the tenant_L,a-Throughput_iRepresenting the amount of throughput lost by tenant i, Memory_iOn behalf of tenant i the current memory size,

is the compensated memory space size of tenant i;

(11.3) after the service priority is determined in the step (11.1), according to the tenant backup service priority, traversing the fingerprint sequence sent in the step (10) from high priority to low priority in sequence, inquiring in a fingerprint index table, and if the fingerprint does not exist, marking the corresponding data block as a new data block; otherwise, marking the data block as a repeated data block, and recording the storage address of the data block;

(a) if the new data block is data commonly backed up by a high-priority tenant and a low-priority tenant in a unit time period, adopting a shared repeated data storage strategy, and updating a fingerprint index table according to a storage address of the new data block, wherein the shared repeated data storage strategy specifically comprises the following steps:

checking whether the throughput ratio values of the high-priority tenants and the low-priority tenants are larger than or equal to the average throughput ratio value of the corresponding grade or not according to the throughput of the tenants and the average throughput of each grade obtained in the step (11.2), if the throughput ratio values of the high-priority tenants and the low-priority tenants are larger than or equal to the average throughput ratio value of the corresponding grade, finishing the storage of the new data block by the high-priority tenants, and if not, finishing the storage of the new data block by the low-priority tenants;

(b) if the new data block is not the data commonly backed up by the high-priority tenant and the low-priority tenant in the unit time period, the tenant to which the data block belongs completes data block storage, and the fingerprint index table is updated according to the storage address of the new data block;

the data recovery method comprises the following specific steps:

(20) the client reads the address of the data needing to be recovered by the tenant, and sends the address of the data needing to be recovered and the tenant level information to the server;

(21.1) establishing corresponding priority for the recovery service of the tenant according to the service level of the tenant;

(21.2) searching metadata information for storing the data on the disk by restoring the address of the data;

(21.3) distributing memory space for the tenant according to the weight corresponding to the service level;

(21.4) carrying out periodic real-time monitoring on the throughput of each tenant in data recovery, summing the monitored throughputs of the tenants, and calculating the average throughputs of different levels according to the corresponding weight of the tenant level and the number of the tenants in each level in the system:

represents an average throughput size of tenants of the L-class;

wherein Throughput_L,aRepresents the average Throughput size, Throughput, of the L class_iIs the current real-time Throughput of the tenant i, and Δ Memory, Δ Throughput are when the current Throughput of the tenant is not equal to the average Throughput of the corresponding classIncrease in memory space and Throughput threshold, Throughput_L,a-Throughput_iRepresenting the amount of throughput lost by tenant i, Memory_iOn behalf of tenant i the current memory size,

is the compensated memory space size of tenant i;

(a) if the data block is data which is recovered by the high-priority tenant and the low-priority tenant together in a unit time period, adopting a shared data reading strategy for processing, wherein the strategy specifically comprises the following steps: checking whether the throughput ratio values of the high-priority tenants and the low-priority tenants are larger than or equal to the average throughput ratio value of the corresponding grade or not according to the tenant throughput and the average throughput of each grade obtained in the step (21.4), if the throughput ratio values of the high-priority tenants and the low-priority tenants are larger than or equal to the average throughput ratio value of the corresponding grade, completing caching of data blocks by the high-priority tenants, increasing the metadata size corresponding to 1 data block in the memory space of the high-priority tenants, and reducing the metadata size corresponding to 1 data block in the memory space of the low-priority tenants; otherwise, the low-priority tenants finish caching and recovering the data blocks;