CN110569245A

CN110569245A - Fingerprint index prefetching method based on reinforcement learning in data de-duplication system

Info

Publication number: CN110569245A
Application number: CN201910852882.7A
Authority: CN
Inventors: 徐光平; 范浩; 毛群芳; 薛彦兵; 张桦
Original assignee: Tianjin University of Technology
Current assignee: Tianjin University of Technology
Priority date: 2019-09-10
Filing date: 2019-09-10
Publication date: 2019-12-13

Abstract

A fingerprint index prefetching method based on reinforcement learning in a data de-duplication system extracts the characteristics of data stream segments by using context information of the data stream, establishes a mapping association relation between characteristic values and the data stream segments through a feedback mechanism, and constructs an efficient index structure; carrying out balance comparison on the current best feedback data segment and the unknown segment by using a multi-arm slot machine model for each new data segment by using the similarity between the reinforcement learning training data segments, and dynamically selecting one data segment for prefetching; the caching mechanism of data segmentation is optimized, a self-adaptive caching algorithm of data fingerprint indexes is designed, and the deduplication efficiency of the storage system is improved.

Description

Fingerprint index prefetching method based on reinforcement learning in data de-duplication system

Technical Field

The invention belongs to the technical field of computer storage, and relates to a fingerprint index prefetching method based on reinforcement learning. The data volume of the information world is increasing explosively nowadays, and effective analysis and management of the data become key matters of concern in the technical field of mass storage systems during the current period of big data. In particular, data centers serve as a foundation support for big data storage and big data analysis, and provide efficient and reliable storage and computing services for various applications through virtualization technologies. The method provided by the invention aims to improve the efficient storage of data resources.

Background

Deduplication technology is a new type of data reduction technology, which is also called data deduplication or data deduplication technology, and improves data storage capacity and transmission efficiency by finding and removing duplicate content in a data stream. The general process is as follows: firstly, a data file is divided into a group of data blocks, a fingerprint is calculated for each data block, then fingerprint index searching is carried out, if matching is carried out, the data block is represented as a repeated data block, only the index number of the data block is stored, otherwise, the data block is represented as a new unique block, the data block is stored, and related meta-information is created.

For a deduplication system with a very large amount of data, the fingerprint index of the data chunk is a key data structure for duplicate data detection. Early deduplication systems stored all data chunk fingerprint indexes in memory, facilitating rapid and complete identification of duplicate data. However, with the explosive increase of data volume, the number of data block fingerprints becomes very large, and the size of the fingerprint index also increases sharply, so that it is difficult to store all the data block fingerprints in the memory, and thus frequent access to a disk with low speed is required to query the fingerprint index. For example, assuming an average size of 8KB of data blocks and a SHA-1 secure cryptographic hash algorithm are used, a 100TB data size will produce a 250GB fingerprint of the data blocks, and such a large fingerprint size cannot be stored completely in memory. Because the random access speed of the disk is far lower than the memory access speed, the search of the fingerprint index is very slow, the system throughput is quite low due to frequent access of the fingerprint index on the disk, the system throughput is greatly reduced, and the performance bottleneck of disk access is formed. Therefore, it becomes an urgent problem to be solved in a large-scale data deduplication system.

To address this performance bottleneck, existing research has proposed a variety of solutions, primarily based on locality and based on similarity. The methods mainly merge some strongly associated small files, divide large files to mine more similarities, supplement similarity detection by using the locality of data streams, solve data blocks with undetected similarities, and improve the performance by reasonably combining two strategies. From the viewpoint of fingerprint index prefetching, the method depends on locality to a large extent, and can be optimized by using locality of storage stream data or locality of backup stream. Therefore, the fingerprint index organization method of the deduplication system is a key for improving the performance and the deduplication rate of the deduplication system.

Therefore, the invention provides a novel self-adaptive method aiming at key problems of disk bottleneck of fingerprint index access in repeated data detection, data fragmentation in data recovery and the like, improves a data cache mechanism, improves data storage efficiency and recovery performance, and provides technical support for enhancing the service quality of data storage of a data center.

Disclosure of Invention

The invention aims to solve the problem that fingerprint index organization management in the prior art is lack of adaptability, and provides a fingerprint index prefetching method based on reinforcement learning in a data de-duplication system. The invention utilizes a method combining reinforcement learning and locality to complement, enhances the relevance between the fingerprint and the pre-fetching unit through feedback excitation, and dynamically adjusts cache pre-fetching to improve the performance of the deduplication system. The fingerprint index prefetching method based on reinforcement learning has better deduplication rate and low memory occupation, can be suitable for different data streams, and has more flexibility because the fingerprint index can also adapt to the system.

Technical scheme of the invention

A fingerprint index prefetching method based on reinforcement learning in a data de-duplication system. As shown in fig. 1, the method specifically includes the following main contents:

1, dividing the input data stream into variable-length data segments according to the byte content of the data stream

the data stream is divided into variable length data segments by a variable length partitioning method. To determine the boundaries of the data segments, a content-based segmentation method analyzes the data stream using a sliding window. The fingerprint of the data in the sliding window only depends on the content of the fingerprint and a set hash function, and the boundary of the same data block can be recaptured at different positions, so that repeated data can be detected again. This is typically done using the Rabin fingerprinting algorithm to compute the fingerprint of the data within the window, since this method is computationally far more efficient than the secure consistent hashing algorithm used to generate the block fingerprints.

2, extracting characteristic fingerprint of data segment by random sampling method

Assuming that each data segment selects m features, selecting m minimum fingerprints as the features through a comparison rule, and associating each feature to one item in a context table by using a hash function; the more features sampled, the more champions are selected and the more duplicate data blocks can be identified at the end. Because the data volume is huge and the memory space is limited, all the fingerprint indexes cannot be stored in the memory, the method selects a sampling mode, only stores a part of fingerprint indexes to replace the complete indexes, and therefore the occupied expense of the memory is reduced.

And selecting a part of proper fingerprints from the data segment as features in a sampling mode. In order to calculate the similarity between the data segments, the sparse indexing method samples a small part of characteristics from each data segment and stores the characteristics into an index structure according to a certain proportion. The fingerprint flow is sampled randomly, fingerprints are generated according to the content of the data blocks, and then random hash selection is carried out to serve as a characteristic fingerprint to represent a section of data flow. In this way, a small number of feature fingerprints are sampled per segment of the data stream, resulting in less memory overhead. Assume that a particular m number of features are selected for each data segment. One simple way is to select the m smallest fingerprints as features by a specific comparison rule, each feature being associated to an entry in the context table. The more features that are sampled, the more candidate comparison data segments (called champions) are selected, and the more duplicate data blocks that can be identified last.

3, making a selection strategy, and selecting a specific data segment as a champion

The main challenge of reinforcement learning-based methods is how to select a suitable champion data segment from k data segments. Each data segment records a score for reflecting the effectiveness of the past deduplication; however, this still has a data segment with a score of 0, especially the latest one, which will not be changed until it is chosen as a score. Because the overhead of loading data segments is large, it is not necessary to re-delete all k data segments, and because the data segments may be very similar in locality, it is meaningless to load all fingerprints of the data segments, and instead, the cache utilization rate is reduced.

According to different characteristics of the data stream, the method can combine three selection strategies to realize the optimal selection of the data segment: a recent strategy, a random strategy, an epsilon-greedy strategy. The latest strategy selects the current latest data segment each time, and the method is suitable for data stream backup with good time sequence. Random strategy, randomly choosing one with the same probability from the k useful candidate sets. However, in the past experiments, learning knowledge is not recorded in the two strategies, and only the strategies are simply selected according to the existing knowledge, so that the knowledge is not known and utilized. Greedy strategy, which is an improvement of the random strategy. The utilization and exploration are traded off based on a probability: each test is utilized according to the probability of 1-epsilon, and the data segment with the highest current average reward is selected, namely the data segment with the highest score is generally the highest hit rate in the past test according to the past knowledge, so that the data segment with the highest score has a good deduplication rate on the following data stream according to the maximum probability; and (3) exploring by using the probability of epsilon, namely randomly selecting one data segment by using uniform probability and exploring unknown data segments which possibly exist and have better effect. Therefore, the ability to discover duplicate data is continually expanded by selecting a best known data segment and exploring an unknown data segment that may be rewarded better in the future, finding a balanced relationship between the two in practice.

4, feeding back the reward value to the data segment according to the hit condition

When a champion data segment is selected, the subsequent data segment is prefetched and then compared with the data segment to be re-deleted for fingerprint search. Where a lookup hit means that a duplicate data block is identified. Before the champion data segment is moved out of the cache, query hits of the data segment are accumulated, and when the data segment is removed, the score feedback is updated to the context table, and the corresponding score of the data segment is updated.

Assuming that a data segment s is selected as champion, the hits in n feedbacks are r₁,r₂,…,r_nAnd is provided with Q_n(s) is the final result of the n feedbacks; then the feedback value is taken as the case where the data segment is hit. An intuitive approach is to compute Q with average feedback_n(s) that is

This presents a problem of how to efficiently calculate the feedback value, and the memory and calculation overhead will grow over time without limitations. Therefore, the following optimization is performed:

Wherein Q is initialized₀(s) is 0.

This calculation method only requires the calculation of Q_n-1Memory overhead of(s) and n and r_nThe computational overhead of (2). The process is unsupervised. Relying on feedback only during training either strengthens or weakens future data segment associations once. Heightened feedback will enhance the probability that a data segment will become a champion; otherwise, the probability of the data segment becoming a champion is reduced. Feedback is handled in a lazy manner, requiring a time limit; instead, champion selection decisions based on current knowledge are real-time.

5 th, update context table

The context table holds all features and corresponding data segments. When the characteristic of a data segment is selected, if the characteristic is not in the context table, the characteristic and the associated data segment are inserted into the context table as a new item, otherwise, the data segment information is added into the corresponding characteristic item. In each entry in the context table, the same signature is stored in a queue for up to k data segments, and when the queue is full, a new data segment is always inserted at the end of the queue and the old data segment is removed from the queue.

And the improvement of the deduplication performance can be greatly facilitated by adopting a proper replacement strategy according to the characteristics of the data source. The method considers two alternative strategies: a first-in-first-out strategy and a minimum strategy. The first-in-first-out strategy is to remove the data segments from the head of the queue except the highest score. The minimum value strategy is to remove the data segment with the lowest score, and if a plurality of data segments with the lowest score exist, the data segments are replaced according to a first-in first-out strategy.

The invention has the advantages and beneficial effects that:

1) By introducing the fingerprint index prefetching method based on reinforcement learning, the performance of the deduplication system is improved to a certain extent in the deduplication rate and the memory occupation, and the throughput of the system is improved; 2) the invention effectively solves the problem of disk fingerprint indexing; 3) the invention excavates the context relation between the fingerprint sequence and the data stream, is more suitable for the change of a system and has self-adaption to the stored data stream.

Drawings

Fig. 1 is a basic framework based on a reinforcement learning method.

FIG. 2 is a flowchart of an implementation of a deduplication algorithm.

Fig. 3 shows the deduplication rates of the evaluation system under (a) Kernel, (b) Vmdk, (c) Fslhomes, and (d) Macos workloads, respectively, under different size data segments and different sampling rates.

Fig. 4 measures the data deduplication rates for two workloads (a) Kernel and (b) User014 data streams, respectively, at an epsilon parameter set under an epsilon-greedy strategy.

Fig. 5 shows that the deduplication rates are measured for (a) fsholmes, (b) Kernel, (c) Macos, and (d) Vmdk workloads, respectively, using minimum and fifo replacement strategies.

Detailed Description

The present invention will be further described with reference to the process flow of FIG. 2.

The basic flow of deduplication is shown in fig. 2. First, a data stream inputted from a storage system is divided into a plurality of relatively small data block sequences, and a hash value of each data block, which is a fingerprint for data block identification, is calculated using a hash secure encryption method (MD5 or SHA-1), and data blocks having the same fingerprint are regarded as being duplicated. The fingerprint index is used to map the fingerprint of a stored data chunk to its physical address. And comparing each data block by using the fingerprint, if the fingerprint is the same as the fingerprint in the index, indicating that the data block is repeated and does not need to be stored, and otherwise, writing the data block only into a container with a fixed size, wherein the container is a storage unit for storing the data block. The fingerprint sequence of the data stream needs to be saved for data recovery.

to illustrate the effectiveness of the present invention, four different data sets were used for performance evaluation, representing different workloads. Where Kernel and Vmdk are real datasets and FSLHomes and Macos characterize log datasets. The file is divided into several data blocks of 4KB size on average, and in the storage phase, the new data blocks are stored into a container of fixed size 4M. The concrete description is as follows:

a) The data set Kernel, which contains 155 Kernel source code versions of linux-2.3.0 to linux-2.5.75 and represents the source code load in the data backup, is mostly a small file and has a high repetition rate.

b) the data set Vmdk is a virtual machine image that contains different versions of 110 different operating systems, and represents a lesser load of duplicate data due to the lack of data redundancy between versions.

c) data sets FSLHomes are journal data sets collected by the New York State university, Yangxi school File System and storage laboratory (FSL). Data logs for 9 users from 9/month 16 to 30/day 14 in 2011 were used. Since each user's backup data is considered a data stream, the daily data corresponds to multiple data stream loads.

d) The dataset Macos is also a data log of the new york state university shixixi branch school computer lab Mac server. Because the data volume is too large, only 18 days of data set logs are selected.

Step 01, obtaining the incidence relation between the characteristic value and the data segment

Firstly, a data file is divided into a group of data blocks, a data stream of a data block sequence is divided into a plurality of data segments by a content-based division method, and each data segment s to be deleted again_tObtaining the features f, s by a sampling method_tAnd f has a corresponding association relationship, namely: f → s_t. And stores both and their relationship in a key value index structure. Each feature f corresponds to at most k data segments in the context table, i.e. at most k data segments share the feature f.

As shown in fig. 3, under different data sets and different data segment lengths, different sampling rates are used to obtain the characteristic values, and the obtained deduplication rates are different. Wherein 1024, 2048 and 4096 are the number of data blocks respectively, and the size of the average data block is 4KB, and the conditions of 1, 2, 4 and 8 characteristics of data segment downsampling with the size of 4MB, 8MB and 16MB are measured respectively. It can be seen from the figure that under the condition of the same data segment size, the more the sampling characteristics are, the better the deduplication effect is; when the sampling feature numbers are the same, the smaller the data segment is, the higher the deduplication rate is, and actually, the sampling feature number determines the deduplication effect. The duplication checking detection of the data block mainly depends on the characteristic value of the sample to perform the pre-fetching of the data segment, so that the good effect of approximate duplication removal is achieved.

step 02, making a selection strategy, and selecting a specific data segment as a champion

The method adopts a method combining three strategies, namely a recent strategy, a random strategy and a greedy strategy. According to all candidates of a given feature mapping, a first data segment is selected by a greedy strategy, and the data segment with the highest score tends to be selected with a 1-epsilon probability or a data segment is randomly selected with an epsilon probability; and selecting one data segment with the latest strategy. There are as few common features as possible between champions in the selection process. This enables the selected champion data segments to be complemented, and avoids the data segments selected by fingerprint features from being too similar to lose the duplication checking effect.

Using the data of User014 (FIG. 4-b) in Kernel (FIG. 4-a) and FSLHomes for half consecutive months as test data set, epsilon was selected as 0.1, 0.3, 0.5, 0.7, 0.9, respectively, and the average duplicate deletion rate was used as the standard. As can be seen from fig. 4, as epsilon increases, the rate of deduplication decreases, and epsilon 0.1 indicates that one data segment is randomly selected with a probability of 0.1, and the data segment with the highest score is selected with a probability of 0.9. It is shown that the effect obtained by selecting the data segment with the highest score is better.

Step 03 of feeding back reward value to data segment according to hit condition

The data segment with the fixed length is prefetched to the cache in advance by utilizing the champion data segment selected in the step 02, so that the subsequent data segment needing to be deleted again can be directly found in the cache, the hit rate of the cache is improved, and the access of the disk fingerprint index is reduced. Wherein the cache is replaced with the LRU policy.

And prefetching the champion data segment and the data segments behind the champion data segment, and performing fingerprint searching comparison on the data segments to be deleted again. A lookup hit means that a duplicate data block is identified, query hits for the data segment are accumulated before the data segment is moved out of the cache, and integral feedback is updated to the context table when the data segment is removed. The results shown in fig. 4 verify the validity of the score, i.e. the effect of feeding back the prize value.

Step 04 update context table

The context table is a key-value index structure: the key represents a fingerprint feature and the value represents the k data segments corresponding to the fingerprint feature. The attribute of each data segment comprises a score and the number of data segments allowed to be prefetched, and the score is realized by using a queue, wherein the score is helpful for the selection of the champion data segment and is also a basis for feedback. As shown in fig. 5, where a metadata store (fingerprint store) is used to store the fingerprint sequence of the backup stream for data recovery, a data block store (container store) stores all data blocks.

in each item in the context table, the same characteristic corresponds to k data segments at most and is stored in a context queue, when the context table queue is full, a new data segment is always inserted into the tail of the queue, and the old data segment is removed from the queue according to a first-in first-out replacement strategy or a minimum replacement strategy. For example, feature X queuesWhen full, if the first-in first-out replacement strategy is adopted, S is deleted_x1Node, inserting new data segment into S_x3Behind the node; if the minimum value replacement strategy is adopted, deleting the node with the minimum score, and inserting a new data segment into the S_x3Behind the node.

As shown in fig. 5, the experimental results of comparing the minimum replacement strategy and the first-in first-out replacement strategy were measured under different data sets with epsilon being set to 0.3. Wherein, in the FSLHomes (figure 5-a), Macos (figure 5-c) and Vmdk (figure 5-d) three data sets show that the deduplication rate of the minimum value replacement strategy is higher, and the effectiveness of the score is also demonstrated. On the contrary, in Kernel (fig. 5-b), since the Kernel dataset is a continuous backup version, is a few small files, and has very strong time sequence, the effect of using the fifo strategy is better.

And selecting a proper replacement strategy to update the context table according to different data stream characteristics, which is a great help for improving the deduplication performance, and is an advantage embodiment of the proposed reinforcement learning method.

Claims

1. A fingerprint index prefetching method based on reinforcement learning in a repeated data deleting system provides a novel self-adaptive method aiming at the key problem of a disk bottleneck of fingerprint index access in repeated data detection, improves a data caching mechanism and improves data storage efficiency and recovery performance; the method comprises the following steps:

1, dividing the data into data segments according to the byte content of the data stream: the method adopts a variable-length blocking method, generally utilizes a sliding window to analyze an input data stream, then quickly calculates data fingerprints through a Hash algorithm, and if a certain fingerprint is found to be matched with a predefined mode, the position of the current sliding window is determined to be the boundary of a data block;

And 2, extracting the characteristic fingerprint of the data segment: assuming that each data segment selects m features, selecting m minimum fingerprints as the features through a comparison rule, and associating each feature to one item in a context table by using a hash function; the more characteristics are sampled, the more crows are selected, and the more repeated data blocks can be identified finally;

And 3, when each new data segment is compared with the existing data segment, a strategy for selecting the existing data segment set is formulated: carrying out balance comparison on the current best feedback data segment and the unknown segment by using a multi-arm slot machine model, dynamically selecting a data segment for prefetching, and selecting a corresponding data segment as a comparison object, namely a champion data segment;

And 4, feeding back a reward value to the data segment according to the hit condition: before the champion data segment is moved out of the cache, query hits of the data segment are accumulated, when the data segment is removed, the integral feedback is updated to the context table, and the corresponding score of the data segment is updated;

And 5, updating a context table: when the characteristic of a data segment is selected, if the characteristic is not in the context table, the characteristic and the associated data segment are inserted into the context table as a new item, otherwise, the data segment information is added into the corresponding characteristic item.

2. the reinforcement learning-based fingerprint index prefetching method of claim 1, wherein the step 1 divides a data stream of the data block sequence into a plurality of data segments by a content-based segmentation method, and then selects a partial fingerprint from the data segments as a feature by means of sampling.

3. The fingerprint index prefetching method based on reinforcement learning as claimed in claim 1, wherein step 2 is to sample a small part of features from each data segment to store in the index structure by a set ratio, and select m smallest fingerprints as features by the comparison rule of coincident features; here, a random sampling method is used, every 2ⁿA fingerprint takes mod 2ⁿA fingerprint with a value of 0 as a feature; in practical implementation, m is usually not higher than 3 to achieve a good deduplication effect.

4. The reinforcement learning-based fingerprint index prefetching method of claim 1, wherein the method for selecting data segments in step 3 is: selecting a champion data segment from the context table by a greedy strategy, and if the selected champion data segment is not in the set S of similar data segments, putting the set S into the context table; the set S refers to a data segment set with the same characteristics as the currently detected data segment; if the selected champion data segment is in the data segment similarity set S, selecting a new data segment which is not in the set S by using a latest strategy, and putting the new data segment into the set S; and finally, putting other candidates of the feature mapping into the set S to finally obtain all champion data segments.

5. The fingerprint index prefetching method based on reinforcement learning as claimed in claim 1, wherein the method of feeding back the reward value to the data segment in step 4 is as follows: let s denote a data segment, the result in n feedbacks after s is selected as champion is r respectively₁,r₂,…,r_n,，Q_n(s) is the final result of the n feedbacks; feedback reward valueWherein Q is initialized₀(s) is 0, and the calculation method only needs to calculate Q_n-1(s) and index memory overhead and feedback reward value r_nThe computational overhead of (2).

6. The reinforcement learning-based fingerprint index prefetching method of claim 1, wherein the context table updating method of step 5 is: in each item in the context table, the same characteristic is stored in a queue corresponding to k data segments at most, when the queue is full, a new data segment is always inserted into the tail of the queue, an old data segment is removed from the queue, and a first-in first-out or minimum strategy is adopted for replacing the data segment.