CN110569245A - Fingerprint index prefetching method based on reinforcement learning in data de-duplication system - Google Patents

Fingerprint index prefetching method based on reinforcement learning in data de-duplication system Download PDF

Info

Publication number
CN110569245A
CN110569245A CN201910852882.7A CN201910852882A CN110569245A CN 110569245 A CN110569245 A CN 110569245A CN 201910852882 A CN201910852882 A CN 201910852882A CN 110569245 A CN110569245 A CN 110569245A
Authority
CN
China
Prior art keywords
data
data segment
fingerprint
segment
reinforcement learning
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910852882.7A
Other languages
Chinese (zh)
Inventor
徐光平
范浩
毛群芳
薛彦兵
张桦
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tianjin University of Technology
Original Assignee
Tianjin University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tianjin University of Technology filed Critical Tianjin University of Technology
Priority to CN201910852882.7A priority Critical patent/CN110569245A/en
Publication of CN110569245A publication Critical patent/CN110569245A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Quality & Reliability (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A fingerprint index prefetching method based on reinforcement learning in a data de-duplication system extracts the characteristics of data stream segments by using context information of the data stream, establishes a mapping association relation between characteristic values and the data stream segments through a feedback mechanism, and constructs an efficient index structure; carrying out balance comparison on the current best feedback data segment and the unknown segment by using a multi-arm slot machine model for each new data segment by using the similarity between the reinforcement learning training data segments, and dynamically selecting one data segment for prefetching; the caching mechanism of data segmentation is optimized, a self-adaptive caching algorithm of data fingerprint indexes is designed, and the deduplication efficiency of the storage system is improved.

Description

Fingerprint index prefetching method based on reinforcement learning in data de-duplication system
Technical Field
The invention belongs to the technical field of computer storage, and relates to a fingerprint index prefetching method based on reinforcement learning. The data volume of the information world is increasing explosively nowadays, and effective analysis and management of the data become key matters of concern in the technical field of mass storage systems during the current period of big data. In particular, data centers serve as a foundation support for big data storage and big data analysis, and provide efficient and reliable storage and computing services for various applications through virtualization technologies. The method provided by the invention aims to improve the efficient storage of data resources.
Background
Deduplication technology is a new type of data reduction technology, which is also called data deduplication or data deduplication technology, and improves data storage capacity and transmission efficiency by finding and removing duplicate content in a data stream. The general process is as follows: firstly, a data file is divided into a group of data blocks, a fingerprint is calculated for each data block, then fingerprint index searching is carried out, if matching is carried out, the data block is represented as a repeated data block, only the index number of the data block is stored, otherwise, the data block is represented as a new unique block, the data block is stored, and related meta-information is created.
For a deduplication system with a very large amount of data, the fingerprint index of the data chunk is a key data structure for duplicate data detection. Early deduplication systems stored all data chunk fingerprint indexes in memory, facilitating rapid and complete identification of duplicate data. However, with the explosive increase of data volume, the number of data block fingerprints becomes very large, and the size of the fingerprint index also increases sharply, so that it is difficult to store all the data block fingerprints in the memory, and thus frequent access to a disk with low speed is required to query the fingerprint index. For example, assuming an average size of 8KB of data blocks and a SHA-1 secure cryptographic hash algorithm are used, a 100TB data size will produce a 250GB fingerprint of the data blocks, and such a large fingerprint size cannot be stored completely in memory. Because the random access speed of the disk is far lower than the memory access speed, the search of the fingerprint index is very slow, the system throughput is quite low due to frequent access of the fingerprint index on the disk, the system throughput is greatly reduced, and the performance bottleneck of disk access is formed. Therefore, it becomes an urgent problem to be solved in a large-scale data deduplication system.
To address this performance bottleneck, existing research has proposed a variety of solutions, primarily based on locality and based on similarity. The methods mainly merge some strongly associated small files, divide large files to mine more similarities, supplement similarity detection by using the locality of data streams, solve data blocks with undetected similarities, and improve the performance by reasonably combining two strategies. From the viewpoint of fingerprint index prefetching, the method depends on locality to a large extent, and can be optimized by using locality of storage stream data or locality of backup stream. Therefore, the fingerprint index organization method of the deduplication system is a key for improving the performance and the deduplication rate of the deduplication system.
Therefore, the invention provides a novel self-adaptive method aiming at key problems of disk bottleneck of fingerprint index access in repeated data detection, data fragmentation in data recovery and the like, improves a data cache mechanism, improves data storage efficiency and recovery performance, and provides technical support for enhancing the service quality of data storage of a data center.
Disclosure of Invention
The invention aims to solve the problem that fingerprint index organization management in the prior art is lack of adaptability, and provides a fingerprint index prefetching method based on reinforcement learning in a data de-duplication system. The invention utilizes a method combining reinforcement learning and locality to complement, enhances the relevance between the fingerprint and the pre-fetching unit through feedback excitation, and dynamically adjusts cache pre-fetching to improve the performance of the deduplication system. The fingerprint index prefetching method based on reinforcement learning has better deduplication rate and low memory occupation, can be suitable for different data streams, and has more flexibility because the fingerprint index can also adapt to the system.
Technical scheme of the invention
A fingerprint index prefetching method based on reinforcement learning in a data de-duplication system. As shown in fig. 1, the method specifically includes the following main contents:
1, dividing the input data stream into variable-length data segments according to the byte content of the data stream
the data stream is divided into variable length data segments by a variable length partitioning method. To determine the boundaries of the data segments, a content-based segmentation method analyzes the data stream using a sliding window. The fingerprint of the data in the sliding window only depends on the content of the fingerprint and a set hash function, and the boundary of the same data block can be recaptured at different positions, so that repeated data can be detected again. This is typically done using the Rabin fingerprinting algorithm to compute the fingerprint of the data within the window, since this method is computationally far more efficient than the secure consistent hashing algorithm used to generate the block fingerprints.
2, extracting characteristic fingerprint of data segment by random sampling method
Assuming that each data segment selects m features, selecting m minimum fingerprints as the features through a comparison rule, and associating each feature to one item in a context table by using a hash function; the more features sampled, the more champions are selected and the more duplicate data blocks can be identified at the end. Because the data volume is huge and the memory space is limited, all the fingerprint indexes cannot be stored in the memory, the method selects a sampling mode, only stores a part of fingerprint indexes to replace the complete indexes, and therefore the occupied expense of the memory is reduced.
And selecting a part of proper fingerprints from the data segment as features in a sampling mode. In order to calculate the similarity between the data segments, the sparse indexing method samples a small part of characteristics from each data segment and stores the characteristics into an index structure according to a certain proportion. The fingerprint flow is sampled randomly, fingerprints are generated according to the content of the data blocks, and then random hash selection is carried out to serve as a characteristic fingerprint to represent a section of data flow. In this way, a small number of feature fingerprints are sampled per segment of the data stream, resulting in less memory overhead. Assume that a particular m number of features are selected for each data segment. One simple way is to select the m smallest fingerprints as features by a specific comparison rule, each feature being associated to an entry in the context table. The more features that are sampled, the more candidate comparison data segments (called champions) are selected, and the more duplicate data blocks that can be identified last.
3, making a selection strategy, and selecting a specific data segment as a champion
The main challenge of reinforcement learning-based methods is how to select a suitable champion data segment from k data segments. Each data segment records a score for reflecting the effectiveness of the past deduplication; however, this still has a data segment with a score of 0, especially the latest one, which will not be changed until it is chosen as a score. Because the overhead of loading data segments is large, it is not necessary to re-delete all k data segments, and because the data segments may be very similar in locality, it is meaningless to load all fingerprints of the data segments, and instead, the cache utilization rate is reduced.
According to different characteristics of the data stream, the method can combine three selection strategies to realize the optimal selection of the data segment: a recent strategy, a random strategy, an epsilon-greedy strategy. The latest strategy selects the current latest data segment each time, and the method is suitable for data stream backup with good time sequence. Random strategy, randomly choosing one with the same probability from the k useful candidate sets. However, in the past experiments, learning knowledge is not recorded in the two strategies, and only the strategies are simply selected according to the existing knowledge, so that the knowledge is not known and utilized. Greedy strategy, which is an improvement of the random strategy. The utilization and exploration are traded off based on a probability: each test is utilized according to the probability of 1-epsilon, and the data segment with the highest current average reward is selected, namely the data segment with the highest score is generally the highest hit rate in the past test according to the past knowledge, so that the data segment with the highest score has a good deduplication rate on the following data stream according to the maximum probability; and (3) exploring by using the probability of epsilon, namely randomly selecting one data segment by using uniform probability and exploring unknown data segments which possibly exist and have better effect. Therefore, the ability to discover duplicate data is continually expanded by selecting a best known data segment and exploring an unknown data segment that may be rewarded better in the future, finding a balanced relationship between the two in practice.
4, feeding back the reward value to the data segment according to the hit condition
When a champion data segment is selected, the subsequent data segment is prefetched and then compared with the data segment to be re-deleted for fingerprint search. Where a lookup hit means that a duplicate data block is identified. Before the champion data segment is moved out of the cache, query hits of the data segment are accumulated, and when the data segment is removed, the score feedback is updated to the context table, and the corresponding score of the data segment is updated.
Assuming that a data segment s is selected as champion, the hits in n feedbacks are r1,r2,…,rnAnd is provided with Qn(s) is the final result of the n feedbacks; then the feedback value is taken as the case where the data segment is hit. An intuitive approach is to compute Q with average feedbackn(s) that is
This presents a problem of how to efficiently calculate the feedback value, and the memory and calculation overhead will grow over time without limitations. Therefore, the following optimization is performed:
Wherein Q is initialized0(s) is 0.
This calculation method only requires the calculation of Qn-1Memory overhead of(s) and n and rnThe computational overhead of (2). The process is unsupervised. Relying on feedback only during training either strengthens or weakens future data segment associations once. Heightened feedback will enhance the probability that a data segment will become a champion; otherwise, the probability of the data segment becoming a champion is reduced. Feedback is handled in a lazy manner, requiring a time limit; instead, champion selection decisions based on current knowledge are real-time.
5 th, update context table
The context table holds all features and corresponding data segments. When the characteristic of a data segment is selected, if the characteristic is not in the context table, the characteristic and the associated data segment are inserted into the context table as a new item, otherwise, the data segment information is added into the corresponding characteristic item. In each entry in the context table, the same signature is stored in a queue for up to k data segments, and when the queue is full, a new data segment is always inserted at the end of the queue and the old data segment is removed from the queue.
And the improvement of the deduplication performance can be greatly facilitated by adopting a proper replacement strategy according to the characteristics of the data source. The method considers two alternative strategies: a first-in-first-out strategy and a minimum strategy. The first-in-first-out strategy is to remove the data segments from the head of the queue except the highest score. The minimum value strategy is to remove the data segment with the lowest score, and if a plurality of data segments with the lowest score exist, the data segments are replaced according to a first-in first-out strategy.
The invention has the advantages and beneficial effects that:
1) By introducing the fingerprint index prefetching method based on reinforcement learning, the performance of the deduplication system is improved to a certain extent in the deduplication rate and the memory occupation, and the throughput of the system is improved; 2) the invention effectively solves the problem of disk fingerprint indexing; 3) the invention excavates the context relation between the fingerprint sequence and the data stream, is more suitable for the change of a system and has self-adaption to the stored data stream.
Drawings
Fig. 1 is a basic framework based on a reinforcement learning method.
FIG. 2 is a flowchart of an implementation of a deduplication algorithm.
Fig. 3 shows the deduplication rates of the evaluation system under (a) Kernel, (b) Vmdk, (c) Fslhomes, and (d) Macos workloads, respectively, under different size data segments and different sampling rates.
Fig. 4 measures the data deduplication rates for two workloads (a) Kernel and (b) User014 data streams, respectively, at an epsilon parameter set under an epsilon-greedy strategy.
Fig. 5 shows that the deduplication rates are measured for (a) fsholmes, (b) Kernel, (c) Macos, and (d) Vmdk workloads, respectively, using minimum and fifo replacement strategies.
Detailed Description
The present invention will be further described with reference to the process flow of FIG. 2.
The basic flow of deduplication is shown in fig. 2. First, a data stream inputted from a storage system is divided into a plurality of relatively small data block sequences, and a hash value of each data block, which is a fingerprint for data block identification, is calculated using a hash secure encryption method (MD5 or SHA-1), and data blocks having the same fingerprint are regarded as being duplicated. The fingerprint index is used to map the fingerprint of a stored data chunk to its physical address. And comparing each data block by using the fingerprint, if the fingerprint is the same as the fingerprint in the index, indicating that the data block is repeated and does not need to be stored, and otherwise, writing the data block only into a container with a fixed size, wherein the container is a storage unit for storing the data block. The fingerprint sequence of the data stream needs to be saved for data recovery.
to illustrate the effectiveness of the present invention, four different data sets were used for performance evaluation, representing different workloads. Where Kernel and Vmdk are real datasets and FSLHomes and Macos characterize log datasets. The file is divided into several data blocks of 4KB size on average, and in the storage phase, the new data blocks are stored into a container of fixed size 4M. The concrete description is as follows:
a) The data set Kernel, which contains 155 Kernel source code versions of linux-2.3.0 to linux-2.5.75 and represents the source code load in the data backup, is mostly a small file and has a high repetition rate.
b) the data set Vmdk is a virtual machine image that contains different versions of 110 different operating systems, and represents a lesser load of duplicate data due to the lack of data redundancy between versions.
c) data sets FSLHomes are journal data sets collected by the New York State university, Yangxi school File System and storage laboratory (FSL). Data logs for 9 users from 9/month 16 to 30/day 14 in 2011 were used. Since each user's backup data is considered a data stream, the daily data corresponds to multiple data stream loads.
d) The dataset Macos is also a data log of the new york state university shixixi branch school computer lab Mac server. Because the data volume is too large, only 18 days of data set logs are selected.
Step 01, obtaining the incidence relation between the characteristic value and the data segment
Firstly, a data file is divided into a group of data blocks, a data stream of a data block sequence is divided into a plurality of data segments by a content-based division method, and each data segment s to be deleted againtObtaining the features f, s by a sampling methodtAnd f has a corresponding association relationship, namely: f → st. And stores both and their relationship in a key value index structure. Each feature f corresponds to at most k data segments in the context table, i.e. at most k data segments share the feature f.
As shown in fig. 3, under different data sets and different data segment lengths, different sampling rates are used to obtain the characteristic values, and the obtained deduplication rates are different. Wherein 1024, 2048 and 4096 are the number of data blocks respectively, and the size of the average data block is 4KB, and the conditions of 1, 2, 4 and 8 characteristics of data segment downsampling with the size of 4MB, 8MB and 16MB are measured respectively. It can be seen from the figure that under the condition of the same data segment size, the more the sampling characteristics are, the better the deduplication effect is; when the sampling feature numbers are the same, the smaller the data segment is, the higher the deduplication rate is, and actually, the sampling feature number determines the deduplication effect. The duplication checking detection of the data block mainly depends on the characteristic value of the sample to perform the pre-fetching of the data segment, so that the good effect of approximate duplication removal is achieved.
step 02, making a selection strategy, and selecting a specific data segment as a champion
The method adopts a method combining three strategies, namely a recent strategy, a random strategy and a greedy strategy. According to all candidates of a given feature mapping, a first data segment is selected by a greedy strategy, and the data segment with the highest score tends to be selected with a 1-epsilon probability or a data segment is randomly selected with an epsilon probability; and selecting one data segment with the latest strategy. There are as few common features as possible between champions in the selection process. This enables the selected champion data segments to be complemented, and avoids the data segments selected by fingerprint features from being too similar to lose the duplication checking effect.
Using the data of User014 (FIG. 4-b) in Kernel (FIG. 4-a) and FSLHomes for half consecutive months as test data set, epsilon was selected as 0.1, 0.3, 0.5, 0.7, 0.9, respectively, and the average duplicate deletion rate was used as the standard. As can be seen from fig. 4, as epsilon increases, the rate of deduplication decreases, and epsilon 0.1 indicates that one data segment is randomly selected with a probability of 0.1, and the data segment with the highest score is selected with a probability of 0.9. It is shown that the effect obtained by selecting the data segment with the highest score is better.
Step 03 of feeding back reward value to data segment according to hit condition
The data segment with the fixed length is prefetched to the cache in advance by utilizing the champion data segment selected in the step 02, so that the subsequent data segment needing to be deleted again can be directly found in the cache, the hit rate of the cache is improved, and the access of the disk fingerprint index is reduced. Wherein the cache is replaced with the LRU policy.
And prefetching the champion data segment and the data segments behind the champion data segment, and performing fingerprint searching comparison on the data segments to be deleted again. A lookup hit means that a duplicate data block is identified, query hits for the data segment are accumulated before the data segment is moved out of the cache, and integral feedback is updated to the context table when the data segment is removed. The results shown in fig. 4 verify the validity of the score, i.e. the effect of feeding back the prize value.
Step 04 update context table
The context table is a key-value index structure: the key represents a fingerprint feature and the value represents the k data segments corresponding to the fingerprint feature. The attribute of each data segment comprises a score and the number of data segments allowed to be prefetched, and the score is realized by using a queue, wherein the score is helpful for the selection of the champion data segment and is also a basis for feedback. As shown in fig. 5, where a metadata store (fingerprint store) is used to store the fingerprint sequence of the backup stream for data recovery, a data block store (container store) stores all data blocks.
in each item in the context table, the same characteristic corresponds to k data segments at most and is stored in a context queue, when the context table queue is full, a new data segment is always inserted into the tail of the queue, and the old data segment is removed from the queue according to a first-in first-out replacement strategy or a minimum replacement strategy. For example, feature X queuesWhen full, if the first-in first-out replacement strategy is adopted, S is deletedx1Node, inserting new data segment into Sx3Behind the node; if the minimum value replacement strategy is adopted, deleting the node with the minimum score, and inserting a new data segment into the Sx3Behind the node.
As shown in fig. 5, the experimental results of comparing the minimum replacement strategy and the first-in first-out replacement strategy were measured under different data sets with epsilon being set to 0.3. Wherein, in the FSLHomes (figure 5-a), Macos (figure 5-c) and Vmdk (figure 5-d) three data sets show that the deduplication rate of the minimum value replacement strategy is higher, and the effectiveness of the score is also demonstrated. On the contrary, in Kernel (fig. 5-b), since the Kernel dataset is a continuous backup version, is a few small files, and has very strong time sequence, the effect of using the fifo strategy is better.
And selecting a proper replacement strategy to update the context table according to different data stream characteristics, which is a great help for improving the deduplication performance, and is an advantage embodiment of the proposed reinforcement learning method.

Claims (6)

1. A fingerprint index prefetching method based on reinforcement learning in a repeated data deleting system provides a novel self-adaptive method aiming at the key problem of a disk bottleneck of fingerprint index access in repeated data detection, improves a data caching mechanism and improves data storage efficiency and recovery performance; the method comprises the following steps:
1, dividing the data into data segments according to the byte content of the data stream: the method adopts a variable-length blocking method, generally utilizes a sliding window to analyze an input data stream, then quickly calculates data fingerprints through a Hash algorithm, and if a certain fingerprint is found to be matched with a predefined mode, the position of the current sliding window is determined to be the boundary of a data block;
And 2, extracting the characteristic fingerprint of the data segment: assuming that each data segment selects m features, selecting m minimum fingerprints as the features through a comparison rule, and associating each feature to one item in a context table by using a hash function; the more characteristics are sampled, the more crows are selected, and the more repeated data blocks can be identified finally;
And 3, when each new data segment is compared with the existing data segment, a strategy for selecting the existing data segment set is formulated: carrying out balance comparison on the current best feedback data segment and the unknown segment by using a multi-arm slot machine model, dynamically selecting a data segment for prefetching, and selecting a corresponding data segment as a comparison object, namely a champion data segment;
And 4, feeding back a reward value to the data segment according to the hit condition: before the champion data segment is moved out of the cache, query hits of the data segment are accumulated, when the data segment is removed, the integral feedback is updated to the context table, and the corresponding score of the data segment is updated;
And 5, updating a context table: when the characteristic of a data segment is selected, if the characteristic is not in the context table, the characteristic and the associated data segment are inserted into the context table as a new item, otherwise, the data segment information is added into the corresponding characteristic item.
2. the reinforcement learning-based fingerprint index prefetching method of claim 1, wherein the step 1 divides a data stream of the data block sequence into a plurality of data segments by a content-based segmentation method, and then selects a partial fingerprint from the data segments as a feature by means of sampling.
3. The fingerprint index prefetching method based on reinforcement learning as claimed in claim 1, wherein step 2 is to sample a small part of features from each data segment to store in the index structure by a set ratio, and select m smallest fingerprints as features by the comparison rule of coincident features; here, a random sampling method is used, every 2nA fingerprint takes mod 2nA fingerprint with a value of 0 as a feature; in practical implementation, m is usually not higher than 3 to achieve a good deduplication effect.
4. The reinforcement learning-based fingerprint index prefetching method of claim 1, wherein the method for selecting data segments in step 3 is: selecting a champion data segment from the context table by a greedy strategy, and if the selected champion data segment is not in the set S of similar data segments, putting the set S into the context table; the set S refers to a data segment set with the same characteristics as the currently detected data segment; if the selected champion data segment is in the data segment similarity set S, selecting a new data segment which is not in the set S by using a latest strategy, and putting the new data segment into the set S; and finally, putting other candidates of the feature mapping into the set S to finally obtain all champion data segments.
5. The fingerprint index prefetching method based on reinforcement learning as claimed in claim 1, wherein the method of feeding back the reward value to the data segment in step 4 is as follows: let s denote a data segment, the result in n feedbacks after s is selected as champion is r respectively1,r2,…,rn,,Qn(s) is the final result of the n feedbacks; feedback reward valueWherein Q is initialized0(s) is 0, and the calculation method only needs to calculate Qn-1(s) and index memory overhead and feedback reward value rnThe computational overhead of (2).
6. The reinforcement learning-based fingerprint index prefetching method of claim 1, wherein the context table updating method of step 5 is: in each item in the context table, the same characteristic is stored in a queue corresponding to k data segments at most, when the queue is full, a new data segment is always inserted into the tail of the queue, an old data segment is removed from the queue, and a first-in first-out or minimum strategy is adopted for replacing the data segment.
CN201910852882.7A 2019-09-10 2019-09-10 Fingerprint index prefetching method based on reinforcement learning in data de-duplication system Pending CN110569245A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910852882.7A CN110569245A (en) 2019-09-10 2019-09-10 Fingerprint index prefetching method based on reinforcement learning in data de-duplication system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910852882.7A CN110569245A (en) 2019-09-10 2019-09-10 Fingerprint index prefetching method based on reinforcement learning in data de-duplication system

Publications (1)

Publication Number Publication Date
CN110569245A true CN110569245A (en) 2019-12-13

Family

ID=68778629

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910852882.7A Pending CN110569245A (en) 2019-09-10 2019-09-10 Fingerprint index prefetching method based on reinforcement learning in data de-duplication system

Country Status (1)

Country Link
CN (1) CN110569245A (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112597345A (en) * 2020-10-30 2021-04-02 深圳市检验检疫科学研究院 Laboratory data automatic acquisition and matching method
CN112637153A (en) * 2020-12-14 2021-04-09 南京壹进制信息科技有限公司 Method and system for removing duplicate in storage encryption
CN112799590A (en) * 2021-01-21 2021-05-14 中国人民解放军国防科技大学 Differential caching method for online main storage deduplication
WO2022228693A1 (en) * 2021-04-30 2022-11-03 Huawei Technologies Co., Ltd. System and method for indexing a data item in a data storage system
CN116451660A (en) * 2023-04-11 2023-07-18 浙江法之道信息技术有限公司 Legal text professional examination and intelligent annotation system
CN116775666A (en) * 2023-08-24 2023-09-19 北京遥感设备研究所 Method for automatically adjusting data index on line
US11868214B1 (en) 2020-02-02 2024-01-09 Veritas Technologies Llc Methods and systems for affinity aware container prefetching

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102222085A (en) * 2011-05-17 2011-10-19 华中科技大学 Data de-duplication method based on combination of similarity and locality
CN102663086A (en) * 2012-04-09 2012-09-12 华中科技大学 Method for retrieving data block indexes
CN103514250A (en) * 2013-06-20 2014-01-15 易乐天 Method and system for deleting global repeating data and storage device

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102222085A (en) * 2011-05-17 2011-10-19 华中科技大学 Data de-duplication method based on combination of similarity and locality
CN102663086A (en) * 2012-04-09 2012-09-12 华中科技大学 Method for retrieving data block indexes
CN103514250A (en) * 2013-06-20 2014-01-15 易乐天 Method and system for deleting global repeating data and storage device

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
毛群芳: "重复数据删除中智能预取算法设计与分析", 《中国优秀硕士学位论文全文数据库(信息科学辑)》 *

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11868214B1 (en) 2020-02-02 2024-01-09 Veritas Technologies Llc Methods and systems for affinity aware container prefetching
CN112597345A (en) * 2020-10-30 2021-04-02 深圳市检验检疫科学研究院 Laboratory data automatic acquisition and matching method
CN112597345B (en) * 2020-10-30 2023-05-12 深圳市检验检疫科学研究院 Automatic acquisition and matching method for laboratory data
CN112637153A (en) * 2020-12-14 2021-04-09 南京壹进制信息科技有限公司 Method and system for removing duplicate in storage encryption
CN112637153B (en) * 2020-12-14 2024-02-20 航天壹进制(江苏)信息科技有限公司 Method and system for storing encryption and deduplication
CN112799590A (en) * 2021-01-21 2021-05-14 中国人民解放军国防科技大学 Differential caching method for online main storage deduplication
CN112799590B (en) * 2021-01-21 2022-07-19 中国人民解放军国防科技大学 Differentiated caching method for online main storage deduplication
WO2022228693A1 (en) * 2021-04-30 2022-11-03 Huawei Technologies Co., Ltd. System and method for indexing a data item in a data storage system
CN116451660A (en) * 2023-04-11 2023-07-18 浙江法之道信息技术有限公司 Legal text professional examination and intelligent annotation system
CN116451660B (en) * 2023-04-11 2023-09-19 浙江法之道信息技术有限公司 Legal text professional examination and intelligent annotation system
CN116775666A (en) * 2023-08-24 2023-09-19 北京遥感设备研究所 Method for automatically adjusting data index on line
CN116775666B (en) * 2023-08-24 2023-11-14 北京遥感设备研究所 Method for automatically adjusting data index on line

Similar Documents

Publication Publication Date Title
CN110569245A (en) Fingerprint index prefetching method based on reinforcement learning in data de-duplication system
US9858303B2 (en) In-memory latch-free index structure
US9798754B1 (en) Method to efficiently track I/O access history using efficient memory data structures
Srinivasan et al. iDedup: latency-aware, inline data deduplication for primary storage.
US8914338B1 (en) Out-of-core similarity matching
US8280860B2 (en) Method for increasing deduplication speed on data streams fragmented by shuffling
US20170293450A1 (en) Integrated Flash Management and Deduplication with Marker Based Reference Set Handling
US20040205044A1 (en) Method for storing inverted index, method for on-line updating the same and inverted index mechanism
KR20170054299A (en) Reference block aggregating into a reference set for deduplication in memory management
US8442954B2 (en) Creating and managing links to deduplication information
CN109445702B (en) block-level data deduplication storage system
AU2010200866B1 (en) Data reduction indexing
US8225060B2 (en) Data de-duplication by predicting the locations of sub-blocks within the repository
US10509769B1 (en) Method to efficiently track I/O access history
US11262929B2 (en) Thining databases for garbage collection
Xu et al. Lipa: A learning-based indexing and prefetching approach for data deduplication
US20170123689A1 (en) Pipelined Reference Set Construction and Use in Memory Management
CN113535670B (en) Virtual resource mirror image storage system and implementation method thereof
Zhang et al. Improving the performance of deduplication-based backup systems via container utilization based hot fingerprint entry distilling
CN111274212A (en) Cold and hot index identification and classification management method in data deduplication system
WO2022205544A1 (en) Cuckoo hashing-based file system directory management method and system
Yu et al. Pdfs: Partially dedupped file system for primary workloads
US8156126B2 (en) Method for the allocation of data on physical media by a file system that eliminates duplicate data
US20200019539A1 (en) Efficient and light-weight indexing for massive blob/objects
Zhang et al. Improved deduplication through parallel binning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20191213