CN104932841B - Economizing type data de-duplication method in a kind of cloud storage system - Google Patents

Economizing type data de-duplication method in a kind of cloud storage system Download PDF

Info

Publication number
CN104932841B
CN104932841B CN201510339033.3A CN201510339033A CN104932841B CN 104932841 B CN104932841 B CN 104932841B CN 201510339033 A CN201510339033 A CN 201510339033A CN 104932841 B CN104932841 B CN 104932841B
Authority
CN
China
Prior art keywords
data
data block
mrow
node
file
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN201510339033.3A
Other languages
Chinese (zh)
Other versions
CN104932841A (en
Inventor
徐小龙
涂群
李涛
徐佳
朱洁
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ciic Yunfu Hangzhou Medical Technology Co Ltd
Original Assignee
Nanjing Post and Telecommunication University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing Post and Telecommunication University filed Critical Nanjing Post and Telecommunication University
Priority to CN201510339033.3A priority Critical patent/CN104932841B/en
Publication of CN104932841A publication Critical patent/CN104932841A/en
Application granted granted Critical
Publication of CN104932841B publication Critical patent/CN104932841B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Abstract

The invention discloses economizing type data de-duplication method in a kind of cloud storage system, the cloud storage system by carry out file operation client, store the meta data server of file system metadata information, the image file of backed up in synchronization metadata and the two level meta data server of operation log, the memory node of storage data block collectively forms, five steps of its method are directed to the dynamic of data in cloud storage system, consider the characteristic of data in itself, split data into hot spot data and non-thermal point data, opportunity is deleted again using different for different data, to ensure the performance of system more preferably, can be more preferable for the reducing effect of system response time.

Description

Economizing type data de-duplication method in a kind of cloud storage system
Technical field
The present invention relates to economizing type repeated data in computer data field of storage, more particularly to a kind of cloud storage system to delete Except method.
Background technology
In recent years, the technology such as cloud computing, mobile computing, Internet of Things become increasingly popular so that current data in explosion type increase Long, cloud storage technology is come into being.Counted according to International Data Corporation (IDC) IDC, global metadata total amount has reached 1.8ZB (1ZB within 2011 =109TB), it is contemplated that the informational capacity produced to the year two thousand twenty whole world is up to 35ZB.The storage pressure of system is also growing day by day. It has also been found that there is nearly 75% repeated and redundant data in information system, substantial amounts of repeatability redundant data wastes largely for IDC investigation Storage resource, and data de-duplication technology can effectively reduce data.
Data de-duplication technology retains unique data by comparing fingerprint value, and with the pointer for being directed toward unique data Instead of the data of other repetitions.Data de-duplication technology has been widely used in backup and filing system, wherein more mature Data de-duplication strategy have based on file semantics perceive multilayer source repeated data method (Semantic-aware Multiered Deduplication, SAM-Dedupe), based on causal data de-duplication method (Causality- Based Deduplication, CABdedupe), based on application perceive data de-duplication method (Application- Aware Deduplication, AA-Dedupe) etc..They respectively have advantage and disadvantage, and SAM-Dedupe passes through to file size, file Fingerprint comparison scope is constantly reduced in position, file type, the cognition of document time stamp;CABdedupe is standby with recording by capture Causality of part data set between multiple time points, excavates unmodified data and implements to delete again;AA-Dedupe passes through to not Same type file apply use different block algorithms and fingerprint extraction technology with obtain it is optimal delete effect again, such as static application number According to or virtual machine image taken the fingerprint using FSC (Fixed-Sized Chunking) algorithm piecemeals and MD5 algorithms.These strategies Using standby system as environment, cause processing data it is relatively static, that is, after uploading to storage end, user will not be to storage end In data directly modify, therefore simply transplant these methods and be not particularly suited for cloud storage system.At present, cloud storage system In also have some achievements in research, lay particular emphasis on security of system, or based on proxy-encrypted data de-duplication mechanism, or based on friendship The data de-duplication mechanism of the PoW (Proof of Ownership) of mutual formula, or the safe repeat number based on data stream degree According to deleting mechanism.Data de-duplication method causes same data block to be shared by multiple users, and modification of the user to data In diversity, how to ensure that the availability of data and security are necessary.
The relatively static backup of the universal data-oriented of the prior art and filing system, avoid in repeated data from source It is not intended that whether the data in storage system can be changed after biography, and data are shared by multi-user in cloud storage system, more User, which changes data, causes the dynamic of data to strengthen, therefore and inapplicable cloud storage system.
The content of the invention
In order to solve the above technical problems, the technical solution adopted by the present invention is as follows:
Economizing type data de-duplication method in a kind of cloud storage system, the cloud storage system is by carry out file operation Client, the meta data server for storing file system metadata information, the image file of backed up in synchronization metadata and operation day Two level meta data server, the memory node of storage data block of will collectively form, and this method comprises the following steps:
Step 1:Each client pre-processes local file to be uploaded, carries out the office of file-level and block level Then the metadata information of file to be uploaded is uploaded to member by portion's data de-duplication operations to prevent the upload again of repeated data Data server;
Step 2:Meta data server receives the metadata information from different clients, be successively read file fingerprint, Data block fingerprint, then compares memory, the fingerprint index information in hard disk and write buffer area, finally believes the fingerprint value being transmitted through on not Breath returns to each client.
Step 3:The new data being transmitted through on not is uploaded to storage end by client, and storage end stores new data, and Update storage the metadata information table at end.
Step 4:Client sends the request of data to be changed, and data place to be modified is obtained by meta data server Memory node number, then connect memory node and operation of directly modifying to the data of storage end.
Step 5:Storage end is detected amended data block, when amended data block is by comparing fingerprint value It was found that on this node, directly it is deleted again;When amended data block is not on this node, then this is first saved in On node, then found on other nodes by the comparison of meta data server, which is deleted again using delay;Work as modification Data block afterwards is found neither on this node, and do not exist by comparing the fingerprint index on this node and meta data server On other nodes, except the data block is saved on this node, meta data server also needs to create a Copy for the data block.
The cloud storage system is characterized in that:Also contain filtering module and update module on meta data server, Filtering module is used for the repeated data information for filtering different clients, and update module is used to update storage end global data metadata Information, i.e., directly update the metadata information of repeated data block, wait receive just update after memory node feedack it is non-heavy The metadata information of complex data block.
The client has file pretreatment module, part to delete module, metadata management module and data transmission module again, Wherein file pretreatment module carries out document classification according to the type of file, then gives local module of deleting again and carries out file-level weight Delete, the non-duplicate file after file-level is deleted again is returned to file pretreatment module and filtered again, filters out less than 64MB Non-duplicate file, finally again by it is local delete module again and carry out block level delete again.Metadata management module is used to record client End has uploaded the fingerprint value information of data block, to avoid the upload of local repeated data;Data transmission module is then that client connects The interface of meta data server and memory node is connect, that is, is responsible for the metadata information of file to be uploaded uploading to Metadata Service Device, non-duplicate data block is uploaded on memory node.
The memory node includes memory module, metadata management module, Self-Check Report module and delay and deletes module again, its Middle memory module is responsible for the storage of data block, distributes the physical address of data block;On metadata management module minute book node The metadata information of data block;Self-Check Report module be detect data block modification caused by repeated data, give delay weight Delete module carry out hot spot repeated data block judgement with it is corresponding processing and the metadata information of modification is fed back into Self-Check Report Module, is then reported to meta data server.
File-level data de-duplication in the step 1:Using MD5 algorithm calculation document fingerprint values, size and class are compared The equal file fingerprint value of type, is then compared with local metadata information table, determines duplicate file and non-duplicate text again Part;
Block level data de-duplication described in the step 1 is as follows:It is non-heavy less than 64MB for having filtered out Multiple file, piecemeal is carried out using fixed length block algorithm, and block length is set to 64MB, and the fingerprint value of data block is calculated using MD5 algorithms, than The data block equal to block length determines repeated data block.
When file fingerprint is compared in the step 2, if finding, fingerprint value is existing, no longer the fingerprint of comparison data block, Otherwise the data block fingerprint of configuration file is also compared.
The mapping relations of the in store data block fingerprint and its storage address thereon of each storage end in the step 3, Pass through data block fingerprint, you can determine the physical address of data block storage.
Modification of the multiple users of client to data block can introduce new repeated data block in the step 4, and existing Storage system puts aside these data blocks repeated.User backs up again after local is to data modification in standby system, The part not made an amendment is filtered out during backup;And cloud storage is experienced as in local, user to the high in the clouds that user brings The address for the data for wanting modification is got, is directly modified to data.This is exactly cloud storage and the difference of standby system.
Postpone to delete again comprising to behaviour of both hot spot repeated data block and non-hot repeated data block in the step 5 Make, determination methods use equation below:
In formula, a certain data block is changed in node i, and determines that the data block does not repeat in node i, in node j On have repeated data block;Represent in tp+1-tpSome interior data block of period being averaged except node i at memory node end Access times;α is a threshold value, represents to become access times minimum in the hot spot data block unit interval;Aj(tp) and Aj (tp+1) t is represented respectivelypAnd tp+1The access times of a certain data block on moment node j;Z is the numbering of node where data block B Set.
Then postpone to delete again for hot spot repeated data block to reduce the access response time of system;For non-hot repeat number According to block, then the deletion where selecting non-hot repeated data block on the relatively small number of node of memory node residual capacity is negative to realize Carry balanced.
Beneficial effect
1. existing data de-duplication is mainly directed towards the relatively static backup of data and filing system, and does not apply to Cloud storage system, and data are shared by multi-user in cloud storage system, multi-user, which changes data, causes the dynamic of data to increase By force.The present invention is directed to the dynamic of data in cloud storage system, considers data characteristic in itself, split data into hot spot data and Non-thermal point data, deletes opportunity, to ensure the performance of system more preferably again for different data using different.
2. the present invention, with reference to replica management mechanism, is ensureing compared to existing data de-duplication strategy in cloud storage On the premise of availability of data, using delayed deletion repeat hot spot data block (being temporarily regarded as copy), within a certain period of time Access pressure of the user to hot spot data block is alleviated, therefore can be more preferable for the reducing effect of system response time.
3. the non-hot data block repeated is also considered as a copy by the present invention, the storage of node where comparing all copies The copy loaded on larger node is deleted, to realize that storage load is more balanced.
Brief description of the drawings
Fig. 1 is the architectural framework figure of cloud storage data deduplication system
Fig. 2 is the procedure chart of delay data de-duplication
Fig. 3 is the processing schematic diagram that storage end changes data block
Embodiment
In order to facilitate description, The present invention gives the Organization Chart of cloud storage data deduplication system, as shown in Figure 1. The system is by m client (Client), 1 meta data server (Metadata Server, MS), 1 two level metadata clothes Business device (Secondary Metadata Server, SMS) and n memory node (Storage Node, Snode) collectively form. Wherein, client mainly initiates the object of the operations such as file upload, access, modification, deletion;Meta data server is mainly stored All metadata informations of file system, there is provided access control and the global foundation deleted again, it is equivalent to whole system framework Maincenter.Two level meta data server mainly undertakes the work of the image file and operation log of backed up in synchronization metadata;Storage section Point is then responsible for the actual data block of storage.In addition, there is close contact in system between each composition part, cooperate. Interacting for metadata information is only carried out between client and meta data server, to mitigate the load of the transmission bandwidth of metadata.When When client will upload data, by meta data server to determine non-repetitive data message;When client will access (including Modification) data when, by meta data server with determine data place nodal information.Can be into line number between client and memory node According to transmission.Memory node can also be interacted with meta data server, such as the metadata for the data changed on memory node Information will also be interacted with meta data server, to determine whether for repeated data.Meanwhile meta data server also can be according to storage The situation of data access creates certain copy for it and accesses load to reduce on node.For there was only a meta data server Framework, once it breaks down, whole system will paralyse, therefore between meta data server and two level meta data server For active and standby relation.
Client mainly has file pretreatment module, part to delete module, metadata management module and data transmission module again, Wherein file pretreatment module carries out document classification according to the type of file, and the later stage, which filtered out block level is deleted again when, to be less than The non-duplicate file of 64MB;Part deletes module and is deleted operation again from two angles of file-level and block level again;Metadata pipe Reason module essential record client has uploaded the fingerprint value information of data block, to avoid the upload of local repeated data;Data pass Defeated module is responsible for the metadata information of file to be uploaded uploading to meta data server, and non-duplicate data block is uploaded to storage On node.Have certain contact between each module, the file after the processing of file pretreatment module give it is local delete again module into Row file-level is deleted again, and the non-duplicate file after file-level is deleted again is returned to file pretreatment module and filtered again, most Deleting again for block level is carried out by local module of deleting again again afterwards.Part involved in whole process to metadata information will be with member Data management module interacts, and data transmission module is then the interface of client connection meta data server and memory node.
There are filtering module and update module on meta data server, wherein filtering module passes through the rope on meta data server The metadata information drawn in table (being distributed on memory and disk) and write buffer area filters out the repeat number from different clients It is believed that breath.For the data block repeated, directly pass through the metadata information of update module renewal corresponding data block;For non-duplicate Data block, update module is then just by the renewal of its metadata information on disk after memory node feedack is received In concordance list.When the data of memory node are changed, it can also be interacted with meta data server, so as to trigger renewal mould Renewal of the block to concordance list on meta data server.
Memory node mainly includes memory module, metadata management module, Self-Check Report module and delay and deletes module again, its Middle memory module is mainly responsible for the storage of data block, records the physical address of data block;Metadata management module minute book node On data block metadata information;Self-Check Report module is mainly to detect repeated data caused by the modification of data block, is handed over Module is deleted again to delay, and the metadata information of modification is reported to meta data server;Delay deletes module for detecting again Repeated data block, then judge whether repeated data block is hot spot repeated data block, for hot spot repeated data block delay delete again, Then select the identical block on suitable node to delete for non-hot repeated data block, believe involved in this module to metadata The part of breath needs to interact with metadata management module and Self-Check Report module.
The present invention carries out data de-duplication according to following steps:
Step 1:Each client pre-processes local file to be uploaded, carries out the office of file-level and block level Then portion's data de-duplication operations (including are treated the metadata information of file to be uploaded to prevent the upload again of repeated data The fingerprint value of the fingerprint value of upper transmitting file and its all data blocks) upload to meta data server.Upload the finger of repeated data block Line value is to quote number to update the data block in meta data server.Wherein, local data de-duplication operations is specific It is described as follows:
1. file-level data de-duplication:Using MD5 algorithm calculation document fingerprint values, size and the equal text of type are compared Part fingerprint value, is then compared with local metadata information table, determines duplicate file and non-duplicate file again;
2. block level data de-duplication:For non-duplicate file (having filtered out the file less than 64MB), using calmly Long block algorithm carries out piecemeal, and block length is set to 64MB, and the fingerprint value of data block is calculated using MD5 algorithms, and it is equal to compare block length Data block determines repeated data block.
Step 2:Meta data server receives the metadata information from different clients, be successively read file fingerprint, Data block fingerprint, then compares memory, the fingerprint index information in hard disk and write buffer area, finally believes the fingerprint value being transmitted through on not Breath returns to each client.
When comparing file fingerprint, if finding, fingerprint value is existing, no longer the fingerprint of comparison data block, otherwise also to compare The data block fingerprint of configuration file.Fingerprint index table is distributed in memory and hard disk, and the space for being primarily due to memory extremely has Limit, therefore most of fingerprint index table storage is in a hard disk.In addition, also have partial data block fingerprint value information in write buffer area, this It is because storage end does not complete the storage work of new data block sended over to client also, and the fingerprint value of new data is not yet It can write in hard disk.
During fingerprint value compares, the present invention utilizes " type by the time for sacrificing document classification, size sorts The file identical with size is very likely similar documents " and " identical block that files in different types is shared can almost neglect Slightly " carry out continuous drawdown ratio to scope.
Step 3:The new data being transmitted through on not is uploaded to storage end by client, and storage end stores new data, and Update storage the metadata information table at end.
For the data repeated, client have updated its letter on meta data server by step 1 and step 2 Breath, and it is uploaded directly into storage end for non-repetitive data, client.And the in store number thereon of each storage end According to block fingerprint and its mapping relations of storage address.Pass through data block fingerprint, you can determine the physical address of data block storage.
Step 4:Client sends the request of data to be changed, and data place to be modified is obtained by meta data server Memory node number, then connect memory node and operation of directly modifying to the data of storage end.
Modification of the client to data is different because of user, that is, the mode for the user's modification for enjoying same data block is different, and Different data are also possible to that identical data can be modified to, this is the dynamic of cloud storage data, and cloud storage with The difference of standby system.Standby system is that user backs up again after local is to data modification, mistake during backup The part not made an amendment is filtered, and cloud storage is experienced as in local, user gets desired modification to the high in the clouds that user brings Data address, directly modify to data.
Step 5:Storage end is detected amended data block, and judges that amended data block belongs in table 1 Which kind of situation simultaneously takes appropriate measures, and specific Method And Principle is as shown in Figure 2.
The amended three kinds of situations of 1 data block of table and corresponding operating
Need to recalculate its fingerprint value for amended data block, and compare the progress of the metadata information on this node Judge, if finding, the data block on this node, directly deletes it again;If it was found that amended data block does not exist On this node, then first it is saved on this node, then compares meta data server and find on other nodes, then carries out delay weight Delete;If it was found that after fingerprint index of the amended data block on this node and meta data server is compared, neither in this node On, and not on other nodes, then meta data server also needs to create a Copy for the data block.Delay is deleted comprising to hot spot again Operation of both repeated data block and non-hot repeated data block, determination methods use formula (1), for hot spot repeated data Block then postpones to delete again to reduce the access response time of system;For non-hot repeated data block, then non-hot repeat number is selected According to the deletion on the relatively small number of node of memory node residual capacity where block to realize load balancing.
In order to make it easy to understand, some concepts of complementary definition:
Hot spot data block:Average access frequency reaches the data block of certain threshold value in a period of time, that is, meets formula (1). The data block that condition is not satisfied, is known as non-hot data block.
Hot spot repeated data block:Amended data block A ' do not have found on this node, but find on other nodes with Identical data block A, and data block A is hot spot data block, then A ' is referred to as hot spot repeated data block.
Non-hot repeated data block:Amended data block B ' does not have found on this node, but but is sent out on other nodes Existing same data block B, and data block B is non-hot data block, then and B ' is referred to as non-hot repeated data block.
Present invention is alternatively directed to the step 5 combination attached drawing 3 give user change memory node i (i=1,2,3 ..., N) data block on, the specific implementation step that storage end is handled are as follows:
1. Request request modifications:Node i is connected to after modification request of the client to a certain data block (being denoted as A), Read block A is replicated into memory;
2. Modify is modified:Node i in memory modifies data block A (amended data block is denoted as B) Then the reference number of A does the operation that subtracts 1, and the fingerprint value of B is calculated using MD5 algorithms;
3. Check repeats to detect:Whether node i has quickly existed in the fingerprint value for locally searching B, to avoid repeat number According to storage.If jumping to step 5. without if, otherwise remember that data block identical with data block B in node i is B ', and carry out next Step;
4. Deduplicate deduplications:Data block B is deleted, and uses the pointer replacement data block B for being directed toward data block B ' Storage;
5. Store is stored:Amended new data block B is stored in node i, and updates the metadata of node i local Information table;
6. Check repeats to detect:The metadata information of renewal is periodically sent on meta data server by node i, by member Data server judges whether there is identical block on other node j (j ≠ i).Step is jumped to if finding 8., it is otherwise next Step;
7. Replica creates a Copy:Created a Copy by meta data server for new data block B;
8. classification is handled:Meta data server judges whether repeated data block B is hot spot repeated data block, such as formula (1), If so, then jump to step 10., otherwise in next step;
In formula, tp+1A certain data block is changed in moment node i, and determines that the data block does not repeat in node i, There is repeated data block on node j;Represent in tp+1-tpSome interior data block of period is at memory node end (except section The Average visits of point i);α is a threshold value, represents to become access times minimum in the hot spot data block unit interval;Aj (tp) and Aj(tp+1) t is represented respectivelypAnd tp+1The access times of a certain data block on moment node j;Z nodes where data block B Numbering set.
9. greed is deleted:tp+1Moment, the residual capacity S of node k (k ∈ Z) where more non-hot repeated data block Bk (tp+1) andSize, all the time select the relatively small number of node of residual capacity on data block B delete.Update Metadata Service Device.Wherein tp+1Moment storage end average residual capacityAsk for as shown in formula (2),
In formula, Sm(tp+1) it is tp+1The memory space residual capacity of moment node m, n are that the section of storage end is always counted.
10. delayed deletion:tp+1Moment does not delete hot spot data block B, and the metadata of synchronizing data blocks B is on node j, etc. To subsequent time tp+2Continue step 8..

Claims (10)

1. economizing type data de-duplication method in a kind of cloud storage system, the cloud storage system by carry out file operation visitor Family end, meta data server, the image file and operation log of backed up in synchronization metadata for storing file system metadata information Two level meta data server, store data block memory node collectively form, this method comprises the following steps:
Step 1:Each client pre-processes local file to be uploaded, carries out the part weight of file-level and block level Then the metadata information of file to be uploaded is uploaded to metadata by complex data delete operation to prevent the upload again of repeated data Server;
Step 2:Meta data server receives the metadata information from different clients, is successively read file fingerprint, data Block fingerprint, then compares memory, the fingerprint index information in hard disk and write buffer area, finally returns the fingerprint value information being transmitted through on not Return to each client;
Step 3:The new data being transmitted through on not is uploaded to storage end by client, and storage end stores new data, and updates The metadata information table of storage end;
Step 4:Client sends the request of data to be changed, and depositing where data to be modified is obtained by meta data server Node number is stored up, then connects memory node and operation of directly modifying to the data of storage end;
Step 5:Storage end is detected amended data block, when amended data block is found by comparing fingerprint value On this node, directly it is deleted again;When amended data block is not on this node, then this node is first saved in On, then found on other nodes by the comparison of meta data server, which is deleted again using delay;When amended Data block found neither on this node by comparing the fingerprint index on this node and meta data server, and not at other On node, except the data block is saved on this node, meta data server also needs to create a Copy for the data block.
2. economizing type data de-duplication method in a kind of cloud storage system according to claim 1, it is characterised in that institute State and also contain filtering module and update module on meta data server, filtering module is used for the repeat number for filtering different clients It is believed that breath, update module is used to update storage end global data metadata information, i.e., directly updates the metadata of repeated data block Information, waits the metadata information for receiving and non-duplicate data block just being updated after memory node feedack.
3. economizing type data de-duplication method in a kind of cloud storage system according to claim 1, it is characterised in that institute Stating client has file pretreatment module, part to delete module, metadata management module and data transmission module again, and wherein file is pre- Processing module carries out document classification according to the type of file, then gives local module progress file-level of deleting again and deletes again, by text Non-duplicate file part level is deleted again after is returned to file pretreatment module and is filtered again, filters out the non-duplicate text less than 64MB Part, is finally deleted, metadata management module has uploaded number for recording client again by local module progress block level of deleting again again According to the fingerprint value information of block, to avoid the upload of local repeated data;Data transmission module is then client connection metadata clothes The interface of business device and memory node, that is, be responsible for the metadata information of file to be uploaded uploading to meta data server, will be non-heavy Complex data block is uploaded on memory node.
4. economizing type data de-duplication method in a kind of cloud storage system according to claim 1, it is characterised in that institute Stating memory node includes memory module, metadata management module, Self-Check Report module and postpones to delete module, wherein memory module again It is responsible for the storage of data block, distributes the physical address of data block;The member of data block on metadata management module minute book node Data message;Self-Check Report module be detect data block modification caused by repeated data, give delay delete again module progress Hot spot repeated data block judgement with it is corresponding processing and the metadata information of modification is fed back into Self-Check Report module, Ran Houbao Accuse to meta data server.
5. economizing type data de-duplication method in a kind of cloud storage system according to claim 1, it is characterised in that institute State file-level data de-duplication in step 1:Using MD5 algorithm calculation document fingerprint values, size and the equal text of type are compared Part fingerprint value, is then compared with local metadata information table, determines duplicate file and non-duplicate file again.
6. economizing type data de-duplication method in a kind of cloud storage system according to claim 1, it is characterised in that institute The block level data de-duplication stated described in step 1 is as follows:For having filtered out the non-duplicate file less than 64MB, profit Piecemeal is carried out with fixed length block algorithm, block length is set to 64MB, and the fingerprint value of data block is calculated using MD5 algorithms, compares block length phase Deng data block determine repeated data block.
7. economizing type data de-duplication method in a kind of cloud storage system according to claim 1, it is characterised in that institute State when file fingerprint is compared in step 2, if finding, fingerprint value is existing, the no longer fingerprint of comparison data block, otherwise also than To the data block fingerprint of configuration file.
8. economizing type data de-duplication method in a kind of cloud storage system according to claim 1, it is characterised in that institute The mapping relations of the in store data block fingerprint and its storage address thereon of each storage end in step 3 are stated, pass through data block Fingerprint, you can determine the physical address of data block storage.
9. economizing type data de-duplication method in a kind of cloud storage system according to claim 1, it is characterised in that institute New repeated data block can be introduced by stating modification of the multiple users of client in step 4 to data block, and existing storage system is temporary Without considering the data block that these are repeated, user backs up again after local is to data modification in standby system, the process of backup In filter out the part not made an amendment;And cloud storage is experienced as got desired in local, user to the high in the clouds that user brings The address of the data of modification, directly modifies data, this is exactly cloud storage and the difference of standby system.
10. economizing type data de-duplication method in a kind of cloud storage system according to claim 1, it is characterised in that institute State and postpone to delete again comprising to operation, determination methods of both hot spot repeated data block and non-hot repeated data block in step 5 Using equation below:
<mrow> <mfenced open = "{" close = ""> <mtable> <mtr> <mtd> <mrow> <msub> <mover> <mi>f</mi> <mo>&amp;OverBar;</mo> </mover> <mrow> <mi>a</mi> <mi>c</mi> <mi>c</mi> <mi>e</mi> <mi>s</mi> <mi>s</mi> </mrow> </msub> <mo>&gt;</mo> <mi>&amp;alpha;</mi> </mrow> </mtd> </mtr> <mtr> <mtd> <mrow> <msub> <mover> <mi>f</mi> <mo>&amp;OverBar;</mo> </mover> <mrow> <mi>a</mi> <mi>c</mi> <mi>c</mi> <mi>e</mi> <mi>s</mi> <mi>s</mi> </mrow> </msub> <mo>=</mo> <munder> <mo>&amp;Sigma;</mo> <mrow> <mi>j</mi> <mo>&amp;Element;</mo> <mi>Z</mi> </mrow> </munder> <mfrac> <mrow> <msub> <mi>A</mi> <mi>j</mi> </msub> <mrow> <mo>(</mo> <msub> <mi>t</mi> <mrow> <mi>p</mi> <mo>+</mo> <mn>1</mn> </mrow> </msub> <mo>)</mo> </mrow> <mo>-</mo> <msub> <mi>A</mi> <mi>j</mi> </msub> <mrow> <mo>(</mo> <msub> <mi>t</mi> <mi>p</mi> </msub> <mo>)</mo> </mrow> </mrow> <mrow> <msub> <mi>t</mi> <mrow> <mi>p</mi> <mo>+</mo> <mn>1</mn> </mrow> </msub> <mo>-</mo> <msub> <mi>t</mi> <mi>p</mi> </msub> </mrow> </mfrac> <mo>,</mo> <mi>j</mi> <mo>&amp;NotEqual;</mo> <mi>i</mi> <mo>,</mo> <mi>j</mi> <mo>&amp;Element;</mo> <mi>Z</mi> <mo>,</mo> <msub> <mi>t</mi> <mrow> <mi>p</mi> <mo>+</mo> <mn>1</mn> </mrow> </msub> <mo>&gt;</mo> <msub> <mi>t</mi> <mi>p</mi> </msub> </mrow> </mtd> </mtr> </mtable> </mfenced> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>1</mn> <mo>)</mo> </mrow> </mrow>
In formula, a certain data block is changed in node i, and determines that the data block does not repeat in node i, is had on node j Repeated data block;Represent in tp+1-tpSome interior data block of period is at memory node end except the average access of node i Number;α is a threshold value, represents to become access times minimum in the hot spot data block unit interval;Aj(tp) and Aj(tp+1) point T is not representedpAnd tp+1The access times of a certain data block on moment node j;Z is the numbering set of node where data block B;
Then postpone to delete again for hot spot repeated data block to reduce the access response time of system;For non-hot repeated data Block, the then deletion where selecting non-hot repeated data block on the relatively small number of node of memory node residual capacity are loaded with realizing It is balanced.
CN201510339033.3A 2015-06-17 2015-06-17 Economizing type data de-duplication method in a kind of cloud storage system Expired - Fee Related CN104932841B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510339033.3A CN104932841B (en) 2015-06-17 2015-06-17 Economizing type data de-duplication method in a kind of cloud storage system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510339033.3A CN104932841B (en) 2015-06-17 2015-06-17 Economizing type data de-duplication method in a kind of cloud storage system

Publications (2)

Publication Number Publication Date
CN104932841A CN104932841A (en) 2015-09-23
CN104932841B true CN104932841B (en) 2018-05-08

Family

ID=54120022

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510339033.3A Expired - Fee Related CN104932841B (en) 2015-06-17 2015-06-17 Economizing type data de-duplication method in a kind of cloud storage system

Country Status (1)

Country Link
CN (1) CN104932841B (en)

Families Citing this family (33)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105242881B (en) * 2015-10-12 2018-02-16 创新科软件技术(深圳)有限公司 Distributed memory system and its data read-write method
CN105302920B (en) * 2015-11-23 2020-01-03 上海爱数信息技术股份有限公司 Cloud storage data optimization management method and system
CN107239474B (en) * 2016-03-29 2021-05-04 创新先进技术有限公司 Data recording method and device
CN106326035A (en) * 2016-08-13 2017-01-11 南京叱咤信息科技有限公司 File-metadata-based incremental backup method
CN106649556A (en) * 2016-11-08 2017-05-10 深圳市中博睿存科技有限公司 Method and device for deleting multiple layered repetitive data based on distributed file system
CN106789180A (en) * 2016-11-30 2017-05-31 郑州云海信息技术有限公司 The service control method and device of a kind of meta data server
CN108243207B (en) * 2016-12-23 2019-03-15 中科星图股份有限公司 A kind of date storage method of network cloud disk
CN106713465B (en) * 2016-12-27 2020-11-17 北京锐安科技有限公司 Distributed storage system
CN108334277B (en) * 2017-05-10 2019-06-28 中兴通讯股份有限公司 A kind of log write-in and synchronous method, device, system, computer storage medium
US11113153B2 (en) * 2017-07-27 2021-09-07 EMC IP Holding Company LLC Method and system for sharing pre-calculated fingerprints and data chunks amongst storage systems on a cloud local area network
CN108008918A (en) * 2017-11-30 2018-05-08 联想(北京)有限公司 Data processing method, memory node and distributed memory system
CN107977168B (en) * 2017-12-15 2021-01-01 安徽长泰信息安全服务有限公司 Data dispersed storage system based on cloud storage
CN110908589B (en) * 2018-09-14 2023-06-27 阿里巴巴集团控股有限公司 Data file processing method, device, system and storage medium
CN109344121A (en) * 2018-09-27 2019-02-15 郑州云海信息技术有限公司 A kind for the treatment of method and apparatus of image file
CN109522283B (en) * 2018-10-30 2021-09-21 深圳先进技术研究院 Method and system for deleting repeated data
US10977217B2 (en) * 2018-10-31 2021-04-13 EMC IP Holding Company LLC Method and system to efficiently recovering a consistent view of a file system image from an asynchronously remote system
CN109213738B (en) * 2018-11-20 2022-01-25 武汉理工光科股份有限公司 Cloud storage file-level repeated data deletion retrieval system and method
CN109597798A (en) * 2018-12-04 2019-04-09 平安科技(深圳)有限公司 Network file delet method, device, computer equipment and storage medium
US10893090B2 (en) 2019-02-14 2021-01-12 International Business Machines Corporation Monitoring a process on an IoT device
KR102367733B1 (en) * 2019-11-11 2022-02-25 한국전자기술연구원 Method for Fast Block Deduplication and transmission by multi-level PreChecker based on policy
CN111104381A (en) * 2019-11-30 2020-05-05 北京浪潮数据技术有限公司 Log management method, device and equipment and computer readable storage medium
CN111061790A (en) * 2019-12-13 2020-04-24 江苏智谋科技有限公司 Information acquisition system and method for customer data management
CN111309794A (en) * 2020-01-17 2020-06-19 青梧桐有限责任公司 Data storage engine
CN111580755B (en) * 2020-05-09 2022-07-05 杭州海康威视系统技术有限公司 Distributed data processing system and distributed data processing method
CN113640321B (en) * 2020-05-11 2024-04-02 同方威视技术股份有限公司 Security inspection delay optimization method and equipment
CN111787070B (en) * 2020-06-10 2022-07-12 俞力奇 Equipment end resource management method
CN112000523A (en) * 2020-08-25 2020-11-27 浪潮云信息技术股份公司 Cloud backup system and method
CN114115696A (en) * 2020-08-25 2022-03-01 华为技术有限公司 Memory deduplication method and device and storage medium
CN112511612A (en) * 2020-11-19 2021-03-16 中国联合网络通信集团有限公司 Cloud storage data storage method, device, system, equipment and storage medium
CN112637153B (en) * 2020-12-14 2024-02-20 航天壹进制(江苏)信息科技有限公司 Method and system for storing encryption and deduplication
CN113326003B (en) * 2021-05-25 2024-03-26 北京计算机技术及应用研究所 Intelligent acceleration method for metadata access of distributed storage system
CN113625961A (en) * 2021-07-07 2021-11-09 暨南大学 Self-adaptive threshold repeated data deleting method based on greedy selection
CN113590535B (en) * 2021-09-30 2021-12-17 中国人民解放军国防科技大学 Efficient data migration method and device for deduplication storage system

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103154950A (en) * 2012-05-04 2013-06-12 华为技术有限公司 Repeated data deleting method and device
CN103186652A (en) * 2011-12-28 2013-07-03 英业达股份有限公司 Distributed data de-duplication system and method thereof
CN103714123A (en) * 2013-12-06 2014-04-09 西安工程大学 Methods for deleting duplicated data and controlling reassembly versions of cloud storage segmented objects of enterprise

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8874523B2 (en) * 2010-02-09 2014-10-28 Google Inc. Method and system for providing efficient access to a tape storage system

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103186652A (en) * 2011-12-28 2013-07-03 英业达股份有限公司 Distributed data de-duplication system and method thereof
CN103154950A (en) * 2012-05-04 2013-06-12 华为技术有限公司 Repeated data deleting method and device
CN103714123A (en) * 2013-12-06 2014-04-09 西安工程大学 Methods for deleting duplicated data and controlling reassembly versions of cloud storage segmented objects of enterprise

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
一种云存储系统中重复数据删除机制;毕朝国;《计算机应用研究》;20141031;第31卷(第10期);第1节,2.1-2.5节 *

Also Published As

Publication number Publication date
CN104932841A (en) 2015-09-23

Similar Documents

Publication Publication Date Title
CN104932841B (en) Economizing type data de-duplication method in a kind of cloud storage system
KR101956236B1 (en) Data replication technique in database management system
CN102591946B (en) It is divided using index and coordinates to carry out data deduplication
JP5732536B2 (en) System, method and non-transitory computer-readable storage medium for scalable reference management in a deduplication-based storage system
He et al. Data deduplication techniques
CN103095843B (en) A kind of data back up method and client based on version vector
US10025808B2 (en) Compacting change logs using file content location identifiers
US9773042B1 (en) Method and system for accelerating data movement using change information concerning difference between current and previous data movements
CN103714123B (en) Enterprise&#39;s cloud memory partitioning object data de-duplication and restructuring version control method
CN103118104B (en) A kind of data restoration method and server based on version vector
CN105069111B (en) Block level data duplicate removal method based on similitude in cloud storage
CN108255647B (en) High-speed data backup method under samba server cluster
KR101922044B1 (en) Recovery technique of data intergrity with non-stop database server redundancy
CN107423426A (en) A kind of data archiving method and electronic equipment of block chain block number evidence
US9547706B2 (en) Using colocation hints to facilitate accessing a distributed data storage system
CN103116615B (en) A kind of data index method and server based on version vector
US9965505B2 (en) Identifying files in change logs using file content location identifiers
CN104077380B (en) A kind of data de-duplication method, apparatus and system
JP6841024B2 (en) Data processing equipment, data processing programs and data processing methods
CN104133882A (en) HDFS (Hadoop Distributed File System)-based old file processing method
CN107046812A (en) A kind of data save method and device
CN109522283A (en) A kind of data de-duplication method and system
CN103227818A (en) Terminal, server, file transferring method, file storage management system and file storage management method
CN107958079A (en) Aggregate file delet method, system, device and readable storage medium storing program for executing
CN108415671A (en) A kind of data de-duplication method and system of Oriented Green cloud computing

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
EE01 Entry into force of recordation of patent licensing contract

Application publication date: 20150923

Assignee: Nanjing Nanyou Information Industry Technology Research Institute Co. Ltd.

Assignor: Nanjing Post & Telecommunication Univ.

Contract record no.: 2018320000285

Denomination of invention: Saving type duplicated data deleting method in cloud storage system

Granted publication date: 20180508

License type: Common License

Record date: 20181101

EE01 Entry into force of recordation of patent licensing contract
TR01 Transfer of patent right

Effective date of registration: 20200515

Address after: 310000 Room 215, gate 1, building 3, beishangxincheng, Xiacheng District, Hangzhou City, Zhejiang Province

Patentee after: CIIC Yunfu (Hangzhou) Medical Technology Co., Ltd

Address before: The city of Nanjing city of Jiangsu Province, 210003 Yuen Road Xianlin University No. 9

Patentee before: NANJING UNIVERSITY OF POSTS AND TELECOMMUNICATIONS

TR01 Transfer of patent right
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20180508

Termination date: 20210617

CF01 Termination of patent right due to non-payment of annual fee