CN104391915A

CN104391915A - Duplicated data delete method

Info

Publication number: CN104391915A
Application number: CN201410661035.XA
Authority: CN
Inventors: 吕辉; 姜黎; 马翼
Original assignee: Hunan Goke Microelectronics Co Ltd
Current assignee: Hunan Goke Microelectronics Co Ltd
Priority date: 2014-11-19
Filing date: 2014-11-19
Publication date: 2015-03-04
Anticipated expiration: 2034-11-19
Also published as: CN104391915B

Abstract

The invention discloses a duplicated data delete method. Data comparison scale is limited in each storing device, the data comparison scale is reduced, meanwhile, concurrence comparison of each storing device can be achieved, the comparison efficiency is increased, and dependence on host resources can be reduced. Each storing device triggers corresponding data comparison operation according to business pressure or host requirements, and the duplicated data delete method can be used during normal business processes without limit to business scenes such as special backups.

Description

A kind of data heavily delete method

Technical field

The present invention relates to a kind of data and heavily delete method.

Background technology

Prior art is all by issuing in data procedures, first the hash fingerprint extracting data from file system or database storage system puts into host memory or dedicated system carries out comparing, by comparison result, deletion is carried out and marked index to the data repeated, then add in hash fingerprint base for non-duplicate data, and then the data distributing after process to data storage device, thus reach the object effectively removing repeating data.

The efficiency that said method data are heavily deleted is low, for ensureing that regular traffic performs, in normal business processing flow, above-mentioned technology cannot use, heavily delete processor except the data that non-usage is special and share host CPU pressure, and prior art has very high requirement to host memory, therefore prior art is mainly applied in the unusual traffic such as backup flow process.

Summary of the invention

Technical matters to be solved by this invention is, for above-mentioned the deficiencies in the prior art, provides a kind of data heavily to delete method.

For solving the problems of the technologies described above, the technical solution adopted in the present invention is: a kind of data heavily delete method, comprise the following steps:

1) one piece of data is obtained from host memory at random, calculate the signature of the data obtained, travel through all signatures, calculate the Hamming distances of two signatures successively, using the signature of Hamming distances within 3 as high similarity data, and add up the similarity counting of each high similarity data, data corresponding for top n signature the highest for similarity counting are saved in initial sample database, wherein 100<N<5000;

2) one piece of data is obtained from host memory again, the data that in this segment data, top n signature is corresponding are extracted according to step 1) method, i.e. database data to be entered, data in database data to be entered and above-mentioned initial sample database are contrasted, delete data identical with data in described initial sample database in database data to be entered, the similarity counting of more remaining database data to be entered counts with the similarity of data in described initial sample database, deletion similarity counting is less than the data in the database data to be entered of the remainder of similarity counting in initial sample database, obtain access database data, according to similarity counting order from big to small, access database data is saved in described initial sample database,

3) above-mentioned steps 2 is repeated), until sample data number=disc capacities/(1G ~ 1M) in initial sample database, namely obtain sample database, described sample database is sent to memory storage;

4) after described memory storage receives host requests, sample data in data in memory storage and above-mentioned sample database is contrasted, if the sample data in the data of memory storage and described sample database has repetition, then mark the address mapping table that host logical address is mapped to the physical address of memory storage, revise the address that described address maps table address is first repeating data block, and mapping result is returned to main frame.

Compared with prior art, the beneficial effect that the present invention has is: it is inner that the present invention is confined to each memory storage Data Comparison scale, reduces comparing scale, allows the concurrent contrast of each memory storage simultaneously, improve specific efficiency, decrease the dependence to host resource.Require that triggering corresponding Data Comparison operates by each memory storage according to traffic pressure or main frame, and can use in normal operation flow, be not limited to the business scenarios such as special backup.

Accompanying drawing explanation

Fig. 1 is one embodiment of the invention Method And Principle figure.

Embodiment

Following is storage array with memory storage, and main frame is server is that example illustrates this technical scheme specific implementation process:

1) server is in issuing service data procedures, sets up sample database first on the server according to step 1 of the present invention, step 2, step 3;

2) server and storage array give each array by the self-defined vendor command of standard agreement (as SCSI/SATA/SAS/FC) or other data commands data distributing in sample database.

3), after each storage array receives the correlation data request of server, the data of the data received and array stores are carried out the process such as contrast.Namely process according to step 4 of the present invention.

After the concrete enforcement of above-mentioned 3 steps, the repeating data of each storage array can carry out processing and deleting simultaneously, and does not affect the regular traffic data processing between server and array.It is hard disk that the program is equally applicable to memory storage, and main frame is the equipment of array or other initiation business.

As shown in Figure 1, the inventive method is as follows:

First main frame issues data to Installed System Memory, and utilize SimHash algorithm calculate internal storage data sample characteristics and put preservation in storage, data sample eigenwert computing method are implemented as follows:

Data sample eigenwert computing method: by obtaining one piece of data from internal memory at random, then utilize SimHash algorithm to data compute signature all in buffer memory, travel through all signatures, calculate the Hamming distances number of 1 (namely after two scale-of-two XORs of signing) of two signatures successively, using the signature of Hamming distances within 3, as high similarity data, (this Hamming distances False Rate is lower, Hamming distances is less, data similarity is higher), and add up the similarity counting of each high similarity data, data corresponding for top n signature the highest for similarity counting are saved in initial sample database, consider database volume and Data Comparison efficiency, wherein 100<N<5000, N increases progressively with disc capacities and increases progressively, for each signature, initial sample repetition rate is counted as 0 and sets up index according to sample repetition rate and similarity counting, thus as sample database, when should first and after counting rank by similarity after existing signature calculation Hamming distances in storehouse put in storage again before new samples warehouse-in.

SimHash arthmetic statement is as follows:

This algorithm list of references:

Moses S. Charikar 《Similarity estimation techniques from rounding algorithms》.

arist gionis, pioter indyk, rajeev motwani 《Similarity Search in High Dimensions via Hashing》

Be input as a N dimensional vector V, the proper vector of such as text, each feature has certain weight.Output is the binary signature S of a C position.

1) initialization C dimensional vector Q is the binary signature S of 0, C position is 0.

2) to each feature in vectorial V, traditional hash algorithm is used to calculate the hashed value H of a C position.To 1<=i<=C,

If i-th of H is 1, then i-th element of Q adds the weight of this feature;

Otherwise i-th element of Q deducts the weight of this feature.

3) if i-th of Q element is greater than 0, then i-th of S is 1; Otherwise be 0;

4) signature S is returned.

Then, when host service pressure is little, from sample database, extract the highest top n data sample of repetition rate (sample data number=disc capacities/(1G ~ 1M)), and be handed down to memory storage by custom command.。

Then after memory storage receives host requests, start internal data contrast, modified address mapping value module is needed for repeating data, this list item is set to address value in first mapping table repeated, thus the physical space that release is corresponding, and to map and comparing result returns to main frame, main frame receive and preserve memory storage to when mapping result.

Claims

1. data heavily delete a method, it is characterized in that, comprise the following steps: