CN104391915A - Duplicated data delete method - Google Patents

Duplicated data delete method Download PDF

Info

Publication number
CN104391915A
CN104391915A CN201410661035.XA CN201410661035A CN104391915A CN 104391915 A CN104391915 A CN 104391915A CN 201410661035 A CN201410661035 A CN 201410661035A CN 104391915 A CN104391915 A CN 104391915A
Authority
CN
China
Prior art keywords
data
database
similarity
sample database
memory storage
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201410661035.XA
Other languages
Chinese (zh)
Other versions
CN104391915B (en
Inventor
吕辉
姜黎
马翼
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hunan Goke Microelectronics Co Ltd
Original Assignee
Hunan Goke Microelectronics Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hunan Goke Microelectronics Co Ltd filed Critical Hunan Goke Microelectronics Co Ltd
Priority to CN201410661035.XA priority Critical patent/CN104391915B/en
Publication of CN104391915A publication Critical patent/CN104391915A/en
Application granted granted Critical
Publication of CN104391915B publication Critical patent/CN104391915B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors

Abstract

The invention discloses a duplicated data delete method. Data comparison scale is limited in each storing device, the data comparison scale is reduced, meanwhile, concurrence comparison of each storing device can be achieved, the comparison efficiency is increased, and dependence on host resources can be reduced. Each storing device triggers corresponding data comparison operation according to business pressure or host requirements, and the duplicated data delete method can be used during normal business processes without limit to business scenes such as special backups.

Description

A kind of data heavily delete method
Technical field
The present invention relates to a kind of data and heavily delete method.
Background technology
Prior art is all by issuing in data procedures, first the hash fingerprint extracting data from file system or database storage system puts into host memory or dedicated system carries out comparing, by comparison result, deletion is carried out and marked index to the data repeated, then add in hash fingerprint base for non-duplicate data, and then the data distributing after process to data storage device, thus reach the object effectively removing repeating data.
The efficiency that said method data are heavily deleted is low, for ensureing that regular traffic performs, in normal business processing flow, above-mentioned technology cannot use, heavily delete processor except the data that non-usage is special and share host CPU pressure, and prior art has very high requirement to host memory, therefore prior art is mainly applied in the unusual traffic such as backup flow process.
Summary of the invention
Technical matters to be solved by this invention is, for above-mentioned the deficiencies in the prior art, provides a kind of data heavily to delete method.
For solving the problems of the technologies described above, the technical solution adopted in the present invention is: a kind of data heavily delete method, comprise the following steps:
1) one piece of data is obtained from host memory at random, calculate the signature of the data obtained, travel through all signatures, calculate the Hamming distances of two signatures successively, using the signature of Hamming distances within 3 as high similarity data, and add up the similarity counting of each high similarity data, data corresponding for top n signature the highest for similarity counting are saved in initial sample database, wherein 100<N<5000;
2) one piece of data is obtained from host memory again, the data that in this segment data, top n signature is corresponding are extracted according to step 1) method, i.e. database data to be entered, data in database data to be entered and above-mentioned initial sample database are contrasted, delete data identical with data in described initial sample database in database data to be entered, the similarity counting of more remaining database data to be entered counts with the similarity of data in described initial sample database, deletion similarity counting is less than the data in the database data to be entered of the remainder of similarity counting in initial sample database, obtain access database data, according to similarity counting order from big to small, access database data is saved in described initial sample database,
3) above-mentioned steps 2 is repeated), until sample data number=disc capacities/(1G ~ 1M) in initial sample database, namely obtain sample database, described sample database is sent to memory storage;
4) after described memory storage receives host requests, sample data in data in memory storage and above-mentioned sample database is contrasted, if the sample data in the data of memory storage and described sample database has repetition, then mark the address mapping table that host logical address is mapped to the physical address of memory storage, revise the address that described address maps table address is first repeating data block, and mapping result is returned to main frame.
Compared with prior art, the beneficial effect that the present invention has is: it is inner that the present invention is confined to each memory storage Data Comparison scale, reduces comparing scale, allows the concurrent contrast of each memory storage simultaneously, improve specific efficiency, decrease the dependence to host resource.Require that triggering corresponding Data Comparison operates by each memory storage according to traffic pressure or main frame, and can use in normal operation flow, be not limited to the business scenarios such as special backup.
Accompanying drawing explanation
Fig. 1 is one embodiment of the invention Method And Principle figure.
Embodiment
Following is storage array with memory storage, and main frame is server is that example illustrates this technical scheme specific implementation process:
1) server is in issuing service data procedures, sets up sample database first on the server according to step 1 of the present invention, step 2, step 3;
2) server and storage array give each array by the self-defined vendor command of standard agreement (as SCSI/SATA/SAS/FC) or other data commands data distributing in sample database.
3), after each storage array receives the correlation data request of server, the data of the data received and array stores are carried out the process such as contrast.Namely process according to step 4 of the present invention.
After the concrete enforcement of above-mentioned 3 steps, the repeating data of each storage array can carry out processing and deleting simultaneously, and does not affect the regular traffic data processing between server and array.It is hard disk that the program is equally applicable to memory storage, and main frame is the equipment of array or other initiation business.
As shown in Figure 1, the inventive method is as follows:
First main frame issues data to Installed System Memory, and utilize SimHash algorithm calculate internal storage data sample characteristics and put preservation in storage, data sample eigenwert computing method are implemented as follows:
Data sample eigenwert computing method: by obtaining one piece of data from internal memory at random, then utilize SimHash algorithm to data compute signature all in buffer memory, travel through all signatures, calculate the Hamming distances number of 1 (namely after two scale-of-two XORs of signing) of two signatures successively, using the signature of Hamming distances within 3, as high similarity data, (this Hamming distances False Rate is lower, Hamming distances is less, data similarity is higher), and add up the similarity counting of each high similarity data, data corresponding for top n signature the highest for similarity counting are saved in initial sample database, consider database volume and Data Comparison efficiency, wherein 100<N<5000, N increases progressively with disc capacities and increases progressively, for each signature, initial sample repetition rate is counted as 0 and sets up index according to sample repetition rate and similarity counting, thus as sample database, when should first and after counting rank by similarity after existing signature calculation Hamming distances in storehouse put in storage again before new samples warehouse-in.
SimHash arthmetic statement is as follows:
This algorithm list of references:
Moses S. Charikar 《Similarity estimation techniques from rounding algorithms》.
arist gionis, pioter indyk, rajeev motwani 《Similarity Search in High Dimensions via Hashing》
Be input as a N dimensional vector V, the proper vector of such as text, each feature has certain weight.Output is the binary signature S of a C position.
1) initialization C dimensional vector Q is the binary signature S of 0, C position is 0.
2) to each feature in vectorial V, traditional hash algorithm is used to calculate the hashed value H of a C position.To 1<=i<=C,
If i-th of H is 1, then i-th element of Q adds the weight of this feature;
Otherwise i-th element of Q deducts the weight of this feature.
3) if i-th of Q element is greater than 0, then i-th of S is 1; Otherwise be 0;
4) signature S is returned.
Then, when host service pressure is little, from sample database, extract the highest top n data sample of repetition rate (sample data number=disc capacities/(1G ~ 1M)), and be handed down to memory storage by custom command.。
Then after memory storage receives host requests, start internal data contrast, modified address mapping value module is needed for repeating data, this list item is set to address value in first mapping table repeated, thus the physical space that release is corresponding, and to map and comparing result returns to main frame, main frame receive and preserve memory storage to when mapping result.

Claims (1)

1. data heavily delete a method, it is characterized in that, comprise the following steps:
1) one piece of data is obtained from host memory at random, calculate the signature of the data obtained, travel through all signatures, calculate the Hamming distances of two signatures successively, using the signature of Hamming distances within 3 as high similarity data, and add up the similarity counting of each high similarity data, data corresponding for top n signature the highest for similarity counting are saved in initial sample database, wherein 100<N<5000;
2) one piece of data is obtained from host memory again, the data that in this segment data, top n signature is corresponding are extracted according to step 1) method, i.e. database data to be entered, data in database data to be entered and above-mentioned initial sample database are contrasted, delete data identical with data in described initial sample database in database data to be entered, the similarity counting of more remaining database data to be entered counts with the similarity of data in described initial sample database, deletion similarity counting is less than the data in the database data to be entered of the remainder of similarity counting in initial sample database, obtain access database data, according to similarity counting order from big to small, access database data is saved in described initial sample database,
3) above-mentioned steps 2 is repeated), until sample data number=disc capacities/(1G ~ 1M) in initial sample database, namely obtain sample database, described sample database is sent to memory storage;
4) after described memory storage receives host requests, sample data in data in memory storage and above-mentioned sample database is contrasted, if the sample data in the data of memory storage and described sample database has repetition, then mark the address mapping table that host logical address is mapped to the physical address of memory storage, revise the address that described address maps table address is first repeating data block, and mapping result is returned to main frame.
CN201410661035.XA 2014-11-19 2014-11-19 A kind of data heavily delete method Active CN104391915B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410661035.XA CN104391915B (en) 2014-11-19 2014-11-19 A kind of data heavily delete method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410661035.XA CN104391915B (en) 2014-11-19 2014-11-19 A kind of data heavily delete method

Publications (2)

Publication Number Publication Date
CN104391915A true CN104391915A (en) 2015-03-04
CN104391915B CN104391915B (en) 2016-02-24

Family

ID=52609819

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410661035.XA Active CN104391915B (en) 2014-11-19 2014-11-19 A kind of data heavily delete method

Country Status (1)

Country Link
CN (1) CN104391915B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108614827A (en) * 2016-12-12 2018-10-02 阿里巴巴集团控股有限公司 Data segmentation method, judging method and electronic equipment
CN114445207A (en) * 2022-04-11 2022-05-06 广东企数标普科技有限公司 Tax administration system based on digital RMB

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7523098B2 (en) * 2004-09-15 2009-04-21 International Business Machines Corporation Systems and methods for efficient data searching, storage and reduction
CN103617177A (en) * 2013-11-05 2014-03-05 浪潮(北京)电子信息产业有限公司 Stackable repeating data deletion file system
CN104040516A (en) * 2011-11-17 2014-09-10 英特尔公司 Method, apparatus and system for data deduplication
CN104049911A (en) * 2013-03-14 2014-09-17 Lsi公司 Storage Device Assisted Data De-duplication

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7523098B2 (en) * 2004-09-15 2009-04-21 International Business Machines Corporation Systems and methods for efficient data searching, storage and reduction
CN104040516A (en) * 2011-11-17 2014-09-10 英特尔公司 Method, apparatus and system for data deduplication
CN104049911A (en) * 2013-03-14 2014-09-17 Lsi公司 Storage Device Assisted Data De-duplication
CN103617177A (en) * 2013-11-05 2014-03-05 浪潮(北京)电子信息产业有限公司 Stackable repeating data deletion file system

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108614827A (en) * 2016-12-12 2018-10-02 阿里巴巴集团控股有限公司 Data segmentation method, judging method and electronic equipment
CN114445207A (en) * 2022-04-11 2022-05-06 广东企数标普科技有限公司 Tax administration system based on digital RMB

Also Published As

Publication number Publication date
CN104391915B (en) 2016-02-24

Similar Documents

Publication Publication Date Title
US8898120B1 (en) Systems and methods for distributed data deduplication
US9851917B2 (en) Method for de-duplicating data and apparatus therefor
AU2011256912B2 (en) Systems and methods for providing increased scalability in deduplication storage systems
ES2700431T3 (en) Method and data processing device in a cluster system
US9244623B1 (en) Parallel de-duplication of data chunks of a shared data object using a log-structured file system
US9569357B1 (en) Managing compressed data in a storage system
Meister et al. Block locality caching for data deduplication
Xu et al. A lightweight virtual machine image deduplication backup approach in cloud environment
CN108027713A (en) Data de-duplication for solid state drive controller
CN106611035A (en) Retrieval algorithm for deleting repetitive data in cloud storage
CN111522502B (en) Data deduplication method and device, electronic equipment and computer-readable storage medium
CN110569245A (en) Fingerprint index prefetching method based on reinforcement learning in data de-duplication system
CN104462388B (en) A kind of redundant data method for cleaning based on tandem type storage medium
US20170322878A1 (en) Determine unreferenced page in deduplication store for garbage collection
US9361028B2 (en) Systems and methods for increasing restore speeds of backups stored in deduplicated storage systems
CN104407982B (en) A kind of SSD discs rubbish recovering method
CN110352410B (en) Tracking access patterns of index nodes and pre-fetching index nodes
US9747051B2 (en) Cluster-wide memory management using similarity-preserving signatures
CN104391915B (en) A kind of data heavily delete method
US8818970B2 (en) Partitioning a directory while accessing the directory
US10936233B2 (en) System and method for optimal order migration into a cache based deduplicated storage array
US10664442B1 (en) Method and system for data consistency verification in a storage system
US20230409222A1 (en) System and method for indexing a data item in a data storage system
US11847334B2 (en) Method or apparatus to integrate physical file verification and garbage collection (GC) by tracking special segments
US11681436B2 (en) Systems and methods for asynchronous input/output scanning and aggregation for solid state drive

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information

Address after: 410125 Hunan, Changsha economic and Technological Development Zone, the east side of the south section of the No. ten road, Tong Tong Street, No.

Applicant after: GOKE MICROELECTRONICS CO., LTD.

Address before: 410125 No. 9, East ten, South Road, Changsha economic and Technological Development Zone, Hunan

Applicant before: Hunan Guoke Microelectronics Co., Ltd.

COR Change of bibliographic data
C14 Grant of patent or utility model
GR01 Patent grant
EE01 Entry into force of recordation of patent licensing contract

Application publication date: 20150304

Assignee: Jiangsu Xinsheng Intelligent Technology Co., Ltd.

Assignor: GOKE MICROELECTRONICS CO., LTD.

Contract record no.: 2018430000021

Denomination of invention: Duplicated data delete method

Granted publication date: 20160224

License type: Common License

Record date: 20181203

EE01 Entry into force of recordation of patent licensing contract