CN104933010A - Duplicated data deleting method and apparatus - Google Patents

Duplicated data deleting method and apparatus Download PDF

Info

Publication number
CN104933010A
CN104933010A CN201410101736.8A CN201410101736A CN104933010A CN 104933010 A CN104933010 A CN 104933010A CN 201410101736 A CN201410101736 A CN 201410101736A CN 104933010 A CN104933010 A CN 104933010A
Authority
CN
China
Prior art keywords
data
data stream
data block
identification information
stream
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201410101736.8A
Other languages
Chinese (zh)
Other versions
CN104933010B (en
Inventor
张亮
陆承涛
刘屹
葛雄资
吴俊�
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Changzhou Hengtang Technology Industry Co ltd
Original Assignee
Huawei Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huawei Technologies Co Ltd filed Critical Huawei Technologies Co Ltd
Priority to CN201410101736.8A priority Critical patent/CN104933010B/en
Publication of CN104933010A publication Critical patent/CN104933010A/en
Application granted granted Critical
Publication of CN104933010B publication Critical patent/CN104933010B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present invention discloses a duplicated data deleting method and apparatus, which are applied to the technical field of data processing. According to the present invention, programs stored in a storage are called by a processor; data to be processed is partitioned into a plurality of data streams, so that a hardware accelerator can simultaneously and respectively calculate identification information of data blocks which the data streams comprise; and then the processor, according to the identification information, carries out duplicated data deleting processing. The duplicated data deleting apparatus disclosed by the present invention can partition the data to be processed into a plurality of data streams so as to carry out processing of deleting duplicated data on the data streams in parallel, and thus, efficiency of deleting the duplicated data can be improved. In addition, according to the present invention, partial functions in the duplicated data deleting process are realized by adopting a hardware structure instead of calling software of the programs, so that efficiency of the duplicated data deleting process can be effectively improved.

Description

A kind of data de-duplication method and device
Technical field
The present invention relates to technical field of data processing, particularly data de-duplication method and device.
Background technology
At present, the data total amount in network exponentially increases, and this not only needs to expend increasing network bandwidth transmission data, also needs to take googol according to storage space simultaneously.For reducing the total cost of ownership (English: Total Cost of Ownership, the abbreviation: TCO), enterprise starts to adopt data de-duplication technology one after another of computer data storage system and computer network.
Data de-duplication (Data deduplication) technology has become a main flow of field of computer and very important technology at present.Its principle of work, for identify repeating data from data stream, only retains a backup of repeating data, deletes other redundant data, quote in the position of deleting duplicated data with pointer, thus save a large amount of data spaces or the network bandwidth.How carrying out data de-duplication is fast an important problem.
Summary of the invention
The invention provides data de-duplication method and device, improve the efficiency of data de-duplication.
First aspect present invention provides a kind of data de-duplication device, comprises the processor, storer and the hardware accelerator that are connected by bus, wherein:
Described storer, for storing the first program for Data classification, and for the second program of data deduplication;
Pending Data Placement, for calling the first program stored in described storer, is N number of data stream by described processor, and gives described hardware accelerator by described N number of data stream respectively by N number of thread; Described N be greater than 1 positive integer;
Described hardware accelerator, for calculating the identification information of data block included in described N number of data stream respectively;
Described processor, also for calling the second program stored in described storer, according to the identification information execution data de-duplication process of the data block that described hardware accelerator calculates.
In the first possibility implementation of first aspect present invention:
Described pending Data Placement, after calling the first program of storing in described storer, is N number of data stream according at least one in following information: application port number, file type and application type by described processor.
May implementation in conjunction with the first of first aspect present invention or first aspect, in the second possibility implementation of first aspect present invention, described hardware accelerator comprises: N number of raw data buffer, data fragmentation module and N number of result data buffer zone;
Described N number of raw data buffer, for cushioning N number of data stream that described processor transmits respectively;
Described data fragmentation module, for carrying out to described N number of data stream the data block that burst obtains included by each described data stream respectively, and calculates the identification information of each data block respectively;
Described N number of result data buffer zone, for cushioning the identification information of data block included by N number of data stream that described data fragmentation module calculates respectively.
In conjunction with first aspect present invention the second possibility implementation, in the third possibility implementation of first aspect present invention, described data fragmentation module specifically comprises: determine submodule, the first data fragmentation submodule and the second data fragmentation submodule, wherein:
Describedly determine submodule, for determining the size to the data block of each data stream when performing data de-duplication process, and according to the described size to each data stream required data block when performing data de-duplication process, described N number of data stream is sent to respectively described first data fragmentation submodule or described second data fragmentation submodule;
Described first data fragmentation submodule, for carrying out the data block that burst obtains included by described data stream to the data stream received, make the size of each data block be 2 n power, and calculate the identification information of each data block of described data stream respectively;
Described second data fragmentation submodule, for carrying out the data block that burst obtains included by described data stream to the data stream received, make the size of each data block be non-2 n power, and calculate the identification information of each data block of described data stream respectively.
In conjunction with the third possibility implementation of first aspect present invention, in the 4th kind of possibility implementation of first aspect present invention, described first data fragmentation submodule and described second data fragmentation submodule comprise respectively:
Moving window computing module, for when to data flow fragmentation, calculates the fingerprint of the data in the moving window of the fixed size of sliding along described data stream;
Burst point computing module; For when the described fingerprint calculated when moving window computing module meets prerequisite, using the burst point of the position of described moving window as the data block included by described data stream;
Described fingerprint computing module, for calculating the identification information of data fingerprint as described each data block of each data block of described data stream, and send the data fingerprint of each data block of described data stream result data buffer zone corresponding to described data stream to.
Second aspect present invention provides a kind of data de-duplication method, and be applied in the data de-duplication device as described in any one of claim 1 to 6, described method comprises:
Processor calls the first program stored in storer, is N number of data stream by pending Data Placement, and by N number of thread respectively by described N number of data stream to hardware accelerator; Described N be greater than 1 positive integer;
Hardware accelerator calculates the identification information of the data block in described N number of data stream included by each data stream respectively;
Described processor calls the second program stored in described storer, according to the identification information execution data de-duplication process of the described data block that described hardware accelerator calculates.
In the first possibility implementation of second aspect present invention, described is N number of data stream by pending Data Placement, specifically comprises:
At least one in information is appointed to be N number of data stream by described pending Data Placement according to following: application port number, file type and application type.
May implementation in conjunction with the first of second aspect present invention or second aspect, in the second possibility implementation of second aspect present invention, described hardware accelerator calculates the identification information of data block included in described N number of data stream respectively, specifically comprises:
Described hardware accelerator carries out to described N number of data stream the data block that burst obtains included by each described data stream respectively; And calculate the identification information of each data block respectively.
In conjunction with second aspect present invention the second possibility implementation, in the third possibility implementation of second aspect present invention, before described hardware accelerator carries out burst to described N number of data stream respectively, described method also comprises:
Determine the size to each data stream data block required when performing data de-duplication process, it is the data stream of the n power of 2 by the size of data block, be the data stream of the n power of non-2 with the size of described data block, send to different data fragmentation submodules respectively.
In conjunction with second aspect present invention the third may implementation, may in implementation at the 4th kind of second aspect present invention, described hardware accelerator carries out to described N number of data stream the data block that burst obtains included by each data stream respectively, specifically comprises:
Described different data fragmentation submodule is respectively to the fingerprint of the data in the moving window of the fixed size that the data-flow computation received is slided along described data stream;
When the fingerprint of described calculating meets prerequisite, using the burst point of the position of moving window as data block included by described data stream.
Visible, data de-duplication device of the present invention, the program stored in storer can be called by processor, pending Data Placement is become multiple data stream, such hardware accelerator just can calculate the identification information of the data block included by these data stream simultaneously respectively, and then carries out data de-duplication process by processor according to these identification informations.Because pending Data Placement can be become multiple data stream by data de-duplication device in the present invention, and then the parallel process these data stream being carried out to data de-duplication, the efficiency of data de-duplication can be improved; In addition, the present invention, by the function of the identification information of the data block included by calculating data stream, is adopted hardware configuration to realize, instead of is realized by calling program, effectively can improve the efficiency of data de-duplication flow process.
Term " first ", " second ", " the 3rd " " 4th " etc. (if existence) in instructions of the present invention and claims and above-mentioned accompanying drawing are for distinguishing similar object, and need not be used for describing specific order or precedence.Should be appreciated that the data used like this can be exchanged in the appropriate case, so as embodiments of the invention described herein such as can with except here diagram or describe those except order implement.In addition, term " comprises " and " having " and their any distortion, intention is to cover not exclusive comprising, such as, contain those steps or unit that the process of series of steps or unit, method, system, product or equipment is not necessarily limited to clearly list, but can comprise clearly do not list or for intrinsic other step of these processes, method, product or equipment or unit.
The embodiment of the present invention provides a kind of data de-duplication device, as shown in Figure 1, comprises the processor 11, storer 10 and the hardware accelerator 12 that connect respectively by bus, wherein:
Storer 10, can be used for storing software program and module, and processor 11 by calling the software program and module that are stored in storer 10, thus performs the application of various function and data processing.Storer 10 mainly can comprise storage program district and store data field, and wherein, storage program district can store operating system, program (such as sound-playing function, image player function etc.) etc. needed at least one function; Store data field and can store data etc.In addition, storer 10 can comprise high-speed random access memory, can also comprise nonvolatile memory, such as at least one disk memory, flush memory device or other volatile solid-state parts.Correspondingly, storer 10 can also comprise Memory Controller, conducts interviews to storer 10 for processor 11 and Peripheral Interface.
Processor 11, it is control center, utilize the various piece of various interface and the whole data de-duplication device of connection, software program in storer 10 and/or module is stored in by running or performing, and call the program be stored in storer 10, perform various function and the process data of this device.Optionally, processor 11 can comprise one or more process core; Preferably, processor 11 accessible site application processor and modem processor, wherein, application processor mainly processes operating system, user interface and application program etc., and modem processor mainly processes radio communication.Be understandable that, above-mentioned modem processor also can not be integrated in processor 11.
In the present embodiment, storer 10, the first program of Data classification and the second program for data deduplication is used for specifically for storing, wherein the first program can realize the function of classifying to data stream according to certain strategy, and the second program can realize identifying repeating data, only retain a backup of repeating data, delete other redundant data, quote the function of the Backup Data of reservation in the position of deleting duplicated data with pointer.
Processor 11, specifically for call in storer 10 store the first program, be N number of data stream by pending Data Placement, and by N number of thread respectively by N number of data stream to hardware accelerator 12, wherein, N be greater than 1 positive integer.Wherein, pending Data Placement, specifically after calling the first program stored in storer 10, is N number of data stream according at least one in following information: application port number, file type and application type etc. by processor 11.The data stream be divided into can be distributed to corresponding thread by processor 11, and particularly, data stream can be distributed to the less thread of load by processor 11.
In addition, processor 11 initiatively can will divide the data stream of formation to hardware accelerator 12 by each thread, also can by hardware accelerator 12 initiatively from each thread reading data flow.And can adopt when transmitting data between each thread and hardware accelerator 12 direct memory access (English: Direct MemoryAccess, abbreviation: DMA) the burst of data transmission mode such as agreement is transmitted.
Hardware accelerator 12, for calculating the identification information of the data block in N number of data stream included by each data stream respectively, identification information is the information of a unique identification data block here, as by information such as the data fingerprints that hash algorithm obtains, wherein:
This hardware accelerator 12 can carry out the calculating of the identification information of the data block that N number of data stream comprises separately simultaneously, and identical or different method can be adopted realize, specifically can comprise and burst is carried out to each data stream obtain data block and calculate identification information to the data block obtained after burst, wherein can comprise content-based data fragmentation to the sharding method of data stream (English: Content-DefinedChunking, CDC) abbreviation: the sliding shoe sharding method such as, or fixed length sharding method, or elongated sharding method etc.; And the calculating of identification information can pass through Rabin fingerprint algorithm, or Murmur hash algorithm etc. calculates the data fingerprint of each data block; After hardware accelerator 12 to complete the calculating of the identification information of data block to a certain data stream, then can be read the identification information that hardware accelerator 12 calculates by the mode proactive notification processor 11 interrupted, the identification information that also initiatively can be completed to hardware accelerator 12 query count by processor 11.In the present invention, one or more hardware accelerator 12 can be comprised.
It is (English: FieldProgrammable Gate Array that this hardware accelerator 12 specifically can pass through field programmable gate array, abbreviation: FPGA) or by special IC (English: Application Specific Integrated Circuit, ASIC) etc. abbreviation: hardware realizes.
Processor 11, also for calling the second program stored in storer 10, according to the identification information execution data de-duplication process of the data block that hardware accelerator 12 calculates.Here data de-duplication process specifically comprises: if stored this identification information in storer 10, then illustrate and stored data corresponding to this identification information in storer 10, then processor 11 can replace concrete data at this identification information of repeating data position, and this pointer is used in reference to the position to storing data; If do not store this identification information in storer 10, or the data that identification information identifies, then by these data or identification information write storer 10.
It should be noted that, pending Data Placement is being N number of data stream and after distributing N number of thread by above-mentioned processor 11, this N number of thread is particular by bus, respectively the data stream in each thread is stored in storer 10, and processor 11 can control arbitrary thread by data stream to hardware accelerator 12.
Visible, in the data de-duplication device of the present embodiment, the first program stored in storer 10 can be called by processor 11, pending Data Placement is become multiple data stream, such hardware accelerator 12 just can calculate the identification information of the data block included by each data stream simultaneously respectively, and then called the second program stored in storer 10 again by processor 11, to carry out data de-duplication process according to these identification informations.Because pending Data Placement can be become multiple data stream by data de-duplication device in the present embodiment, and then the parallel process these data stream being carried out to data de-duplication, the efficiency of data de-duplication can be improved; In addition, in the present embodiment, realizing the partial function in data de-duplication process, be about to the function of the identification information of the data block calculated included by data stream, hardware configuration is adopted to realize, instead of by the software simulating of calling program, the efficiency of data de-duplication flow process effectively can be improved.
Shown in figure 2, in a specific embodiment, when stating hardware accelerator 12 in realization, specifically can be realized by following structure, comprise N number of raw data buffer 120, data fragmentation module 121 and N number of result data buffer zone 122, wherein:
N number of raw data buffer 120, for N number of data stream of buffer 11 transmission respectively, described N number of data stream is received from processor 11 by bus interface module.
Data fragmentation module 121, the N number of data stream for cushioning N number of raw data buffer 120 respectively carries out the data block that burst obtains included by each data stream, and calculates the identification information of each data block respectively.Wherein, data fragmentation module 121 can comprise two data fragmentation submodules, namely support to carry out the first data fragmentation submodule of the n power size of 2 to data stream and support the second data fragmentation submodule of the n power size of data stream being carried out to non-2, and the function and structure of each data fragmentation submodule is similar, identification information is calculated after all needing that burst is carried out to data stream, unlike, the data fragmentation length difference that two data fragmentation submodules are supported, and be calculate identification information after burst is carried out to different data stream.
N number of result data buffer zone 122, for the identification information of the data block included by N number of data stream that difference buffered data burst module 121 calculates, and some other data can also be comprised in this result data buffer zone 122, the thread identification etc. that the cut-off of the data block that such as each data block, data stream comprise, burst quantity and this data stream are corresponding.
Be appreciated that, for improving the interoperability of each ingredient in data de-duplication device, in the present embodiment, hardware accelerator 12 can adopt the high-performance on-chip bus of standard, such as PCI Express bus, and realize the protocol conversion between the bus of data de-duplication device by bus protocol bridge.And above-mentioned raw data buffer 120 and result data buffer zone 122 can be realized by arbitrary storage medium.
And hardware accelerator 12 is except above-mentioned N number of raw data buffer 120, data fragmentation module 121 and N number of result data buffer zone 122, can also comprise the ingredient of other necessity, such as bus interface module and controller.Wherein, it is interconnected that bus interface module is used in this hardware accelerator 12 between each ingredient and the bus of data de-duplication device, simultaneously can the register such as setting data register, control register and status register in bus interface module, realize the control of hardware accelerator 12 for the treatment of device 11 and the functions such as running parameter are set, the start and stop of such as hardware accelerator 12, the average parameter such as burst size, minimum and maximum burst size; Controller is for realizing the control to each ingredient in hardware accelerator 12, and that coordinates between each ingredient is mutual, thus realizes the function of hardware accelerator.
And it should be noted that, in order to improve the efficiency of data de-duplication, this hardware accelerator 12 multiple can be set in data de-duplication device, the structure of each hardware accelerator 12 and similar with the connected mode of other module, not repeat at this.
Visible, in the present embodiment, a result data buffer zone 122 in hardware accelerator 12 can add up a result data being flow to row relax, namely carry out burst and calculate the identification information of each data block of this data stream, the result of same data stream is made to be stored in close position like this, namely be independently stored in a result data buffer zone 122, conveniently can send processor 11 together to by the result of same data stream, can treatment effeciency be improved.Shown in figure 3, in another specific embodiment, data fragmentation module 121 in above-mentioned hardware accelerator 12 specifically can be realized by one or more data fragmentation submodule, wherein, each data fragmentation submodule can adopt different parameters to process accordingly, and is process different data stream, particularly, this data fragmentation module 121 can comprise determines submodule 131, first data fragmentation submodule 132 and the second data fragmentation submodule 133, wherein:
Determine submodule 131, determine the size to each data stream data block required when performing data de-duplication process, and according to the size to each data stream required data block when performing data de-duplication process, described N number of data stream is sent to respectively the first data fragmentation submodule 132 or described second data fragmentation submodule 133;
First data fragmentation submodule 132, for carrying out the data block that burst obtains included by described data stream to the data stream received, make the size of each data block be 2 n power, and calculate the identification information of each data block of described data stream respectively.The size of the data stream that this first data fragmentation submodule 132 receives can be the n power of 2.
Second data fragmentation submodule 133, for carrying out the data block that burst obtains included by described data stream to the individual data stream received, make the size of each data block be non-2 n power, and calculate the identification information of each data block of described data stream respectively.The size of the data stream that this second data fragmentation submodule 133 receives can be the n power of non-2.
If above-mentioned each data fragmentation submodule carries out burst according to CDC sharding method to data stream, then any one data fragmentation submodule above-mentioned can be realized by moving window computing module, burst point computing module and fingerprint computing module, and the hardware implementing structure of these modules, be all decided by the computing method that modules is concrete.Particularly, moving window computing module, mainly when to data flow fragmentation, from the initial position of some data stream, calculates the fingerprint of the data in the moving window of the fixed size of sliding along data stream; When the fingerprint that burst point computing module mainly calculates when moving window computing module meets certain prerequisite, then using the burst point of the position of this moving window as the data block included by described data stream, namely the terminal of this moving window is as the terminal of place data block, simultaneously as the starting point of next data block; Then the identification information of data fingerprint as described each data block of each data block of data stream is calculated in the accounting of fingerprint computing module, and sending the data fingerprint of each data block of described data stream to a result data buffer zone 122, the result data buffer zone 122 namely sending data stream to corresponding is cushioned.
Wherein, if when the fingerprint that moving window computing module calculates does not meet certain prerequisite, then burst point computing module is also for notifying moving window computing module by moving window along data flow front slide byte, then the fingerprint of data in slip rear hatch is calculated, and by burst point computing module determination burst point.Pass through the cycling of moving window computing module and burst point computing module like this, until find the burst point of an all data block of data stream.Above-mentioned fingerprint meet prerequisite can include but not limited to following condition: if in moving window data fingerprint minimum 13 with data block mask off code (chunk mask) do position and operation, if the value obtained is a magic number (magic value), then think and satisfy condition, wherein, if the size of moving window is 48 bytes, then can arrange magic number is 0x12, and device data block mask off code is 0x1fff.
Above-mentioned moving window computing module generally adopts Rabin fingerprint algorithm, and this Rabin fingerprint algorithm character string to be calculated is regarded as a polynomial expression on galois field or large number (shown in following formula 1); Then obtain remainder by an irreducible polynomial to its delivery, this remainder is as the fingerprint calculated.
F (x)=m 0+ m 1x+...+m n-1x n-1(formula 1), wherein x to represent in character string to be calculated information on each, and the coefficient of x can be read from constant table by moving window computing module, and particular hardware logic can adopt state machine to realize.
In addition, above-mentioned fingerprint computing module generally adopts Murmur hash algorithm to calculate the fingerprint of each data block.The hardware logic of Murmur hash algorithm realizes specifically to be made up of three state machines, a host state machine and two sub-state machines, wherein host state machine realizes the main flow of hash algorithm, a sub-state machine is for realizing the circulation process in hash algorithm, a sub-state machine is for realizing the branch process in hash algorithm, wherein each state machine performs any process, mainly different according to the difference of this hash algorithm.
The embodiment of the present invention also provides a kind of data de-duplication method, method mainly performed by above-mentioned data de-duplication device, the structure of data de-duplication device can as described in above-described embodiment, comprise storer 10, processor 11 and hardware accelerator 12, do not repeat at this, the method flow diagram of the present embodiment as shown in Figure 4, comprising:
Step 101, processor 11 call in storer 10 store the first program, be N number of data stream by pending Data Placement, and by N number of thread respectively by N number of data stream to hardware accelerator; Described N be greater than 1 positive integer.Particularly, pending Data Placement can be N number of data stream according to any one or more strategy following by processor 11: application port number, file type and application type etc.
Step 102, hardware accelerator 12 calculates the identification information of data block included in N number of data stream respectively.Particularly, hardware accelerator 12 can carry out to N number of data stream the data block that burst obtains included by each data stream respectively; And calculate the identification information of data block respectively.And wherein, hardware accelerator 12 is before carrying out burst to N number of data stream, can also by determining submodule to determine the size to each data stream data block required when performing data de-duplication process, it is the data stream of the n power of 2 by the size of data block, be the data stream of the n power of non-2 with the size of data block, send to different data fragmentation submodules respectively.
Wherein when carrying out burst to data stream, can the sliding shoe slicing algorithms such as CDC be adopted, fixed length burst or elongated slicing algorithm also can be adopted to carry out burst.Especially, for CDC slicing algorithm, hardware accelerator 12 respectively to the data stream received, calculates the fingerprint of the data in the moving window of the fixed size of sliding along data stream particular by different data fragmentation submodules; When the fingerprint calculated meets prerequisite, using the burst point of the position of moving window as the data block included by data stream, when the fingerprint calculated does not meet prerequisite, moving window is needed to slide a byte along data stream, and then determining whether the burst point into data block according to the fingerprint of data in slip rear hatch, such cycling just obtains the burst point whole data stream being carried out to burst.
Step 103, processor 11 calls the second program stored in storer 10, according to the identification information execution data de-duplication process of the data block that hardware accelerator 12 calculates.
Here data de-duplication process specifically comprises: if stored this identification information in storer 10, then illustrate and stored data corresponding to this identification information in storer 10, then processor 11 can replace at repeating data position pointer, and this pointer is used in reference to the position to storing data; If do not store this identification information in storer 10, or the data that identification information identifies, then by these data or identification information write storer 10.
Visible, in the data de-duplication method of the present embodiment, the program stored in storer 10 can be called by the processor 11 in data de-duplication device, pending Data Placement is become multiple data stream, such hardware accelerator 12 just can calculate the identification information of the data block included by these data stream simultaneously respectively, and then carries out data de-duplication process by processor 11 according to these identification informations.Because pending Data Placement can be become multiple data stream by data de-duplication device in the present embodiment, and then the parallel process these data stream being carried out to data de-duplication, the efficiency of data de-duplication can be improved; In addition, in the present embodiment, realizing the partial function in data de-duplication process, be about to the function of the identification information of the data block calculated included by data stream, hardware configuration is adopted to realize, instead of by the software simulating of calling program, the efficiency of data de-duplication flow process effectively can be improved.
One of ordinary skill in the art will appreciate that all or part of step in the various methods of above-described embodiment is that the hardware that can carry out instruction relevant by program has come, this program can be stored in a computer-readable recording medium, and storage medium can comprise: ROM (read-only memory) (ROM), random-access memory (ram), disk or CD etc.
The data de-duplication method provided the embodiment of the present invention above and device are described in detail, apply specific case herein to set forth principle of the present invention and embodiment, the explanation of above embodiment just understands method of the present invention and core concept thereof for helping; Meanwhile, for one of ordinary skill in the art, according to thought of the present invention, all will change in specific embodiments and applications, in sum, this description should not be construed as limitation of the present invention.
Accompanying drawing explanation
In order to be illustrated more clearly in the technical scheme of the embodiment of the present invention, below the accompanying drawing used required in describing embodiment is briefly described, apparently, accompanying drawing in the following describes is only some embodiments of the present invention, for those of ordinary skill in the art, under the prerequisite not paying creative work, other accompanying drawing can also be obtained according to these accompanying drawings.
Fig. 1 is the structural representation of a kind of data de-duplication device that the embodiment of the present invention provides;
Fig. 2 is the structural representation of hardware accelerator in the data de-duplication device that provides of the embodiment of the present invention;
Fig. 3 is the structural representation of the data fragmentation module that in the data de-duplication device that provides of the embodiment of the present invention, hardware accelerator comprises;
Fig. 4 is the process flow diagram of a kind of data de-duplication method that the embodiment of the present invention provides.
Embodiment
Below in conjunction with the accompanying drawing in the embodiment of the present invention, be clearly and completely described the technical scheme in the embodiment of the present invention, obviously, described embodiment is only the present invention's part embodiment, instead of whole embodiments.Based on the embodiment in the present invention, those of ordinary skill in the art, not making the every other embodiment obtained under creative work prerequisite, belong to the scope of protection of the invention.

Claims (10)

1. a data de-duplication device, is characterized in that, comprises the processor, storer and the hardware accelerator that are connected by bus, wherein:
Described storer, for storing the first program for Data classification, and for the second program of data deduplication;
Pending Data Placement, for calling the first program stored in described storer, is N number of data stream by described processor, and gives described hardware accelerator by described N number of data stream respectively by N number of thread; Described N be greater than 1 positive integer;
Described hardware accelerator, for calculating the identification information of data block included in described N number of data stream respectively;
Described processor, also for calling the second program stored in described storer, according to the identification information execution data de-duplication process of the data block that described hardware accelerator calculates.
2. device as claimed in claim 1, is characterized in that,
Described pending Data Placement, after calling the first program of storing in described storer, is N number of data stream according at least one in following information: application port number, file type and application type by described processor.
3. device as claimed in claim 1 or 2, it is characterized in that, described hardware accelerator comprises: N number of raw data buffer, data fragmentation module and N number of result data buffer zone;
Described N number of raw data buffer, for cushioning N number of data stream that described processor transmits respectively;
Described data fragmentation module, for carrying out to described N number of data stream the data block that burst obtains included by each described data stream respectively, and calculates the identification information of each data block respectively;
Described N number of result data buffer zone, for cushioning the identification information of data block included by N number of data stream that described data fragmentation module calculates respectively.
4. device as claimed in claim 3, it is characterized in that, described data fragmentation module specifically comprises: determine submodule, the first data fragmentation submodule and the second data fragmentation submodule, wherein:
Describedly determine submodule, for determining the size to the data block of each data stream when performing data de-duplication process, and according to the described size to each data stream required data block when performing data de-duplication process, described N number of data stream is sent to respectively described first data fragmentation submodule or described second data fragmentation submodule;
Described first data fragmentation submodule, for carrying out the data block that burst obtains included by described data stream to the data stream received, make the size of each data block be 2 n power, and calculate the identification information of each data block of described data stream respectively;
Described second data fragmentation submodule, for carrying out the data block that burst obtains included by described data stream to the data stream received, make the size of each data block be non-2 n power, and calculate the identification information of each data block of described data stream respectively.
5. device as claimed in claim 4, it is characterized in that, described first data fragmentation submodule and described second data fragmentation submodule comprise respectively:
Moving window computing module, for when to data flow fragmentation, calculates the fingerprint of the data in the moving window of the fixed size of sliding along described data stream;
Burst point computing module; For when the described fingerprint calculated when moving window computing module meets prerequisite, using the burst point of the position of described moving window as the data block included by described data stream;
Described fingerprint computing module, for calculating the identification information of data fingerprint as described each data block of each data block of described data stream, and send the data fingerprint of each data block of described data stream result data buffer zone corresponding to described data stream to.
6. a data de-duplication method, is characterized in that, be applied in the data de-duplication device as described in any one of claim 1 to 6, described method comprises:
Processor calls the first program stored in storer, is N number of data stream by pending Data Placement, and by N number of thread respectively by described N number of data stream to hardware accelerator; Described N be greater than 1 positive integer;
Hardware accelerator calculates the identification information of the data block in described N number of data stream included by each data stream respectively;
Described processor calls the second program stored in described storer, according to the identification information execution data de-duplication process of the described data block that described hardware accelerator calculates.
7. method as claimed in claim 6, it is characterized in that, described is N number of data stream by pending Data Placement, specifically comprises:
At least one in information is appointed to be N number of data stream by described pending Data Placement according to following: application port number, file type and application type.
8. method as claimed in claims 6 or 7, is characterized in that, described hardware accelerator calculates the identification information of data block included in described N number of data stream respectively, specifically comprises:
Described hardware accelerator carries out to described N number of data stream the data block that burst obtains included by each described data stream respectively; And calculate the identification information of each data block respectively.
9. method according to claim 8, is characterized in that, before described hardware accelerator carries out burst to described N number of data stream respectively, described method also comprises:
Determine the size to each data stream data block required when performing data de-duplication process, it is the data stream of the n power of 2 by the size of data block, be the data stream of the n power of non-2 with the size of described data block, send to different data fragmentation submodules respectively.
10. method as claimed in claim 9, is characterized in that, described hardware accelerator carries out to described N number of data stream the data block that burst obtains included by each data stream respectively, specifically comprises:
Described different data fragmentation submodule is respectively to the fingerprint of the data in the moving window of the fixed size that the data-flow computation received is slided along described data stream;
When the fingerprint of described calculating meets prerequisite, using the burst point of the position of moving window as data block included by described data stream.
CN201410101736.8A 2014-03-18 2014-03-18 A kind of data de-duplication method and device Active CN104933010B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410101736.8A CN104933010B (en) 2014-03-18 2014-03-18 A kind of data de-duplication method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410101736.8A CN104933010B (en) 2014-03-18 2014-03-18 A kind of data de-duplication method and device

Publications (2)

Publication Number Publication Date
CN104933010A true CN104933010A (en) 2015-09-23
CN104933010B CN104933010B (en) 2019-02-19

Family

ID=54120179

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410101736.8A Active CN104933010B (en) 2014-03-18 2014-03-18 A kind of data de-duplication method and device

Country Status (1)

Country Link
CN (1) CN104933010B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107122130A (en) * 2017-04-13 2017-09-01 杭州宏杉科技股份有限公司 A kind of data delete method and device again
CN113064869A (en) * 2021-03-23 2021-07-02 网易(杭州)网络有限公司 Log processing method and device, sending end, receiving end equipment and storage medium
WO2022193447A1 (en) * 2021-03-17 2022-09-22 网宿科技股份有限公司 Data packet deduplication and transmission method, electronic device, and storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102722583A (en) * 2012-06-07 2012-10-10 无锡众志和达存储技术有限公司 Hardware accelerating device for data de-duplication and method
CN103309975A (en) * 2013-06-09 2013-09-18 华为技术有限公司 Duplicated data deleting method and apparatus
CN103514250A (en) * 2013-06-20 2014-01-15 易乐天 Method and system for deleting global repeating data and storage device

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102722583A (en) * 2012-06-07 2012-10-10 无锡众志和达存储技术有限公司 Hardware accelerating device for data de-duplication and method
CN103309975A (en) * 2013-06-09 2013-09-18 华为技术有限公司 Duplicated data deleting method and apparatus
CN103514250A (en) * 2013-06-20 2014-01-15 易乐天 Method and system for deleting global repeating data and storage device

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
丁德兵: "虚拟机备份系统中存储空间的优化虚拟机备份系统中存储空间的优化虚拟机备份系统中存储空间的优化虚拟机备份系统中存储空间的优化", 《万方学位论文数据库》 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107122130A (en) * 2017-04-13 2017-09-01 杭州宏杉科技股份有限公司 A kind of data delete method and device again
CN107122130B (en) * 2017-04-13 2020-04-21 杭州宏杉科技股份有限公司 Data deduplication method and device
WO2022193447A1 (en) * 2021-03-17 2022-09-22 网宿科技股份有限公司 Data packet deduplication and transmission method, electronic device, and storage medium
CN113064869A (en) * 2021-03-23 2021-07-02 网易(杭州)网络有限公司 Log processing method and device, sending end, receiving end equipment and storage medium

Also Published As

Publication number Publication date
CN104933010B (en) 2019-02-19

Similar Documents

Publication Publication Date Title
CN108089814B (en) Data storage method and device
CN102629258B (en) Repeating data deleting method and device
US10713202B2 (en) Quality of service (QOS)-aware input/output (IO) management for peripheral component interconnect express (PCIE) storage system with reconfigurable multi-ports
CN110659151A (en) Data verification method and device and storage medium
US11102322B2 (en) Data processing method and apparatus, server, and controller
CN104102458B (en) Load-balancing method, multi-core CPU and the solid state hard disc of multi-core CPU
WO2020119029A1 (en) Distributed task scheduling method and system, and storage medium
CN103838860A (en) File storing system based on dynamic transcript strategy and storage method of file storing system
EP2808778A1 (en) Capacity expansion method and device
CN110737401B (en) Method, apparatus and computer program product for managing redundant array of independent disks
US11379127B2 (en) Method and system for enhancing a distributed storage system by decoupling computation and network tasks
CN105808169A (en) Data deduplication method, apparatus and system
US11561707B2 (en) Allocating data storage based on aggregate duplicate performance
CN104933010A (en) Duplicated data deleting method and apparatus
CN108062235A (en) Data processing method and device
CN104486442A (en) Method and device for transmitting data of distributed storage system
CN110874284A (en) Data processing method and device
WO2019174206A1 (en) Data reading method and apparatus of storage device, terminal device, and storage medium
CN113126879B (en) Data storage method and device and electronic equipment
WO2015081742A1 (en) Data writing method and device
CN106383670B (en) Data processing method and storage device
WO2020253575A1 (en) Method and device for determining number of decoder iterations, and storage medium and electronic device
CN108984112B (en) Method and device for realizing storage QoS control strategy
WO2021237513A1 (en) Data compression storage system and method, processor, and computer storage medium
CN111865741B (en) Data transmission method and data transmission system

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20201102

Address after: 625, room 269, Connaught platinum Plaza, No. 518101, Qianjin Road, Xin'an street, Shenzhen, Guangdong, Baoan District

Patentee after: SHENZHEN SHANGGE INTELLECTUAL PROPERTY SERVICE Co.,Ltd.

Address before: 518129 Bantian HUAWEI headquarters office building, Longgang District, Guangdong, Shenzhen

Patentee before: HUAWEI TECHNOLOGIES Co.,Ltd.

TR01 Transfer of patent right

Effective date of registration: 20201201

Address after: 213000 No. 11 Qingyang North Road, Tianning District, Changzhou, Jiangsu

Patentee after: Changzhou Hong quantity Electronic Technology Co.,Ltd.

Address before: 625, room 269, Connaught platinum Plaza, No. 518101, Qianjin Road, Xin'an street, Shenzhen, Guangdong, Baoan District

Patentee before: SHENZHEN SHANGGE INTELLECTUAL PROPERTY SERVICE Co.,Ltd.

TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20220829

Address after: Tianning District Hehai road 213000 Jiangsu city of Changzhou province No. 9

Patentee after: Changzhou Tianning Communication Technology Industrial Park Co.,Ltd.

Address before: 213000 No. 11 Qingyang North Road, Tianning District, Changzhou City, Jiangsu Province

Patentee before: Changzhou Hong quantity Electronic Technology Co.,Ltd.

TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20230105

Address after: Tianning District Hehai road 213000 Jiangsu city of Changzhou province No. 9

Patentee after: Changzhou Hengtang Technology Industry Co.,Ltd.

Address before: Tianning District Hehai road 213000 Jiangsu city of Changzhou province No. 9

Patentee before: Changzhou Tianning Communication Technology Industrial Park Co.,Ltd.

TR01 Transfer of patent right