CN109800218A - Distributed memory system, memory node equipment and data duplicate removal method - Google Patents
Distributed memory system, memory node equipment and data duplicate removal method Download PDFInfo
- Publication number
- CN109800218A CN109800218A CN201910007367.9A CN201910007367A CN109800218A CN 109800218 A CN109800218 A CN 109800218A CN 201910007367 A CN201910007367 A CN 201910007367A CN 109800218 A CN109800218 A CN 109800218A
- Authority
- CN
- China
- Prior art keywords
- fingerprint
- data
- data slice
- written
- node equipment
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 44
- 238000003860 storage Methods 0.000 claims abstract description 43
- 238000004891 communication Methods 0.000 claims abstract description 12
- 238000005520 cutting process Methods 0.000 claims description 20
- 238000005516 engineering process Methods 0.000 abstract description 4
- 238000012545 processing Methods 0.000 description 11
- 238000012217 deletion Methods 0.000 description 9
- 230000037430 deletion Effects 0.000 description 9
- 238000010586 diagram Methods 0.000 description 6
- 101100217298 Mus musculus Aspm gene Proteins 0.000 description 4
- 238000004321 preservation Methods 0.000 description 4
- 235000013399 edible fruits Nutrition 0.000 description 3
- 239000008187 granular material Substances 0.000 description 3
- 238000007781 pre-processing Methods 0.000 description 3
- 230000006870 function Effects 0.000 description 2
- 239000004973 liquid crystal related substance Substances 0.000 description 2
- 230000007547 defect Effects 0.000 description 1
- 238000000151 deposition Methods 0.000 description 1
- 238000009826 distribution Methods 0.000 description 1
- 230000014759 maintenance of location Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 230000009466 transformation Effects 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/21—Design, administration or maintenance of databases
- G06F16/215—Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/22—Indexing; Data structures therefor; Storage structures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/27—Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Databases & Information Systems (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Quality & Reliability (AREA)
- Software Systems (AREA)
- Computing Systems (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The present invention relates to a kind of distributed storage technology, a kind of distributed memory system, memory node equipment and data duplicate removal method are disclosed.A memory node equipment of the invention is when carrying out data deduplication, if not inquiring the fingerprint of a data slice to be written in local fingerprint base, then directly it can inquire whether the fingerprint is duplicate fingerprint in shared fingerprint base, without carrying out communication inquiry one by one with other memory node equipment, this improves the data deduplication efficiency of distributed memory system.
Description
Technical field
The present invention relates to technical field of distributed memory, in particular to a kind of distributed memory system, memory node equipment,
Data duplicate removal method and computer readable storage medium.
Background technique
Data deduplication is also known as data de-duplication (Data Deduplication), is a kind of apply within the storage system
The technology for globally identifying and eliminating redundant data, becomes the hot spot of storage system research in recent years.Data deduplication passes through meter
The secure Hash abstract (such as SHA1 fingerprint) for calculating data block carrys out unique identification data block, avoids of the character one by one of data
Match, and storage system only needs simply to safeguard the concordance list of secure Hash abstract, so that it may which realization quickly and easily identifies
Repeated data is with good expansibility.Duplicate data content only needs to record corresponding data pointer information i.e. reachable
To the purpose for saving memory space.So data deduplication technology can save memory space greatly to improve storage equipment
Resource utilization.
Currently, a memory node generally includes to walk as follows in the duplicate removal process to a data slice in distributed memory system
It is rapid: to calculate the fingerprint of the data slice, then inquire the fingerprint in the fingerprint base of the memory node and whether there is, if it does not exist, then
Inquiring the fingerprint in the fingerprint base of other memory nodes in the distributed memory system whether there is, and confirm the data slice with this
With the presence or absence of in distributed memory system.The defect of this method is that the quantity of memory node is logical in distributed memory system
It is often more, if memory node needs inquire fingerprint in the fingerprint base of other multiple memory nodes, need and multiple storages
Node communicates one by one, and speed is slow and low efficiency.
Therefore, the deduplicated efficiency for how improving distributed memory system becomes a urgent problem to be solved.
Summary of the invention
The main object of the present invention be to provide a kind of distributed memory system, memory node equipment, data duplicate removal method and
Computer readable storage medium, it is intended to improve the deduplicated efficiency of distributed memory system.
To achieve the above object, the present invention proposes that a kind of distributed memory system, the distributed memory system include more
A memory node equipment and several shared fingerprint bases, communicate to connect, institute between the memory node equipment and shared fingerprint base
It states and is provided with local fingerprint base in memory node equipment, alternatively, the memory node equipment is communicated with corresponding local fingerprint base
Connection, the memory node equipment are used for:
Receive data slice write request, the data slice write request includes several data slices to be written and each described
The fingerprint of data slice to be written;
Determine in the fingerprint of the data slice to be written to duplicate removal fingerprint, and search in local fingerprint base it is each to
Duplicate removal fingerprint whether there is, and the local fingerprint base includes the fingerprint of storing data piece in the memory node equipment;
It is one or more when duplicate removal fingerprint is present in the local fingerprint base when having, by one or more of to duplicate removal
The corresponding data slice to be written of fingerprint is deleted;
It is one or more when duplicate removal fingerprint is not present in the local fingerprint base when having, by one or more of wait go
Weight fingerprint searches each fingerprint to be processed as fingerprint to be processed, and in shared fingerprint base, and the shared data bank includes institute
There is in memory node equipment the fingerprint of storing data piece, it is one or more to be processed when being found in the shared fingerprint base
When fingerprint, the corresponding data slice to be written of the one or more of fingerprints to be processed found is deleted.
Preferably, the data slice to be written is obtained by data cutting to be written, and the data slice write request further includes
Data slice fingerprint sequence, the data slice fingerprint sequence include the fingerprint for each data slice to be written being arranged in order;
In the fingerprint for determining the data slice to be written includes: to judge the data to be written to duplicate removal fingerprint
Whether there is the fingerprint of redundancy in the fingerprint of piece, and if it exists, then delete the fingerprint of the redundancy, and using remaining fingerprint as to
Duplicate removal fingerprint, if it does not exist, then using the fingerprint of all data slices to be written as to duplicate removal fingerprint;
The memory node equipment is also used to:
The data slice fingerprint sequence is saved into the local fingerprint base and shared fingerprint base;
The memory node equipment is also used to:
It saves all remaining data slices to be written, and the corresponding storage location of the remaining data slice to be written is believed
Breath is saved into the local fingerprint base and shared fingerprint base.
Preferably, the distributed memory system further includes communicating to connect with each memory node equipment and shared fingerprint base
Control node equipment, the memory node equipment is also used to:
Determine the reference count changing value of the fingerprint of each data slice to be written, and by the fingerprint of each data slice to be written
Reference count changing value be sent to control node equipment;
The control node equipment is used for:
According to the reference count changing value of the fingerprint of each data slice to be written, the fingerprint of each data slice to be written is updated
Accumulative reference count.
Preferably, the memory node equipment is also used to:
Receive the removal request of a data to be deleted;
The data slice fingerprint sequence of the data to be deleted is obtained, determines and respectively refers in the data slice fingerprint sequence obtained
The reference count changing value of line, and the reference count changing value of each fingerprint in the data slice fingerprint sequence is sent to control node
Equipment;
The control node equipment is also used to:
According to the reference count changing value of each fingerprint in the data slice fingerprint sequence, the data slice fingerprint sequence is updated
In each fingerprint accumulative reference count, and the data slice fingerprint sequence of the data to be deleted is deleted from the shared fingerprint base
It removes, and the memory node equipment is notified to delete the data slice fingerprint sequence of the data to be deleted from local fingerprint base;
When detecting the accumulative reference count of a fingerprint in the shared fingerprint base is zero, records the fingerprint and keep
The duration for the state that accumulative reference count is zero deletes the fingerprint when the duration is greater than preset duration, and
Corresponding memory node equipment is notified to delete the fingerprint and the corresponding data slice of the fingerprint.
In addition, to achieve the above object, the present invention also proposes that a kind of data duplicate removal method, this method are deposited suitable for distribution
Storage system, the distributed memory system include multiple memory node equipment and several shared fingerprint bases, the memory node
It is communicated to connect between equipment and shared fingerprint base, is provided with local fingerprint base in the memory node equipment, alternatively, the storage
Node device is communicated to connect with corresponding local fingerprint base, the method includes the steps:
Receiving step: memory node equipment receives data slice write request, and the data slice write request includes several
The fingerprint of data slice to be written and each data slice to be written;
Query steps: memory node equipment determine in the fingerprint of the data slice to be written to duplicate removal fingerprint, and in
Searched in local fingerprint base it is each whether there is to duplicate removal fingerprint, it is described local fingerprint base include in the memory node equipment
The fingerprint of storing data piece;
First duplicate removal step:, memory node one or more when duplicate removal fingerprint is present in the local fingerprint base when having
Equipment is deleted one or more of to the corresponding data slice to be written of duplicate removal fingerprint;
Second duplicate removal step:, storage section one or more when duplicate removal fingerprint is not present in the local fingerprint base when having
Point device using it is one or more of to duplicate removal fingerprint as fingerprint to be processed, and search in shared fingerprint base each to be processed
Fingerprint, the shared data bank includes the fingerprint of storing data piece in all memory node equipment, when in the shared fingerprint
It is when finding one or more fingerprints to be processed in library, the one or more of fingerprints to be processed found are corresponding to be written
Enter data slice deletion.
Preferably, the data slice to be written is obtained by data cutting to be written, and the data slice write request further includes
Data slice fingerprint sequence, the data slice fingerprint sequence include the fingerprint for each data slice to be written being arranged in order;
In the fingerprint for determining the data slice to be written to the step of duplicate removal fingerprint include: judge it is described to be written
Enter to whether there is in the fingerprint of data slice the fingerprint of redundancy, and if it exists, then delete the fingerprint of the redundancy, and by remaining fingerprint
As to duplicate removal fingerprint, if it does not exist, then using the fingerprint of all data slices to be written as to duplicate removal fingerprint;
After the receiving step, the method also includes:
Memory node equipment saves the data slice fingerprint sequence into the local fingerprint base and shared fingerprint base;
After the second duplicate removal step, the method also includes:
Memory node equipment saves all remaining data slices to be written, and the remaining data slice to be written is corresponding
Storage location information preservation into the local fingerprint base and shared fingerprint base.
Preferably, the distributed memory system further includes communicating to connect with each memory node equipment and shared fingerprint base
Control node equipment, after the receiving step, the method also includes:
Memory node equipment determines the reference count changing value of the fingerprint of each data slice to be written, and will be each to be written
The reference count changing value of the fingerprint of data slice is sent to control node equipment;
Control node equipment updates each to be written according to the reference count changing value of the fingerprint of each data slice to be written
The accumulative reference count of the fingerprint of data slice.
Preferably, the method also includes:
Memory node equipment receives the removal request of a data to be deleted;
Memory node equipment obtains the data slice fingerprint sequence of the data to be deleted, determines that the data slice obtained refers to
The reference count changing value of each fingerprint in line sequence, and send the reference count variation of each fingerprint in the data slice fingerprint sequence
It is worth to control node equipment;
Control node equipment updates the number according to the reference count changing value of each fingerprint in the data slice fingerprint sequence
According to the accumulative reference count of each fingerprint in piece fingerprint sequence, and by the data slice fingerprint sequence of the data to be deleted from described total
Fingerprint base deletion is enjoyed, and notifies the memory node equipment by the data slice fingerprint sequence of the data to be deleted from local fingerprint
It deletes in library;
When detecting the accumulative reference count of a fingerprint in the shared fingerprint base is zero, control node equipment record
The fingerprint keeps the duration for the state that accumulative reference count is zero, when the duration is greater than preset duration, deletes
Except the fingerprint, and corresponding memory node equipment is notified to delete the fingerprint and the corresponding data slice of the fingerprint.
In addition, to achieve the above object, the present invention also proposes a kind of memory node equipment, and the memory node equipment is together
It enjoys and being communicated to connect between fingerprint base, be provided with local fingerprint base in the memory node equipment, alternatively, the memory node equipment
It is communicated to connect with corresponding local fingerprint base, the memory node equipment includes memory and processor, is deposited on the memory
Data deduplication program is contained, the data deduplication program realizes following steps when being executed by the processor:
Receiving step: data slice write request is received, the data slice write request includes several data slices to be written
And the fingerprint of each data slice to be written;
Query steps: determine in the fingerprint of the data slice to be written to duplicate removal fingerprint, and in local fingerprint base
Search it is each whether there is to duplicate removal fingerprint, the local fingerprint base includes storing data piece in the memory node equipment
Fingerprint;
First duplicate removal step: it is one or more when duplicate removal fingerprint is present in the local fingerprint base when having, by described one
It is a or multiple to the corresponding data slice deletion to be written of duplicate removal fingerprint;
Second duplicate removal step: it is one or more when duplicate removal fingerprint is not present in the local fingerprint base when having, it will be described
One or more searches each fingerprint to be processed as fingerprint to be processed, and in shared fingerprint base to duplicate removal fingerprint, described total
Enjoying database includes the fingerprint of storing data piece in all memory node equipment, when finding one in the shared fingerprint base
When a or multiple fingerprints to be processed, the corresponding data slice to be written of one or more of fingerprints to be processed found is deleted
It removes.
Preferably, the data slice to be written is obtained by data cutting to be written, and the data slice write request further includes
Data slice fingerprint sequence, the data slice fingerprint sequence include the fingerprint for each data slice to be written being arranged in order;
In the fingerprint for determining the data slice to be written to the step of duplicate removal fingerprint include: judge it is described to be written
Enter to whether there is in the fingerprint of data slice the fingerprint of redundancy, and if it exists, then delete the fingerprint of the redundancy, and by remaining fingerprint
As to duplicate removal fingerprint, if it does not exist, then using the fingerprint of all data slices to be written as to duplicate removal fingerprint;
The processor executes the data deduplication program, after the receiving step, also realizes following steps:
The data slice fingerprint sequence is saved into the local fingerprint base and shared fingerprint base;
The processor executes the data deduplication program, after the second duplicate removal step, also realizes following steps:
It saves all remaining data slices to be written, and the corresponding storage location of the remaining data slice to be written is believed
Breath is saved into the local fingerprint base and shared fingerprint base.
Preferably, the distributed memory system further includes communicating to connect with each memory node equipment and shared fingerprint base
Control node equipment, the processor executes the data deduplication program, after the receiving step, also realizes following step
It is rapid:
Determine the reference count changing value of the fingerprint of each data slice to be written, and by the fingerprint of each data slice to be written
Reference count changing value be sent to control node equipment, for the control node equipment according to the finger of each data slice to be written
The reference count changing value of line updates the accumulative reference count of the fingerprint of each data slice to be written.
In addition, to achieve the above object, the present invention also proposes a kind of computer readable storage medium, it is suitable for memory node
Equipment communicates to connect between the memory node equipment and shared fingerprint base, is provided in the memory node equipment and locally refers to
Line library, alternatively, the memory node equipment is communicated to connect with corresponding local fingerprint base, the computer readable storage medium is deposited
Data deduplication program is contained, the data deduplication program can be executed by least one processor, so that at least one described processing
Device executes following steps:
Receiving step: data slice write request is received, the data slice write request includes several data slices to be written
And the fingerprint of each data slice to be written;
Query steps: determine in the fingerprint of the data slice to be written to duplicate removal fingerprint, and in local fingerprint base
Search it is each whether there is to duplicate removal fingerprint, the local fingerprint base includes storing data piece in the memory node equipment
Fingerprint;
First duplicate removal step: it is one or more when duplicate removal fingerprint is present in the local fingerprint base when having, by described one
It is a or multiple to the corresponding data slice deletion to be written of duplicate removal fingerprint;
Second duplicate removal step: it is one or more when duplicate removal fingerprint is not present in the local fingerprint base when having, it will be described
One or more searches each fingerprint to be processed as fingerprint to be processed, and in shared fingerprint base to duplicate removal fingerprint, described total
Enjoying database includes the fingerprint of storing data piece in all memory node equipment, when finding one in the shared fingerprint base
When a or multiple fingerprints to be processed, the corresponding data slice to be written of one or more of fingerprints to be processed found is deleted
It removes.
Preferably, the data slice to be written is obtained by data cutting to be written, and the data slice write request further includes
Data slice fingerprint sequence, the data slice fingerprint sequence include the fingerprint for each data slice to be written being arranged in order;
In the fingerprint for determining the data slice to be written to the step of duplicate removal fingerprint include: judge it is described to be written
Enter to whether there is in the fingerprint of data slice the fingerprint of redundancy, and if it exists, then delete the fingerprint of the redundancy, and by remaining fingerprint
As to duplicate removal fingerprint, if it does not exist, then using the fingerprint of all data slices to be written as to duplicate removal fingerprint;
The processor executes the data deduplication program, after the receiving step, also realizes following steps:
The data slice fingerprint sequence is saved into the local fingerprint base and shared fingerprint base;
The processor executes the data deduplication program, after the second duplicate removal step, also realizes following steps:
It saves all remaining data slices to be written, and the corresponding storage location of the remaining data slice to be written is believed
Breath is saved into the local fingerprint base and shared fingerprint base.
Preferably, the distributed memory system further includes communicating to connect with each memory node equipment and shared fingerprint base
Control node equipment, the processor executes the data deduplication program, after the receiving step, also realizes following step
It is rapid:
Determine the reference count changing value of the fingerprint of each data slice to be written, and by the fingerprint of each data slice to be written
Reference count changing value be sent to control node equipment, for the control node equipment according to the finger of each data slice to be written
The reference count changing value of line updates the accumulative reference count of the fingerprint of each data slice to be written.
Compared with prior art, one memory node equipment of the present embodiment is when carrying out data deduplication, if in local fingerprint base
In do not inquire the fingerprint of a data slice to be written, then directly can inquire whether the fingerprint is duplicate finger in shared fingerprint base
Line, without carrying out communication inquiry one by one with other memory node equipment, this improves the data deduplications of distributed memory system
Efficiency.
Detailed description of the invention
In order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, to embodiment or will show below
There is attached drawing needed in technical description to be briefly described, it should be apparent that, the accompanying drawings in the following description is only this
Some embodiments of invention for those of ordinary skill in the art without creative efforts, can be with
The structure shown according to these attached drawings obtains other attached drawings.
Fig. 1 is the system architecture schematic diagram of one embodiment of distributed memory system of the present invention;
Fig. 2 is the running environment schematic diagram of one embodiment of data deduplication program of the present invention;
Fig. 3 is the Program modual graph of one embodiment of data deduplication program of the present invention;
Fig. 4 is the flow diagram of one embodiment of data duplicate removal method of the present invention.
The embodiments will be further described with reference to the accompanying drawings for the realization, the function and the advantages of the object of the present invention.
Specific embodiment
The principle and features of the present invention will be described below with reference to the accompanying drawings, and the given examples are served only to explain the present invention, and
It is non-to be used to limit the scope of the invention.
As shown in fig.1, being the system architecture diagram of one embodiment of distributed memory system of the present invention.
In the present embodiment, the distributed memory system includes multiple memory node equipment 1 and several shared fingerprints
Library 2, communication connection (for example, being communicated to connect by network 4), described to deposit between the memory node equipment 1 and shared fingerprint base 2
It is provided with local fingerprint base 3 in storage node device 1, alternatively, the memory node equipment 1 is communicated with corresponding local fingerprint base 3
Connection.The local fingerprint base 3 includes the fingerprint of storing data piece in corresponding memory node equipment 1, the shared fingerprint base 2
Including the fingerprint of storing data piece in all memory node equipment 1.
The memory node equipment 1 is used for:
Receive data slice write request, the data slice write request includes several data slices to be written and each described
The fingerprint of data slice to be written;
Determine in the fingerprint of the data slice to be written to duplicate removal fingerprint, and search in local fingerprint base 3 each
It whether there is to duplicate removal fingerprint;
It is one or more when duplicate removal fingerprint is present in the local fingerprint base 3 when having, by one or more of wait go
The corresponding data slice to be written of weight fingerprint is deleted;
It is one or more when duplicate removal fingerprint is not present in the local fingerprint base 3 when having, by it is one or more of to
Duplicate removal fingerprint searches each fingerprint to be processed as fingerprint to be processed in shared fingerprint base 2, when in the shared fingerprint base
It is when finding one or more fingerprints to be processed in 2, the one or more of fingerprints to be processed found are corresponding to be written
Enter data slice deletion.
In the present embodiment, memory node equipment 1 receives data slice write request, and the data slice write request includes several
The fingerprint of a data slice to be written and each data slice to be written.The data slice to be written is (described by data to be written
The data type of data to be written includes block grade data, file-level data) cutting obtains.The operation of the cutting can be by memory node
Equipment 1 executes, or is executed by other any applicable equipment (for example, client), and cutting method includes:
It is written into the data slice that data file is cut into the preset quantity of identical data size.Alternatively, by preset quantity
It is denoted as M, when M is natural number greater than 1, the corresponding data slice size of data file to be written described in cutting is determined, according to true
Fixed data slice size is syncopated as the identical M-1 data block of size one by one, and remaining after cutting is m-th data block.Its
In, the size of data slice to be written can be 4KB, 8KB, 12KB, 16KB or other granule sizes.
After being written into data and being cut into several data slices to be written, the fingerprint of each data slice to be written is calculated,
For example, passing through Message-Digest Algorithm 5 (Message-Digest Algorithm 5, MD5), Secure Hash Algorithm (Secure
Hash Algorithm, SHA1) etc. calculate the fingerprint of each data slice to be written, meanwhile, record the row of each data slice to be written
Column sequence (i.e. data slice fingerprint sequence) is written into data slice according to the data when being used for subsequent reading data to be written
Piece fingerprint sequence is assembled into the data to be written.In addition, memory node equipment 1 the data slice fingerprint sequence can also be saved to
In the local fingerprint base 3 and shared fingerprint base 2.
Then, memory node equipment 1 determine in the fingerprint of the data slice to be written to duplicate removal fingerprint, determine wait go
The method of weight fingerprint includes: the fingerprint that whether there is redundancy in the fingerprint for judge the data slice to be written, and if it exists, is then deleted
The fingerprint of the redundancy, and using remaining fingerprint as to duplicate removal fingerprint, if it does not exist, then by all data slices to be written
Fingerprint be used as to duplicate removal fingerprint.
For example, memory node equipment 1 judges in the fingerprint of all data slices to be written with the presence or absence of identical fingerprint.If depositing
In identical fingerprint, then using identical fingerprint as a fingerprint group, after finding out all fingerprint groups, in each fingerprint group
Middle one fingerprint of selection retains, and deletes non-selected fingerprint as the fingerprint of redundancy, and judge whether there is ungrouped
Fingerprint, if so, using each ungrouped fingerprint as to duplicate removal fingerprint, if it is not, then terminating process.Identical finger if it does not exist
Line, then using the fingerprint of all data slices to be written as ungrouped fingerprint, then using each ungrouped fingerprint as to duplicate removal
Fingerprint.
It is identifying after duplicate removal fingerprint, memory node equipment 1 is searched in local fingerprint base 3 each is to duplicate removal fingerprint
No presence.One or more when duplicate removal fingerprint is present in the local fingerprint base 3 when having, memory node equipment 1 is by described one
It is a or multiple to the corresponding data slice deletion to be written of duplicate removal fingerprint.When there is one or more to duplicate removal fingerprint to be present in described
When ground fingerprint base 3, represents these and wait for that the corresponding data slice to be written of duplicate removal fingerprint is duplicate data slice, it is empty in order to save storage
Between, these duplicate data slices are deleted.
Finally, memory node equipment 1 one or more when duplicate removal fingerprint is not present in the local fingerprint base 3 when having
Using it is one or more of to duplicate removal fingerprint as fingerprint to be processed, and each fingerprint to be processed is searched in shared fingerprint base 2,
When finding one or more fingerprints to be processed in the shared fingerprint base 2, by find it is one or more of to
The corresponding data slice to be written of fingerprint is handled to delete.
Possess the finger print data of full dose in fingerprint base 2 due to sharing, if memory node equipment 1 is in local fingerprint base 3
In do not inquire one to duplicate removal fingerprint, then continue inquiry in shared fingerprint base 2 should whether there is to duplicate removal fingerprint, and if it exists,
Then determining should be present in other memory nodes to the corresponding data slice to be written of duplicate removal fingerprint, belong to duplicate data slice,
It no longer needs to carry out storage processing to the data slice to be written.
Compared with prior art, one memory node equipment 1 of the present embodiment is when carrying out data deduplication, if in local fingerprint base
The fingerprint of a data slice to be written is not inquired in 3, then directly can inquire whether the fingerprint is duplicate in shared fingerprint base 2
Fingerprint, without carrying out communication inquiry one by one with other memory node equipment 1, this improves the data of distributed memory system
Deduplicated efficiency.
Further, in the present embodiment, the memory node equipment 1 is also used to:
Memory node equipment 1 saves all remaining data slices (i.e. not duplicate data slice) to be written, and will be described surplus
The corresponding storage location information preservation of remaining data slice to be written is into the local fingerprint base 3 and shared fingerprint base 2.
Further, in the present embodiment, the distributed memory system further includes control node equipment 5, the control
Node device 5 is respectively with the memory node equipment 1 and the communication connection of shared fingerprint base 2 (for example, pass through 4 communication link of network
It connects).The shared fingerprint base 2 may be disposed in a shared disk (the NVME disk as passed through NVMEOF carry), the shared disk
It may be disposed in control node equipment 5, be independently of the setting of control node equipment 5.
The memory node equipment 1 is also used to:
Determine the reference count changing value of the fingerprint of each data slice to be written (for example, determining each to duplicate removal fingerprint
Reference count changing value is+1), and the reference count changing value of the fingerprint of each data slice to be written is sent to control section
Point device 5.
The control node equipment 5 is used for:
According to the reference count changing value of the fingerprint of each data slice to be written, the fingerprint of each data slice to be written is updated
Accumulative reference count (the accumulative reference count of a fingerprint represent the corresponding data slice of the fingerprint quoted by storing data it is total
Number).
Further, in the present embodiment, the memory node equipment 1 is also used to:
The removal request of a data to be deleted is received, the data slice fingerprint sequence of the data to be deleted is obtained, determination obtains
The reference count changing value of each fingerprint is (for example, determine in the data slice fingerprint sequence in the data slice fingerprint sequence taken
The reference count changing value of each fingerprint is -1), and send the reference count of each fingerprint in the data slice fingerprint sequence
Changing value is to control node equipment 5.
The control node equipment 5 is also used to:
According to the reference count changing value of each fingerprint in the data slice fingerprint sequence, the data slice fingerprint sequence is updated
In each fingerprint accumulative reference count, and the data slice fingerprint sequence of the data to be deleted is deleted from the shared fingerprint base 2
It removes, and the memory node equipment 1 is notified to delete the data slice fingerprint sequence of the data to be deleted from local fingerprint base 3.
In the present embodiment, to delete data, memory node equipment 1 cannot be directly straight by the data slice of the data to be deleted
Connect deletion because memory node equipment 1 can not determine the data to be deleted data slice whether simultaneously by other data referencings, such as
Fruit directly deletes the data slice of the data to be deleted, then is likely to result in loss of data.Thus, it is only required to update the data to be deleted
Data slice sequence in each fingerprint accumulative reference count, and by the data slice fingerprint sequence of the data to be deleted delete.
Further, in the present embodiment, the control node equipment 5 is also used to:
When the accumulative reference count for detecting a fingerprint in the shared fingerprint base 2 is zero (the i.e. corresponding number of the fingerprint
According to piece not by any data referencing) when, record the duration that the fingerprint keeps adding up the state that reference count is zero.
When the duration is greater than preset duration, the fingerprint is deleted, and corresponding memory node equipment 1 is notified to delete
Except the fingerprint and the corresponding data slice of the fingerprint.
When the duration is less than or equal to preset duration, do not make delete processing.
In the present embodiment, control node equipment 5 need to undergo one section when the accumulative reference count for detecting a fingerprint is zero
The corresponding data slice of the fingerprint is deleted again after preset duration, and each memory node equipment 1 of real-time reception in the preset duration
The reference count changing value of the fingerprint reported, to avoid due to memory node equipment 1 reports reference count changing value not in time
Caused by data accidentally delete.
Further, in the present embodiment, the memory node equipment 1 is also used to:
When receiving the read requests of a data to be read, the data slice fingerprint sequence of the data to be read is obtained, and
The storage location information for obtaining the corresponding data slice of each fingerprint in the data slice fingerprint sequence, according to the storage position of acquisition
Confidence breath obtains the corresponding data slice of each fingerprint in the data slice fingerprint sequence, then the data slice that will acquire is according to the number
The data to be read are assembled into according to piece fingerprint sequence.
The present invention proposes a kind of data deduplication program.
Referring to Fig. 2, being the running environment schematic diagram of 10 1 embodiment of data deduplication program of the present invention.
In the present embodiment, data deduplication program 10 is installed and is run in memory node equipment 1.Memory node equipment 1
It can be desktop PC, notebook, palm PC and server etc. and calculate equipment.The memory node equipment 1 may include,
But it is not limited only to, memory 11, processor 12 and display 13.Fig. 2 illustrates only the memory node equipment with component 11-13
1, it should be understood that be not required for implementing all components shown, the implementation that can be substituted is more or less component.
Memory 11 can be the internal storage unit of memory node equipment 1, such as the storage in some embodiments
The hard disk or memory of node device 1.Memory 11 is also possible to the external storage of memory node equipment 1 in further embodiments
The plug-in type hard disk being equipped in equipment, such as memory node equipment 1, intelligent memory card (Smart Media Card, SMC), peace
Digital (Secure Digital, SD) card, flash card (Flash Card) etc..Further, memory 11 can also be wrapped both
The internal storage unit for including memory node equipment 1 also includes External memory equipment.Memory 11 is installed on storage section for storing
Application software and Various types of data, such as the program code of data deduplication program 10 of point device 1 etc..Memory 11 can be also used for
Temporarily store the data that has exported or will export.
Processor 12 can be in some embodiments a central processing unit (Central Processing Unit,
CPU), microprocessor or other data processing chips, program code or processing data for being stored in run memory 11, example
Such as execute data deduplication program 10.
Display 13 can be in some embodiments light-emitting diode display, liquid crystal display, touch-control liquid crystal display and
OLED (Organic Light-Emitting Diode, Organic Light Emitting Diode) touches device etc..Display 13 is for being shown in
The information that is handled in memory node equipment 1 and for showing visual user interface.The component 11- of memory node equipment 1
13 are in communication with each other by program bus.
Referring to Fig. 3, being the Program modual graph of 10 1 embodiment of data deduplication program of the present invention.In the present embodiment, number
According to going master control program 10 that can be divided into one or more modules, one or more module is stored in memory 11, and
It is performed by one or more processors (the present embodiment is processor 12), to complete the present invention.For example, data are gone in Fig. 3
Master control program 10 can be divided into receiving module 101, preprocessing module 102, enquiry module 103, the first deduplication module 104 and
Two deduplication modules 105.The so-called module of the present invention is the series of computation machine program instruction section for referring to complete specific function, than
Implementation procedure of the program more suitable for description data deduplication program 10 in memory node equipment 1, in which:
Receiving module 101, piece write request, the data slice write request include that several are to be written for receiving data
The fingerprint of data slice and each data slice to be written.
By data to be written, (data type of the data to be written includes block grade data and text to the data slice to be written
Part grade data) cutting obtains, and the operation of the cutting can be executed by receiving module 101, or by other any applicable equipment (examples
Such as, client) it executes, cutting method includes:
It is written into the data slice that data file is cut into the preset quantity of identical data size.Alternatively, by preset quantity
It is denoted as M, when M is natural number greater than 1, the corresponding data slice size of data file to be written described in cutting is determined, according to true
Fixed data slice size is syncopated as the identical M-1 data block of size one by one, and remaining after cutting is m-th data block.Its
In, the size of data slice to be written can be 4KB, 8KB, 12KB, 16KB or other granule sizes.
After being written into data and being cut into several data slices to be written, the fingerprint of each data slice to be written is calculated,
For example, passing through Message-Digest Algorithm 5 (Message-Digest Algorithm 5, MD5), Secure Hash Algorithm (Secure
Hash Algorithm, SHA1) etc. calculate the fingerprint of each data slice to be written, meanwhile, record the row of each data slice to be written
Column sequence (i.e. data slice fingerprint sequence) is written into data slice according to the data when being used for subsequent reading data to be written
Piece fingerprint sequence is assembled into the data to be written.In addition, also the data slice fingerprint sequence can be saved to the local fingerprint
In library 3 and shared fingerprint base 2.
Preprocessing module 102, in the fingerprint for determining the data slice to be written to duplicate removal fingerprint.
Preprocessing module 102 determines that the method to duplicate removal fingerprint includes:
Judge the fingerprint that whether there is redundancy in the fingerprint of the data slice to be written, and if it exists, then delete the redundancy
Fingerprint if it does not exist, then the fingerprint of all data slices to be written is made and using remaining fingerprint as to duplicate removal fingerprint
For to duplicate removal fingerprint.
For example, judging in the fingerprint of all data slices to be written with the presence or absence of identical fingerprint.Identical fingerprint if it exists,
Then using identical fingerprint as a fingerprint group, after finding out all fingerprint groups, a finger is selected in each fingerprint group
Line retains, and deletes non-selected fingerprint as the fingerprint of redundancy, and judge whether there is ungrouped fingerprint, if so,
Using each ungrouped fingerprint as to duplicate removal fingerprint, if it is not, then terminating process.Identical fingerprint if it does not exist will then own
The fingerprint of data slice to be written is as ungrouped fingerprint, then using each ungrouped fingerprint as to duplicate removal fingerprint.
Enquiry module 103 each whether there is for searching in local fingerprint base 3 to duplicate removal fingerprint.
First deduplication module 104, for one or more when duplicate removal fingerprint is present in the local fingerprint base 3 when having,
It is deleted one or more of to the corresponding data slice to be written of duplicate removal fingerprint.
It is one or more when duplicate removal fingerprint is present in the local fingerprint base 3 when having, it represents these and waits for duplicate removal fingerprint pair
The data slice to be written answered is duplicate data slice, and in order to save memory space, these duplicate data slices are deleted.
Second deduplication module 105 has one or more to be not present in the local fingerprint base 3 to duplicate removal fingerprint for working as
When, using it is one or more of to duplicate removal fingerprint as fingerprint to be processed, and each finger to be processed is searched in shared fingerprint base 2
Line, it is one or more of by what is found when finding one or more fingerprints to be processed in the shared fingerprint base 2
The corresponding data slice to be written of fingerprint to be processed is deleted.
Possess the finger print data of full dose in fingerprint base 2 due to sharing, if the first deduplication module 104 is in local fingerprint
One is not inquired in library 3 to duplicate removal fingerprint, then the second deduplication module 105 continued inquiry in shared fingerprint base 2 to refer to duplicate removal
Line whether there is, and if it exists, and then determining should be present in other memory nodes to the corresponding data slice to be written of duplicate removal fingerprint,
Belong to duplicate data slice, no longer needs to carry out storage processing to the data slice to be written.
Compared with prior art, one memory node equipment 1 of the present embodiment is when carrying out data deduplication, if in local fingerprint base
The fingerprint of a data slice to be written is not inquired in 3, then directly can inquire whether the fingerprint is duplicate in shared fingerprint base 2
Fingerprint, without carrying out communication inquiry one by one with other memory node equipment 1, this improves the data of distributed memory system
Deduplicated efficiency.
Further, in the present embodiment, the data deduplication program 10 further includes memory module (not shown), is used
In:
Save all remaining data slices (i.e. not duplicate data slice) to be written, and by the remaining data to be written
The corresponding storage location information preservation of piece is into the local fingerprint base 3 and shared fingerprint base 2.
Further, in the present embodiment, the data deduplication program 10 further includes that reference update module (is not shown in figure
Out), it is used for:
Determine the reference count changing value of the fingerprint of each data slice to be written (for example, determining each to duplicate removal fingerprint
Reference count changing value is+1), and the reference count changing value of the fingerprint of each data slice to be written is sent to control section
Point device 5 for control node equipment 5 according to the reference count changing value of the fingerprint of each data slice to be written, and updates each
(the accumulative reference count of a fingerprint represents the corresponding data slice quilt of the fingerprint for the accumulative reference count of the fingerprint of data slice to be written
The total degree of storing data reference).
Further, in the present embodiment, the data deduplication program 10 further includes removing module (not shown), is used
In:
The removal request of a data to be deleted is received, the data slice fingerprint sequence of the data to be deleted is obtained, determination obtains
The reference count changing value of each fingerprint is (for example, determine in the data slice fingerprint sequence in the data slice fingerprint sequence taken
The reference count changing value of each fingerprint is -1), and send the reference count of each fingerprint in the data slice fingerprint sequence
Changing value is to control node equipment 5.For control node equipment 5 according to the reference count of each fingerprint in the data slice fingerprint sequence
Changing value, updates the accumulative reference count of each fingerprint in the data slice fingerprint sequence, and by the data of the data to be deleted
Piece fingerprint sequence is deleted from the shared fingerprint base 2, and notifies the memory node equipment 1 by the data of the data to be deleted
Piece fingerprint sequence is deleted from local fingerprint base 3.
In the present embodiment, to delete data, memory node equipment 1 cannot be directly straight by the data slice of the data to be deleted
Connect deletion because memory node equipment 1 can not determine the data to be deleted data slice whether simultaneously by other data referencings, such as
Fruit directly deletes the data slice of the data to be deleted, then is likely to result in loss of data.Thus, it is only required to update the data to be deleted
Data slice sequence in each fingerprint accumulative reference count, and by the data slice fingerprint sequence of the data to be deleted delete.
When the accumulative reference count for detecting a fingerprint in the shared fingerprint base 2 is zero (the i.e. corresponding number of the fingerprint
According to piece not by any data referencing) when, control node equipment 5 records the fingerprint and keeps the state that accumulative reference count is zero
Duration.When the duration is greater than preset duration, the fingerprint is deleted, and notify corresponding memory node equipment 1
Delete the fingerprint and the corresponding data slice of the fingerprint.When the duration is less than or equal to preset duration, do not delete
Except processing.
In the present embodiment, control node equipment 5 need to undergo one section when the accumulative reference count for detecting a fingerprint is zero
The corresponding data slice of the fingerprint is deleted again after preset duration, and each memory node equipment 1 of real-time reception in the preset duration
The reference count changing value of the fingerprint reported, to avoid due to memory node equipment 1 reports reference count changing value not in time
Caused by data accidentally delete.
Further, in the present embodiment, the data deduplication program 10 further includes read module (not shown), is used
In:
When receiving the read requests of a data to be read, the data slice fingerprint sequence of the data to be read is obtained, and
The storage location information for obtaining the corresponding data slice of each fingerprint in the data slice fingerprint sequence, according to the storage position of acquisition
Confidence breath obtains the corresponding data slice of each fingerprint in the data slice fingerprint sequence, then the data slice that will acquire is according to the number
The data to be read are assembled into according to piece fingerprint sequence.
Further it is proposed that a kind of data duplicate removal method.This method is suitable for above-mentioned distributed memory system.
As shown in figure 4, Fig. 4 is the flow diagram of one embodiment of data duplicate removal method of the present invention.
In the present embodiment, which comprises
Step S10, memory node equipment 1 receive data slice write request, and the data slice write request includes several
The fingerprint of data slice to be written and each data slice to be written.
By data to be written, (data type of the data to be written includes block grade data, text to the data slice to be written
Part grade data) cutting obtains, and the operation of the cutting can be executed by memory node equipment 1, or by other any applicable equipment (examples
Such as, client) it executes, cutting method includes:
It is written into the data slice that data file is cut into the preset quantity of identical data size.Alternatively, by preset quantity
It is denoted as M, when M is natural number greater than 1, the corresponding data slice size of data file to be written described in cutting is determined, according to true
Fixed data slice size is syncopated as the identical M-1 data block of size one by one, and remaining after cutting is m-th data block.Its
In, the size of data slice to be written can be 4KB, 8KB, 12KB, 16KB or other granule sizes.
After being written into data and being cut into several data slices to be written, the fingerprint of each data slice to be written is calculated,
For example, passing through Message-Digest Algorithm 5 (Message-Digest Algorithm 5, MD5), Secure Hash Algorithm (Secure
Hash Algorithm, SHA1) etc. calculate the fingerprint of each data slice to be written, meanwhile, record the row of each data slice to be written
Column sequence (i.e. data slice fingerprint sequence) is written into data slice according to the data when being used for subsequent reading data to be written
Piece fingerprint sequence is assembled into the data to be written.In addition, also the data slice fingerprint sequence can be saved to the local fingerprint
In library 3 and shared fingerprint base 2.
Step S20, memory node equipment 1 determine in the fingerprint of the data slice to be written to duplicate removal fingerprint.
Determine that the method to duplicate removal fingerprint includes: the finger that whether there is redundancy in the fingerprint for judge the data slice to be written
Line, and if it exists, then delete the fingerprint of the redundancy, and using remaining fingerprint as to duplicate removal fingerprint, if it does not exist, then will own
The fingerprint of the data slice to be written is used as to duplicate removal fingerprint.
For example, the step S20 includes step S21~S26 (not shown).Wherein:
Step S21 judges in the fingerprint of all data slices to be written with the presence or absence of identical fingerprint.
Step S22, identical fingerprint is finding out all fingers then using identical fingerprint as a fingerprint group if it exists
It after line group, selects a fingerprint to retain in each fingerprint group, is deleted non-selected fingerprint as the fingerprint of redundancy, and
Ungrouped fingerprint is judged whether there is, if so, using each ungrouped fingerprint as to duplicate removal fingerprint, if it is not, then terminating
Process.
Step S23, identical fingerprint if it does not exist, then using the fingerprint of all data slices to be written as ungrouped finger
Line, then using each ungrouped fingerprint as to duplicate removal fingerprint.
Step S30, memory node equipment 1 is searched in local fingerprint base 3 each whether there is to duplicate removal fingerprint.
Step S40, memory node equipment 1 one or more when duplicate removal fingerprint is present in the local fingerprint base 3 when having
It is deleted one or more of to the corresponding data slice to be written of duplicate removal fingerprint.
It is one or more when duplicate removal fingerprint is present in the local fingerprint base 3 when having, it represents these and waits for duplicate removal fingerprint pair
The data slice to be written answered is duplicate data slice, and in order to save memory space, these duplicate data slices are deleted.
Step S50, one or more when duplicate removal fingerprint is not present in the local fingerprint base 3 when having, memory node is set
Standby 1 using it is one or more of to duplicate removal fingerprint as fingerprint to be processed, and each finger to be processed is searched in shared fingerprint base 2
Line, it is one or more of by what is found when finding one or more fingerprints to be processed in the shared fingerprint base 2
The corresponding data slice to be written of fingerprint to be processed is deleted.
Possess the finger print data of full dose in fingerprint base 2 due to sharing, if memory node equipment 1 is in local fingerprint base 3
In do not inquire one to duplicate removal fingerprint, then continue inquiry in shared fingerprint base 2 should whether there is to duplicate removal fingerprint, and if it exists,
Then determining should be present in other memory nodes to the corresponding data slice to be written of duplicate removal fingerprint, belong to duplicate data slice,
It no longer needs to carry out storage processing to the data slice to be written.
Compared with prior art, one memory node equipment 1 of the present embodiment is when carrying out data deduplication, if in local fingerprint base
The fingerprint of a data slice to be written is not inquired in 3, then directly can inquire whether the fingerprint is duplicate in shared fingerprint base 2
Fingerprint, without carrying out communication inquiry one by one with other memory node equipment 1, this improves the data of distributed memory system
Deduplicated efficiency.
Further, in the present embodiment, after step S60, this method further include:
Memory node equipment 1 saves all remaining data slices (i.e. not duplicate data slice) to be written, and will be described surplus
The corresponding storage location information preservation of remaining data slice to be written is into the local fingerprint base 3 and shared fingerprint base 2.
Further, in the present embodiment, the distributed memory system further include with each memory node equipment 1 and altogether
The control node equipment 5 for enjoying the communication connection of fingerprint base 2, after the step S20, the method also includes:
Memory node equipment 1 determines the reference count changing value of the fingerprint of each data slice to be written (for example, determination is each
Reference count changing value to duplicate removal fingerprint is+1), and by the reference count changing value of the fingerprint of each data slice to be written
It is sent to control node equipment 5.
Then, control node equipment 5 updates each according to the reference count changing value of the fingerprint of each data slice to be written
(the accumulative reference count of a fingerprint represents the corresponding data slice quilt of the fingerprint for the accumulative reference count of the fingerprint of data slice to be written
The total degree of storing data reference).
Further, in the present embodiment, this method further includes step S60~S80 (not shown).
Wherein:
Step S60, memory node equipment 1 receive the removal request of a data to be deleted.
Step S70, memory node equipment 1 obtain the data slice fingerprint sequence of the data to be deleted, determine the institute obtained
The reference count changing value of each fingerprint in data slice fingerprint sequence is stated (for example, determining each in the data slice fingerprint sequence
The reference count changing value of fingerprint is -1), and send the reference count changing value of each fingerprint in the data slice fingerprint sequence
To control node equipment 5.
Step S80, control node equipment 5 according to the reference count changing value of each fingerprint in the data slice fingerprint sequence,
Update the accumulative reference count of each fingerprint in the data slice fingerprint sequence, and by the data slice fingerprint sequence of the data to be deleted
Column are deleted from the shared fingerprint base 2, and notify data slice fingerprint sequence of the memory node equipment 1 by the data to be deleted
Column are deleted from local fingerprint base 3.
In the present embodiment, to delete data, memory node equipment 1 cannot be directly straight by the data slice of the data to be deleted
Connect deletion because memory node equipment 1 can not determine the data to be deleted data slice whether simultaneously by other data referencings, such as
Fruit directly deletes the data slice of the data to be deleted, then is likely to result in loss of data.Thus, it is only required to update the data to be deleted
Data slice sequence in each fingerprint accumulative reference count, and by the data slice fingerprint sequence of the data to be deleted delete.
Further, in the present embodiment, this method further include:
When the accumulative reference count for detecting a fingerprint in the shared fingerprint base 2 is zero (the i.e. corresponding number of the fingerprint
According to piece not by any data referencing) when, control node equipment 5 records the fingerprint and keeps the state that accumulative reference count is zero
Duration.
When the duration is greater than preset duration, the fingerprint is deleted, and corresponding memory node equipment 1 is notified to delete
Except the fingerprint and the corresponding data slice of the fingerprint.
When the duration is less than or equal to preset duration, do not make delete processing.
In the present embodiment, control node equipment 5 need to undergo one section when the accumulative reference count for detecting a fingerprint is zero
The corresponding data slice of the fingerprint is deleted again after preset duration, and each memory node equipment 1 of real-time reception in the preset duration
The reference count changing value of the fingerprint reported, to avoid due to memory node equipment 1 reports reference count changing value not in time
Caused by data accidentally delete.
Further, in the present embodiment, this method further includes step S90 (not shown).
Step S90, memory node equipment 1 obtain the access of continuing when receiving the read requests of a data to be read
According to data slice fingerprint sequence, and obtain the storage location letter of the corresponding data slice of each fingerprint in the data slice fingerprint sequence
Breath obtains the corresponding data slice of each fingerprint in the data slice fingerprint sequence, then will according to the storage location information of acquisition
The data slice of acquisition is assembled into the data to be read according to the data slice fingerprint sequence.
Further, the present invention also proposes that a kind of computer readable storage medium, the computer readable storage medium are deposited
Data deduplication program 10 is contained, the embodiment of the data deduplication program 10 has been described in detail in the above content, has not been done herein
It repeats.
The above description is only a preferred embodiment of the present invention, is not intended to limit the scope of the invention, all at this
Under the inventive concept of invention, using equivalent structure transformation made by description of the invention and accompanying drawing content, or directly/use indirectly
It is included in other related technical areas in scope of patent protection of the invention.
Claims (10)
1. a kind of distributed memory system, which is characterized in that the distributed memory system include multiple memory node equipment and
Several shared fingerprint bases, communicate to connect between the memory node equipment and shared fingerprint base, in the memory node equipment
It is provided with local fingerprint base, alternatively, the memory node equipment is communicated to connect with corresponding local fingerprint base, the memory node
Equipment is used for:
Data slice write request is received, the data slice write request includes several data slices to be written and each described to be written
Enter the fingerprint of data slice;
Determine in the fingerprint of the data slice to be written to duplicate removal fingerprint, and search in local fingerprint base each to duplicate removal
Fingerprint whether there is, and the local fingerprint base includes the fingerprint of storing data piece in the memory node equipment;
It is one or more when duplicate removal fingerprint is present in the local fingerprint base when having, by one or more of to duplicate removal fingerprint
Corresponding data slice to be written is deleted;
It is one or more when duplicate removal fingerprint is not present in the local fingerprint base when having, refer to one or more of to duplicate removal
Line searches each fingerprint to be processed as fingerprint to be processed, and in shared fingerprint base, and the shared data bank includes all deposits
The fingerprint for storing up storing data piece in node device, when finding one or more fingerprints to be processed in the shared fingerprint base
When, the corresponding data slice to be written of the one or more of fingerprints to be processed found is deleted.
2. distributed memory system as described in claim 1, which is characterized in that the data slice to be written is by data to be written
Cutting obtains, and the data slice write request further includes data slice fingerprint sequence, and the data slice fingerprint sequence includes in order
The fingerprint of each data slice to be written of arrangement;
In the fingerprint for determining the data slice to be written includes: to judge the data slice to be written to duplicate removal fingerprint
It whether there is the fingerprint of redundancy in fingerprint, and if it exists, then delete the fingerprint of the redundancy, and using remaining fingerprint as to duplicate removal
Fingerprint, if it does not exist, then using the fingerprint of all data slices to be written as to duplicate removal fingerprint;
The memory node equipment is also used to:
The data slice fingerprint sequence is saved into the local fingerprint base and shared fingerprint base;
The memory node equipment is also used to:
It saves all remaining data slices to be written, and the corresponding storage location information of the remaining data slice to be written is protected
It deposits into the local fingerprint base and shared fingerprint base.
3. distributed memory system as claimed in claim 1 or 2, which is characterized in that the distributed memory system further includes
The control node equipment communicated to connect with each memory node equipment and shared fingerprint base, the memory node equipment are also used to:
Determine the reference count changing value of the fingerprint of each data slice to be written, and drawing the fingerprint of each data slice to be written
Control node equipment is sent to counting changing value;
The control node equipment is used for:
According to the reference count changing value of the fingerprint of each data slice to be written, the tired of the fingerprint of each data slice to be written is updated
Count reference count.
4. distributed memory system as claimed in claim 2, which is characterized in that the memory node equipment is also used to:
Receive the removal request of a data to be deleted;
The data slice fingerprint sequence of the data to be deleted is obtained, determines each fingerprint in the data slice fingerprint sequence obtained
Reference count changing value, and reference count changing value to the control node for sending each fingerprint in the data slice fingerprint sequence is set
It is standby;
The control node equipment is also used to:
According to the reference count changing value of each fingerprint in the data slice fingerprint sequence, update each in the data slice fingerprint sequence
The accumulative reference count of fingerprint, and the data slice fingerprint sequence of the data to be deleted is deleted from the shared fingerprint base, and
The memory node equipment is notified to delete the data slice fingerprint sequence of the data to be deleted from local fingerprint base;
When detecting the accumulative reference count of a fingerprint in the shared fingerprint base is zero, records the fingerprint and keep accumulative
The duration for the state that reference count is zero deletes the fingerprint, and notify when the duration is greater than preset duration
Corresponding memory node equipment deletes the fingerprint and the corresponding data slice of the fingerprint.
5. a kind of data duplicate removal method, this method is suitable for distributed memory system, which is characterized in that the distributed storage system
System includes multiple memory node equipment and several shared fingerprint bases, is communicated between the memory node equipment and shared fingerprint base
Connection is provided with local fingerprint base in the memory node equipment, alternatively, the memory node equipment and corresponding local fingerprint
Library communication connection, the method includes the steps:
Receiving step: memory node equipment receives data slice write request, and the data slice write request includes that several are to be written
Enter the fingerprint of data slice and each data slice to be written;
Query steps: memory node equipment determine in the fingerprint of the data slice to be written to duplicate removal fingerprint, and in local
Searched in fingerprint base it is each whether there is to duplicate removal fingerprint, it is described local fingerprint base include having been stored in the memory node equipment
The fingerprint of data slice;
First duplicate removal step:, memory node equipment one or more when duplicate removal fingerprint is present in the local fingerprint base when having
It is deleted one or more of to the corresponding data slice to be written of duplicate removal fingerprint;
Second duplicate removal step: one or more when duplicate removal fingerprint is not present in the local fingerprint base when having, memory node is set
It is standby to search each finger to be processed as fingerprint to be processed, and in shared fingerprint base to duplicate removal fingerprint for one or more of
Line, the shared data bank includes the fingerprint of storing data piece in all memory node equipment, when in the shared fingerprint base
In when finding one or more fingerprints to be processed, the one or more of fingerprints to be processed found are corresponding to be written
Data slice is deleted.
6. data duplicate removal method as claimed in claim 5, which is characterized in that the data slice to be written is cut by data to be written
Get, the data slice write request further includes data slice fingerprint sequence, and the data slice fingerprint sequence includes arranging in order
The fingerprint of each data slice to be written of column;
In the fingerprint for determining the data slice to be written includes: to judge the number to be written to the step of duplicate removal fingerprint
According to the fingerprint that whether there is redundancy in the fingerprint of piece, and if it exists, then delete the fingerprint of the redundancy, and using remaining fingerprint as
To duplicate removal fingerprint, if it does not exist, then using the fingerprint of all data slices to be written as to duplicate removal fingerprint;
After the receiving step, the method also includes:
Memory node equipment saves the data slice fingerprint sequence into the local fingerprint base and shared fingerprint base;
After the second duplicate removal step, the method also includes:
Memory node equipment saves all remaining data slices to be written, and deposits the remaining data slice to be written is corresponding
Storage location information is saved into the local fingerprint base and shared fingerprint base.
7. such as data duplicate removal method described in claim 5 or 6, which is characterized in that the distributed memory system further include with
Each memory node equipment and the control node equipment of shared fingerprint base communication connection, after the receiving step, the side
Method further include:
Memory node equipment determines the reference count changing value of the fingerprint of each data slice to be written, and by each data to be written
The reference count changing value of the fingerprint of piece is sent to control node equipment;
Control node equipment updates each data to be written according to the reference count changing value of the fingerprint of each data slice to be written
The accumulative reference count of the fingerprint of piece.
8. data duplicate removal method as claimed in claim 6, which is characterized in that the method also includes:
Memory node equipment receives the removal request of a data to be deleted;
Memory node equipment obtains the data slice fingerprint sequence of the data to be deleted, determines the data slice fingerprint sequence obtained
The reference count changing value of each fingerprint in column, and send the reference count changing value of each fingerprint in the data slice fingerprint sequence extremely
Control node equipment;
Control node equipment updates the data slice according to the reference count changing value of each fingerprint in the data slice fingerprint sequence
The accumulative reference count of each fingerprint in fingerprint sequence, and by the data slice fingerprint sequence of the data to be deleted from the shared finger
Line library is deleted, and the memory node equipment is notified to delete the data slice fingerprint sequence of the data to be deleted from local fingerprint base
It removes;
When detecting the accumulative reference count of a fingerprint in the shared fingerprint base is zero, described in control node equipment record
Fingerprint keeps the duration for the state that accumulative reference count is zero, when the duration is greater than preset duration, deletes institute
Fingerprint is stated, and corresponding memory node equipment is notified to delete the fingerprint and the corresponding data slice of the fingerprint.
9. a kind of memory node equipment, which is characterized in that communicated to connect between the memory node equipment and shared fingerprint base, institute
It states and is provided with local fingerprint base in memory node equipment, alternatively, the memory node equipment is communicated with corresponding local fingerprint base
Connection, the memory node equipment includes memory and processor, and data deduplication program, the number are stored on the memory
Following steps are realized according to when master control program being gone to be executed by the processor:
Receiving step: receiving data slice write request, and the data slice write request includes several data slices to be written and each
The fingerprint of a data slice to be written;
Query steps: searching in the fingerprint of the data slice to be written to duplicate removal fingerprint, and in local fingerprint base is determined
Each to whether there is to duplicate removal fingerprint, the local fingerprint base includes the finger of storing data piece in the memory node equipment
Line;
First duplicate removal step: it is one or more when duplicate removal fingerprint is present in the local fingerprint base when having, by one or
It is multiple to be deleted to the corresponding data slice to be written of duplicate removal fingerprint;
Second duplicate removal step: it is one or more when duplicate removal fingerprint is not present in the local fingerprint base when having, it will be one
Or multiple each fingerprint to be processed, the shared number are searched as fingerprint to be processed, and in shared fingerprint base to duplicate removal fingerprint
Include the fingerprint of storing data piece in all memory node equipment according to library, when found in the shared fingerprint base one or
When multiple fingerprints to be processed, the corresponding data slice to be written of the one or more of fingerprints to be processed found is deleted.
10. a kind of computer readable storage medium is suitable for memory node equipment, which is characterized in that the memory node equipment
It is communicated to connect between shared fingerprint base, is provided with local fingerprint base in the memory node equipment, alternatively, the memory node
Equipment is communicated to connect with corresponding local fingerprint base, and the computer-readable recording medium storage has data deduplication program, described
Data deduplication program can be executed by least one processor, so that at least one described processor executes following steps:
Receiving step: receiving data slice write request, and the data slice write request includes several data slices to be written and each
The fingerprint of a data slice to be written;
Query steps: searching in the fingerprint of the data slice to be written to duplicate removal fingerprint, and in local fingerprint base is determined
Each to whether there is to duplicate removal fingerprint, the local fingerprint base includes the finger of storing data piece in the memory node equipment
Line;
First duplicate removal step: it is one or more when duplicate removal fingerprint is present in the local fingerprint base when having, by one or
It is multiple to be deleted to the corresponding data slice to be written of duplicate removal fingerprint;
Second duplicate removal step: it is one or more when duplicate removal fingerprint is not present in the local fingerprint base when having, it will be one
Or multiple each fingerprint to be processed, the shared number are searched as fingerprint to be processed, and in shared fingerprint base to duplicate removal fingerprint
Include the fingerprint of storing data piece in all memory node equipment according to library, when found in the shared fingerprint base one or
When multiple fingerprints to be processed, the corresponding data slice to be written of the one or more of fingerprints to be processed found is deleted.
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910007367.9A CN109800218B (en) | 2019-01-04 | 2019-01-04 | Distributed storage system, storage node device and data deduplication method |
PCT/CN2019/118009 WO2020140622A1 (en) | 2019-01-04 | 2019-11-13 | Distributed storage system, storage node device and data duplicate deletion method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910007367.9A CN109800218B (en) | 2019-01-04 | 2019-01-04 | Distributed storage system, storage node device and data deduplication method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN109800218A true CN109800218A (en) | 2019-05-24 |
CN109800218B CN109800218B (en) | 2024-04-09 |
Family
ID=66558525
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910007367.9A Active CN109800218B (en) | 2019-01-04 | 2019-01-04 | Distributed storage system, storage node device and data deduplication method |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN109800218B (en) |
WO (1) | WO2020140622A1 (en) |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110457305A (en) * | 2019-08-13 | 2019-11-15 | 腾讯科技(深圳)有限公司 | Data duplicate removal method, device, equipment and medium |
WO2020140622A1 (en) * | 2019-01-04 | 2020-07-09 | 平安科技(深圳)有限公司 | Distributed storage system, storage node device and data duplicate deletion method |
CN111399768A (en) * | 2020-02-21 | 2020-07-10 | 苏州浪潮智能科技有限公司 | Data storage method, system, equipment and computer readable storage medium |
CN111459928A (en) * | 2020-03-27 | 2020-07-28 | 上海爱数信息技术股份有限公司 | Data deduplication method applied to data backup scene in cluster range and application |
CN111580755A (en) * | 2020-05-09 | 2020-08-25 | 杭州海康威视系统技术有限公司 | Distributed data processing system and distributed data processing method |
WO2022048475A1 (en) * | 2020-09-03 | 2022-03-10 | 中兴通讯股份有限公司 | Data deduplication method, node, and computer readable storage medium |
CN114442931A (en) * | 2021-12-23 | 2022-05-06 | 天翼云科技有限公司 | Data deduplication method and system, electronic device and storage medium |
CN117369731A (en) * | 2023-12-07 | 2024-01-09 | 苏州元脑智能科技有限公司 | Data reduction processing method, device, equipment and medium |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8495392B1 (en) * | 2010-09-02 | 2013-07-23 | Symantec Corporation | Systems and methods for securely deduplicating data owned by multiple entities |
CN103942292A (en) * | 2014-04-11 | 2014-07-23 | 华为技术有限公司 | Virtual machine mirror image document processing method, device and system |
CN103944988A (en) * | 2014-04-22 | 2014-07-23 | 南京邮电大学 | Repeating data deleting system and method applicable to cloud storage |
WO2015176249A1 (en) * | 2014-05-21 | 2015-11-26 | 华为技术有限公司 | Transmission method for wireless ethernet interface hard disk, related device, and system |
CN107391761A (en) * | 2017-08-28 | 2017-11-24 | 郑州云海信息技术有限公司 | A kind of data managing method and device based on data de-duplication technology |
US20180052846A1 (en) * | 2016-08-22 | 2018-02-22 | Kabushiki Kaisha Toshiba | Data processing method, data processing device, storage system, and method for controlling storage system |
CN108008918A (en) * | 2017-11-30 | 2018-05-08 | 联想(北京)有限公司 | Data processing method, memory node and distributed memory system |
CN108415669A (en) * | 2018-03-15 | 2018-08-17 | 深信服科技股份有限公司 | The data duplicate removal method and device of storage system, computer installation and storage medium |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107229420B (en) * | 2017-05-27 | 2020-05-26 | 苏州浪潮智能科技有限公司 | Data storage method, reading method, deleting method and data operating system |
CN109800218B (en) * | 2019-01-04 | 2024-04-09 | 平安科技(深圳)有限公司 | Distributed storage system, storage node device and data deduplication method |
-
2019
- 2019-01-04 CN CN201910007367.9A patent/CN109800218B/en active Active
- 2019-11-13 WO PCT/CN2019/118009 patent/WO2020140622A1/en active Application Filing
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8495392B1 (en) * | 2010-09-02 | 2013-07-23 | Symantec Corporation | Systems and methods for securely deduplicating data owned by multiple entities |
CN103942292A (en) * | 2014-04-11 | 2014-07-23 | 华为技术有限公司 | Virtual machine mirror image document processing method, device and system |
CN103944988A (en) * | 2014-04-22 | 2014-07-23 | 南京邮电大学 | Repeating data deleting system and method applicable to cloud storage |
WO2015176249A1 (en) * | 2014-05-21 | 2015-11-26 | 华为技术有限公司 | Transmission method for wireless ethernet interface hard disk, related device, and system |
US20180052846A1 (en) * | 2016-08-22 | 2018-02-22 | Kabushiki Kaisha Toshiba | Data processing method, data processing device, storage system, and method for controlling storage system |
CN107391761A (en) * | 2017-08-28 | 2017-11-24 | 郑州云海信息技术有限公司 | A kind of data managing method and device based on data de-duplication technology |
CN108008918A (en) * | 2017-11-30 | 2018-05-08 | 联想(北京)有限公司 | Data processing method, memory node and distributed memory system |
CN108415669A (en) * | 2018-03-15 | 2018-08-17 | 深信服科技股份有限公司 | The data duplicate removal method and device of storage system, computer installation and storage medium |
Non-Patent Citations (1)
Title |
---|
胡渝苹: "文件秒传系统在云存储环境下的设计与实现", 计算机应用与软件, no. 04, 15 April 2016 (2016-04-15), pages 335 - 339 * |
Cited By (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2020140622A1 (en) * | 2019-01-04 | 2020-07-09 | 平安科技(深圳)有限公司 | Distributed storage system, storage node device and data duplicate deletion method |
CN110457305A (en) * | 2019-08-13 | 2019-11-15 | 腾讯科技(深圳)有限公司 | Data duplicate removal method, device, equipment and medium |
CN110457305B (en) * | 2019-08-13 | 2021-11-26 | 腾讯科技(深圳)有限公司 | Data deduplication method, device, equipment and medium |
CN111399768A (en) * | 2020-02-21 | 2020-07-10 | 苏州浪潮智能科技有限公司 | Data storage method, system, equipment and computer readable storage medium |
CN111459928A (en) * | 2020-03-27 | 2020-07-28 | 上海爱数信息技术股份有限公司 | Data deduplication method applied to data backup scene in cluster range and application |
CN111459928B (en) * | 2020-03-27 | 2023-07-07 | 上海爱数信息技术股份有限公司 | Data deduplication method applied to data backup scene in cluster range and application |
CN111580755A (en) * | 2020-05-09 | 2020-08-25 | 杭州海康威视系统技术有限公司 | Distributed data processing system and distributed data processing method |
CN111580755B (en) * | 2020-05-09 | 2022-07-05 | 杭州海康威视系统技术有限公司 | Distributed data processing system and distributed data processing method |
WO2022048475A1 (en) * | 2020-09-03 | 2022-03-10 | 中兴通讯股份有限公司 | Data deduplication method, node, and computer readable storage medium |
CN114442931A (en) * | 2021-12-23 | 2022-05-06 | 天翼云科技有限公司 | Data deduplication method and system, electronic device and storage medium |
CN117369731A (en) * | 2023-12-07 | 2024-01-09 | 苏州元脑智能科技有限公司 | Data reduction processing method, device, equipment and medium |
CN117369731B (en) * | 2023-12-07 | 2024-02-27 | 苏州元脑智能科技有限公司 | Data reduction processing method, device, equipment and medium |
Also Published As
Publication number | Publication date |
---|---|
CN109800218B (en) | 2024-04-09 |
WO2020140622A1 (en) | 2020-07-09 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109800218A (en) | Distributed memory system, memory node equipment and data duplicate removal method | |
Huang et al. | X-Engine: An optimized storage engine for large-scale E-commerce transaction processing | |
CN105487818B (en) | For the efficient De-weight method of repeated and redundant data in cloud storage system | |
US11016955B2 (en) | Deduplication index enabling scalability | |
JP4116413B2 (en) | Prefetch appliance server | |
US9342574B2 (en) | Distributed storage system and distributed storage method | |
Liao et al. | Multi-dimensional index on hadoop distributed file system | |
CN103020315B (en) | A kind of mass small documents storage means based on master-salve distributed file system | |
US8799601B1 (en) | Techniques for managing deduplication based on recently written extents | |
CN103890738B (en) | The system and method for the weight that disappears in storage object after retaining clone and separate operation | |
CN102460439B (en) | Data distribution through capacity leveling in a striped file system | |
CN104850572A (en) | HBase non-primary key index building and inquiring method and system | |
US20150169655A1 (en) | Efficient query processing in columnar databases using bloom filters | |
US20160350302A1 (en) | Dynamically splitting a range of a node in a distributed hash table | |
CN103150394A (en) | Distributed file system metadata management method facing to high-performance calculation | |
US9569477B1 (en) | Managing scanning of databases in data storage systems | |
US20180113804A1 (en) | Distributed data parallel method for reclaiming space | |
CN103218404A (en) | Multi-dimensional metadata management method and system based on association characteristics | |
WO2020016649A2 (en) | Pushing a point in time to a backend object storage for a distributed storage system | |
Merceedi et al. | A comprehensive survey for hadoop distributed file system | |
US10891266B2 (en) | File handling in a hierarchical storage system | |
CN110352410A (en) | Track the access module and preextraction index node of index node | |
CN110427347A (en) | Method, apparatus, memory node and the storage medium of data de-duplication | |
US20140258264A1 (en) | Management of searches in a database system | |
CN107133334B (en) | Data synchronization method based on high-bandwidth storage system |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |