CN109800218A - Distributed memory system, memory node equipment and data duplicate removal method - Google Patents

Distributed memory system, memory node equipment and data duplicate removal method Download PDF

Info

Publication number
CN109800218A
CN109800218A CN201910007367.9A CN201910007367A CN109800218A CN 109800218 A CN109800218 A CN 109800218A CN 201910007367 A CN201910007367 A CN 201910007367A CN 109800218 A CN109800218 A CN 109800218A
Authority
CN
China
Prior art keywords
fingerprint
data
data slice
written
node equipment
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910007367.9A
Other languages
Chinese (zh)
Other versions
CN109800218B (en
Inventor
宋小兵
姜文峰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Technology Shenzhen Co Ltd
Original Assignee
Ping An Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Technology Shenzhen Co Ltd filed Critical Ping An Technology Shenzhen Co Ltd
Priority to CN201910007367.9A priority Critical patent/CN109800218B/en
Publication of CN109800218A publication Critical patent/CN109800218A/en
Priority to PCT/CN2019/118009 priority patent/WO2020140622A1/en
Application granted granted Critical
Publication of CN109800218B publication Critical patent/CN109800218B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/27Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Quality & Reliability (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present invention relates to a kind of distributed storage technology, a kind of distributed memory system, memory node equipment and data duplicate removal method are disclosed.A memory node equipment of the invention is when carrying out data deduplication, if not inquiring the fingerprint of a data slice to be written in local fingerprint base, then directly it can inquire whether the fingerprint is duplicate fingerprint in shared fingerprint base, without carrying out communication inquiry one by one with other memory node equipment, this improves the data deduplication efficiency of distributed memory system.

Description

Distributed memory system, memory node equipment and data duplicate removal method
Technical field
The present invention relates to technical field of distributed memory, in particular to a kind of distributed memory system, memory node equipment, Data duplicate removal method and computer readable storage medium.
Background technique
Data deduplication is also known as data de-duplication (Data Deduplication), is a kind of apply within the storage system The technology for globally identifying and eliminating redundant data, becomes the hot spot of storage system research in recent years.Data deduplication passes through meter The secure Hash abstract (such as SHA1 fingerprint) for calculating data block carrys out unique identification data block, avoids of the character one by one of data Match, and storage system only needs simply to safeguard the concordance list of secure Hash abstract, so that it may which realization quickly and easily identifies Repeated data is with good expansibility.Duplicate data content only needs to record corresponding data pointer information i.e. reachable To the purpose for saving memory space.So data deduplication technology can save memory space greatly to improve storage equipment Resource utilization.
Currently, a memory node generally includes to walk as follows in the duplicate removal process to a data slice in distributed memory system It is rapid: to calculate the fingerprint of the data slice, then inquire the fingerprint in the fingerprint base of the memory node and whether there is, if it does not exist, then Inquiring the fingerprint in the fingerprint base of other memory nodes in the distributed memory system whether there is, and confirm the data slice with this With the presence or absence of in distributed memory system.The defect of this method is that the quantity of memory node is logical in distributed memory system It is often more, if memory node needs inquire fingerprint in the fingerprint base of other multiple memory nodes, need and multiple storages Node communicates one by one, and speed is slow and low efficiency.
Therefore, the deduplicated efficiency for how improving distributed memory system becomes a urgent problem to be solved.
Summary of the invention
The main object of the present invention be to provide a kind of distributed memory system, memory node equipment, data duplicate removal method and Computer readable storage medium, it is intended to improve the deduplicated efficiency of distributed memory system.
To achieve the above object, the present invention proposes that a kind of distributed memory system, the distributed memory system include more A memory node equipment and several shared fingerprint bases, communicate to connect, institute between the memory node equipment and shared fingerprint base It states and is provided with local fingerprint base in memory node equipment, alternatively, the memory node equipment is communicated with corresponding local fingerprint base Connection, the memory node equipment are used for:
Receive data slice write request, the data slice write request includes several data slices to be written and each described The fingerprint of data slice to be written;
Determine in the fingerprint of the data slice to be written to duplicate removal fingerprint, and search in local fingerprint base it is each to Duplicate removal fingerprint whether there is, and the local fingerprint base includes the fingerprint of storing data piece in the memory node equipment;
It is one or more when duplicate removal fingerprint is present in the local fingerprint base when having, by one or more of to duplicate removal The corresponding data slice to be written of fingerprint is deleted;
It is one or more when duplicate removal fingerprint is not present in the local fingerprint base when having, by one or more of wait go Weight fingerprint searches each fingerprint to be processed as fingerprint to be processed, and in shared fingerprint base, and the shared data bank includes institute There is in memory node equipment the fingerprint of storing data piece, it is one or more to be processed when being found in the shared fingerprint base When fingerprint, the corresponding data slice to be written of the one or more of fingerprints to be processed found is deleted.
Preferably, the data slice to be written is obtained by data cutting to be written, and the data slice write request further includes Data slice fingerprint sequence, the data slice fingerprint sequence include the fingerprint for each data slice to be written being arranged in order;
In the fingerprint for determining the data slice to be written includes: to judge the data to be written to duplicate removal fingerprint Whether there is the fingerprint of redundancy in the fingerprint of piece, and if it exists, then delete the fingerprint of the redundancy, and using remaining fingerprint as to Duplicate removal fingerprint, if it does not exist, then using the fingerprint of all data slices to be written as to duplicate removal fingerprint;
The memory node equipment is also used to:
The data slice fingerprint sequence is saved into the local fingerprint base and shared fingerprint base;
The memory node equipment is also used to:
It saves all remaining data slices to be written, and the corresponding storage location of the remaining data slice to be written is believed Breath is saved into the local fingerprint base and shared fingerprint base.
Preferably, the distributed memory system further includes communicating to connect with each memory node equipment and shared fingerprint base Control node equipment, the memory node equipment is also used to:
Determine the reference count changing value of the fingerprint of each data slice to be written, and by the fingerprint of each data slice to be written Reference count changing value be sent to control node equipment;
The control node equipment is used for:
According to the reference count changing value of the fingerprint of each data slice to be written, the fingerprint of each data slice to be written is updated Accumulative reference count.
Preferably, the memory node equipment is also used to:
Receive the removal request of a data to be deleted;
The data slice fingerprint sequence of the data to be deleted is obtained, determines and respectively refers in the data slice fingerprint sequence obtained The reference count changing value of line, and the reference count changing value of each fingerprint in the data slice fingerprint sequence is sent to control node Equipment;
The control node equipment is also used to:
According to the reference count changing value of each fingerprint in the data slice fingerprint sequence, the data slice fingerprint sequence is updated In each fingerprint accumulative reference count, and the data slice fingerprint sequence of the data to be deleted is deleted from the shared fingerprint base It removes, and the memory node equipment is notified to delete the data slice fingerprint sequence of the data to be deleted from local fingerprint base;
When detecting the accumulative reference count of a fingerprint in the shared fingerprint base is zero, records the fingerprint and keep The duration for the state that accumulative reference count is zero deletes the fingerprint when the duration is greater than preset duration, and Corresponding memory node equipment is notified to delete the fingerprint and the corresponding data slice of the fingerprint.
In addition, to achieve the above object, the present invention also proposes that a kind of data duplicate removal method, this method are deposited suitable for distribution Storage system, the distributed memory system include multiple memory node equipment and several shared fingerprint bases, the memory node It is communicated to connect between equipment and shared fingerprint base, is provided with local fingerprint base in the memory node equipment, alternatively, the storage Node device is communicated to connect with corresponding local fingerprint base, the method includes the steps:
Receiving step: memory node equipment receives data slice write request, and the data slice write request includes several The fingerprint of data slice to be written and each data slice to be written;
Query steps: memory node equipment determine in the fingerprint of the data slice to be written to duplicate removal fingerprint, and in Searched in local fingerprint base it is each whether there is to duplicate removal fingerprint, it is described local fingerprint base include in the memory node equipment The fingerprint of storing data piece;
First duplicate removal step:, memory node one or more when duplicate removal fingerprint is present in the local fingerprint base when having Equipment is deleted one or more of to the corresponding data slice to be written of duplicate removal fingerprint;
Second duplicate removal step:, storage section one or more when duplicate removal fingerprint is not present in the local fingerprint base when having Point device using it is one or more of to duplicate removal fingerprint as fingerprint to be processed, and search in shared fingerprint base each to be processed Fingerprint, the shared data bank includes the fingerprint of storing data piece in all memory node equipment, when in the shared fingerprint It is when finding one or more fingerprints to be processed in library, the one or more of fingerprints to be processed found are corresponding to be written Enter data slice deletion.
Preferably, the data slice to be written is obtained by data cutting to be written, and the data slice write request further includes Data slice fingerprint sequence, the data slice fingerprint sequence include the fingerprint for each data slice to be written being arranged in order;
In the fingerprint for determining the data slice to be written to the step of duplicate removal fingerprint include: judge it is described to be written Enter to whether there is in the fingerprint of data slice the fingerprint of redundancy, and if it exists, then delete the fingerprint of the redundancy, and by remaining fingerprint As to duplicate removal fingerprint, if it does not exist, then using the fingerprint of all data slices to be written as to duplicate removal fingerprint;
After the receiving step, the method also includes:
Memory node equipment saves the data slice fingerprint sequence into the local fingerprint base and shared fingerprint base;
After the second duplicate removal step, the method also includes:
Memory node equipment saves all remaining data slices to be written, and the remaining data slice to be written is corresponding Storage location information preservation into the local fingerprint base and shared fingerprint base.
Preferably, the distributed memory system further includes communicating to connect with each memory node equipment and shared fingerprint base Control node equipment, after the receiving step, the method also includes:
Memory node equipment determines the reference count changing value of the fingerprint of each data slice to be written, and will be each to be written The reference count changing value of the fingerprint of data slice is sent to control node equipment;
Control node equipment updates each to be written according to the reference count changing value of the fingerprint of each data slice to be written The accumulative reference count of the fingerprint of data slice.
Preferably, the method also includes:
Memory node equipment receives the removal request of a data to be deleted;
Memory node equipment obtains the data slice fingerprint sequence of the data to be deleted, determines that the data slice obtained refers to The reference count changing value of each fingerprint in line sequence, and send the reference count variation of each fingerprint in the data slice fingerprint sequence It is worth to control node equipment;
Control node equipment updates the number according to the reference count changing value of each fingerprint in the data slice fingerprint sequence According to the accumulative reference count of each fingerprint in piece fingerprint sequence, and by the data slice fingerprint sequence of the data to be deleted from described total Fingerprint base deletion is enjoyed, and notifies the memory node equipment by the data slice fingerprint sequence of the data to be deleted from local fingerprint It deletes in library;
When detecting the accumulative reference count of a fingerprint in the shared fingerprint base is zero, control node equipment record The fingerprint keeps the duration for the state that accumulative reference count is zero, when the duration is greater than preset duration, deletes Except the fingerprint, and corresponding memory node equipment is notified to delete the fingerprint and the corresponding data slice of the fingerprint.
In addition, to achieve the above object, the present invention also proposes a kind of memory node equipment, and the memory node equipment is together It enjoys and being communicated to connect between fingerprint base, be provided with local fingerprint base in the memory node equipment, alternatively, the memory node equipment It is communicated to connect with corresponding local fingerprint base, the memory node equipment includes memory and processor, is deposited on the memory Data deduplication program is contained, the data deduplication program realizes following steps when being executed by the processor:
Receiving step: data slice write request is received, the data slice write request includes several data slices to be written And the fingerprint of each data slice to be written;
Query steps: determine in the fingerprint of the data slice to be written to duplicate removal fingerprint, and in local fingerprint base Search it is each whether there is to duplicate removal fingerprint, the local fingerprint base includes storing data piece in the memory node equipment Fingerprint;
First duplicate removal step: it is one or more when duplicate removal fingerprint is present in the local fingerprint base when having, by described one It is a or multiple to the corresponding data slice deletion to be written of duplicate removal fingerprint;
Second duplicate removal step: it is one or more when duplicate removal fingerprint is not present in the local fingerprint base when having, it will be described One or more searches each fingerprint to be processed as fingerprint to be processed, and in shared fingerprint base to duplicate removal fingerprint, described total Enjoying database includes the fingerprint of storing data piece in all memory node equipment, when finding one in the shared fingerprint base When a or multiple fingerprints to be processed, the corresponding data slice to be written of one or more of fingerprints to be processed found is deleted It removes.
Preferably, the data slice to be written is obtained by data cutting to be written, and the data slice write request further includes Data slice fingerprint sequence, the data slice fingerprint sequence include the fingerprint for each data slice to be written being arranged in order;
In the fingerprint for determining the data slice to be written to the step of duplicate removal fingerprint include: judge it is described to be written Enter to whether there is in the fingerprint of data slice the fingerprint of redundancy, and if it exists, then delete the fingerprint of the redundancy, and by remaining fingerprint As to duplicate removal fingerprint, if it does not exist, then using the fingerprint of all data slices to be written as to duplicate removal fingerprint;
The processor executes the data deduplication program, after the receiving step, also realizes following steps:
The data slice fingerprint sequence is saved into the local fingerprint base and shared fingerprint base;
The processor executes the data deduplication program, after the second duplicate removal step, also realizes following steps:
It saves all remaining data slices to be written, and the corresponding storage location of the remaining data slice to be written is believed Breath is saved into the local fingerprint base and shared fingerprint base.
Preferably, the distributed memory system further includes communicating to connect with each memory node equipment and shared fingerprint base Control node equipment, the processor executes the data deduplication program, after the receiving step, also realizes following step It is rapid:
Determine the reference count changing value of the fingerprint of each data slice to be written, and by the fingerprint of each data slice to be written Reference count changing value be sent to control node equipment, for the control node equipment according to the finger of each data slice to be written The reference count changing value of line updates the accumulative reference count of the fingerprint of each data slice to be written.
In addition, to achieve the above object, the present invention also proposes a kind of computer readable storage medium, it is suitable for memory node Equipment communicates to connect between the memory node equipment and shared fingerprint base, is provided in the memory node equipment and locally refers to Line library, alternatively, the memory node equipment is communicated to connect with corresponding local fingerprint base, the computer readable storage medium is deposited Data deduplication program is contained, the data deduplication program can be executed by least one processor, so that at least one described processing Device executes following steps:
Receiving step: data slice write request is received, the data slice write request includes several data slices to be written And the fingerprint of each data slice to be written;
Query steps: determine in the fingerprint of the data slice to be written to duplicate removal fingerprint, and in local fingerprint base Search it is each whether there is to duplicate removal fingerprint, the local fingerprint base includes storing data piece in the memory node equipment Fingerprint;
First duplicate removal step: it is one or more when duplicate removal fingerprint is present in the local fingerprint base when having, by described one It is a or multiple to the corresponding data slice deletion to be written of duplicate removal fingerprint;
Second duplicate removal step: it is one or more when duplicate removal fingerprint is not present in the local fingerprint base when having, it will be described One or more searches each fingerprint to be processed as fingerprint to be processed, and in shared fingerprint base to duplicate removal fingerprint, described total Enjoying database includes the fingerprint of storing data piece in all memory node equipment, when finding one in the shared fingerprint base When a or multiple fingerprints to be processed, the corresponding data slice to be written of one or more of fingerprints to be processed found is deleted It removes.
Preferably, the data slice to be written is obtained by data cutting to be written, and the data slice write request further includes Data slice fingerprint sequence, the data slice fingerprint sequence include the fingerprint for each data slice to be written being arranged in order;
In the fingerprint for determining the data slice to be written to the step of duplicate removal fingerprint include: judge it is described to be written Enter to whether there is in the fingerprint of data slice the fingerprint of redundancy, and if it exists, then delete the fingerprint of the redundancy, and by remaining fingerprint As to duplicate removal fingerprint, if it does not exist, then using the fingerprint of all data slices to be written as to duplicate removal fingerprint;
The processor executes the data deduplication program, after the receiving step, also realizes following steps:
The data slice fingerprint sequence is saved into the local fingerprint base and shared fingerprint base;
The processor executes the data deduplication program, after the second duplicate removal step, also realizes following steps:
It saves all remaining data slices to be written, and the corresponding storage location of the remaining data slice to be written is believed Breath is saved into the local fingerprint base and shared fingerprint base.
Preferably, the distributed memory system further includes communicating to connect with each memory node equipment and shared fingerprint base Control node equipment, the processor executes the data deduplication program, after the receiving step, also realizes following step It is rapid:
Determine the reference count changing value of the fingerprint of each data slice to be written, and by the fingerprint of each data slice to be written Reference count changing value be sent to control node equipment, for the control node equipment according to the finger of each data slice to be written The reference count changing value of line updates the accumulative reference count of the fingerprint of each data slice to be written.
Compared with prior art, one memory node equipment of the present embodiment is when carrying out data deduplication, if in local fingerprint base In do not inquire the fingerprint of a data slice to be written, then directly can inquire whether the fingerprint is duplicate finger in shared fingerprint base Line, without carrying out communication inquiry one by one with other memory node equipment, this improves the data deduplications of distributed memory system Efficiency.
Detailed description of the invention
In order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, to embodiment or will show below There is attached drawing needed in technical description to be briefly described, it should be apparent that, the accompanying drawings in the following description is only this Some embodiments of invention for those of ordinary skill in the art without creative efforts, can be with The structure shown according to these attached drawings obtains other attached drawings.
Fig. 1 is the system architecture schematic diagram of one embodiment of distributed memory system of the present invention;
Fig. 2 is the running environment schematic diagram of one embodiment of data deduplication program of the present invention;
Fig. 3 is the Program modual graph of one embodiment of data deduplication program of the present invention;
Fig. 4 is the flow diagram of one embodiment of data duplicate removal method of the present invention.
The embodiments will be further described with reference to the accompanying drawings for the realization, the function and the advantages of the object of the present invention.
Specific embodiment
The principle and features of the present invention will be described below with reference to the accompanying drawings, and the given examples are served only to explain the present invention, and It is non-to be used to limit the scope of the invention.
As shown in fig.1, being the system architecture diagram of one embodiment of distributed memory system of the present invention.
In the present embodiment, the distributed memory system includes multiple memory node equipment 1 and several shared fingerprints Library 2, communication connection (for example, being communicated to connect by network 4), described to deposit between the memory node equipment 1 and shared fingerprint base 2 It is provided with local fingerprint base 3 in storage node device 1, alternatively, the memory node equipment 1 is communicated with corresponding local fingerprint base 3 Connection.The local fingerprint base 3 includes the fingerprint of storing data piece in corresponding memory node equipment 1, the shared fingerprint base 2 Including the fingerprint of storing data piece in all memory node equipment 1.
The memory node equipment 1 is used for:
Receive data slice write request, the data slice write request includes several data slices to be written and each described The fingerprint of data slice to be written;
Determine in the fingerprint of the data slice to be written to duplicate removal fingerprint, and search in local fingerprint base 3 each It whether there is to duplicate removal fingerprint;
It is one or more when duplicate removal fingerprint is present in the local fingerprint base 3 when having, by one or more of wait go The corresponding data slice to be written of weight fingerprint is deleted;
It is one or more when duplicate removal fingerprint is not present in the local fingerprint base 3 when having, by it is one or more of to Duplicate removal fingerprint searches each fingerprint to be processed as fingerprint to be processed in shared fingerprint base 2, when in the shared fingerprint base It is when finding one or more fingerprints to be processed in 2, the one or more of fingerprints to be processed found are corresponding to be written Enter data slice deletion.
In the present embodiment, memory node equipment 1 receives data slice write request, and the data slice write request includes several The fingerprint of a data slice to be written and each data slice to be written.The data slice to be written is (described by data to be written The data type of data to be written includes block grade data, file-level data) cutting obtains.The operation of the cutting can be by memory node Equipment 1 executes, or is executed by other any applicable equipment (for example, client), and cutting method includes:
It is written into the data slice that data file is cut into the preset quantity of identical data size.Alternatively, by preset quantity It is denoted as M, when M is natural number greater than 1, the corresponding data slice size of data file to be written described in cutting is determined, according to true Fixed data slice size is syncopated as the identical M-1 data block of size one by one, and remaining after cutting is m-th data block.Its In, the size of data slice to be written can be 4KB, 8KB, 12KB, 16KB or other granule sizes.
After being written into data and being cut into several data slices to be written, the fingerprint of each data slice to be written is calculated, For example, passing through Message-Digest Algorithm 5 (Message-Digest Algorithm 5, MD5), Secure Hash Algorithm (Secure Hash Algorithm, SHA1) etc. calculate the fingerprint of each data slice to be written, meanwhile, record the row of each data slice to be written Column sequence (i.e. data slice fingerprint sequence) is written into data slice according to the data when being used for subsequent reading data to be written Piece fingerprint sequence is assembled into the data to be written.In addition, memory node equipment 1 the data slice fingerprint sequence can also be saved to In the local fingerprint base 3 and shared fingerprint base 2.
Then, memory node equipment 1 determine in the fingerprint of the data slice to be written to duplicate removal fingerprint, determine wait go The method of weight fingerprint includes: the fingerprint that whether there is redundancy in the fingerprint for judge the data slice to be written, and if it exists, is then deleted The fingerprint of the redundancy, and using remaining fingerprint as to duplicate removal fingerprint, if it does not exist, then by all data slices to be written Fingerprint be used as to duplicate removal fingerprint.
For example, memory node equipment 1 judges in the fingerprint of all data slices to be written with the presence or absence of identical fingerprint.If depositing In identical fingerprint, then using identical fingerprint as a fingerprint group, after finding out all fingerprint groups, in each fingerprint group Middle one fingerprint of selection retains, and deletes non-selected fingerprint as the fingerprint of redundancy, and judge whether there is ungrouped Fingerprint, if so, using each ungrouped fingerprint as to duplicate removal fingerprint, if it is not, then terminating process.Identical finger if it does not exist Line, then using the fingerprint of all data slices to be written as ungrouped fingerprint, then using each ungrouped fingerprint as to duplicate removal Fingerprint.
It is identifying after duplicate removal fingerprint, memory node equipment 1 is searched in local fingerprint base 3 each is to duplicate removal fingerprint No presence.One or more when duplicate removal fingerprint is present in the local fingerprint base 3 when having, memory node equipment 1 is by described one It is a or multiple to the corresponding data slice deletion to be written of duplicate removal fingerprint.When there is one or more to duplicate removal fingerprint to be present in described When ground fingerprint base 3, represents these and wait for that the corresponding data slice to be written of duplicate removal fingerprint is duplicate data slice, it is empty in order to save storage Between, these duplicate data slices are deleted.
Finally, memory node equipment 1 one or more when duplicate removal fingerprint is not present in the local fingerprint base 3 when having Using it is one or more of to duplicate removal fingerprint as fingerprint to be processed, and each fingerprint to be processed is searched in shared fingerprint base 2, When finding one or more fingerprints to be processed in the shared fingerprint base 2, by find it is one or more of to The corresponding data slice to be written of fingerprint is handled to delete.
Possess the finger print data of full dose in fingerprint base 2 due to sharing, if memory node equipment 1 is in local fingerprint base 3 In do not inquire one to duplicate removal fingerprint, then continue inquiry in shared fingerprint base 2 should whether there is to duplicate removal fingerprint, and if it exists, Then determining should be present in other memory nodes to the corresponding data slice to be written of duplicate removal fingerprint, belong to duplicate data slice, It no longer needs to carry out storage processing to the data slice to be written.
Compared with prior art, one memory node equipment 1 of the present embodiment is when carrying out data deduplication, if in local fingerprint base The fingerprint of a data slice to be written is not inquired in 3, then directly can inquire whether the fingerprint is duplicate in shared fingerprint base 2 Fingerprint, without carrying out communication inquiry one by one with other memory node equipment 1, this improves the data of distributed memory system Deduplicated efficiency.
Further, in the present embodiment, the memory node equipment 1 is also used to:
Memory node equipment 1 saves all remaining data slices (i.e. not duplicate data slice) to be written, and will be described surplus The corresponding storage location information preservation of remaining data slice to be written is into the local fingerprint base 3 and shared fingerprint base 2.
Further, in the present embodiment, the distributed memory system further includes control node equipment 5, the control Node device 5 is respectively with the memory node equipment 1 and the communication connection of shared fingerprint base 2 (for example, pass through 4 communication link of network It connects).The shared fingerprint base 2 may be disposed in a shared disk (the NVME disk as passed through NVMEOF carry), the shared disk It may be disposed in control node equipment 5, be independently of the setting of control node equipment 5.
The memory node equipment 1 is also used to:
Determine the reference count changing value of the fingerprint of each data slice to be written (for example, determining each to duplicate removal fingerprint Reference count changing value is+1), and the reference count changing value of the fingerprint of each data slice to be written is sent to control section Point device 5.
The control node equipment 5 is used for:
According to the reference count changing value of the fingerprint of each data slice to be written, the fingerprint of each data slice to be written is updated Accumulative reference count (the accumulative reference count of a fingerprint represent the corresponding data slice of the fingerprint quoted by storing data it is total Number).
Further, in the present embodiment, the memory node equipment 1 is also used to:
The removal request of a data to be deleted is received, the data slice fingerprint sequence of the data to be deleted is obtained, determination obtains The reference count changing value of each fingerprint is (for example, determine in the data slice fingerprint sequence in the data slice fingerprint sequence taken The reference count changing value of each fingerprint is -1), and send the reference count of each fingerprint in the data slice fingerprint sequence Changing value is to control node equipment 5.
The control node equipment 5 is also used to:
According to the reference count changing value of each fingerprint in the data slice fingerprint sequence, the data slice fingerprint sequence is updated In each fingerprint accumulative reference count, and the data slice fingerprint sequence of the data to be deleted is deleted from the shared fingerprint base 2 It removes, and the memory node equipment 1 is notified to delete the data slice fingerprint sequence of the data to be deleted from local fingerprint base 3.
In the present embodiment, to delete data, memory node equipment 1 cannot be directly straight by the data slice of the data to be deleted Connect deletion because memory node equipment 1 can not determine the data to be deleted data slice whether simultaneously by other data referencings, such as Fruit directly deletes the data slice of the data to be deleted, then is likely to result in loss of data.Thus, it is only required to update the data to be deleted Data slice sequence in each fingerprint accumulative reference count, and by the data slice fingerprint sequence of the data to be deleted delete.
Further, in the present embodiment, the control node equipment 5 is also used to:
When the accumulative reference count for detecting a fingerprint in the shared fingerprint base 2 is zero (the i.e. corresponding number of the fingerprint According to piece not by any data referencing) when, record the duration that the fingerprint keeps adding up the state that reference count is zero.
When the duration is greater than preset duration, the fingerprint is deleted, and corresponding memory node equipment 1 is notified to delete Except the fingerprint and the corresponding data slice of the fingerprint.
When the duration is less than or equal to preset duration, do not make delete processing.
In the present embodiment, control node equipment 5 need to undergo one section when the accumulative reference count for detecting a fingerprint is zero The corresponding data slice of the fingerprint is deleted again after preset duration, and each memory node equipment 1 of real-time reception in the preset duration The reference count changing value of the fingerprint reported, to avoid due to memory node equipment 1 reports reference count changing value not in time Caused by data accidentally delete.
Further, in the present embodiment, the memory node equipment 1 is also used to:
When receiving the read requests of a data to be read, the data slice fingerprint sequence of the data to be read is obtained, and The storage location information for obtaining the corresponding data slice of each fingerprint in the data slice fingerprint sequence, according to the storage position of acquisition Confidence breath obtains the corresponding data slice of each fingerprint in the data slice fingerprint sequence, then the data slice that will acquire is according to the number The data to be read are assembled into according to piece fingerprint sequence.
The present invention proposes a kind of data deduplication program.
Referring to Fig. 2, being the running environment schematic diagram of 10 1 embodiment of data deduplication program of the present invention.
In the present embodiment, data deduplication program 10 is installed and is run in memory node equipment 1.Memory node equipment 1 It can be desktop PC, notebook, palm PC and server etc. and calculate equipment.The memory node equipment 1 may include, But it is not limited only to, memory 11, processor 12 and display 13.Fig. 2 illustrates only the memory node equipment with component 11-13 1, it should be understood that be not required for implementing all components shown, the implementation that can be substituted is more or less component.
Memory 11 can be the internal storage unit of memory node equipment 1, such as the storage in some embodiments The hard disk or memory of node device 1.Memory 11 is also possible to the external storage of memory node equipment 1 in further embodiments The plug-in type hard disk being equipped in equipment, such as memory node equipment 1, intelligent memory card (Smart Media Card, SMC), peace Digital (Secure Digital, SD) card, flash card (Flash Card) etc..Further, memory 11 can also be wrapped both The internal storage unit for including memory node equipment 1 also includes External memory equipment.Memory 11 is installed on storage section for storing Application software and Various types of data, such as the program code of data deduplication program 10 of point device 1 etc..Memory 11 can be also used for Temporarily store the data that has exported or will export.
Processor 12 can be in some embodiments a central processing unit (Central Processing Unit, CPU), microprocessor or other data processing chips, program code or processing data for being stored in run memory 11, example Such as execute data deduplication program 10.
Display 13 can be in some embodiments light-emitting diode display, liquid crystal display, touch-control liquid crystal display and OLED (Organic Light-Emitting Diode, Organic Light Emitting Diode) touches device etc..Display 13 is for being shown in The information that is handled in memory node equipment 1 and for showing visual user interface.The component 11- of memory node equipment 1 13 are in communication with each other by program bus.
Referring to Fig. 3, being the Program modual graph of 10 1 embodiment of data deduplication program of the present invention.In the present embodiment, number According to going master control program 10 that can be divided into one or more modules, one or more module is stored in memory 11, and It is performed by one or more processors (the present embodiment is processor 12), to complete the present invention.For example, data are gone in Fig. 3 Master control program 10 can be divided into receiving module 101, preprocessing module 102, enquiry module 103, the first deduplication module 104 and Two deduplication modules 105.The so-called module of the present invention is the series of computation machine program instruction section for referring to complete specific function, than Implementation procedure of the program more suitable for description data deduplication program 10 in memory node equipment 1, in which:
Receiving module 101, piece write request, the data slice write request include that several are to be written for receiving data The fingerprint of data slice and each data slice to be written.
By data to be written, (data type of the data to be written includes block grade data and text to the data slice to be written Part grade data) cutting obtains, and the operation of the cutting can be executed by receiving module 101, or by other any applicable equipment (examples Such as, client) it executes, cutting method includes:
It is written into the data slice that data file is cut into the preset quantity of identical data size.Alternatively, by preset quantity It is denoted as M, when M is natural number greater than 1, the corresponding data slice size of data file to be written described in cutting is determined, according to true Fixed data slice size is syncopated as the identical M-1 data block of size one by one, and remaining after cutting is m-th data block.Its In, the size of data slice to be written can be 4KB, 8KB, 12KB, 16KB or other granule sizes.
After being written into data and being cut into several data slices to be written, the fingerprint of each data slice to be written is calculated, For example, passing through Message-Digest Algorithm 5 (Message-Digest Algorithm 5, MD5), Secure Hash Algorithm (Secure Hash Algorithm, SHA1) etc. calculate the fingerprint of each data slice to be written, meanwhile, record the row of each data slice to be written Column sequence (i.e. data slice fingerprint sequence) is written into data slice according to the data when being used for subsequent reading data to be written Piece fingerprint sequence is assembled into the data to be written.In addition, also the data slice fingerprint sequence can be saved to the local fingerprint In library 3 and shared fingerprint base 2.
Preprocessing module 102, in the fingerprint for determining the data slice to be written to duplicate removal fingerprint.
Preprocessing module 102 determines that the method to duplicate removal fingerprint includes:
Judge the fingerprint that whether there is redundancy in the fingerprint of the data slice to be written, and if it exists, then delete the redundancy Fingerprint if it does not exist, then the fingerprint of all data slices to be written is made and using remaining fingerprint as to duplicate removal fingerprint For to duplicate removal fingerprint.
For example, judging in the fingerprint of all data slices to be written with the presence or absence of identical fingerprint.Identical fingerprint if it exists, Then using identical fingerprint as a fingerprint group, after finding out all fingerprint groups, a finger is selected in each fingerprint group Line retains, and deletes non-selected fingerprint as the fingerprint of redundancy, and judge whether there is ungrouped fingerprint, if so, Using each ungrouped fingerprint as to duplicate removal fingerprint, if it is not, then terminating process.Identical fingerprint if it does not exist will then own The fingerprint of data slice to be written is as ungrouped fingerprint, then using each ungrouped fingerprint as to duplicate removal fingerprint.
Enquiry module 103 each whether there is for searching in local fingerprint base 3 to duplicate removal fingerprint.
First deduplication module 104, for one or more when duplicate removal fingerprint is present in the local fingerprint base 3 when having, It is deleted one or more of to the corresponding data slice to be written of duplicate removal fingerprint.
It is one or more when duplicate removal fingerprint is present in the local fingerprint base 3 when having, it represents these and waits for duplicate removal fingerprint pair The data slice to be written answered is duplicate data slice, and in order to save memory space, these duplicate data slices are deleted.
Second deduplication module 105 has one or more to be not present in the local fingerprint base 3 to duplicate removal fingerprint for working as When, using it is one or more of to duplicate removal fingerprint as fingerprint to be processed, and each finger to be processed is searched in shared fingerprint base 2 Line, it is one or more of by what is found when finding one or more fingerprints to be processed in the shared fingerprint base 2 The corresponding data slice to be written of fingerprint to be processed is deleted.
Possess the finger print data of full dose in fingerprint base 2 due to sharing, if the first deduplication module 104 is in local fingerprint One is not inquired in library 3 to duplicate removal fingerprint, then the second deduplication module 105 continued inquiry in shared fingerprint base 2 to refer to duplicate removal Line whether there is, and if it exists, and then determining should be present in other memory nodes to the corresponding data slice to be written of duplicate removal fingerprint, Belong to duplicate data slice, no longer needs to carry out storage processing to the data slice to be written.
Compared with prior art, one memory node equipment 1 of the present embodiment is when carrying out data deduplication, if in local fingerprint base The fingerprint of a data slice to be written is not inquired in 3, then directly can inquire whether the fingerprint is duplicate in shared fingerprint base 2 Fingerprint, without carrying out communication inquiry one by one with other memory node equipment 1, this improves the data of distributed memory system Deduplicated efficiency.
Further, in the present embodiment, the data deduplication program 10 further includes memory module (not shown), is used In:
Save all remaining data slices (i.e. not duplicate data slice) to be written, and by the remaining data to be written The corresponding storage location information preservation of piece is into the local fingerprint base 3 and shared fingerprint base 2.
Further, in the present embodiment, the data deduplication program 10 further includes that reference update module (is not shown in figure Out), it is used for:
Determine the reference count changing value of the fingerprint of each data slice to be written (for example, determining each to duplicate removal fingerprint Reference count changing value is+1), and the reference count changing value of the fingerprint of each data slice to be written is sent to control section Point device 5 for control node equipment 5 according to the reference count changing value of the fingerprint of each data slice to be written, and updates each (the accumulative reference count of a fingerprint represents the corresponding data slice quilt of the fingerprint for the accumulative reference count of the fingerprint of data slice to be written The total degree of storing data reference).
Further, in the present embodiment, the data deduplication program 10 further includes removing module (not shown), is used In:
The removal request of a data to be deleted is received, the data slice fingerprint sequence of the data to be deleted is obtained, determination obtains The reference count changing value of each fingerprint is (for example, determine in the data slice fingerprint sequence in the data slice fingerprint sequence taken The reference count changing value of each fingerprint is -1), and send the reference count of each fingerprint in the data slice fingerprint sequence Changing value is to control node equipment 5.For control node equipment 5 according to the reference count of each fingerprint in the data slice fingerprint sequence Changing value, updates the accumulative reference count of each fingerprint in the data slice fingerprint sequence, and by the data of the data to be deleted Piece fingerprint sequence is deleted from the shared fingerprint base 2, and notifies the memory node equipment 1 by the data of the data to be deleted Piece fingerprint sequence is deleted from local fingerprint base 3.
In the present embodiment, to delete data, memory node equipment 1 cannot be directly straight by the data slice of the data to be deleted Connect deletion because memory node equipment 1 can not determine the data to be deleted data slice whether simultaneously by other data referencings, such as Fruit directly deletes the data slice of the data to be deleted, then is likely to result in loss of data.Thus, it is only required to update the data to be deleted Data slice sequence in each fingerprint accumulative reference count, and by the data slice fingerprint sequence of the data to be deleted delete.
When the accumulative reference count for detecting a fingerprint in the shared fingerprint base 2 is zero (the i.e. corresponding number of the fingerprint According to piece not by any data referencing) when, control node equipment 5 records the fingerprint and keeps the state that accumulative reference count is zero Duration.When the duration is greater than preset duration, the fingerprint is deleted, and notify corresponding memory node equipment 1 Delete the fingerprint and the corresponding data slice of the fingerprint.When the duration is less than or equal to preset duration, do not delete Except processing.
In the present embodiment, control node equipment 5 need to undergo one section when the accumulative reference count for detecting a fingerprint is zero The corresponding data slice of the fingerprint is deleted again after preset duration, and each memory node equipment 1 of real-time reception in the preset duration The reference count changing value of the fingerprint reported, to avoid due to memory node equipment 1 reports reference count changing value not in time Caused by data accidentally delete.
Further, in the present embodiment, the data deduplication program 10 further includes read module (not shown), is used In:
When receiving the read requests of a data to be read, the data slice fingerprint sequence of the data to be read is obtained, and The storage location information for obtaining the corresponding data slice of each fingerprint in the data slice fingerprint sequence, according to the storage position of acquisition Confidence breath obtains the corresponding data slice of each fingerprint in the data slice fingerprint sequence, then the data slice that will acquire is according to the number The data to be read are assembled into according to piece fingerprint sequence.
Further it is proposed that a kind of data duplicate removal method.This method is suitable for above-mentioned distributed memory system.
As shown in figure 4, Fig. 4 is the flow diagram of one embodiment of data duplicate removal method of the present invention.
In the present embodiment, which comprises
Step S10, memory node equipment 1 receive data slice write request, and the data slice write request includes several The fingerprint of data slice to be written and each data slice to be written.
By data to be written, (data type of the data to be written includes block grade data, text to the data slice to be written Part grade data) cutting obtains, and the operation of the cutting can be executed by memory node equipment 1, or by other any applicable equipment (examples Such as, client) it executes, cutting method includes:
It is written into the data slice that data file is cut into the preset quantity of identical data size.Alternatively, by preset quantity It is denoted as M, when M is natural number greater than 1, the corresponding data slice size of data file to be written described in cutting is determined, according to true Fixed data slice size is syncopated as the identical M-1 data block of size one by one, and remaining after cutting is m-th data block.Its In, the size of data slice to be written can be 4KB, 8KB, 12KB, 16KB or other granule sizes.
After being written into data and being cut into several data slices to be written, the fingerprint of each data slice to be written is calculated, For example, passing through Message-Digest Algorithm 5 (Message-Digest Algorithm 5, MD5), Secure Hash Algorithm (Secure Hash Algorithm, SHA1) etc. calculate the fingerprint of each data slice to be written, meanwhile, record the row of each data slice to be written Column sequence (i.e. data slice fingerprint sequence) is written into data slice according to the data when being used for subsequent reading data to be written Piece fingerprint sequence is assembled into the data to be written.In addition, also the data slice fingerprint sequence can be saved to the local fingerprint In library 3 and shared fingerprint base 2.
Step S20, memory node equipment 1 determine in the fingerprint of the data slice to be written to duplicate removal fingerprint.
Determine that the method to duplicate removal fingerprint includes: the finger that whether there is redundancy in the fingerprint for judge the data slice to be written Line, and if it exists, then delete the fingerprint of the redundancy, and using remaining fingerprint as to duplicate removal fingerprint, if it does not exist, then will own The fingerprint of the data slice to be written is used as to duplicate removal fingerprint.
For example, the step S20 includes step S21~S26 (not shown).Wherein:
Step S21 judges in the fingerprint of all data slices to be written with the presence or absence of identical fingerprint.
Step S22, identical fingerprint is finding out all fingers then using identical fingerprint as a fingerprint group if it exists It after line group, selects a fingerprint to retain in each fingerprint group, is deleted non-selected fingerprint as the fingerprint of redundancy, and Ungrouped fingerprint is judged whether there is, if so, using each ungrouped fingerprint as to duplicate removal fingerprint, if it is not, then terminating Process.
Step S23, identical fingerprint if it does not exist, then using the fingerprint of all data slices to be written as ungrouped finger Line, then using each ungrouped fingerprint as to duplicate removal fingerprint.
Step S30, memory node equipment 1 is searched in local fingerprint base 3 each whether there is to duplicate removal fingerprint.
Step S40, memory node equipment 1 one or more when duplicate removal fingerprint is present in the local fingerprint base 3 when having It is deleted one or more of to the corresponding data slice to be written of duplicate removal fingerprint.
It is one or more when duplicate removal fingerprint is present in the local fingerprint base 3 when having, it represents these and waits for duplicate removal fingerprint pair The data slice to be written answered is duplicate data slice, and in order to save memory space, these duplicate data slices are deleted.
Step S50, one or more when duplicate removal fingerprint is not present in the local fingerprint base 3 when having, memory node is set Standby 1 using it is one or more of to duplicate removal fingerprint as fingerprint to be processed, and each finger to be processed is searched in shared fingerprint base 2 Line, it is one or more of by what is found when finding one or more fingerprints to be processed in the shared fingerprint base 2 The corresponding data slice to be written of fingerprint to be processed is deleted.
Possess the finger print data of full dose in fingerprint base 2 due to sharing, if memory node equipment 1 is in local fingerprint base 3 In do not inquire one to duplicate removal fingerprint, then continue inquiry in shared fingerprint base 2 should whether there is to duplicate removal fingerprint, and if it exists, Then determining should be present in other memory nodes to the corresponding data slice to be written of duplicate removal fingerprint, belong to duplicate data slice, It no longer needs to carry out storage processing to the data slice to be written.
Compared with prior art, one memory node equipment 1 of the present embodiment is when carrying out data deduplication, if in local fingerprint base The fingerprint of a data slice to be written is not inquired in 3, then directly can inquire whether the fingerprint is duplicate in shared fingerprint base 2 Fingerprint, without carrying out communication inquiry one by one with other memory node equipment 1, this improves the data of distributed memory system Deduplicated efficiency.
Further, in the present embodiment, after step S60, this method further include:
Memory node equipment 1 saves all remaining data slices (i.e. not duplicate data slice) to be written, and will be described surplus The corresponding storage location information preservation of remaining data slice to be written is into the local fingerprint base 3 and shared fingerprint base 2.
Further, in the present embodiment, the distributed memory system further include with each memory node equipment 1 and altogether The control node equipment 5 for enjoying the communication connection of fingerprint base 2, after the step S20, the method also includes:
Memory node equipment 1 determines the reference count changing value of the fingerprint of each data slice to be written (for example, determination is each Reference count changing value to duplicate removal fingerprint is+1), and by the reference count changing value of the fingerprint of each data slice to be written It is sent to control node equipment 5.
Then, control node equipment 5 updates each according to the reference count changing value of the fingerprint of each data slice to be written (the accumulative reference count of a fingerprint represents the corresponding data slice quilt of the fingerprint for the accumulative reference count of the fingerprint of data slice to be written The total degree of storing data reference).
Further, in the present embodiment, this method further includes step S60~S80 (not shown).
Wherein:
Step S60, memory node equipment 1 receive the removal request of a data to be deleted.
Step S70, memory node equipment 1 obtain the data slice fingerprint sequence of the data to be deleted, determine the institute obtained The reference count changing value of each fingerprint in data slice fingerprint sequence is stated (for example, determining each in the data slice fingerprint sequence The reference count changing value of fingerprint is -1), and send the reference count changing value of each fingerprint in the data slice fingerprint sequence To control node equipment 5.
Step S80, control node equipment 5 according to the reference count changing value of each fingerprint in the data slice fingerprint sequence, Update the accumulative reference count of each fingerprint in the data slice fingerprint sequence, and by the data slice fingerprint sequence of the data to be deleted Column are deleted from the shared fingerprint base 2, and notify data slice fingerprint sequence of the memory node equipment 1 by the data to be deleted Column are deleted from local fingerprint base 3.
In the present embodiment, to delete data, memory node equipment 1 cannot be directly straight by the data slice of the data to be deleted Connect deletion because memory node equipment 1 can not determine the data to be deleted data slice whether simultaneously by other data referencings, such as Fruit directly deletes the data slice of the data to be deleted, then is likely to result in loss of data.Thus, it is only required to update the data to be deleted Data slice sequence in each fingerprint accumulative reference count, and by the data slice fingerprint sequence of the data to be deleted delete.
Further, in the present embodiment, this method further include:
When the accumulative reference count for detecting a fingerprint in the shared fingerprint base 2 is zero (the i.e. corresponding number of the fingerprint According to piece not by any data referencing) when, control node equipment 5 records the fingerprint and keeps the state that accumulative reference count is zero Duration.
When the duration is greater than preset duration, the fingerprint is deleted, and corresponding memory node equipment 1 is notified to delete Except the fingerprint and the corresponding data slice of the fingerprint.
When the duration is less than or equal to preset duration, do not make delete processing.
In the present embodiment, control node equipment 5 need to undergo one section when the accumulative reference count for detecting a fingerprint is zero The corresponding data slice of the fingerprint is deleted again after preset duration, and each memory node equipment 1 of real-time reception in the preset duration The reference count changing value of the fingerprint reported, to avoid due to memory node equipment 1 reports reference count changing value not in time Caused by data accidentally delete.
Further, in the present embodiment, this method further includes step S90 (not shown).
Step S90, memory node equipment 1 obtain the access of continuing when receiving the read requests of a data to be read According to data slice fingerprint sequence, and obtain the storage location letter of the corresponding data slice of each fingerprint in the data slice fingerprint sequence Breath obtains the corresponding data slice of each fingerprint in the data slice fingerprint sequence, then will according to the storage location information of acquisition The data slice of acquisition is assembled into the data to be read according to the data slice fingerprint sequence.
Further, the present invention also proposes that a kind of computer readable storage medium, the computer readable storage medium are deposited Data deduplication program 10 is contained, the embodiment of the data deduplication program 10 has been described in detail in the above content, has not been done herein It repeats.
The above description is only a preferred embodiment of the present invention, is not intended to limit the scope of the invention, all at this Under the inventive concept of invention, using equivalent structure transformation made by description of the invention and accompanying drawing content, or directly/use indirectly It is included in other related technical areas in scope of patent protection of the invention.

Claims (10)

1. a kind of distributed memory system, which is characterized in that the distributed memory system include multiple memory node equipment and Several shared fingerprint bases, communicate to connect between the memory node equipment and shared fingerprint base, in the memory node equipment It is provided with local fingerprint base, alternatively, the memory node equipment is communicated to connect with corresponding local fingerprint base, the memory node Equipment is used for:
Data slice write request is received, the data slice write request includes several data slices to be written and each described to be written Enter the fingerprint of data slice;
Determine in the fingerprint of the data slice to be written to duplicate removal fingerprint, and search in local fingerprint base each to duplicate removal Fingerprint whether there is, and the local fingerprint base includes the fingerprint of storing data piece in the memory node equipment;
It is one or more when duplicate removal fingerprint is present in the local fingerprint base when having, by one or more of to duplicate removal fingerprint Corresponding data slice to be written is deleted;
It is one or more when duplicate removal fingerprint is not present in the local fingerprint base when having, refer to one or more of to duplicate removal Line searches each fingerprint to be processed as fingerprint to be processed, and in shared fingerprint base, and the shared data bank includes all deposits The fingerprint for storing up storing data piece in node device, when finding one or more fingerprints to be processed in the shared fingerprint base When, the corresponding data slice to be written of the one or more of fingerprints to be processed found is deleted.
2. distributed memory system as described in claim 1, which is characterized in that the data slice to be written is by data to be written Cutting obtains, and the data slice write request further includes data slice fingerprint sequence, and the data slice fingerprint sequence includes in order The fingerprint of each data slice to be written of arrangement;
In the fingerprint for determining the data slice to be written includes: to judge the data slice to be written to duplicate removal fingerprint It whether there is the fingerprint of redundancy in fingerprint, and if it exists, then delete the fingerprint of the redundancy, and using remaining fingerprint as to duplicate removal Fingerprint, if it does not exist, then using the fingerprint of all data slices to be written as to duplicate removal fingerprint;
The memory node equipment is also used to:
The data slice fingerprint sequence is saved into the local fingerprint base and shared fingerprint base;
The memory node equipment is also used to:
It saves all remaining data slices to be written, and the corresponding storage location information of the remaining data slice to be written is protected It deposits into the local fingerprint base and shared fingerprint base.
3. distributed memory system as claimed in claim 1 or 2, which is characterized in that the distributed memory system further includes The control node equipment communicated to connect with each memory node equipment and shared fingerprint base, the memory node equipment are also used to:
Determine the reference count changing value of the fingerprint of each data slice to be written, and drawing the fingerprint of each data slice to be written Control node equipment is sent to counting changing value;
The control node equipment is used for:
According to the reference count changing value of the fingerprint of each data slice to be written, the tired of the fingerprint of each data slice to be written is updated Count reference count.
4. distributed memory system as claimed in claim 2, which is characterized in that the memory node equipment is also used to:
Receive the removal request of a data to be deleted;
The data slice fingerprint sequence of the data to be deleted is obtained, determines each fingerprint in the data slice fingerprint sequence obtained Reference count changing value, and reference count changing value to the control node for sending each fingerprint in the data slice fingerprint sequence is set It is standby;
The control node equipment is also used to:
According to the reference count changing value of each fingerprint in the data slice fingerprint sequence, update each in the data slice fingerprint sequence The accumulative reference count of fingerprint, and the data slice fingerprint sequence of the data to be deleted is deleted from the shared fingerprint base, and The memory node equipment is notified to delete the data slice fingerprint sequence of the data to be deleted from local fingerprint base;
When detecting the accumulative reference count of a fingerprint in the shared fingerprint base is zero, records the fingerprint and keep accumulative The duration for the state that reference count is zero deletes the fingerprint, and notify when the duration is greater than preset duration Corresponding memory node equipment deletes the fingerprint and the corresponding data slice of the fingerprint.
5. a kind of data duplicate removal method, this method is suitable for distributed memory system, which is characterized in that the distributed storage system System includes multiple memory node equipment and several shared fingerprint bases, is communicated between the memory node equipment and shared fingerprint base Connection is provided with local fingerprint base in the memory node equipment, alternatively, the memory node equipment and corresponding local fingerprint Library communication connection, the method includes the steps:
Receiving step: memory node equipment receives data slice write request, and the data slice write request includes that several are to be written Enter the fingerprint of data slice and each data slice to be written;
Query steps: memory node equipment determine in the fingerprint of the data slice to be written to duplicate removal fingerprint, and in local Searched in fingerprint base it is each whether there is to duplicate removal fingerprint, it is described local fingerprint base include having been stored in the memory node equipment The fingerprint of data slice;
First duplicate removal step:, memory node equipment one or more when duplicate removal fingerprint is present in the local fingerprint base when having It is deleted one or more of to the corresponding data slice to be written of duplicate removal fingerprint;
Second duplicate removal step: one or more when duplicate removal fingerprint is not present in the local fingerprint base when having, memory node is set It is standby to search each finger to be processed as fingerprint to be processed, and in shared fingerprint base to duplicate removal fingerprint for one or more of Line, the shared data bank includes the fingerprint of storing data piece in all memory node equipment, when in the shared fingerprint base In when finding one or more fingerprints to be processed, the one or more of fingerprints to be processed found are corresponding to be written Data slice is deleted.
6. data duplicate removal method as claimed in claim 5, which is characterized in that the data slice to be written is cut by data to be written Get, the data slice write request further includes data slice fingerprint sequence, and the data slice fingerprint sequence includes arranging in order The fingerprint of each data slice to be written of column;
In the fingerprint for determining the data slice to be written includes: to judge the number to be written to the step of duplicate removal fingerprint According to the fingerprint that whether there is redundancy in the fingerprint of piece, and if it exists, then delete the fingerprint of the redundancy, and using remaining fingerprint as To duplicate removal fingerprint, if it does not exist, then using the fingerprint of all data slices to be written as to duplicate removal fingerprint;
After the receiving step, the method also includes:
Memory node equipment saves the data slice fingerprint sequence into the local fingerprint base and shared fingerprint base;
After the second duplicate removal step, the method also includes:
Memory node equipment saves all remaining data slices to be written, and deposits the remaining data slice to be written is corresponding Storage location information is saved into the local fingerprint base and shared fingerprint base.
7. such as data duplicate removal method described in claim 5 or 6, which is characterized in that the distributed memory system further include with Each memory node equipment and the control node equipment of shared fingerprint base communication connection, after the receiving step, the side Method further include:
Memory node equipment determines the reference count changing value of the fingerprint of each data slice to be written, and by each data to be written The reference count changing value of the fingerprint of piece is sent to control node equipment;
Control node equipment updates each data to be written according to the reference count changing value of the fingerprint of each data slice to be written The accumulative reference count of the fingerprint of piece.
8. data duplicate removal method as claimed in claim 6, which is characterized in that the method also includes:
Memory node equipment receives the removal request of a data to be deleted;
Memory node equipment obtains the data slice fingerprint sequence of the data to be deleted, determines the data slice fingerprint sequence obtained The reference count changing value of each fingerprint in column, and send the reference count changing value of each fingerprint in the data slice fingerprint sequence extremely Control node equipment;
Control node equipment updates the data slice according to the reference count changing value of each fingerprint in the data slice fingerprint sequence The accumulative reference count of each fingerprint in fingerprint sequence, and by the data slice fingerprint sequence of the data to be deleted from the shared finger Line library is deleted, and the memory node equipment is notified to delete the data slice fingerprint sequence of the data to be deleted from local fingerprint base It removes;
When detecting the accumulative reference count of a fingerprint in the shared fingerprint base is zero, described in control node equipment record Fingerprint keeps the duration for the state that accumulative reference count is zero, when the duration is greater than preset duration, deletes institute Fingerprint is stated, and corresponding memory node equipment is notified to delete the fingerprint and the corresponding data slice of the fingerprint.
9. a kind of memory node equipment, which is characterized in that communicated to connect between the memory node equipment and shared fingerprint base, institute It states and is provided with local fingerprint base in memory node equipment, alternatively, the memory node equipment is communicated with corresponding local fingerprint base Connection, the memory node equipment includes memory and processor, and data deduplication program, the number are stored on the memory Following steps are realized according to when master control program being gone to be executed by the processor:
Receiving step: receiving data slice write request, and the data slice write request includes several data slices to be written and each The fingerprint of a data slice to be written;
Query steps: searching in the fingerprint of the data slice to be written to duplicate removal fingerprint, and in local fingerprint base is determined Each to whether there is to duplicate removal fingerprint, the local fingerprint base includes the finger of storing data piece in the memory node equipment Line;
First duplicate removal step: it is one or more when duplicate removal fingerprint is present in the local fingerprint base when having, by one or It is multiple to be deleted to the corresponding data slice to be written of duplicate removal fingerprint;
Second duplicate removal step: it is one or more when duplicate removal fingerprint is not present in the local fingerprint base when having, it will be one Or multiple each fingerprint to be processed, the shared number are searched as fingerprint to be processed, and in shared fingerprint base to duplicate removal fingerprint Include the fingerprint of storing data piece in all memory node equipment according to library, when found in the shared fingerprint base one or When multiple fingerprints to be processed, the corresponding data slice to be written of the one or more of fingerprints to be processed found is deleted.
10. a kind of computer readable storage medium is suitable for memory node equipment, which is characterized in that the memory node equipment It is communicated to connect between shared fingerprint base, is provided with local fingerprint base in the memory node equipment, alternatively, the memory node Equipment is communicated to connect with corresponding local fingerprint base, and the computer-readable recording medium storage has data deduplication program, described Data deduplication program can be executed by least one processor, so that at least one described processor executes following steps:
Receiving step: receiving data slice write request, and the data slice write request includes several data slices to be written and each The fingerprint of a data slice to be written;
Query steps: searching in the fingerprint of the data slice to be written to duplicate removal fingerprint, and in local fingerprint base is determined Each to whether there is to duplicate removal fingerprint, the local fingerprint base includes the finger of storing data piece in the memory node equipment Line;
First duplicate removal step: it is one or more when duplicate removal fingerprint is present in the local fingerprint base when having, by one or It is multiple to be deleted to the corresponding data slice to be written of duplicate removal fingerprint;
Second duplicate removal step: it is one or more when duplicate removal fingerprint is not present in the local fingerprint base when having, it will be one Or multiple each fingerprint to be processed, the shared number are searched as fingerprint to be processed, and in shared fingerprint base to duplicate removal fingerprint Include the fingerprint of storing data piece in all memory node equipment according to library, when found in the shared fingerprint base one or When multiple fingerprints to be processed, the corresponding data slice to be written of the one or more of fingerprints to be processed found is deleted.
CN201910007367.9A 2019-01-04 2019-01-04 Distributed storage system, storage node device and data deduplication method Active CN109800218B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201910007367.9A CN109800218B (en) 2019-01-04 2019-01-04 Distributed storage system, storage node device and data deduplication method
PCT/CN2019/118009 WO2020140622A1 (en) 2019-01-04 2019-11-13 Distributed storage system, storage node device and data duplicate deletion method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910007367.9A CN109800218B (en) 2019-01-04 2019-01-04 Distributed storage system, storage node device and data deduplication method

Publications (2)

Publication Number Publication Date
CN109800218A true CN109800218A (en) 2019-05-24
CN109800218B CN109800218B (en) 2024-04-09

Family

ID=66558525

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910007367.9A Active CN109800218B (en) 2019-01-04 2019-01-04 Distributed storage system, storage node device and data deduplication method

Country Status (2)

Country Link
CN (1) CN109800218B (en)
WO (1) WO2020140622A1 (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110457305A (en) * 2019-08-13 2019-11-15 腾讯科技(深圳)有限公司 Data duplicate removal method, device, equipment and medium
WO2020140622A1 (en) * 2019-01-04 2020-07-09 平安科技(深圳)有限公司 Distributed storage system, storage node device and data duplicate deletion method
CN111399768A (en) * 2020-02-21 2020-07-10 苏州浪潮智能科技有限公司 Data storage method, system, equipment and computer readable storage medium
CN111459928A (en) * 2020-03-27 2020-07-28 上海爱数信息技术股份有限公司 Data deduplication method applied to data backup scene in cluster range and application
CN111580755A (en) * 2020-05-09 2020-08-25 杭州海康威视系统技术有限公司 Distributed data processing system and distributed data processing method
WO2022048475A1 (en) * 2020-09-03 2022-03-10 中兴通讯股份有限公司 Data deduplication method, node, and computer readable storage medium
CN114442931A (en) * 2021-12-23 2022-05-06 天翼云科技有限公司 Data deduplication method and system, electronic device and storage medium
CN117369731A (en) * 2023-12-07 2024-01-09 苏州元脑智能科技有限公司 Data reduction processing method, device, equipment and medium

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8495392B1 (en) * 2010-09-02 2013-07-23 Symantec Corporation Systems and methods for securely deduplicating data owned by multiple entities
CN103942292A (en) * 2014-04-11 2014-07-23 华为技术有限公司 Virtual machine mirror image document processing method, device and system
CN103944988A (en) * 2014-04-22 2014-07-23 南京邮电大学 Repeating data deleting system and method applicable to cloud storage
WO2015176249A1 (en) * 2014-05-21 2015-11-26 华为技术有限公司 Transmission method for wireless ethernet interface hard disk, related device, and system
CN107391761A (en) * 2017-08-28 2017-11-24 郑州云海信息技术有限公司 A kind of data managing method and device based on data de-duplication technology
US20180052846A1 (en) * 2016-08-22 2018-02-22 Kabushiki Kaisha Toshiba Data processing method, data processing device, storage system, and method for controlling storage system
CN108008918A (en) * 2017-11-30 2018-05-08 联想(北京)有限公司 Data processing method, memory node and distributed memory system
CN108415669A (en) * 2018-03-15 2018-08-17 深信服科技股份有限公司 The data duplicate removal method and device of storage system, computer installation and storage medium

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107229420B (en) * 2017-05-27 2020-05-26 苏州浪潮智能科技有限公司 Data storage method, reading method, deleting method and data operating system
CN109800218B (en) * 2019-01-04 2024-04-09 平安科技(深圳)有限公司 Distributed storage system, storage node device and data deduplication method

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8495392B1 (en) * 2010-09-02 2013-07-23 Symantec Corporation Systems and methods for securely deduplicating data owned by multiple entities
CN103942292A (en) * 2014-04-11 2014-07-23 华为技术有限公司 Virtual machine mirror image document processing method, device and system
CN103944988A (en) * 2014-04-22 2014-07-23 南京邮电大学 Repeating data deleting system and method applicable to cloud storage
WO2015176249A1 (en) * 2014-05-21 2015-11-26 华为技术有限公司 Transmission method for wireless ethernet interface hard disk, related device, and system
US20180052846A1 (en) * 2016-08-22 2018-02-22 Kabushiki Kaisha Toshiba Data processing method, data processing device, storage system, and method for controlling storage system
CN107391761A (en) * 2017-08-28 2017-11-24 郑州云海信息技术有限公司 A kind of data managing method and device based on data de-duplication technology
CN108008918A (en) * 2017-11-30 2018-05-08 联想(北京)有限公司 Data processing method, memory node and distributed memory system
CN108415669A (en) * 2018-03-15 2018-08-17 深信服科技股份有限公司 The data duplicate removal method and device of storage system, computer installation and storage medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
胡渝苹: "文件秒传系统在云存储环境下的设计与实现", 计算机应用与软件, no. 04, 15 April 2016 (2016-04-15), pages 335 - 339 *

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2020140622A1 (en) * 2019-01-04 2020-07-09 平安科技(深圳)有限公司 Distributed storage system, storage node device and data duplicate deletion method
CN110457305A (en) * 2019-08-13 2019-11-15 腾讯科技(深圳)有限公司 Data duplicate removal method, device, equipment and medium
CN110457305B (en) * 2019-08-13 2021-11-26 腾讯科技(深圳)有限公司 Data deduplication method, device, equipment and medium
CN111399768A (en) * 2020-02-21 2020-07-10 苏州浪潮智能科技有限公司 Data storage method, system, equipment and computer readable storage medium
CN111459928A (en) * 2020-03-27 2020-07-28 上海爱数信息技术股份有限公司 Data deduplication method applied to data backup scene in cluster range and application
CN111459928B (en) * 2020-03-27 2023-07-07 上海爱数信息技术股份有限公司 Data deduplication method applied to data backup scene in cluster range and application
CN111580755A (en) * 2020-05-09 2020-08-25 杭州海康威视系统技术有限公司 Distributed data processing system and distributed data processing method
CN111580755B (en) * 2020-05-09 2022-07-05 杭州海康威视系统技术有限公司 Distributed data processing system and distributed data processing method
WO2022048475A1 (en) * 2020-09-03 2022-03-10 中兴通讯股份有限公司 Data deduplication method, node, and computer readable storage medium
CN114442931A (en) * 2021-12-23 2022-05-06 天翼云科技有限公司 Data deduplication method and system, electronic device and storage medium
CN117369731A (en) * 2023-12-07 2024-01-09 苏州元脑智能科技有限公司 Data reduction processing method, device, equipment and medium
CN117369731B (en) * 2023-12-07 2024-02-27 苏州元脑智能科技有限公司 Data reduction processing method, device, equipment and medium

Also Published As

Publication number Publication date
CN109800218B (en) 2024-04-09
WO2020140622A1 (en) 2020-07-09

Similar Documents

Publication Publication Date Title
CN109800218A (en) Distributed memory system, memory node equipment and data duplicate removal method
Huang et al. X-Engine: An optimized storage engine for large-scale E-commerce transaction processing
CN105487818B (en) For the efficient De-weight method of repeated and redundant data in cloud storage system
US11016955B2 (en) Deduplication index enabling scalability
JP4116413B2 (en) Prefetch appliance server
US9342574B2 (en) Distributed storage system and distributed storage method
Liao et al. Multi-dimensional index on hadoop distributed file system
CN103020315B (en) A kind of mass small documents storage means based on master-salve distributed file system
US8799601B1 (en) Techniques for managing deduplication based on recently written extents
CN103890738B (en) The system and method for the weight that disappears in storage object after retaining clone and separate operation
CN102460439B (en) Data distribution through capacity leveling in a striped file system
CN104850572A (en) HBase non-primary key index building and inquiring method and system
US20150169655A1 (en) Efficient query processing in columnar databases using bloom filters
US20160350302A1 (en) Dynamically splitting a range of a node in a distributed hash table
CN103150394A (en) Distributed file system metadata management method facing to high-performance calculation
US9569477B1 (en) Managing scanning of databases in data storage systems
US20180113804A1 (en) Distributed data parallel method for reclaiming space
CN103218404A (en) Multi-dimensional metadata management method and system based on association characteristics
WO2020016649A2 (en) Pushing a point in time to a backend object storage for a distributed storage system
Merceedi et al. A comprehensive survey for hadoop distributed file system
US10891266B2 (en) File handling in a hierarchical storage system
CN110352410A (en) Track the access module and preextraction index node of index node
CN110427347A (en) Method, apparatus, memory node and the storage medium of data de-duplication
US20140258264A1 (en) Management of searches in a database system
CN107133334B (en) Data synchronization method based on high-bandwidth storage system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant