CN109800218B - Distributed storage system, storage node device and data deduplication method - Google Patents

Distributed storage system, storage node device and data deduplication method Download PDF

Info

Publication number
CN109800218B
CN109800218B CN201910007367.9A CN201910007367A CN109800218B CN 109800218 B CN109800218 B CN 109800218B CN 201910007367 A CN201910007367 A CN 201910007367A CN 109800218 B CN109800218 B CN 109800218B
Authority
CN
China
Prior art keywords
data
fingerprint
written
fingerprints
storage node
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910007367.9A
Other languages
Chinese (zh)
Other versions
CN109800218A (en
Inventor
宋小兵
姜文峰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Technology Shenzhen Co Ltd
Original Assignee
Ping An Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Technology Shenzhen Co Ltd filed Critical Ping An Technology Shenzhen Co Ltd
Priority to CN201910007367.9A priority Critical patent/CN109800218B/en
Publication of CN109800218A publication Critical patent/CN109800218A/en
Priority to PCT/CN2019/118009 priority patent/WO2020140622A1/en
Application granted granted Critical
Publication of CN109800218B publication Critical patent/CN109800218B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/27Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Quality & Reliability (AREA)
  • Computing Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to a distributed storage technology, and discloses a distributed storage system, storage node equipment and a data deduplication method. When the storage node equipment performs data deduplication, if the fingerprint of the data sheet to be written is not queried in the local fingerprint library, whether the fingerprint is a repeated fingerprint or not can be directly queried in the shared fingerprint library, and the communication query with other storage node equipment is not needed one by one, so that the data deduplication efficiency of the distributed storage system is improved.

Description

A distributed storage system storage node device and data deduplication method
Technical Field
The present invention relates to the field of distributed storage technologies, and in particular, to a distributed storage system, a storage node device, a data deduplication method, and a computer readable storage medium.
Background
Data deduplication, also known as deduplication (Data Deduplication), is a technique applied in storage systems that globally identifies and eliminates redundant data, and is a hotspot in storage system research in recent years. The data deduplication uniquely identifies the data block by calculating a secure hash digest (such as an SHA1 fingerprint) of the data block, so that character-by-character matching of the data is avoided, and the storage system can quickly and conveniently identify repeated data by simply maintaining an index table of the secure hash digest, so that the method has good expandability. The repeated data content can achieve the purpose of saving the storage space only by recording corresponding data pointer information. The data deduplication technology can greatly save storage space, thereby improving the resource utilization rate of the storage device.
Currently, a deduplication process for a data slice by a storage node in a distributed storage system generally includes the following steps: and calculating the fingerprint of the data sheet, inquiring whether the fingerprint exists in the fingerprint library of the storage node, and if not, inquiring whether the fingerprint exists in the fingerprint libraries of other storage nodes in the distributed storage system, thereby confirming whether the data sheet exists in the distributed storage system. The method has the defects that the number of storage nodes in the distributed storage system is usually large, if one storage node needs to inquire fingerprints in fingerprint libraries of other storage nodes, the storage node needs to communicate with the storage nodes one by one, and the speed is low and the efficiency is low.
Therefore, how to improve the deduplication efficiency of the distributed storage system is a urgent issue to be resolved.
Disclosure of Invention
The invention mainly aims to provide a distributed storage system, a storage node device, a data deduplication method and a computer readable storage medium, and aims to improve the deduplication efficiency of the distributed storage system.
In order to achieve the above object, the present invention provides a distributed storage system, where the distributed storage system includes a plurality of storage node devices and a plurality of shared fingerprint libraries, the storage node devices are connected to the shared fingerprint libraries in a communication manner, and a local fingerprint library is provided in the storage node devices, or the storage node devices are connected to the corresponding local fingerprint libraries in a communication manner, and the storage node devices are used for:
Receiving a data sheet writing request, wherein the data sheet writing request comprises a plurality of data sheets to be written and fingerprints of the data sheets to be written;
determining fingerprints to be deduplicated in the fingerprints of the data sheets to be written, and searching whether each fingerprint to be deduplicated exists in a local fingerprint library, wherein the local fingerprint library comprises fingerprints of the data sheets stored in the storage node equipment;
when one or more fingerprints to be deduplicated exist in the local fingerprint library, deleting the data pieces to be written corresponding to the one or more fingerprints to be deduplicated;
when one or more fingerprints to be deduplicated do not exist in the local fingerprint library, the one or more fingerprints to be deduplicated are used as fingerprints to be processed, each fingerprint to be processed is searched in a shared fingerprint library, the shared database comprises fingerprints of data pieces stored in all storage node devices, and when one or more fingerprints to be processed are searched in the shared fingerprint library, the data pieces to be written corresponding to the one or more fingerprints to be processed which are searched are deleted.
Preferably, the data slice to be written is obtained by splitting the data to be written, the data slice writing request further comprises a data slice fingerprint sequence, and the data slice fingerprint sequence comprises fingerprints of all the data slices to be written which are arranged in sequence;
The determining the fingerprint to be deduplicated in the fingerprint of the data sheet to be written comprises: judging whether redundant fingerprints exist in the fingerprints of the data sheets to be written, if so, deleting the redundant fingerprints, taking the rest fingerprints as fingerprints to be deduplicated, and if not, taking all the fingerprints to be written as fingerprints to be deduplicated;
the storage node device is further configured to:
storing the data slice fingerprint sequence into the local fingerprint library and the shared fingerprint library;
the storage node device is further configured to:
and storing all the rest data sheets to be written, and storing the storage position information corresponding to the rest data sheets to be written into the local fingerprint library and the shared fingerprint library.
Preferably, the distributed storage system further comprises a control node device communicatively connected to each storage node device and the shared fingerprint library, and the storage node device is further configured to:
determining a reference count change value of the fingerprint of each data sheet to be written, and sending the reference count change value of the fingerprint of each data sheet to be written to the control node equipment;
the control node device is configured to:
and updating the accumulated reference count of the fingerprints of each data sheet to be written according to the reference count change value of the fingerprints of each data sheet to be written.
Preferably, the storage node device is further configured to:
receiving a deletion request of data to be deleted;
acquiring a data slice fingerprint sequence of the data to be deleted, determining a reference count change value of each fingerprint in the acquired data slice fingerprint sequence, and transmitting the reference count change value of each fingerprint in the data slice fingerprint sequence to control node equipment;
the control node device is further configured to:
updating the accumulated reference count of each fingerprint in the data slice fingerprint sequence according to the reference count change value of each fingerprint in the data slice fingerprint sequence, deleting the data slice fingerprint sequence of the data to be deleted from the shared fingerprint library, and notifying the storage node equipment to delete the data slice fingerprint sequence of the data to be deleted from the local fingerprint library;
when the accumulated reference count of a fingerprint is detected to be zero in the shared fingerprint library, recording the duration of the state that the accumulated reference count of the fingerprint is kept to be zero, deleting the fingerprint when the duration is longer than the preset duration, and informing the corresponding storage node equipment to delete the fingerprint and the data sheet corresponding to the fingerprint.
In addition, in order to achieve the above objective, the present invention further provides a data deduplication method, which is applicable to a distributed storage system, where the distributed storage system includes a plurality of storage node devices and a plurality of shared fingerprint libraries, the storage node devices are connected to the shared fingerprint libraries in a communication manner, and a local fingerprint library is provided in the storage node devices, or the storage node devices are connected to the corresponding local fingerprint libraries in a communication manner, and the method includes the steps of:
A receiving step: the method comprises the steps that storage node equipment receives a data sheet writing request, wherein the data sheet writing request comprises a plurality of data sheets to be written and fingerprints of the data sheets to be written;
inquiring: the storage node equipment determines fingerprints to be deduplicated in the fingerprints of the data sheets to be written, and searches whether each fingerprint to be deduplicated exists in a local fingerprint library, wherein the local fingerprint library comprises the fingerprints of the data sheets stored in the storage node equipment;
a first deduplication step: when one or more fingerprints to be deduplicated exist in the local fingerprint library, deleting the data pieces to be written corresponding to the one or more fingerprints to be deduplicated by the storage node device;
and a second de-duplication step: when one or more fingerprints to be deduplicated do not exist in the local fingerprint library, the storage node device takes the one or more fingerprints to be deduplicated as fingerprints to be processed, searches each fingerprint to be processed in a shared fingerprint library, the shared database comprises fingerprints of data pieces stored in all storage node devices, and when one or more fingerprints to be processed are found in the shared fingerprint library, the data pieces to be written corresponding to the found one or more fingerprints to be processed are deleted.
Preferably, the data slice to be written is obtained by splitting the data to be written, the data slice writing request further comprises a data slice fingerprint sequence, and the data slice fingerprint sequence comprises fingerprints of all the data slices to be written which are arranged in sequence;
the step of determining the fingerprint to be deduplicated in the fingerprint of the data sheet to be written comprises the following steps: judging whether redundant fingerprints exist in the fingerprints of the data sheets to be written, if so, deleting the redundant fingerprints, taking the rest fingerprints as fingerprints to be deduplicated, and if not, taking all the fingerprints to be written as fingerprints to be deduplicated;
after the receiving step, the method further comprises:
the storage node equipment stores the data slice fingerprint sequence into the local fingerprint library and the shared fingerprint library;
after the second deduplication step, the method further comprises:
and the storage node equipment stores all the rest data sheets to be written in, and stores the storage position information corresponding to the rest data sheets to be written in the local fingerprint library and the shared fingerprint library.
Preferably, the distributed storage system further comprises a control node device communicatively connected to each storage node device and to the shared fingerprint library, and after the receiving step, the method further comprises:
The storage node equipment determines the reference count change value of the fingerprint of each data sheet to be written, and sends the reference count change value of the fingerprint of each data sheet to be written to the control node equipment;
and the control node equipment updates the accumulated reference count of the fingerprints of each data sheet to be written according to the reference count change value of the fingerprints of each data sheet to be written.
Preferably, the method further comprises:
the storage node equipment receives a deletion request of data to be deleted;
the storage node equipment acquires a data slice fingerprint sequence of the data to be deleted, determines a reference count change value of each fingerprint in the acquired data slice fingerprint sequence, and sends the reference count change value of each fingerprint in the data slice fingerprint sequence to the control node equipment;
the control node equipment updates the accumulated reference count of each fingerprint in the data slice fingerprint sequence according to the reference count change value of each fingerprint in the data slice fingerprint sequence, deletes the data slice fingerprint sequence of the data to be deleted from the shared fingerprint library, and informs the storage node equipment to delete the data slice fingerprint sequence of the data to be deleted from the local fingerprint library;
When the accumulated reference count of a fingerprint is detected to be zero in the shared fingerprint library, the control node equipment records the duration of the state that the accumulated reference count of the fingerprint is kept to be zero, and when the duration is longer than the preset duration, the fingerprint is deleted, and the corresponding storage node equipment is informed to delete the fingerprint and the data piece corresponding to the fingerprint.
In addition, in order to achieve the above object, the present invention further provides a storage node device, where the storage node device is in communication connection with a shared fingerprint library, and a local fingerprint library is provided in the storage node device, or the storage node device is in communication connection with a corresponding local fingerprint library, and the storage node device includes a memory and a processor, where the memory stores a data deduplication program, and when the data deduplication program is executed by the processor, the steps are implemented as follows:
a receiving step: receiving a data sheet writing request, wherein the data sheet writing request comprises a plurality of data sheets to be written and fingerprints of the data sheets to be written;
inquiring: determining fingerprints to be deduplicated in the fingerprints of the data sheets to be written, and searching whether each fingerprint to be deduplicated exists in a local fingerprint library, wherein the local fingerprint library comprises fingerprints of the data sheets stored in the storage node equipment;
A first deduplication step: when one or more fingerprints to be deduplicated exist in the local fingerprint library, deleting the data pieces to be written corresponding to the one or more fingerprints to be deduplicated;
and a second de-duplication step: when one or more fingerprints to be deduplicated do not exist in the local fingerprint library, the one or more fingerprints to be deduplicated are used as fingerprints to be processed, each fingerprint to be processed is searched in a shared fingerprint library, the shared database comprises fingerprints of data pieces stored in all storage node devices, and when one or more fingerprints to be processed are searched in the shared fingerprint library, the data pieces to be written corresponding to the one or more fingerprints to be processed which are searched are deleted.
Preferably, the data slice to be written is obtained by splitting the data to be written, the data slice writing request further comprises a data slice fingerprint sequence, and the data slice fingerprint sequence comprises fingerprints of all the data slices to be written which are arranged in sequence;
the step of determining the fingerprint to be deduplicated in the fingerprint of the data sheet to be written comprises the following steps: judging whether redundant fingerprints exist in the fingerprints of the data sheets to be written, if so, deleting the redundant fingerprints, taking the rest fingerprints as fingerprints to be deduplicated, and if not, taking all the fingerprints to be written as fingerprints to be deduplicated;
The processor executes the data deduplication program, and after the receiving step, further implements the steps of:
storing the data slice fingerprint sequence into the local fingerprint library and the shared fingerprint library;
the processor executes the data deduplication program, and after the second deduplication step, further implements the steps of:
and storing all the rest data sheets to be written, and storing the storage position information corresponding to the rest data sheets to be written into the local fingerprint library and the shared fingerprint library.
Preferably, the distributed storage system further includes a control node device communicatively connected to each storage node device and the shared fingerprint library, and the processor executes the data deduplication program, and after the receiving step, further implements the steps of:
and determining the reference count change value of the fingerprints of each data sheet to be written, and sending the reference count change value of the fingerprints of each data sheet to be written to control node equipment, so that the control node equipment updates the accumulated reference count of the fingerprints of each data sheet to be written according to the reference count change value of the fingerprints of each data sheet to be written.
In addition, to achieve the above object, the present invention further proposes a computer readable storage medium adapted for a storage node device, the storage node device being communicatively connected to a shared fingerprint library, the storage node device being provided with a local fingerprint library therein, or the storage node device being communicatively connected to a corresponding local fingerprint library, the computer readable storage medium storing a data deduplication program executable by at least one processor to cause the at least one processor to perform the steps of:
A receiving step: receiving a data sheet writing request, wherein the data sheet writing request comprises a plurality of data sheets to be written and fingerprints of the data sheets to be written;
inquiring: determining fingerprints to be deduplicated in the fingerprints of the data sheets to be written, and searching whether each fingerprint to be deduplicated exists in a local fingerprint library, wherein the local fingerprint library comprises fingerprints of the data sheets stored in the storage node equipment;
a first deduplication step: when one or more fingerprints to be deduplicated exist in the local fingerprint library, deleting the data pieces to be written corresponding to the one or more fingerprints to be deduplicated;
and a second de-duplication step: when one or more fingerprints to be deduplicated do not exist in the local fingerprint library, the one or more fingerprints to be deduplicated are used as fingerprints to be processed, each fingerprint to be processed is searched in a shared fingerprint library, the shared database comprises fingerprints of data pieces stored in all storage node devices, and when one or more fingerprints to be processed are searched in the shared fingerprint library, the data pieces to be written corresponding to the one or more fingerprints to be processed which are searched are deleted.
Preferably, the data slice to be written is obtained by splitting the data to be written, the data slice writing request further comprises a data slice fingerprint sequence, and the data slice fingerprint sequence comprises fingerprints of all the data slices to be written which are arranged in sequence;
The step of determining the fingerprint to be deduplicated in the fingerprint of the data sheet to be written comprises the following steps: judging whether redundant fingerprints exist in the fingerprints of the data sheets to be written, if so, deleting the redundant fingerprints, taking the rest fingerprints as fingerprints to be deduplicated, and if not, taking all the fingerprints to be written as fingerprints to be deduplicated;
the processor executes the data deduplication program, and after the receiving step, further implements the steps of:
storing the data slice fingerprint sequence into the local fingerprint library and the shared fingerprint library;
the processor executes the data deduplication program, and after the second deduplication step, further implements the steps of:
and storing all the rest data sheets to be written, and storing the storage position information corresponding to the rest data sheets to be written into the local fingerprint library and the shared fingerprint library.
Preferably, the distributed storage system further includes a control node device communicatively connected to each storage node device and the shared fingerprint library, and the processor executes the data deduplication program, and after the receiving step, further implements the steps of:
And determining the reference count change value of the fingerprints of each data sheet to be written, and sending the reference count change value of the fingerprints of each data sheet to be written to control node equipment, so that the control node equipment updates the accumulated reference count of the fingerprints of each data sheet to be written according to the reference count change value of the fingerprints of each data sheet to be written.
Compared with the prior art, when the storage node device performs data deduplication, if the fingerprint of the data sheet to be written is not queried in the local fingerprint library, whether the fingerprint is a repeated fingerprint can be directly queried in the shared fingerprint library, and communication query with other storage node devices is not needed one by one, so that the data deduplication efficiency of the distributed storage system is improved.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to the structures shown in these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a schematic diagram of a system architecture of an embodiment of a distributed storage system according to the present invention;
FIG. 2 is a schematic diagram illustrating an operating environment of an embodiment of a data deduplication process according to the present invention;
FIG. 3 is a block diagram illustrating an embodiment of a data deduplication process according to the present invention;
FIG. 4 is a flowchart illustrating an embodiment of a data deduplication method according to the present invention.
The achievement of the objects, functional features and advantages of the present invention will be further described with reference to the accompanying drawings, in conjunction with the embodiments.
Detailed Description
The principles and features of the present invention are described below with reference to the drawings, the examples are illustrated for the purpose of illustrating the invention and are not to be construed as limiting the scope of the invention.
Referring to FIG. 1, a system architecture diagram of one embodiment of a distributed storage system according to the present invention is shown.
In this embodiment, the distributed storage system includes a plurality of storage node devices 1 and a plurality of shared fingerprint libraries 2, where the storage node devices 1 and the shared fingerprint libraries 2 are in communication connection (for example, through a network 4, communication connection), and a local fingerprint library 3 is disposed in the storage node devices 1, or the storage node devices 1 are in communication connection with the corresponding local fingerprint libraries 3. The local fingerprint library 3 comprises fingerprints of stored data pieces in the corresponding storage node devices 1, and the shared fingerprint library 2 comprises fingerprints of stored data pieces in all storage node devices 1.
The storage node device 1 is configured to:
a data slice write request is received and, the data sheet writing request comprises a plurality of data sheets to be written and fingerprints of the data sheets to be written;
determining fingerprints to be deduplicated in the fingerprints of the data sheet to be written, and searching whether each fingerprint to be deduplicated exists in a local fingerprint library 3;
when one or more fingerprints to be deduplicated exist in the local fingerprint library 3, deleting the data pieces to be written corresponding to the one or more fingerprints to be deduplicated;
when one or more to-be-deduplicated fingerprints do not exist in the local fingerprint library 3, the one or more to-be-deduplicated fingerprints are used as to-be-processed fingerprints, each to-be-processed fingerprint is searched in the shared fingerprint library 2, and when one or more to-be-processed fingerprints are searched in the shared fingerprint library 2, the to-be-written data sheet corresponding to the searched one or more to-be-processed fingerprints is deleted.
In this embodiment, the storage node device 1 receives a data slice write request, where the data slice write request includes a plurality of data slices to be written and fingerprints of the data slices to be written. The data sheet to be written is obtained by slicing data to be written (the data type of the data to be written comprises block-level data and file-level data). The splitting may be performed by the storage node device 1, or by any other suitable device (e.g., client), the splitting method comprising:
And cutting the data file to be written into a preset number of data pieces with the same data size. Or, marking the preset number as M, when M is a natural number greater than 1, determining the size of the data piece corresponding to the data file to be written, and cutting M-1 data blocks with the same size one by one according to the determined size of the data piece, wherein the M-1 data block is remained after cutting. Wherein, the size of the data pieces to be written may be 4KB, 8KB, 12KB, 16KB, or other granularity.
After the data to be written is split into a plurality of data pieces to be written, the fingerprint of each data piece to be written is calculated, for example, the fingerprint of each data piece to be written is calculated through a Message-Digest Algorithm 5 (MD5), a secure hash Algorithm (Secure Hash Algorithm, SHA 1) and the like, and meanwhile, the arrangement sequence (namely the data piece fingerprint sequence) of each data piece to be written is recorded, so that when the data to be written is read later, the data pieces to be written are assembled into the data to be written according to the data piece fingerprint sequence. In addition, the storage node device 1 may also save the sequence of data slice fingerprints to the local fingerprint library 3 and the shared fingerprint library 2.
Next, the storage node device 1 determines a fingerprint to be deduplicated from among the fingerprints of the data pieces to be written, and the method for determining the fingerprint to be deduplicated includes: judging whether redundant fingerprints exist in the fingerprints of the data sheet to be written, if so, deleting the redundant fingerprints, taking the rest fingerprints as fingerprints to be deduplicated, and if not, taking all the fingerprints to be written as fingerprints to be deduplicated.
For example, the storage node device 1 determines whether the same fingerprint exists among all the fingerprints of the pieces of data to be written. If the same fingerprints exist, the same fingerprints are used as a fingerprint group, after all fingerprint groups are found, one fingerprint is selected in each fingerprint group to be reserved, the unselected fingerprints are used as redundant fingerprints to be deleted, whether the ungrouped fingerprints exist or not is judged, if yes, each ungrouped fingerprint is used as a fingerprint to be de-duplicated, and if not, the flow is ended. If the same fingerprints do not exist, taking all the fingerprints to be written into the data sheet as ungrouped fingerprints, and taking all ungrouped fingerprints as the fingerprints to be de-duplicated.
After identifying the fingerprint to be deduplicated, the storage node device 1 searches the local fingerprint library 3 for the presence or absence of each fingerprint to be deduplicated. When one or more fingerprints to be deduplicated exist in the local fingerprint library 3, the storage node device 1 deletes the data piece to be written corresponding to the one or more fingerprints to be deduplicated. When one or more fingerprints to be deduplicated exist in the local fingerprint library 3, the data pieces to be written corresponding to the fingerprints to be deduplicated are duplicate data pieces, and in order to save storage space, the duplicate data pieces are deleted.
Finally, when one or more to-be-deduplicated fingerprints do not exist in the local fingerprint library 3, the storage node device 1 takes the one or more to-be-deduplicated fingerprints as to-be-processed fingerprints, searches each to-be-processed fingerprint in the shared fingerprint library 2, and deletes the to-be-written data piece corresponding to the one or more to-be-processed fingerprints when the one or more to-be-processed fingerprints are found in the shared fingerprint library 2.
Since the shared fingerprint library 2 has the full fingerprint data, if the local fingerprint library 3 does not find a fingerprint to be deduplicated, the storage node device 1 continuously searches whether the fingerprint to be deduplicated exists in the shared fingerprint library 2, if so, determines that the data piece to be written corresponding to the fingerprint to be deduplicated already exists in other storage nodes, belongs to the repeated data piece, and does not need to store the data piece to be written.
Compared with the prior art, when the storage node device 1 performs data deduplication, if no fingerprint of a data piece to be written is queried in the local fingerprint library 3, whether the fingerprint is a duplicate fingerprint can be directly queried in the shared fingerprint library 2, and communication query with other storage node devices 1 is not needed one by one, so that the data deduplication efficiency of the distributed storage system is improved.
Further, in the present embodiment, the storage node device 1 is further configured to:
the storage node device 1 stores all the remaining data slices to be written (i.e. the data slices which are not repeated), and stores the storage position information corresponding to the remaining data slices to be written into the local fingerprint library 3 and the shared fingerprint library 2.
Further, in this embodiment, the distributed storage system further includes a control node device 5, where the control node device 5 is respectively communicatively connected to the storage node device 1 and the shared fingerprint library 2 (e.g., communicatively connected via the network 4). The shared fingerprint library 2 may be disposed in a shared disk (e.g., NVME disk mounted by NVMEOF), which may be disposed in the control node device 5, or may be disposed independently of the control node device 5.
The storage node device 1 is further configured to:
the reference count change value of the fingerprint of each piece of data to be written is determined (for example, it is determined that the reference count change value of each piece of fingerprint to be deduplicated is +1), and the reference count change value of the fingerprint of each piece of data to be written is transmitted to the control node device 5.
The control node device 5 is configured to:
and updating the accumulated reference count of the fingerprints of each data piece to be written according to the reference count change value of the fingerprints of each data piece to be written (the accumulated reference count of a fingerprint represents the total number of times the data piece corresponding to the fingerprint is referenced by stored data).
Further, in the present embodiment, the storage node device 1 is further configured to:
receiving a deletion request of data to be deleted, acquiring a data slice fingerprint sequence of the data to be deleted, determining a reference count change value of each fingerprint in the acquired data slice fingerprint sequence (for example, determining that the reference count change value of each fingerprint in the data slice fingerprint sequence is-1), and transmitting the reference count change value of each fingerprint in the data slice fingerprint sequence to the control node device 5.
The control node device 5 is further configured to:
and updating the accumulated reference count of each fingerprint in the data slice fingerprint sequence according to the reference count change value of each fingerprint in the data slice fingerprint sequence, deleting the data slice fingerprint sequence of the data to be deleted from the shared fingerprint library 2, and informing the storage node device 1 to delete the data slice fingerprint sequence of the data to be deleted from the local fingerprint library 3.
In this embodiment, if the data is to be deleted, the storage node device 1 cannot directly delete the data piece of the data to be deleted, because the storage node device 1 cannot determine whether the data piece of the data to be deleted is referenced by other data at the same time, if the data piece of the data to be deleted is directly deleted, the data may be lost. Therefore, only the accumulated reference count of each fingerprint in the data slice sequence of the data to be deleted is updated, and the data slice fingerprint sequence of the data to be deleted is deleted.
Further, in the present embodiment, the control node device 5 is further configured to:
when the accumulated reference count of a fingerprint is detected as zero in the shared fingerprint library 2 (i.e., the piece of data corresponding to the fingerprint is not referenced by any data), the duration of the state in which the accumulated reference count of the fingerprint is kept as zero is recorded.
And deleting the fingerprint when the duration time is longer than the preset duration time, and notifying the corresponding storage node equipment 1 to delete the fingerprint and the data sheet corresponding to the fingerprint.
And when the duration is less than or equal to the preset duration, not performing deletion processing.
In this embodiment, when detecting that the accumulated reference count of a fingerprint is zero, the control node device 5 needs to delete a data piece corresponding to the fingerprint after a preset period of time, and receives the reference count change value of the fingerprint reported by each storage node device 1 in real time within the preset period of time, so as to avoid data deletion caused by that the storage node device 1 does not report the reference count change value in time.
Further, in the present embodiment, the storage node device 1 is further configured to:
when a reading request of data to be read is received, acquiring a data sheet fingerprint sequence of the data to be read, acquiring storage position information of data sheets corresponding to all fingerprints in the data sheet fingerprint sequence, acquiring the data sheets corresponding to all fingerprints in the data sheet fingerprint sequence according to the acquired storage position information, and then assembling the acquired data sheets into the data to be read according to the data sheet fingerprint sequence.
The invention provides a data deduplication program.
Referring to FIG. 2, a schematic diagram of an operating environment of a data deduplication process 10 according to an embodiment of the present invention is shown.
In the present embodiment, the data deduplication program 10 is installed and run in the storage node apparatus 1. Storage node device 1 may be a computing device such as a desktop computer, a notebook, a palm top computer, a server, or the like. The storage node apparatus 1 may include, but is not limited to, a memory 11, a processor 12, and a display 13. Fig. 2 shows only the storage node device 1 with components 11-13, but it should be understood that not all of the illustrated components are required to be implemented, and that more or fewer components may alternatively be implemented.
The memory 11 may in some embodiments be an internal storage unit of the storage node device 1, such as a hard disk or a memory of the storage node device 1. The memory 11 may in other embodiments also be an external storage device of the storage node device 1, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card) or the like, which are provided on the storage node device 1. Further, the memory 11 may also include both an internal storage unit of the storage node device 1 and an external storage device. The memory 11 is used for storing application software installed in the storage node device 1 and various types of data, such as program codes of the data deduplication program 10. The memory 11 may also be used to temporarily store data that has been output or is to be output.
The processor 12 may in some embodiments be a central processing unit (Central Processing Unit, CPU), microprocessor or other data processing chip for running program code or processing data stored in the memory 11, e.g. executing the data deduplication program 10, etc.
The display 13 may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode) touch, or the like in some embodiments. The display 13 is used for displaying information processed in the storage node device 1 and for displaying a visualized user interface. The components 11-13 of the storage node device 1 communicate with each other via a program bus.
Referring to FIG. 3, a block diagram of a data deduplication process 10 according to an embodiment of the present invention is shown. In this embodiment, the data deduplication program 10 may be divided into one or more modules, and one or more modules are stored in the memory 11 and executed by one or more processors (the processor 12 in this embodiment) to complete the present invention. For example, in fig. 3, the data deduplication program 10 may be partitioned into a receiving module 101, a preprocessing module 102, a querying module 103, a first deduplication module 104, and a second deduplication module 105. The modules referred to in the present invention refer to a series of instruction segments of a computer program capable of performing a specific function, more suitable than the program for describing the execution of the data deduplication program 10 in the storage node device 1, wherein:
The receiving module 101 is configured to receive a data slice writing request, where the data slice writing request includes a plurality of data slices to be written and fingerprints of the data slices to be written.
The data slice to be written is obtained by slicing data to be written (the data type of the data to be written includes block-level data and file-level data), and the slicing operation may be performed by the receiving module 101 or any other suitable device (e.g., a client), where a slicing method includes:
and cutting the data file to be written into a preset number of data pieces with the same data size. Or, marking the preset number as M, when M is a natural number greater than 1, determining the size of the data piece corresponding to the data file to be written, and cutting M-1 data blocks with the same size one by one according to the determined size of the data piece, wherein the M-1 data block is remained after cutting. The size of the data pieces to be written may be 4KB, 8KB, 12KB, 16KB, or other granularity.
After splitting the data to be written into pieces of data to be written, the fingerprint of each piece of data to be written is calculated, for example, by means of Message-Digest Algorithm 5 (md5), secure hash Algorithm (Secure Hash Algorithm, SHA 1) or the like, meanwhile, the arrangement sequence of the data sheets to be written (namely, the fingerprint sequence of the data sheets) is recorded, and when the data sheets to be written are read later, the data sheets to be written are assembled into the data sheets to be written according to the fingerprint sequence of the data sheets. In addition, the data slice fingerprint sequence can be stored in the local fingerprint library 3 and the shared fingerprint library 2.
And the preprocessing module 102 is used for determining a fingerprint to be deduplicated in the fingerprints of the data sheet to be written.
The method for determining the fingerprint to be de-duplicated by the preprocessing module 102 includes:
judging whether redundant fingerprints exist in the fingerprints of the data sheet to be written, if so, deleting the redundant fingerprints, taking the rest fingerprints as fingerprints to be deduplicated, and if not, taking all the fingerprints to be written as fingerprints to be deduplicated.
For example, it is determined whether the same fingerprint exists among all the fingerprints of the data pieces to be written. If the same fingerprints exist, the same fingerprints are used as a fingerprint group, after all fingerprint groups are found, one fingerprint is selected in each fingerprint group to be reserved, the unselected fingerprints are used as redundant fingerprints to be deleted, whether the ungrouped fingerprints exist or not is judged, if yes, each ungrouped fingerprint is used as a fingerprint to be de-duplicated, and if not, the flow is ended. If the same fingerprints do not exist, taking all the fingerprints to be written into the data sheet as ungrouped fingerprints, and taking all ungrouped fingerprints as the fingerprints to be de-duplicated.
A query module 103, configured to find whether each fingerprint to be deduplicated exists in the local fingerprint database 3.
The first deduplication module 104 is configured to delete a piece of data to be written corresponding to one or more fingerprints to be deduplicated when the one or more fingerprints to be deduplicated exist in the local fingerprint library 3.
When one or more fingerprints to be deduplicated exist in the local fingerprint library 3, the data pieces to be written corresponding to the fingerprints to be deduplicated are duplicate data pieces, and in order to save storage space, the duplicate data pieces are deleted.
And the second deduplication module 105 is configured to, when one or more to-be-deduplicated fingerprints do not exist in the local fingerprint library 3, take the one or more to-be-deduplicated fingerprints as to-be-processed fingerprints, search each to-be-processed fingerprint in the shared fingerprint library 2, and delete to-be-written data pieces corresponding to the one or more to-be-searched to-be-processed fingerprints when one or more to-be-processed fingerprints are found in the shared fingerprint library 2.
Since the shared fingerprint library 2 has the full fingerprint data, if the first deduplication module 104 does not query a to-be-deduplicated fingerprint in the local fingerprint library 3, the second deduplication module 105 continuously queries whether the to-be-deduplicated fingerprint exists in the shared fingerprint library 2, if so, determines that the to-be-written data slice corresponding to the to-be-deduplicated fingerprint already exists in other storage nodes, belongs to repeated data slices, and does not need to store the to-be-written data slice.
Compared with the prior art, when the storage node device 1 performs data deduplication, if no fingerprint of a data piece to be written is queried in the local fingerprint library 3, whether the fingerprint is a duplicate fingerprint can be directly queried in the shared fingerprint library 2, and communication query with other storage node devices 1 is not needed one by one, so that the data deduplication efficiency of the distributed storage system is improved.
Further, in this embodiment, the data deduplication program 10 further includes a storage module (not shown in the figure) for:
and storing all the rest data sheets to be written (namely, unrepeated data sheets), and storing the storage position information corresponding to the rest data sheets to be written into the local fingerprint library 3 and the shared fingerprint library 2.
Further, in this embodiment, the data deduplication program 10 further includes a reference update module (not shown in the figure) for:
determining a reference count change value of the fingerprint of each piece of data to be written (for example, determining that the reference count change value of each piece of data to be deduplicated is +1), and sending the reference count change value of the fingerprint of each piece of data to be written to the control node device 5, so that the control node device 5 updates the accumulated reference count of the fingerprint of each piece of data to be written according to the reference count change value of the fingerprint of each piece of data to be written (the accumulated reference count of a fingerprint represents the total number of times the piece of data corresponding to the fingerprint is referenced by stored data).
Further, in this embodiment, the data deduplication program 10 further includes a deletion module (not shown in the figure) for:
receiving a deletion request of data to be deleted, acquiring a data slice fingerprint sequence of the data to be deleted, determining a reference count change value of each fingerprint in the acquired data slice fingerprint sequence (for example, determining that the reference count change value of each fingerprint in the data slice fingerprint sequence is-1), and transmitting the reference count change value of each fingerprint in the data slice fingerprint sequence to the control node device 5. The control node device 5 updates the accumulated reference count of each fingerprint in the data slice fingerprint sequence according to the reference count change value of each fingerprint in the data slice fingerprint sequence, deletes the data slice fingerprint sequence of the data to be deleted from the shared fingerprint library 2, and notifies the storage node device 1 to delete the data slice fingerprint sequence of the data to be deleted from the local fingerprint library 3.
In this embodiment, if the data is to be deleted, the storage node device 1 cannot directly delete the data piece of the data to be deleted, because the storage node device 1 cannot determine whether the data piece of the data to be deleted is referenced by other data at the same time, if the data piece of the data to be deleted is directly deleted, the data may be lost. Therefore, only the accumulated reference count of each fingerprint in the data slice sequence of the data to be deleted is updated, and the data slice fingerprint sequence of the data to be deleted is deleted.
When a fingerprint is detected in the shared fingerprint library 2 with a zero accumulated reference count (i.e. the piece of data corresponding to the fingerprint is not referenced by any data), the control node device 5 records the duration of the state in which the fingerprint maintains the accumulated reference count zero. And deleting the fingerprint when the duration time is longer than the preset duration time, and notifying the corresponding storage node equipment 1 to delete the fingerprint and the data sheet corresponding to the fingerprint. And when the duration is less than or equal to the preset duration, not performing deletion processing.
In this embodiment, when detecting that the accumulated reference count of a fingerprint is zero, the control node device 5 needs to delete a data piece corresponding to the fingerprint after a preset period of time, and receives the reference count change value of the fingerprint reported by each storage node device 1 in real time within the preset period of time, so as to avoid data deletion caused by that the storage node device 1 does not report the reference count change value in time.
Further, in this embodiment, the data deduplication program 10 further includes a reading module (not shown in the figure) for:
when a reading request of data to be read is received, acquiring a data sheet fingerprint sequence of the data to be read, acquiring storage position information of data sheets corresponding to all fingerprints in the data sheet fingerprint sequence, acquiring the data sheets corresponding to all fingerprints in the data sheet fingerprint sequence according to the acquired storage position information, and then assembling the acquired data sheets into the data to be read according to the data sheet fingerprint sequence.
In addition, the invention provides a data deduplication method. The method is suitable for the distributed storage system.
Fig. 4 is a flowchart of a data deduplication method according to an embodiment of the present invention.
In this embodiment, the method includes:
in step S10, the storage node device 1 receives a data slice write request, where the data slice write request includes a plurality of data slices to be written and fingerprints of the data slices to be written.
The data slice to be written is sliced from the data to be written (the data type of the data to be written includes block level data and file level data), and the slicing operation may be performed by the storage node device 1 or any other applicable device (e.g., client), and the slicing method includes:
and cutting the data file to be written into a preset number of data pieces with the same data size. Or, marking the preset number as M, when M is a natural number greater than 1, determining the size of the data piece corresponding to the data file to be written, and cutting M-1 data blocks with the same size one by one according to the determined size of the data piece, wherein the M-1 data block is remained after cutting. The size of the data pieces to be written may be 4KB, 8KB, 12KB, 16KB, or other granularity.
After the data to be written is split into a plurality of data pieces to be written, the fingerprint of each data piece to be written is calculated, for example, the fingerprint of each data piece to be written is calculated through a Message-Digest Algorithm 5 (MD5), a secure hash Algorithm (Secure Hash Algorithm, SHA 1) and the like, and meanwhile, the arrangement sequence (namely the data piece fingerprint sequence) of each data piece to be written is recorded, so that when the data to be written is read later, the data pieces to be written are assembled into the data to be written according to the data piece fingerprint sequence. In addition, the data slice fingerprint sequence can be stored in the local fingerprint library 3 and the shared fingerprint library 2.
In step S20, the storage node device 1 determines a fingerprint to be deduplicated among the fingerprints of the data pieces to be written.
The method for determining the fingerprint to be de-duplicated comprises the following steps: judging whether redundant fingerprints exist in the fingerprints of the data sheet to be written, if so, deleting the redundant fingerprints, taking the rest fingerprints as fingerprints to be deduplicated, and if not, taking all the fingerprints to be written as fingerprints to be deduplicated.
For example, the step S20 includes steps S21 to S26 (not shown in the figure). Wherein:
Step S21, judging whether the same fingerprints exist in all the fingerprints to be written in the data sheet.
Step S22, if the same fingerprints exist, the same fingerprints are used as a fingerprint group, after all fingerprint groups are found, one fingerprint is selected to be reserved in each fingerprint group, the unselected fingerprints are used as redundant fingerprints to be deleted, whether the ungrouped fingerprints exist or not is judged, if yes, each ungrouped fingerprint is used as a fingerprint to be de-duplicated, and if not, the flow is ended.
In step S23, if the same fingerprint does not exist, all the fingerprints to be written into the data slice are taken as ungrouped fingerprints, and each ungrouped fingerprint is taken as a fingerprint to be de-duplicated.
In step S30, the storage node device 1 searches the local fingerprint library 3 for the presence of each fingerprint to be deduplicated.
In step S40, when one or more fingerprints to be deduplicated exist in the local fingerprint library 3, the storage node device 1 deletes the data piece to be written corresponding to the one or more fingerprints to be deduplicated.
When one or more fingerprints to be deduplicated exist in the local fingerprint library 3, the data pieces to be written corresponding to the fingerprints to be deduplicated are duplicate data pieces, and in order to save storage space, the duplicate data pieces are deleted.
In step S50, when one or more to-be-deduplicated fingerprints do not exist in the local fingerprint library 3, the storage node device 1 takes the one or more to-be-deduplicated fingerprints as to-be-processed fingerprints, searches each to-be-processed fingerprint in the shared fingerprint library 2, and deletes the to-be-written data piece corresponding to the one or more to-be-processed fingerprints when one or more to-be-processed fingerprints are found in the shared fingerprint library 2.
Since the shared fingerprint library 2 has the full fingerprint data, if the local fingerprint library 3 does not find a fingerprint to be deduplicated, the storage node device 1 continuously searches whether the fingerprint to be deduplicated exists in the shared fingerprint library 2, if so, determines that the data piece to be written corresponding to the fingerprint to be deduplicated already exists in other storage nodes, belongs to the repeated data piece, and does not need to store the data piece to be written.
Compared with the prior art, when the storage node device 1 performs data deduplication, if no fingerprint of a data piece to be written is queried in the local fingerprint library 3, whether the fingerprint is a duplicate fingerprint can be directly queried in the shared fingerprint library 2, and communication query with other storage node devices 1 is not needed one by one, so that the data deduplication efficiency of the distributed storage system is improved.
Further, in the present embodiment, after step S60, the method further includes:
the storage node device 1 stores all the remaining data slices to be written (i.e. the data slices which are not repeated), and stores the storage position information corresponding to the remaining data slices to be written into the local fingerprint library 3 and the shared fingerprint library 2.
Further, in this embodiment, the distributed storage system further includes a control node device 5 communicatively connected to each storage node device 1 and the shared fingerprint library 2, and after the step S20, the method further includes:
the storage node device 1 determines a reference count change value of the fingerprint of each piece of data to be written (for example, determines that the reference count change value of each piece of data to be deduplicated is +1), and transmits the reference count change value of the fingerprint of each piece of data to be written to the control node device 5.
Next, the control node device 5 updates the accumulated reference count of the fingerprints of each piece of data to be written according to the reference count change value of the fingerprint of each piece of data to be written (the accumulated reference count of a fingerprint represents the total number of times the piece of data corresponding to the fingerprint is referenced by the stored data).
Further, in the present embodiment, the method further includes steps S60 to S80 (not shown in the drawings).
Wherein:
in step S60 of the process, the storage node device 1 receives a deletion request of data to be deleted.
In step S70, the storage node device 1 acquires a data slice fingerprint sequence of the data to be deleted, determines a reference count change value of each fingerprint in the acquired data slice fingerprint sequence (for example, determines that the reference count change value of each fingerprint in the data slice fingerprint sequence is-1), and sends the reference count change value of each fingerprint in the data slice fingerprint sequence to the control node device 5.
In step S80, the control node device 5 updates the accumulated reference count of each fingerprint in the data slice fingerprint sequence according to the reference count change value of each fingerprint in the data slice fingerprint sequence, deletes the data slice fingerprint sequence of the data to be deleted from the shared fingerprint library 2, and notifies the storage node device 1 to delete the data slice fingerprint sequence of the data to be deleted from the local fingerprint library 3.
In this embodiment, if the data is to be deleted, the storage node device 1 cannot directly delete the data piece of the data to be deleted, because the storage node device 1 cannot determine whether the data piece of the data to be deleted is referenced by other data at the same time, if the data piece of the data to be deleted is directly deleted, the data may be lost. Therefore, only the accumulated reference count of each fingerprint in the sequence of data pieces of the data to be deleted needs to be updated, and deleting the fingerprint sequence of the data sheet of the data to be deleted.
Further, in this embodiment, the method further includes:
when a fingerprint is detected in the shared fingerprint library 2 with a zero accumulated reference count (i.e. the piece of data corresponding to the fingerprint is not referenced by any data), the control node device 5 records the duration of the state in which the fingerprint maintains the accumulated reference count zero.
And deleting the fingerprint when the duration time is longer than the preset duration time, and notifying the corresponding storage node equipment 1 to delete the fingerprint and the data sheet corresponding to the fingerprint.
And when the duration is less than or equal to the preset duration, not performing deletion processing.
In this embodiment, when detecting that the accumulated reference count of a fingerprint is zero, the control node device 5 needs to delete a data piece corresponding to the fingerprint after a preset period of time, and receives the reference count change value of the fingerprint reported by each storage node device 1 in real time within the preset period of time, so as to avoid data deletion caused by that the storage node device 1 does not report the reference count change value in time.
Further, in the present embodiment, the method further includes step S90 (not shown in the figure).
In step S90, when receiving a read request of data to be read, the storage node device 1 obtains a data slice fingerprint sequence of the data to be read, obtains storage position information of data slices corresponding to each fingerprint in the data slice fingerprint sequence, obtains the data slices corresponding to each fingerprint in the data slice fingerprint sequence according to the obtained storage position information, and assembles the obtained data slices into the data to be read according to the data slice fingerprint sequence.
Further, the present invention also proposes a computer-readable storage medium, which stores the data deduplication program 10, embodiments of the data deduplication process 10 are described in detail above and are not described in detail herein.
The foregoing is only a preferred embodiment of the present invention, and thus do not limit the scope of the invention, which, under the inventive concept of the present invention, equivalent structural changes made by the specification and the attached drawings of the present invention or direct/indirect application in other related technical fields are included in the scope of the present invention.

Claims (10)

1. The distributed storage system is characterized by comprising a plurality of storage node devices and a plurality of shared fingerprint libraries, wherein the storage node devices are in communication connection with the shared fingerprint libraries, and local fingerprint libraries are arranged in the storage node devices, or the storage node devices are in communication connection with the corresponding local fingerprint libraries, and the storage node devices are used for:
receiving a data sheet writing request, wherein the data sheet writing request comprises a plurality of data sheets to be written, fingerprints of each data sheet to be written and a data sheet fingerprint sequence, the data sheet fingerprint sequence comprises fingerprints of each data sheet to be written, which are arranged in sequence, the data sheets to be written are obtained by dividing the data to be written into a preset number of data sheets with the same data size, after dividing the data to be written into a plurality of data sheets to be written, the fingerprints of each data sheet to be written are calculated, and the arrangement sequence of each data sheet to be written is recorded to obtain a data sheet fingerprint sequence, so that when the data sheets to be written are read, the data sheets to be written are assembled into the data to be written according to the data sheet fingerprint sequence, and the data sheet fingerprint sequence is stored in a local fingerprint library and a shared fingerprint library;
Determining fingerprints to be deduplicated in the fingerprints of the data sheets to be written, and searching whether each fingerprint to be deduplicated exists in a local fingerprint library, wherein the local fingerprint library comprises fingerprints of the data sheets stored in the storage node equipment;
when one or more fingerprints to be deduplicated exist in the local fingerprint library, deleting the data pieces to be written corresponding to the one or more fingerprints to be deduplicated;
when one or more fingerprints to be deduplicated do not exist in the local fingerprint library, the one or more fingerprints to be deduplicated are used as fingerprints to be processed, each fingerprint to be processed is searched in a shared fingerprint library, the shared fingerprint library comprises fingerprints of data sheets stored in all storage node devices, when one or more fingerprints to be processed are searched in the shared fingerprint library, the data sheets to be written corresponding to the searched one or more fingerprints to be processed are deleted, all the rest data sheets to be written are stored, and storage position information corresponding to the rest data sheets to be written is stored in the local fingerprint library and the shared fingerprint library.
2. The distributed storage system of claim 1, wherein the piece of data to be written is sliced from the data to be written;
The determining the fingerprint to be deduplicated in the fingerprint of the data sheet to be written comprises: judging whether redundant fingerprints exist in the fingerprints of the data sheet to be written, if so, deleting the redundant fingerprints, taking the rest fingerprints as fingerprints to be deduplicated, and if not, taking all the fingerprints to be written as fingerprints to be deduplicated.
3. The distributed storage system of claim 1 or 2, further comprising a control node device communicatively coupled to each storage node device and to the shared fingerprint library, the storage node device further configured to:
determining a reference count change value of the fingerprint of each data sheet to be written, and sending the reference count change value of the fingerprint of each data sheet to be written to the control node equipment;
the control node device is configured to:
and updating the accumulated reference count of the fingerprints of each data sheet to be written according to the reference count change value of the fingerprints of each data sheet to be written.
4. The distributed storage system of claim 2 wherein the storage node device is further configured to:
receiving a deletion request of data to be deleted;
Acquiring a data slice fingerprint sequence of the data to be deleted, determining a reference count change value of each fingerprint in the acquired data slice fingerprint sequence, and transmitting the reference count change value of each fingerprint in the data slice fingerprint sequence to control node equipment;
the control node device is further configured to:
updating the accumulated reference count of each fingerprint in the data slice fingerprint sequence according to the reference count change value of each fingerprint in the data slice fingerprint sequence, deleting the data slice fingerprint sequence of the data to be deleted from the shared fingerprint library, and notifying the storage node equipment to delete the data slice fingerprint sequence of the data to be deleted from the local fingerprint library;
when the accumulated reference count of a fingerprint is detected to be zero in the shared fingerprint library, recording the duration of the state that the accumulated reference count of the fingerprint is kept to be zero, deleting the fingerprint when the duration is longer than the preset duration, and informing the corresponding storage node equipment to delete the fingerprint and the data sheet corresponding to the fingerprint.
5. The method for data deduplication is suitable for a distributed storage system, and is characterized in that the distributed storage system comprises a plurality of storage node devices and a plurality of shared fingerprint libraries, the storage node devices are in communication connection with the shared fingerprint libraries, and local fingerprint libraries are arranged in the storage node devices, or the storage node devices are in communication connection with the corresponding local fingerprint libraries, and the method comprises the following steps:
A receiving step: the method comprises the steps that a storage node device receives a data sheet writing request, wherein the data sheet writing request comprises a plurality of data sheets to be written, fingerprints of all the data sheets to be written and a data sheet fingerprint sequence, the data sheet fingerprint sequence comprises fingerprints of all the data sheets to be written which are arranged in sequence, the data sheets to be written are obtained by dividing the data sheets to be written into a preset number of data sheets with the same data size, after dividing the data sheets to be written into a plurality of data sheets to be written, the fingerprints of each data sheet to be written are calculated, and the arrangement sequence of all the data sheets to be written is recorded to obtain a data sheet fingerprint sequence, so that when the data sheets to be written are read, the data sheets to be written are assembled into the data sheets to be written according to the data sheet fingerprint sequence, and the data sheet fingerprint sequence is stored in a local fingerprint library and a shared fingerprint library;
inquiring: the storage node equipment determines fingerprints to be deduplicated in the fingerprints of the data sheets to be written, and searches whether each fingerprint to be deduplicated exists in a local fingerprint library, wherein the local fingerprint library comprises the fingerprints of the data sheets stored in the storage node equipment;
A first deduplication step: when one or more fingerprints to be deduplicated exist in the local fingerprint library, deleting the data pieces to be written corresponding to the one or more fingerprints to be deduplicated by the storage node device;
and a second de-duplication step: when one or more fingerprints to be deduplicated do not exist in the local fingerprint library, the storage node device takes the one or more fingerprints to be deduplicated as fingerprints to be processed, searches each fingerprint to be processed in a shared fingerprint library, the shared fingerprint library comprises fingerprints of all data sheets stored in the storage node device, and when one or more fingerprints to be processed are searched in the shared fingerprint library, deletes the data sheets to be written corresponding to the searched one or more fingerprints to be processed, stores all the rest data sheets to be written, and stores storage position information corresponding to the rest data sheets to the local fingerprint library and the shared fingerprint library.
6. The method for deduplication of data according to claim 5, wherein the data slice to be written is sliced from the data to be written;
the step of determining the fingerprint to be deduplicated in the fingerprint of the data sheet to be written comprises the following steps: judging whether redundant fingerprints exist in the fingerprints of the data sheet to be written, if so, deleting the redundant fingerprints, taking the rest fingerprints as fingerprints to be deduplicated, and if not, taking all the fingerprints to be written as fingerprints to be deduplicated.
7. The data deduplication method of claim 5 or 6, wherein the distributed storage system further comprises a control node device communicatively coupled to each storage node device and to a shared fingerprint library, the method further comprising, after the receiving step:
the storage node equipment determines the reference count change value of the fingerprint of each data sheet to be written, and sends the reference count change value of the fingerprint of each data sheet to be written to the control node equipment;
and the control node equipment updates the accumulated reference count of the fingerprints of each data sheet to be written according to the reference count change value of the fingerprints of each data sheet to be written.
8. The method of data deduplication as described in claim 6, wherein the method further comprises:
the storage node equipment receives a deletion request of data to be deleted;
the storage node equipment acquires a data slice fingerprint sequence of the data to be deleted, determines a reference count change value of each fingerprint in the acquired data slice fingerprint sequence, and sends the reference count change value of each fingerprint in the data slice fingerprint sequence to the control node equipment;
the control node equipment updates the accumulated reference count of each fingerprint in the data slice fingerprint sequence according to the reference count change value of each fingerprint in the data slice fingerprint sequence, deletes the data slice fingerprint sequence of the data to be deleted from the shared fingerprint library, and informs the storage node equipment to delete the data slice fingerprint sequence of the data to be deleted from the local fingerprint library;
When the accumulated reference count of a fingerprint is detected to be zero in the shared fingerprint library, the control node equipment records the duration of the state that the accumulated reference count of the fingerprint is kept to be zero, and when the duration is longer than the preset duration, the fingerprint is deleted, and the corresponding storage node equipment is informed to delete the fingerprint and the data piece corresponding to the fingerprint.
9. The storage node device is characterized in that the storage node device is in communication connection with a shared fingerprint library, a local fingerprint library is arranged in the storage node device, or the storage node device is in communication connection with a corresponding local fingerprint library, the storage node device comprises a memory and a processor, a data deduplication program is stored in the memory, and the data deduplication program is executed by the processor to realize the following steps:
a receiving step: receiving a data sheet writing request, wherein the data sheet writing request comprises a plurality of data sheets to be written, fingerprints of each data sheet to be written and a data sheet fingerprint sequence, the data sheet fingerprint sequence comprises fingerprints of each data sheet to be written, which are arranged in sequence, the data sheets to be written are obtained by dividing the data to be written into a preset number of data sheets with the same data size, after dividing the data to be written into a plurality of data sheets to be written, the fingerprints of each data sheet to be written are calculated, and the arrangement sequence of each data sheet to be written is recorded to obtain a data sheet fingerprint sequence, so that when the data sheets to be written are read, the data sheets to be written are assembled into the data to be written according to the data sheet fingerprint sequence, and the data sheet fingerprint sequence is stored in a local fingerprint library and a shared fingerprint library;
Inquiring: determining fingerprints to be deduplicated in the fingerprints of the data sheets to be written, and searching whether each fingerprint to be deduplicated exists in a local fingerprint library, wherein the local fingerprint library comprises fingerprints of the data sheets stored in the storage node equipment;
a first deduplication step: when one or more fingerprints to be deduplicated exist in the local fingerprint library, deleting the data pieces to be written corresponding to the one or more fingerprints to be deduplicated;
and a second de-duplication step: when one or more fingerprints to be deduplicated do not exist in the local fingerprint library, the one or more fingerprints to be deduplicated are used as fingerprints to be processed, each fingerprint to be processed is searched in a shared fingerprint library, the shared fingerprint library comprises fingerprints of data sheets stored in all storage node devices, when one or more fingerprints to be processed are searched in the shared fingerprint library, the data sheets to be written corresponding to the searched one or more fingerprints to be processed are deleted, all the rest data sheets to be written are stored, and storage position information corresponding to the rest data sheets to be written is stored in the local fingerprint library and the shared fingerprint library.
10. A computer readable storage medium adapted for use with a storage node device, wherein the storage node device is communicatively coupled to a shared fingerprint library, wherein the storage node device has a local fingerprint library disposed therein, or wherein, the storage node device is communicatively connected to a corresponding local fingerprint library, and the computer readable storage medium stores a data deduplication program executable by at least one processor to cause the at least one processor to implement the steps of:
A receiving step: receiving a data sheet writing request, wherein the data sheet writing request comprises a plurality of data sheets to be written, fingerprints of each data sheet to be written and a data sheet fingerprint sequence, the data sheet fingerprint sequence comprises fingerprints of each data sheet to be written, which are arranged in sequence, the data sheets to be written are obtained by dividing the data to be written into a preset number of data sheets with the same data size, after dividing the data to be written into a plurality of data sheets to be written, the fingerprints of each data sheet to be written are calculated, and the arrangement sequence of each data sheet to be written is recorded to obtain a data sheet fingerprint sequence, so that when the data sheets to be written are read, the data sheets to be written are assembled into the data to be written according to the data sheet fingerprint sequence, and the data sheet fingerprint sequence is stored in a local fingerprint library and a shared fingerprint library;
inquiring: determining fingerprints to be deduplicated in the fingerprints of the data sheets to be written, and searching whether each fingerprint to be deduplicated exists in a local fingerprint library, wherein the local fingerprint library comprises fingerprints of the data sheets stored in the storage node equipment;
a first deduplication step: when one or more fingerprints to be deduplicated exist in the local fingerprint library, deleting the data slices to be written corresponding to the one or more fingerprints to be deduplicated;
And a second de-duplication step: when one or more fingerprints to be deduplicated do not exist in the local fingerprint library, the one or more fingerprints to be deduplicated are used as fingerprints to be processed, each fingerprint to be processed is searched in a shared fingerprint library, the shared fingerprint library comprises fingerprints of data sheets stored in all storage node devices, when one or more fingerprints to be processed are searched in the shared fingerprint library, the data sheets to be written corresponding to the searched one or more fingerprints to be processed are deleted, all the rest data sheets to be written are stored, and storage position information corresponding to the rest data sheets to be written is stored in the local fingerprint library and the shared fingerprint library.
CN201910007367.9A 2019-01-04 2019-01-04 Distributed storage system, storage node device and data deduplication method Active CN109800218B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201910007367.9A CN109800218B (en) 2019-01-04 2019-01-04 Distributed storage system, storage node device and data deduplication method
PCT/CN2019/118009 WO2020140622A1 (en) 2019-01-04 2019-11-13 Distributed storage system, storage node device and data duplicate deletion method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910007367.9A CN109800218B (en) 2019-01-04 2019-01-04 Distributed storage system, storage node device and data deduplication method

Publications (2)

Publication Number Publication Date
CN109800218A CN109800218A (en) 2019-05-24
CN109800218B true CN109800218B (en) 2024-04-09

Family

ID=66558525

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910007367.9A Active CN109800218B (en) 2019-01-04 2019-01-04 Distributed storage system, storage node device and data deduplication method

Country Status (2)

Country Link
CN (1) CN109800218B (en)
WO (1) WO2020140622A1 (en)

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109800218B (en) * 2019-01-04 2024-04-09 平安科技(深圳)有限公司 Distributed storage system, storage node device and data deduplication method
CN110457305B (en) * 2019-08-13 2021-11-26 腾讯科技(深圳)有限公司 Data deduplication method, device, equipment and medium
CN111399768A (en) * 2020-02-21 2020-07-10 苏州浪潮智能科技有限公司 Data storage method, system, equipment and computer readable storage medium
CN111459928B (en) * 2020-03-27 2023-07-07 上海爱数信息技术股份有限公司 Data deduplication method applied to data backup scene in cluster range and application
CN111580755B (en) * 2020-05-09 2022-07-05 杭州海康威视系统技术有限公司 Distributed data processing system and distributed data processing method
CN114138756B (en) * 2020-09-03 2023-03-24 金篆信科有限责任公司 Data deduplication method, node and computer-readable storage medium
CN114442931A (en) * 2021-12-23 2022-05-06 天翼云科技有限公司 Data deduplication method and system, electronic device and storage medium
CN117369731B (en) * 2023-12-07 2024-02-27 苏州元脑智能科技有限公司 Data reduction processing method, device, equipment and medium

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8495392B1 (en) * 2010-09-02 2013-07-23 Symantec Corporation Systems and methods for securely deduplicating data owned by multiple entities
CN103942292A (en) * 2014-04-11 2014-07-23 华为技术有限公司 Virtual machine mirror image document processing method, device and system
CN103944988A (en) * 2014-04-22 2014-07-23 南京邮电大学 Repeating data deleting system and method applicable to cloud storage
WO2015176249A1 (en) * 2014-05-21 2015-11-26 华为技术有限公司 Transmission method for wireless ethernet interface hard disk, related device, and system
CN107391761A (en) * 2017-08-28 2017-11-24 郑州云海信息技术有限公司 A kind of data managing method and device based on data de-duplication technology
CN108008918A (en) * 2017-11-30 2018-05-08 联想(北京)有限公司 Data processing method, memory node and distributed memory system
CN108415669A (en) * 2018-03-15 2018-08-17 深信服科技股份有限公司 The data duplicate removal method and device of storage system, computer installation and storage medium

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP6677605B2 (en) * 2016-08-22 2020-04-08 株式会社東芝 Program, storage system, and storage system control method
CN107229420B (en) * 2017-05-27 2020-05-26 苏州浪潮智能科技有限公司 Data storage method, reading method, deleting method and data operating system
CN109800218B (en) * 2019-01-04 2024-04-09 平安科技(深圳)有限公司 Distributed storage system, storage node device and data deduplication method

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8495392B1 (en) * 2010-09-02 2013-07-23 Symantec Corporation Systems and methods for securely deduplicating data owned by multiple entities
CN103942292A (en) * 2014-04-11 2014-07-23 华为技术有限公司 Virtual machine mirror image document processing method, device and system
CN103944988A (en) * 2014-04-22 2014-07-23 南京邮电大学 Repeating data deleting system and method applicable to cloud storage
WO2015176249A1 (en) * 2014-05-21 2015-11-26 华为技术有限公司 Transmission method for wireless ethernet interface hard disk, related device, and system
CN107391761A (en) * 2017-08-28 2017-11-24 郑州云海信息技术有限公司 A kind of data managing method and device based on data de-duplication technology
CN108008918A (en) * 2017-11-30 2018-05-08 联想(北京)有限公司 Data processing method, memory node and distributed memory system
CN108415669A (en) * 2018-03-15 2018-08-17 深信服科技股份有限公司 The data duplicate removal method and device of storage system, computer installation and storage medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
文件秒传系统在云存储环境下的设计与实现;胡渝苹;计算机应用与软件;20160415(第04期);第335-339页 *

Also Published As

Publication number Publication date
WO2020140622A1 (en) 2020-07-09
CN109800218A (en) 2019-05-24

Similar Documents

Publication Publication Date Title
CN109800218B (en) Distributed storage system, storage node device and data deduplication method
US10754835B2 (en) High-efficiency deduplication module of a database-management system
JP6026738B2 (en) System and method for improving scalability of a deduplication storage system
WO2015078370A1 (en) Method, device, node and system for managing file in distributed data warehouse
CN103902623B (en) Method and system for the accessing file in storage system
US20140229452A1 (en) Stored data deduplication method, stored data deduplication apparatus, and deduplication program
CN105069048A (en) Small file storage method, query method and device
EP2336901B1 (en) Online access to database snapshots
CN110998537B (en) Expired backup processing method and backup server
US10929176B2 (en) Method of efficiently migrating data from one tier to another with suspend and resume capability
US11650967B2 (en) Managing a deduplicated data index
CN110352410B (en) Tracking access patterns of index nodes and pre-fetching index nodes
CN114547095A (en) Data rapid query method and device, electronic equipment and storage medium
JP2018511861A (en) Method and device for processing data blocks in a distributed database
US20170322960A1 (en) Storing mid-sized large objects for use with an in-memory database system
US10996857B1 (en) Extent map performance
US9886446B1 (en) Inverted index for text searching within deduplication backup system
US10467190B2 (en) Tracking access pattern of inodes and pre-fetching inodes
CN105493080A (en) Method and apparatus for context aware based data de-duplication
KR101666440B1 (en) Data processing method in In-memory Database System based on Circle-Queue
WO2017156855A1 (en) Database systems with re-ordered replicas and methods of accessing and backing up databases
CN114036226A (en) Data synchronization method, device, equipment and storage medium
US11663166B2 (en) Post-processing global deduplication algorithm for scaled-out deduplication file system
CN107133334B (en) Data synchronization method based on high-bandwidth storage system
US20210191639A1 (en) Best-effort deduplication of data while the data resides in a front-end log along an i/o path that leads to back end storage

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant