The content of the invention
Present invention is primarily targeted at providing a kind of multilayer data de-duplication method based on distributed file system
And device, it is intended to improve the deletion rate to duplicate data.
For achieving the above object, the present invention provides a kind of multilayer data de-duplication side based on distributed file system
Method, the method comprising the steps of:
Obtain the digital finger-print of file to be written;
Judge in global profile digital finger-print list with the presence or absence of the digital finger-print of the file to be written;
If so, the metadata information of the file to be written is then recorded;
If it is not, be then written into file cutting into slices by predetermined manner, and obtain the digital finger-print of each section;
Judge in the digital fingerprint list of global profile section with the presence or absence of the digital finger-print of the section;
If so, then the metadata information of the section recorded in memory node;
If it is not, then the digital finger-print of the section and the section is sent to into corresponding memory node.
Preferably, the digital finger-print by the section and the section is sent to after corresponding memory node and also includes
Step:
Judge in the digital fingerprint list of section of currently stored node with the presence or absence of the digital finger-print of the section;
If so, then confirm that the section writes successfully;
If it is not, then writing the section, and the digital finger-print of the section be recorded into the section numeral of this memory node
Fingerprint list.
Preferably, it is described that the section is write into disk, and the digital finger-print of the section be recorded into this memory node
Section numeral fingerprint list after also include step:
Timing acquisition system load;
When system load is less than preset value, the information in the digital fingerprint list of the section of each memory node is uploaded to
The global section numeral fingerprint list.
Preferably, it is described be written into file by predetermined manner cut into slices, and obtain each section digital finger-print specifically wrap
Include step:
Judge the size of the file to be written whether more than preset value;
If so, then by the file to be written by default size section;
If it is not, then the file to be written is determined entirely by as a section.
Preferably, the digital finger-print for obtaining file to be written specifically includes step:
Obtain the MD5 check values and sha values of the file to be written;
The character string of the MD5 check values and sha values is superimposed into the digital finger-print as file to be written.
Additionally, for achieving the above object, the present invention also provides a kind of multilayer duplicate data based on distributed file system
Device is deleted, including:
First acquisition module, for obtaining the digital finger-print of file to be written;
First judge module, for judging global profile digital finger-print list in the presence or absence of the file to be written number
Word fingerprint;
First logging modle, for when the judged result of first judge module is "Yes", recording described to be written
The metadata information of file;
Section module, for when the judged result of first judge module is "No", being written into file by default
Mode is cut into slices, and obtains the digital finger-print of each section;
Second judge module, for judging the digital fingerprint list of global profile section in the presence or absence of the section numeral
Fingerprint;
Second logging modle, for when the judged result of second judge module is "Yes", by the unit of the section
Data message recorded in memory node;
Sending module, for when the judged result of second judge module is "No", by the section and the section
Digital finger-print be sent to corresponding memory node.
Preferably, also include:
3rd judge module, for whether there is the section in the section numeral fingerprint list for judging currently stored node
Digital finger-print;
Confirm module, for being judged as "Yes" constantly in the 3rd judge module, confirm that the section writes successfully;
Writing module, for when the 3rd judge module is judged as "No", writes the section, and by the section
Digital finger-print recorded this memory node section numeral fingerprint list.
Preferably, also include:
Second acquisition module, for timing acquisition system load;
Upper transmission module, for when system load is less than preset value, by the section numeral fingerprint list of each memory node
In information be uploaded to the digital fingerprint list of the global section.
Preferably, the section module is specifically included:
Judging unit, for judging the size of the file to be written whether more than preset value;
Section unit, for when the judging unit is judged as "Yes", the file to be written being cut by default size
Piece;
Determining unit, for when the judging unit is judged as "No", the file to be written being determined entirely by as one
Individual section.
Preferably, first acquisition module is specifically included:
Acquiring unit, for obtaining the MD5 check values and sha values of the file to be written;
Superpositing unit, for the superposition of the character string of the MD5 check values and sha values to be referred to as the numeral of file to be written
Line.
Embodiments of the invention are comprised the following steps:Obtain the digital finger-print of file to be written;Judge global profile numeral
With the presence or absence of the digital finger-print of the file to be written in fingerprint list;If so, the metadata of the file to be written is then recorded
Information;If it is not, be then written into file cutting into slices by predetermined manner, and obtain the digital finger-print of each section;Judge global profile
With the presence or absence of the digital finger-print of the section in the digital fingerprint list of section;If so, then by the metadata information note of the section
In recording memory node;If it is not, then the digital finger-print of the section and the section is sent to into corresponding memory node.The present invention
Technical scheme by storing to file or the digital finger-print of section, improve the deletion rate to duplicate data, save
Memory space.
Specific embodiment
It should be appreciated that specific embodiment described herein is not intended to limit the present invention only to explain the present invention.
The present invention provides a kind of multilayer data de-duplication method based on distributed file system.
The software systems that distributed system (distributed system) is built upon on network, with height
Poly- property and the transparency.Cohesion refers to each database distribution node high degree of autonomy, there is local data base management system.Thoroughly
Bright property refers to that each database distribution node is transparent for the application of user, does not see local or long-range.
In distributed data base system, the imperceptible data of user are distributions, i.e., user is not necessary to know whether relation is split, whether there is
Copy, data are stored in which website and affairs and perform on which website etc..What independent computer was presented to user is one
Individual unified entirety, it is a system to just look like, and the system possesses the physics and logical resource of many general, can dynamically divide
With task, scattered physics and logical resource realize that information is exchanged by computer network.Most typical distributed system is just
It is WWW (World Wide Web).
The first embodiment of currently proposed the present processes.As shown in figure 1, the method comprising the steps of:
S100, the digital finger-print for obtaining file to be written.
Digital finger-print is the digital coding of the uniqueness generated according to the content of file, and common digital finger-print generally has
MD5 (Message Digest Algorithm message digest algorithms the 5th edition), sha1 (Secure Hash Algorithm Secure Hash
Algorithm) etc..Each file generates unique digital finger-print by default function or algorithm, due to function and algorithm
Uniqueness, even if only having nuance in two files, the digital finger-print for obtaining is also far apart, therefore verifies the numeral of file
Fingerprint is to judge file whether identical reliable basis.
In the present embodiment, when receiving from the file write request of client, the number of the file to be written is first obtained
Word fingerprint.
S200, judge in global profile digital finger-print list with the presence or absence of the file to be written digital finger-print;If so,
S210 steps are then performed, if it is not, then performing S220 steps.
Further, after the digital finger-print in the acquisition file to be written, global profile digital finger-print row are being judged
Whether there is corresponding digital finger-print in table.Here global profile digital finger-print list refers to the distributed text that is stored with
The list of all complete file digital finger-prints in part system.If there is file to be written in the global profile digital finger-print list
, then there is digital finger-print and file digital finger-print identical file to be written in original file system, by digital finger-print in digital finger-print
Uniqueness it was determined that existed in original file system and file identical file to be written, now continue executing with S210
Step;If conversely, there is no the digital finger-print of file to be written in global profile digital finger-print list, depositing in original file system
In digital finger-print and file digital finger-print identical file to be written, then prove do not exist and file to be written in original file system
Identical file, now continues executing with S220 steps.
S210, the metadata information for recording the file to be written.
In the present embodiment, the metadata information of file to be written is recorded in meta data server.When in original file system
When there is digital finger-print with file digital finger-print identical file to be written, it is clear that if being written into files passe again,
Can repeat to take up room, so not uploading original in the present embodiment, but directly record in meta data server and treat
The metadata of write file, realizes the deletion of the file for repeatedly being uploaded.From known explanation, metadata
(Metadata), also known as broker data, relay data, the information of data attribute (property) is mainly described, for supporting
Such as indicate storage location, historical data, resource lookup, file record function.A kind of metadata electronic type catalogue at last, in order to
Reach the purpose of scheduling, it is necessary to describing and collect in data perhaps characteristic, and then reach the mesh for assisting data retrieval
's.In brief, metadata is exactly the data (data about data) for describing data.In the present embodiment, metadata is mainly wrapped
Include original path and resource information with file identical file to be written in system, the metadata record of file to be written in
After in meta data server, when calling file to be written again, directly by the fileinfo calling system to be written of record
In original just can obtain and the identical file of file to be written with file identical file to be written.Treated by record
The metadata information of write file instead of direct upper transmitting file in prior art, can be handed over effectively save memory space and data
Change the time.
S220, be written into file by predetermined manner cut into slices, obtain each section digital finger-print, and perform S300 step
Suddenly.
When the digital finger-print that there is no file to be written in global profile digital finger-print list, then exist in original file system
Digital finger-print and file digital finger-print identical file to be written, then prove do not exist and file phase to be written in original file system
Same file.Now, file is further written in the present embodiment to cut into slices by predetermined manner.Reply is understood by, with skill
The development of art and information, file is in diversity Long-term change trend, and based on identical file various different versions can be derived.To operate
As a example by system, be it is known known to Windows10 operating systems (Windows 10) including dividing into 32 systems and 64
Position system, it is further to divide into multiple versions such as home edition, enterprise version, professional version, the part of these different operating versions again
File content is identical, and the size of wherein each image file is about 4GB, if only check digit fingerprint, above-mentioned 6
The operating system of kind of different editions belongs to different files, the total about memory space 24GB of storage above-mentioned image file, and wherein
Duplicate data occupies most spaces.
In the present embodiment, further each file is cut into slices according to preset mode, a complete file is divided into some
Individual little section file.After the completion of cutting into slices to file, the digital finger-print of each section is obtained according to preset algorithm or function, and
Further perform S300 steps.
S300, judge in the digital fingerprint list of global profile section with the presence or absence of the digital finger-print of the section;If so, then
S310 steps are performed, if otherwise performing S320 steps.
After obtaining the digital finger-print of each section, judge in the digital fingerprint list of global profile section with the presence or absence of with it is described
The digital finger-print identical digital finger-print of section, it is clear that if it is present proving there is identical section, otherwise then prove not
There is identical section.When there is identical section, S310 steps being performed, when there is no identical section, performing S320
Step.
S310, the metadata information of the section recorded in memory node.
When there is identical section in distributed file system, metadata information be recorded into corresponding memory node
In.Obviously, when needing to call file to be written again, by the metadata for calling file to be written, the phase of its section is obtained
Pass information, and further by calling the metadata information of section, the storage information for obtaining section reduces former file to be written
Realization is called to former file to be written.
S320, the section and the digital finger-print of the section are sent to into corresponding memory node.
When there is no identical section in distributed file system, then the digital finger-print of the section and the section is sent out
Corresponding memory node is sent to, and is stored.
Obviously in the present embodiment, the image file of the operating system of above-mentioned 6 versions is stored if desired, after section,
The section that the file of most identical contents is constituted in file only needs to storage once, constitutes the section file of difference and needs list
Solely storage, then only need the memory space less than 5GB just can realize needing original depositing for the file of occupancy 24GB memory spaces
Storage.
By the verification of the digital finger-print to file in the present embodiment, and further by different digital fingerprint
File is cut into slices and again the digital finger-print to cutting into slices is verified, and effectively avoids the multiple storage to duplicate data,
Memory space is saved.Also data transmission period and data carrying cost have been saved accordingly.
Further, Fig. 2 is referred to, based on above-described embodiment, the second embodiment of the inventive method is proposed.The S320
Also include step after step:
S400 is judged in the digital fingerprint list of the section of currently stored node with the presence or absence of the digital finger-print of the section;If
It is then to perform S410 steps, if it is not, then performing S420 steps.
S410 confirms that the section writes successfully.
S420 writes the section, and the digital finger-print of the section recorded into the section digital finger-print of this memory node
List.
It should be appreciated that in the present embodiment, the corresponding section numeral of the memory node being provided with each memory node and being referred to
Line list, when section is sent to corresponding memory node, be in the section numeral fingerprint list for judging currently stored node
The no digital finger-print that there is the section.If there is the number of the section in the digital fingerprint list of the section of currently stored node
Word fingerprint, then prove the identical section that has been stored with currently stored node, then this is cut to only need to return meta data server
Piece is writing successfully.If there is no the digital finger-print of the section in the digital fingerprint list of the section of currently stored node,
By the section write disk.
As shown in figure 3, in the 3rd embodiment based on the above-mentioned second embodiment of the method for the present invention, step S420
Also include step afterwards:
S500, timing acquisition system load.
S600, when system load is less than preset value, by the information in the digital fingerprint list of section of each memory node
It is uploaded to the digital fingerprint list of the global section.
Obviously, system is in running, if taking excessive system load, can affect data and file storage and
Transmission speed, in all the present embodiment, further obtains the load of system, and the ability only when system load is less than certain preset value
Further operated.In the present embodiment, when system load is less than preset value, by the section digital finger-print of each memory node
Information in list is uploaded to the digital fingerprint list of the global section.
In fourth embodiment of the method for the present invention based on above-described embodiment, step S220 is specifically included:
Whether S221 judges the size of the file to be written more than preset value;If so, S222 steps are then performed, if it is not,
Then perform S223 steps.
S222 is by the file to be written by default size section.
S223 is determined entirely by the file to be written for a section.
It should be appreciated that for file, if section is less, comparatively it is easier to find identical and cuts
Piece, but accordingly, file can be divided into less file, then can more the time required to cutting into slices and obtaining the digital finger-print of section
It is long.And the size cut into slices it is relatively large when, because number of sections is less, then cutting into slices the time and obtains section digital finger-print
Time can accordingly shorten, but section can relative reduction with existing section identical possibility.Basis is tackled when specifically used
Demand and set, specific setting value can be 4MB, 8MB, 16MB, 32MB etc., wherein be preferably set to 64MB, general setting
It is less than 4TB.
Based on above-described embodiment, the 5th embodiment of the inventive method is proposed, step S100 is specifically included:
S110 obtains the MD5 check values and sha values of the file to be written.
The character string of the MD5 check values and sha values is superimposed S120 the digital finger-print as file to be written.
It should be appreciated that in specifically used, the species of digital finger-print is diversified, in the present embodiment, there is provided one
Preferred digital finger-print is planted, the MD5 check values for specially obtaining file to be written are designated as x, and obtain the sha of file to be written
Value, be more highly preferred to for sha1 values, be designated as y, the character string of two values is superposed to xy as the digital finger-print of this document.With
As a example by 64 simplified form of Chinese Character plate originals image files of the formal versions of Win10, the MD5 values of this document are
2F8691F7FE2F569A70418A8633AC63F6 is designated as x, and sha1 values are
C71D49A6144772F352806201EF564951BE55EDD5 is designated as y, and x and y series connection is obtained
2F8691F7FE2F569A70418A8633AC63F6C71D49A6144772F352806201 EF564951BE55EDD5 conducts
The digital finger-print of verification word.
Additionally, for achieving the above object, the present invention also provides a kind of multilayer duplicate data based on distributed file system
Device is deleted, Fig. 4 is referred to, the device includes:
First acquisition module 10, for obtaining the digital finger-print of file to be written.
Digital finger-print is the digital coding of the uniqueness generated according to the content of file, and common digital finger-print generally has
MD5 (Message Digest Algorithm message digest algorithms the 5th edition), sha1 (Secure Hash Algorithm Secure Hash
Algorithm) etc..Each file generates unique digital finger-print by default function or algorithm, due to function and algorithm
Uniqueness, even if only having nuance in two files, the digital finger-print for obtaining is also far apart, therefore verifies the numeral of file
Fingerprint is to judge file whether identical reliable basis.
In the present embodiment, when receiving from the file write request of client, the number of the file to be written is first obtained
Word fingerprint.
First judge module 20, for judging global profile digital finger-print list in the presence or absence of the file to be written
Digital finger-print.
Further, after the digital finger-print in the acquisition file to be written, global profile digital finger-print row are being judged
Whether there is corresponding digital finger-print in table.Here global profile digital finger-print list refers to the distributed text that is stored with
The list of all complete file digital finger-prints in part system.If there is file to be written in the global profile digital finger-print list
, then there is digital finger-print and file digital finger-print identical file to be written in original file system, by digital finger-print in digital finger-print
Uniqueness it was determined that existed in original file system and file identical file to be written, now continue executing with S210
Step;If conversely, there is no the digital finger-print of file to be written in global profile digital finger-print list, depositing in original file system
In digital finger-print and file digital finger-print identical file to be written, then prove do not exist and file to be written in original file system
Identical file, now continues executing with S220 steps.
First logging modle 30, for when the judged result of first judge module 20 is "Yes", treating described in record
The metadata information of write file.
In the present embodiment, the metadata information of file to be written is recorded in meta data server.When in original file system
When there is digital finger-print with file digital finger-print identical file to be written, it is clear that if being written into files passe again,
Can repeat to take up room, so not uploading original in the present embodiment, but directly record in meta data server and treat
The metadata of write file, realizes the deletion of the file for repeatedly being uploaded.From known explanation, metadata
(Metadata), also known as broker data, relay data, the information of data attribute (property) is mainly described, for supporting
Such as indicate storage location, historical data, resource lookup, file record function.A kind of metadata electronic type catalogue at last, in order to
Reach the purpose of scheduling, it is necessary to describing and collect in data perhaps characteristic, and then reach the mesh for assisting data retrieval
's.In brief, metadata is exactly the data (data about data) for describing data.In the present embodiment, metadata is mainly wrapped
Include original path and resource information with file identical file to be written in system, the metadata record of file to be written in
After in meta data server, when calling file to be written again, directly by the fileinfo calling system to be written of record
In original just can obtain and the identical file of file to be written with file identical file to be written.Treated by record
The metadata information of write file instead of direct upper transmitting file in prior art, can be handed over effectively save memory space and data
Change the time.
Section module 40, for the judged result of first judge module 20 be "No" when, be written into file by
Predetermined manner is cut into slices, and obtains the digital finger-print of each section.
When the digital finger-print that there is no file to be written in global profile digital finger-print list, then exist in original file system
Digital finger-print and file digital finger-print identical file to be written, then prove do not exist and file phase to be written in original file system
Same file.Now, file is further written in the present embodiment to cut into slices by predetermined manner.Reply is understood by, with skill
The development of art and information, file is in diversity Long-term change trend, and based on identical file various different versions can be derived.To operate
As a example by system, be it is known known to Windows10 operating systems (Windows 10) including dividing into 32 systems and 64
Position system, it is further to divide into multiple versions such as home edition, enterprise version, professional version, the part of these different operating versions again
File content is identical, and the size of wherein each image file is about 4GB, if only check digit fingerprint, above-mentioned 6
The operating system of kind of different editions belongs to different files, the total about memory space 24GB of storage above-mentioned image file, and wherein
Duplicate data occupies most spaces.
In the present embodiment, further each file is cut into slices according to preset mode, a complete file is divided into some
Individual little section file.After the completion of cutting into slices to file, according to preset algorithm or function the digital finger-print of each section is obtained.
Second judge module 50, for be written into file by predetermined manner cut into slices, and obtain each section numeral
Judge after fingerprint in the digital fingerprint list of global profile section with the presence or absence of the digital finger-print of the section.
After obtaining the digital finger-print of each section, judge in the digital fingerprint list of global profile section with the presence or absence of with it is described
The digital finger-print identical digital finger-print of section, it is clear that if it is present proving there is identical section, otherwise then prove not
There is identical section.
Second logging modle 60, for when the judged result of second judge module 50 is "Yes", by the section
Metadata information recorded in memory node.
When there is identical section in distributed file system, metadata information be recorded into corresponding memory node
In.Obviously, when needing to call file to be written again, by the metadata for calling file to be written, the phase of its section is obtained
Pass information, and further by calling the metadata information of section, the storage information for obtaining section reduces former file to be written
Realization is called to former file to be written.
Sending module 70, for when the judged result of second judge module 50 is "No", cutting into slices described and this
The digital finger-print of section is sent to corresponding memory node.
When there is no identical section in distributed file system, then the digital finger-print of the section and the section is sent out
Corresponding memory node is sent to, and is stored.
Obviously in the present embodiment, the image file of the operating system of above-mentioned 6 versions is stored if desired, after section,
The section that the file of most identical contents is constituted in file only needs to storage once, constitutes the section file of difference and needs list
Solely storage, then only need the memory space less than 5GB just can realize needing original depositing for the file of occupancy 24GB memory spaces
Storage.
By the verification of the digital finger-print to file in the present embodiment, and further by different digital fingerprint
File is cut into slices and again the digital finger-print to cutting into slices is verified, and effectively avoids the multiple storage to duplicate data,
Memory space is saved.Also data transmission period and data carrying cost have been saved accordingly.
Further, Fig. 5 is referred to, based on above-described embodiment, the second embodiment of apparatus of the present invention is proposed.Also include:
3rd judge module 80, for cutting with the presence or absence of described in the section numeral fingerprint list for judging currently stored node
The digital finger-print of piece.
Module 90 is confirmed, for when the judged result of the 3rd judge module 90 is "Yes", confirming the section write
Success.
Writing module 100, for when the judged result of the 3rd judge module 90 is "No", writing the section,
And the digital finger-print of the section be recorded into the section numeral fingerprint list of this memory node.
It should be appreciated that in the present embodiment, the corresponding section numeral of the memory node being provided with each memory node and being referred to
Line list, when section is sent to corresponding memory node, be in the section numeral fingerprint list for judging currently stored node
The no digital finger-print that there is the section.If there is the number of the section in the digital fingerprint list of the section of currently stored node
Word fingerprint, then prove the identical section that has been stored with currently stored node, then this is cut to only need to return meta data server
Piece is writing successfully.If there is no the digital finger-print of the section in the digital fingerprint list of the section of currently stored node,
By the section write disk.
In the 3rd embodiment of the above-mentioned second embodiment of device based on the present invention, also include:
Second acquisition module, for timing acquisition system load;
Upper transmission module, for when system load is less than preset value, by the section numeral fingerprint list of each memory node
In information be uploaded to the digital fingerprint list of the global section.
Obviously, system is in running, if taking excessive system load, can affect data and file storage and
Transmission speed, in all the present embodiment, further obtains the load of system, and the ability only when system load is less than certain preset value
Further operated.In the present embodiment, when system load is less than preset value, by the section digital finger-print of each memory node
Information in list is uploaded to the digital fingerprint list of the global section.
Further referring to Fig. 6, it is based in the fourth embodiment of above-described embodiment in the device of the present invention, the slice module
Block 40 is specifically included:
Judging unit 41, for judging the size of the file to be written whether more than preset value;
Section unit 42, for when the judged result of the judging unit 41 is "Yes", the file to be written being pressed
Default size section;
Determining unit 43, it is for when the judged result of the judging unit 41 is "No", the file to be written is whole
Body is defined as a section.
It should be appreciated that for file, if section is less, comparatively it is easier to find identical and cuts
Piece, but accordingly, file can be divided into less file, then can more the time required to cutting into slices and obtaining the digital finger-print of section
It is long.And the size cut into slices it is relatively large when, because number of sections is less, then cutting into slices the time and obtains section digital finger-print
Time can accordingly shorten, but section can relative reduction with existing section identical possibility.Basis is tackled when specifically used
Demand and set, specific setting value can be 4MB, 8MB, 16MB, 32MB etc., wherein be preferably set to 64MB, general setting
It is less than 4TB.
Based on above-described embodiment, the 5th embodiment of apparatus of the present invention is proposed, first acquisition module 10 is specifically included:
Acquiring unit, for obtaining the MD5 check values and sha values of the file to be written;
Superpositing unit, for the superposition of the character string of the MD5 check values and sha values to be referred to as the numeral of file to be written
Line.
It should be appreciated that in specifically used, the species of digital finger-print is diversified, in the present embodiment, there is provided one
Preferred digital finger-print is planted, the MD5 check values for specially obtaining file to be written are designated as x, and obtain the sha of file to be written
Value, be more highly preferred to for sha1 values, be designated as y, the character string of two values is superposed to xy as the digital finger-print of this document.With
As a example by 64 simplified form of Chinese Character plate originals image files of the formal versions of Win10, the MD5 values of this document are
2F8691F7FE2F569A70418A8633AC63F6 is designated as x, and sha1 values are
C71D49A6144772F352806201EF564951BE55EDD5 is designated as y, and x and y series connection is obtained
2F8691F7FE2F569A70418A8633AC63F6C71D49A6144772F352806201 EF564951BE55EDD5 conducts
The digital finger-print of verification word.
The preferred embodiments of the present invention are these are only, the scope of the claims of the present invention is not thereby limited, it is every using this
Equivalent structure or equivalent flow conversion that bright specification and accompanying drawing content are made, or directly or indirectly it is used in other related skills
Art field, is included within the scope of the present invention.