CN106649556A - Method and device for deleting multiple layered repetitive data based on distributed file system - Google Patents

Method and device for deleting multiple layered repetitive data based on distributed file system Download PDF

Info

Publication number
CN106649556A
CN106649556A CN201610984188.7A CN201610984188A CN106649556A CN 106649556 A CN106649556 A CN 106649556A CN 201610984188 A CN201610984188 A CN 201610984188A CN 106649556 A CN106649556 A CN 106649556A
Authority
CN
China
Prior art keywords
section
file
written
print
digital finger
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201610984188.7A
Other languages
Chinese (zh)
Inventor
李发明
张勤
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Toyou Feiji Electronics Co., Ltd.
Original Assignee
Shenzhen City Rui Bo Storage Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen City Rui Bo Storage Technology Co Ltd filed Critical Shenzhen City Rui Bo Storage Technology Co Ltd
Priority to CN201610984188.7A priority Critical patent/CN106649556A/en
Publication of CN106649556A publication Critical patent/CN106649556A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/18File system types
    • G06F16/182Distributed file systems
    • G06F16/1824Distributed file systems implemented using Network-attached Storage [NAS] architecture
    • G06F16/1827Management specifically adapted to NAS
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/16File or folder operations, e.g. details of user interfaces specifically adapted to file systems
    • G06F16/164File meta data generation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/17Details of further file system functions
    • G06F16/174Redundancy elimination performed by the file system
    • G06F16/1748De-duplication implemented within the file system, e.g. based on file segments

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Human Computer Interaction (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a method for deleting multiple layered repetitive data based on a distributed file system. The method comprises the steps that digital fingerprints pending to be written into files are obtained; the digital fingerprints pending to be written into the files are judged from the digital fingerprint list in the global files whether or not the digital fingerprints exist; if the digital fingerprints exist, metadata information pending to be written into the files is recorded; if the digital fingerprints do not exist, the metadata information pending to be written into the files is segmented according to a preset method, and the digital fingerprint for each segment is obtained; the digital fingerprints of the segments are judged from the segment digital fingerprint list in the global files whether or not the segment fingerprints exist; if the segment digital fingerprints exist, segment metadata information pending to be written into the files is recorded; if the segment digital fingerprints do not exist, the segments and the segment digital fingerprints are sent to the corresponding storage nodes. The invention also discloses a device for deleting multiple layered repetitive data based on the distributed file system. The efficiency of deleting repetitive data is increased, and storage space is saved through the storage of the files or the segment digital fingerprints by the technical scheme.

Description

Multilayer data de-duplication method and device based on distributed file system
Technical field
The present invention relates to area information storage, more particularly to the multilayer data de-duplication side based on distributed file system Method and device.
Background technology
Using data de-duplication technology duplicate data can be carried out storing in existing distributed file system improving Disk utilization, reduces cost.But with the development of technology and information, file becomes more and more diversified, in whole file Hold identical probability less and less.For example, developer can make to software according to the demand of oneself and targetedly change, this In the case of kind, there is nuance in amended software, existing data de-duplication method is to duplicate data with former software Deletion rate is relatively low.
The content of the invention
Present invention is primarily targeted at providing a kind of multilayer data de-duplication method based on distributed file system And device, it is intended to improve the deletion rate to duplicate data.
For achieving the above object, the present invention provides a kind of multilayer data de-duplication side based on distributed file system Method, the method comprising the steps of:
Obtain the digital finger-print of file to be written;
Judge in global profile digital finger-print list with the presence or absence of the digital finger-print of the file to be written;
If so, the metadata information of the file to be written is then recorded;
If it is not, be then written into file cutting into slices by predetermined manner, and obtain the digital finger-print of each section;
Judge in the digital fingerprint list of global profile section with the presence or absence of the digital finger-print of the section;
If so, then the metadata information of the section recorded in memory node;
If it is not, then the digital finger-print of the section and the section is sent to into corresponding memory node.
Preferably, the digital finger-print by the section and the section is sent to after corresponding memory node and also includes Step:
Judge in the digital fingerprint list of section of currently stored node with the presence or absence of the digital finger-print of the section;
If so, then confirm that the section writes successfully;
If it is not, then writing the section, and the digital finger-print of the section be recorded into the section numeral of this memory node Fingerprint list.
Preferably, it is described that the section is write into disk, and the digital finger-print of the section be recorded into this memory node Section numeral fingerprint list after also include step:
Timing acquisition system load;
When system load is less than preset value, the information in the digital fingerprint list of the section of each memory node is uploaded to The global section numeral fingerprint list.
Preferably, it is described be written into file by predetermined manner cut into slices, and obtain each section digital finger-print specifically wrap Include step:
Judge the size of the file to be written whether more than preset value;
If so, then by the file to be written by default size section;
If it is not, then the file to be written is determined entirely by as a section.
Preferably, the digital finger-print for obtaining file to be written specifically includes step:
Obtain the MD5 check values and sha values of the file to be written;
The character string of the MD5 check values and sha values is superimposed into the digital finger-print as file to be written.
Additionally, for achieving the above object, the present invention also provides a kind of multilayer duplicate data based on distributed file system Device is deleted, including:
First acquisition module, for obtaining the digital finger-print of file to be written;
First judge module, for judging global profile digital finger-print list in the presence or absence of the file to be written number Word fingerprint;
First logging modle, for when the judged result of first judge module is "Yes", recording described to be written The metadata information of file;
Section module, for when the judged result of first judge module is "No", being written into file by default Mode is cut into slices, and obtains the digital finger-print of each section;
Second judge module, for judging the digital fingerprint list of global profile section in the presence or absence of the section numeral Fingerprint;
Second logging modle, for when the judged result of second judge module is "Yes", by the unit of the section Data message recorded in memory node;
Sending module, for when the judged result of second judge module is "No", by the section and the section Digital finger-print be sent to corresponding memory node.
Preferably, also include:
3rd judge module, for whether there is the section in the section numeral fingerprint list for judging currently stored node Digital finger-print;
Confirm module, for being judged as "Yes" constantly in the 3rd judge module, confirm that the section writes successfully;
Writing module, for when the 3rd judge module is judged as "No", writes the section, and by the section Digital finger-print recorded this memory node section numeral fingerprint list.
Preferably, also include:
Second acquisition module, for timing acquisition system load;
Upper transmission module, for when system load is less than preset value, by the section numeral fingerprint list of each memory node In information be uploaded to the digital fingerprint list of the global section.
Preferably, the section module is specifically included:
Judging unit, for judging the size of the file to be written whether more than preset value;
Section unit, for when the judging unit is judged as "Yes", the file to be written being cut by default size Piece;
Determining unit, for when the judging unit is judged as "No", the file to be written being determined entirely by as one Individual section.
Preferably, first acquisition module is specifically included:
Acquiring unit, for obtaining the MD5 check values and sha values of the file to be written;
Superpositing unit, for the superposition of the character string of the MD5 check values and sha values to be referred to as the numeral of file to be written Line.
Embodiments of the invention are comprised the following steps:Obtain the digital finger-print of file to be written;Judge global profile numeral With the presence or absence of the digital finger-print of the file to be written in fingerprint list;If so, the metadata of the file to be written is then recorded Information;If it is not, be then written into file cutting into slices by predetermined manner, and obtain the digital finger-print of each section;Judge global profile With the presence or absence of the digital finger-print of the section in the digital fingerprint list of section;If so, then by the metadata information note of the section In recording memory node;If it is not, then the digital finger-print of the section and the section is sent to into corresponding memory node.The present invention Technical scheme by storing to file or the digital finger-print of section, improve the deletion rate to duplicate data, save Memory space.
Description of the drawings
Fig. 1 is the schematic flow sheet of the embodiment of the method for the present invention one;
Fig. 2 is the schematic flow sheet of method of the present invention second embodiment;
Fig. 3 is the schematic flow sheet of method of the present invention 3rd embodiment;
Fig. 4 is the high-level schematic functional block diagram of the embodiment of device one of the present invention;
Fig. 5 is the device second embodiment high-level schematic functional block diagram of the present invention;
Fig. 6 is the refinement high-level schematic functional block diagram of section module in device fourth embodiment of the invention.
The realization of the object of the invention, functional characteristics and advantage will be described further referring to the drawings in conjunction with the embodiments.
Specific embodiment
It should be appreciated that specific embodiment described herein is not intended to limit the present invention only to explain the present invention.
The present invention provides a kind of multilayer data de-duplication method based on distributed file system.
The software systems that distributed system (distributed system) is built upon on network, with height Poly- property and the transparency.Cohesion refers to each database distribution node high degree of autonomy, there is local data base management system.Thoroughly Bright property refers to that each database distribution node is transparent for the application of user, does not see local or long-range. In distributed data base system, the imperceptible data of user are distributions, i.e., user is not necessary to know whether relation is split, whether there is Copy, data are stored in which website and affairs and perform on which website etc..What independent computer was presented to user is one Individual unified entirety, it is a system to just look like, and the system possesses the physics and logical resource of many general, can dynamically divide With task, scattered physics and logical resource realize that information is exchanged by computer network.Most typical distributed system is just It is WWW (World Wide Web).
The first embodiment of currently proposed the present processes.As shown in figure 1, the method comprising the steps of:
S100, the digital finger-print for obtaining file to be written.
Digital finger-print is the digital coding of the uniqueness generated according to the content of file, and common digital finger-print generally has MD5 (Message Digest Algorithm message digest algorithms the 5th edition), sha1 (Secure Hash Algorithm Secure Hash Algorithm) etc..Each file generates unique digital finger-print by default function or algorithm, due to function and algorithm Uniqueness, even if only having nuance in two files, the digital finger-print for obtaining is also far apart, therefore verifies the numeral of file Fingerprint is to judge file whether identical reliable basis.
In the present embodiment, when receiving from the file write request of client, the number of the file to be written is first obtained Word fingerprint.
S200, judge in global profile digital finger-print list with the presence or absence of the file to be written digital finger-print;If so, S210 steps are then performed, if it is not, then performing S220 steps.
Further, after the digital finger-print in the acquisition file to be written, global profile digital finger-print row are being judged Whether there is corresponding digital finger-print in table.Here global profile digital finger-print list refers to the distributed text that is stored with The list of all complete file digital finger-prints in part system.If there is file to be written in the global profile digital finger-print list , then there is digital finger-print and file digital finger-print identical file to be written in original file system, by digital finger-print in digital finger-print Uniqueness it was determined that existed in original file system and file identical file to be written, now continue executing with S210 Step;If conversely, there is no the digital finger-print of file to be written in global profile digital finger-print list, depositing in original file system In digital finger-print and file digital finger-print identical file to be written, then prove do not exist and file to be written in original file system Identical file, now continues executing with S220 steps.
S210, the metadata information for recording the file to be written.
In the present embodiment, the metadata information of file to be written is recorded in meta data server.When in original file system When there is digital finger-print with file digital finger-print identical file to be written, it is clear that if being written into files passe again, Can repeat to take up room, so not uploading original in the present embodiment, but directly record in meta data server and treat The metadata of write file, realizes the deletion of the file for repeatedly being uploaded.From known explanation, metadata (Metadata), also known as broker data, relay data, the information of data attribute (property) is mainly described, for supporting Such as indicate storage location, historical data, resource lookup, file record function.A kind of metadata electronic type catalogue at last, in order to Reach the purpose of scheduling, it is necessary to describing and collect in data perhaps characteristic, and then reach the mesh for assisting data retrieval 's.In brief, metadata is exactly the data (data about data) for describing data.In the present embodiment, metadata is mainly wrapped Include original path and resource information with file identical file to be written in system, the metadata record of file to be written in After in meta data server, when calling file to be written again, directly by the fileinfo calling system to be written of record In original just can obtain and the identical file of file to be written with file identical file to be written.Treated by record The metadata information of write file instead of direct upper transmitting file in prior art, can be handed over effectively save memory space and data Change the time.
S220, be written into file by predetermined manner cut into slices, obtain each section digital finger-print, and perform S300 step Suddenly.
When the digital finger-print that there is no file to be written in global profile digital finger-print list, then exist in original file system Digital finger-print and file digital finger-print identical file to be written, then prove do not exist and file phase to be written in original file system Same file.Now, file is further written in the present embodiment to cut into slices by predetermined manner.Reply is understood by, with skill The development of art and information, file is in diversity Long-term change trend, and based on identical file various different versions can be derived.To operate As a example by system, be it is known known to Windows10 operating systems (Windows 10) including dividing into 32 systems and 64 Position system, it is further to divide into multiple versions such as home edition, enterprise version, professional version, the part of these different operating versions again File content is identical, and the size of wherein each image file is about 4GB, if only check digit fingerprint, above-mentioned 6 The operating system of kind of different editions belongs to different files, the total about memory space 24GB of storage above-mentioned image file, and wherein Duplicate data occupies most spaces.
In the present embodiment, further each file is cut into slices according to preset mode, a complete file is divided into some Individual little section file.After the completion of cutting into slices to file, the digital finger-print of each section is obtained according to preset algorithm or function, and Further perform S300 steps.
S300, judge in the digital fingerprint list of global profile section with the presence or absence of the digital finger-print of the section;If so, then S310 steps are performed, if otherwise performing S320 steps.
After obtaining the digital finger-print of each section, judge in the digital fingerprint list of global profile section with the presence or absence of with it is described The digital finger-print identical digital finger-print of section, it is clear that if it is present proving there is identical section, otherwise then prove not There is identical section.When there is identical section, S310 steps being performed, when there is no identical section, performing S320 Step.
S310, the metadata information of the section recorded in memory node.
When there is identical section in distributed file system, metadata information be recorded into corresponding memory node In.Obviously, when needing to call file to be written again, by the metadata for calling file to be written, the phase of its section is obtained Pass information, and further by calling the metadata information of section, the storage information for obtaining section reduces former file to be written Realization is called to former file to be written.
S320, the section and the digital finger-print of the section are sent to into corresponding memory node.
When there is no identical section in distributed file system, then the digital finger-print of the section and the section is sent out Corresponding memory node is sent to, and is stored.
Obviously in the present embodiment, the image file of the operating system of above-mentioned 6 versions is stored if desired, after section, The section that the file of most identical contents is constituted in file only needs to storage once, constitutes the section file of difference and needs list Solely storage, then only need the memory space less than 5GB just can realize needing original depositing for the file of occupancy 24GB memory spaces Storage.
By the verification of the digital finger-print to file in the present embodiment, and further by different digital fingerprint File is cut into slices and again the digital finger-print to cutting into slices is verified, and effectively avoids the multiple storage to duplicate data, Memory space is saved.Also data transmission period and data carrying cost have been saved accordingly.
Further, Fig. 2 is referred to, based on above-described embodiment, the second embodiment of the inventive method is proposed.The S320 Also include step after step:
S400 is judged in the digital fingerprint list of the section of currently stored node with the presence or absence of the digital finger-print of the section;If It is then to perform S410 steps, if it is not, then performing S420 steps.
S410 confirms that the section writes successfully.
S420 writes the section, and the digital finger-print of the section recorded into the section digital finger-print of this memory node List.
It should be appreciated that in the present embodiment, the corresponding section numeral of the memory node being provided with each memory node and being referred to Line list, when section is sent to corresponding memory node, be in the section numeral fingerprint list for judging currently stored node The no digital finger-print that there is the section.If there is the number of the section in the digital fingerprint list of the section of currently stored node Word fingerprint, then prove the identical section that has been stored with currently stored node, then this is cut to only need to return meta data server Piece is writing successfully.If there is no the digital finger-print of the section in the digital fingerprint list of the section of currently stored node, By the section write disk.
As shown in figure 3, in the 3rd embodiment based on the above-mentioned second embodiment of the method for the present invention, step S420 Also include step afterwards:
S500, timing acquisition system load.
S600, when system load is less than preset value, by the information in the digital fingerprint list of section of each memory node It is uploaded to the digital fingerprint list of the global section.
Obviously, system is in running, if taking excessive system load, can affect data and file storage and Transmission speed, in all the present embodiment, further obtains the load of system, and the ability only when system load is less than certain preset value Further operated.In the present embodiment, when system load is less than preset value, by the section digital finger-print of each memory node Information in list is uploaded to the digital fingerprint list of the global section.
In fourth embodiment of the method for the present invention based on above-described embodiment, step S220 is specifically included:
Whether S221 judges the size of the file to be written more than preset value;If so, S222 steps are then performed, if it is not, Then perform S223 steps.
S222 is by the file to be written by default size section.
S223 is determined entirely by the file to be written for a section.
It should be appreciated that for file, if section is less, comparatively it is easier to find identical and cuts Piece, but accordingly, file can be divided into less file, then can more the time required to cutting into slices and obtaining the digital finger-print of section It is long.And the size cut into slices it is relatively large when, because number of sections is less, then cutting into slices the time and obtains section digital finger-print Time can accordingly shorten, but section can relative reduction with existing section identical possibility.Basis is tackled when specifically used Demand and set, specific setting value can be 4MB, 8MB, 16MB, 32MB etc., wherein be preferably set to 64MB, general setting It is less than 4TB.
Based on above-described embodiment, the 5th embodiment of the inventive method is proposed, step S100 is specifically included:
S110 obtains the MD5 check values and sha values of the file to be written.
The character string of the MD5 check values and sha values is superimposed S120 the digital finger-print as file to be written.
It should be appreciated that in specifically used, the species of digital finger-print is diversified, in the present embodiment, there is provided one Preferred digital finger-print is planted, the MD5 check values for specially obtaining file to be written are designated as x, and obtain the sha of file to be written Value, be more highly preferred to for sha1 values, be designated as y, the character string of two values is superposed to xy as the digital finger-print of this document.With As a example by 64 simplified form of Chinese Character plate originals image files of the formal versions of Win10, the MD5 values of this document are 2F8691F7FE2F569A70418A8633AC63F6 is designated as x, and sha1 values are C71D49A6144772F352806201EF564951BE55EDD5 is designated as y, and x and y series connection is obtained 2F8691F7FE2F569A70418A8633AC63F6C71D49A6144772F352806201 EF564951BE55EDD5 conducts The digital finger-print of verification word.
Additionally, for achieving the above object, the present invention also provides a kind of multilayer duplicate data based on distributed file system Device is deleted, Fig. 4 is referred to, the device includes:
First acquisition module 10, for obtaining the digital finger-print of file to be written.
Digital finger-print is the digital coding of the uniqueness generated according to the content of file, and common digital finger-print generally has MD5 (Message Digest Algorithm message digest algorithms the 5th edition), sha1 (Secure Hash Algorithm Secure Hash Algorithm) etc..Each file generates unique digital finger-print by default function or algorithm, due to function and algorithm Uniqueness, even if only having nuance in two files, the digital finger-print for obtaining is also far apart, therefore verifies the numeral of file Fingerprint is to judge file whether identical reliable basis.
In the present embodiment, when receiving from the file write request of client, the number of the file to be written is first obtained Word fingerprint.
First judge module 20, for judging global profile digital finger-print list in the presence or absence of the file to be written Digital finger-print.
Further, after the digital finger-print in the acquisition file to be written, global profile digital finger-print row are being judged Whether there is corresponding digital finger-print in table.Here global profile digital finger-print list refers to the distributed text that is stored with The list of all complete file digital finger-prints in part system.If there is file to be written in the global profile digital finger-print list , then there is digital finger-print and file digital finger-print identical file to be written in original file system, by digital finger-print in digital finger-print Uniqueness it was determined that existed in original file system and file identical file to be written, now continue executing with S210 Step;If conversely, there is no the digital finger-print of file to be written in global profile digital finger-print list, depositing in original file system In digital finger-print and file digital finger-print identical file to be written, then prove do not exist and file to be written in original file system Identical file, now continues executing with S220 steps.
First logging modle 30, for when the judged result of first judge module 20 is "Yes", treating described in record The metadata information of write file.
In the present embodiment, the metadata information of file to be written is recorded in meta data server.When in original file system When there is digital finger-print with file digital finger-print identical file to be written, it is clear that if being written into files passe again, Can repeat to take up room, so not uploading original in the present embodiment, but directly record in meta data server and treat The metadata of write file, realizes the deletion of the file for repeatedly being uploaded.From known explanation, metadata (Metadata), also known as broker data, relay data, the information of data attribute (property) is mainly described, for supporting Such as indicate storage location, historical data, resource lookup, file record function.A kind of metadata electronic type catalogue at last, in order to Reach the purpose of scheduling, it is necessary to describing and collect in data perhaps characteristic, and then reach the mesh for assisting data retrieval 's.In brief, metadata is exactly the data (data about data) for describing data.In the present embodiment, metadata is mainly wrapped Include original path and resource information with file identical file to be written in system, the metadata record of file to be written in After in meta data server, when calling file to be written again, directly by the fileinfo calling system to be written of record In original just can obtain and the identical file of file to be written with file identical file to be written.Treated by record The metadata information of write file instead of direct upper transmitting file in prior art, can be handed over effectively save memory space and data Change the time.
Section module 40, for the judged result of first judge module 20 be "No" when, be written into file by Predetermined manner is cut into slices, and obtains the digital finger-print of each section.
When the digital finger-print that there is no file to be written in global profile digital finger-print list, then exist in original file system Digital finger-print and file digital finger-print identical file to be written, then prove do not exist and file phase to be written in original file system Same file.Now, file is further written in the present embodiment to cut into slices by predetermined manner.Reply is understood by, with skill The development of art and information, file is in diversity Long-term change trend, and based on identical file various different versions can be derived.To operate As a example by system, be it is known known to Windows10 operating systems (Windows 10) including dividing into 32 systems and 64 Position system, it is further to divide into multiple versions such as home edition, enterprise version, professional version, the part of these different operating versions again File content is identical, and the size of wherein each image file is about 4GB, if only check digit fingerprint, above-mentioned 6 The operating system of kind of different editions belongs to different files, the total about memory space 24GB of storage above-mentioned image file, and wherein Duplicate data occupies most spaces.
In the present embodiment, further each file is cut into slices according to preset mode, a complete file is divided into some Individual little section file.After the completion of cutting into slices to file, according to preset algorithm or function the digital finger-print of each section is obtained.
Second judge module 50, for be written into file by predetermined manner cut into slices, and obtain each section numeral Judge after fingerprint in the digital fingerprint list of global profile section with the presence or absence of the digital finger-print of the section.
After obtaining the digital finger-print of each section, judge in the digital fingerprint list of global profile section with the presence or absence of with it is described The digital finger-print identical digital finger-print of section, it is clear that if it is present proving there is identical section, otherwise then prove not There is identical section.
Second logging modle 60, for when the judged result of second judge module 50 is "Yes", by the section Metadata information recorded in memory node.
When there is identical section in distributed file system, metadata information be recorded into corresponding memory node In.Obviously, when needing to call file to be written again, by the metadata for calling file to be written, the phase of its section is obtained Pass information, and further by calling the metadata information of section, the storage information for obtaining section reduces former file to be written Realization is called to former file to be written.
Sending module 70, for when the judged result of second judge module 50 is "No", cutting into slices described and this The digital finger-print of section is sent to corresponding memory node.
When there is no identical section in distributed file system, then the digital finger-print of the section and the section is sent out Corresponding memory node is sent to, and is stored.
Obviously in the present embodiment, the image file of the operating system of above-mentioned 6 versions is stored if desired, after section, The section that the file of most identical contents is constituted in file only needs to storage once, constitutes the section file of difference and needs list Solely storage, then only need the memory space less than 5GB just can realize needing original depositing for the file of occupancy 24GB memory spaces Storage.
By the verification of the digital finger-print to file in the present embodiment, and further by different digital fingerprint File is cut into slices and again the digital finger-print to cutting into slices is verified, and effectively avoids the multiple storage to duplicate data, Memory space is saved.Also data transmission period and data carrying cost have been saved accordingly.
Further, Fig. 5 is referred to, based on above-described embodiment, the second embodiment of apparatus of the present invention is proposed.Also include:
3rd judge module 80, for cutting with the presence or absence of described in the section numeral fingerprint list for judging currently stored node The digital finger-print of piece.
Module 90 is confirmed, for when the judged result of the 3rd judge module 90 is "Yes", confirming the section write Success.
Writing module 100, for when the judged result of the 3rd judge module 90 is "No", writing the section, And the digital finger-print of the section be recorded into the section numeral fingerprint list of this memory node.
It should be appreciated that in the present embodiment, the corresponding section numeral of the memory node being provided with each memory node and being referred to Line list, when section is sent to corresponding memory node, be in the section numeral fingerprint list for judging currently stored node The no digital finger-print that there is the section.If there is the number of the section in the digital fingerprint list of the section of currently stored node Word fingerprint, then prove the identical section that has been stored with currently stored node, then this is cut to only need to return meta data server Piece is writing successfully.If there is no the digital finger-print of the section in the digital fingerprint list of the section of currently stored node, By the section write disk.
In the 3rd embodiment of the above-mentioned second embodiment of device based on the present invention, also include:
Second acquisition module, for timing acquisition system load;
Upper transmission module, for when system load is less than preset value, by the section numeral fingerprint list of each memory node In information be uploaded to the digital fingerprint list of the global section.
Obviously, system is in running, if taking excessive system load, can affect data and file storage and Transmission speed, in all the present embodiment, further obtains the load of system, and the ability only when system load is less than certain preset value Further operated.In the present embodiment, when system load is less than preset value, by the section digital finger-print of each memory node Information in list is uploaded to the digital fingerprint list of the global section.
Further referring to Fig. 6, it is based in the fourth embodiment of above-described embodiment in the device of the present invention, the slice module Block 40 is specifically included:
Judging unit 41, for judging the size of the file to be written whether more than preset value;
Section unit 42, for when the judged result of the judging unit 41 is "Yes", the file to be written being pressed Default size section;
Determining unit 43, it is for when the judged result of the judging unit 41 is "No", the file to be written is whole Body is defined as a section.
It should be appreciated that for file, if section is less, comparatively it is easier to find identical and cuts Piece, but accordingly, file can be divided into less file, then can more the time required to cutting into slices and obtaining the digital finger-print of section It is long.And the size cut into slices it is relatively large when, because number of sections is less, then cutting into slices the time and obtains section digital finger-print Time can accordingly shorten, but section can relative reduction with existing section identical possibility.Basis is tackled when specifically used Demand and set, specific setting value can be 4MB, 8MB, 16MB, 32MB etc., wherein be preferably set to 64MB, general setting It is less than 4TB.
Based on above-described embodiment, the 5th embodiment of apparatus of the present invention is proposed, first acquisition module 10 is specifically included:
Acquiring unit, for obtaining the MD5 check values and sha values of the file to be written;
Superpositing unit, for the superposition of the character string of the MD5 check values and sha values to be referred to as the numeral of file to be written Line.
It should be appreciated that in specifically used, the species of digital finger-print is diversified, in the present embodiment, there is provided one Preferred digital finger-print is planted, the MD5 check values for specially obtaining file to be written are designated as x, and obtain the sha of file to be written Value, be more highly preferred to for sha1 values, be designated as y, the character string of two values is superposed to xy as the digital finger-print of this document.With As a example by 64 simplified form of Chinese Character plate originals image files of the formal versions of Win10, the MD5 values of this document are 2F8691F7FE2F569A70418A8633AC63F6 is designated as x, and sha1 values are C71D49A6144772F352806201EF564951BE55EDD5 is designated as y, and x and y series connection is obtained 2F8691F7FE2F569A70418A8633AC63F6C71D49A6144772F352806201 EF564951BE55EDD5 conducts The digital finger-print of verification word.
The preferred embodiments of the present invention are these are only, the scope of the claims of the present invention is not thereby limited, it is every using this Equivalent structure or equivalent flow conversion that bright specification and accompanying drawing content are made, or directly or indirectly it is used in other related skills Art field, is included within the scope of the present invention.

Claims (10)

1. a kind of multilayer data de-duplication method based on distributed file system, it is characterised in that methods described include with Lower step:
Obtain the digital finger-print of file to be written;
Judge in global profile digital finger-print list with the presence or absence of the digital finger-print of the file to be written;
If so, the metadata information of the file to be written is then recorded;
If it is not, be then written into file cutting into slices by predetermined manner, and obtain the digital finger-print of each section;
Judge in the digital fingerprint list of global profile section with the presence or absence of the digital finger-print of the section;
If so, then the metadata information of the section recorded in memory node;
If it is not, then the digital finger-print of the section and the section is sent to into corresponding memory node.
2. the multilayer data de-duplication method of distributed file system is based on as claimed in claim 1, it is characterised in that institute State that the digital finger-print of the section and the section is sent to after corresponding memory node and also include step:
Judge in the digital fingerprint list of section of currently stored node with the presence or absence of the digital finger-print of the section;
If so, then confirm that the section writes successfully;
If it is not, then writing the section, and the digital finger-print of the section be recorded into the section digital finger-print of this memory node List.
3. the multilayer data de-duplication method of distributed file system is based on as claimed in claim 2, it is characterised in that institute State and the section is write into disk, and the digital finger-print of the section be recorded into the section numeral fingerprint list of this memory node Also include step afterwards:
Timing acquisition system load;
When system load is less than preset value, the information in the digital fingerprint list of the section of each memory node is uploaded to described The digital fingerprint list of overall situation section.
4. the multilayer data de-duplication method based on distributed file system as described in any one of claim 1-3, it is special Levy and be, the file that is written into is cut into slices by predetermined manner, and obtain the digital finger-print of each section and specifically include step:
Judge the size of the file to be written whether more than preset value;
If so, then by the file to be written by default size section;
If it is not, then the file to be written is determined entirely by as a section.
5. the multilayer data de-duplication method based on distributed file system as described in any one of claim 1-3, it is special Levy and be, the digital finger-print for obtaining file to be written specifically includes step:
Obtain the MD5 check values and sha values of the file to be written;
The character string of the MD5 check values and sha values is superimposed into the digital finger-print as file to be written.
6. a kind of multilayer data de-duplication device based on distributed file system, it is characterised in that include:
First acquisition module, for obtaining the digital finger-print of file to be written;
First judge module, for judging global profile digital finger-print list in refer to the presence or absence of the numeral of the file to be written Line;
First logging modle, for when the judged result of first judge module is "Yes", recording the file to be written Metadata information;
Section module, for when the judged result of first judge module is "No", being written into file by predetermined manner Section, and obtain the digital finger-print of each section;
Second judge module, for judging the digital fingerprint list of global profile section in refer to the presence or absence of the numeral of the section Line;
Second logging modle, for when the judged result of second judge module is "Yes", by the metadata of the section Information record is in memory node;
Sending module, for when the judged result of second judge module is "No", by the section and the number of the section Word fingerprint is sent to corresponding memory node.
7. the multilayer data de-duplication device of distributed file system is based on as claimed in claim 6, it is characterised in that also Including:
3rd judge module, with the presence or absence of the number of the section in the digital fingerprint list of the section for judging currently stored node Word fingerprint;
Confirm module, for being judged as "Yes" constantly in the 3rd judge module, confirm that the section writes successfully;
Writing module, for when the 3rd judge module is judged as "No", writes the section, and by the number of the section Section numeral fingerprint list of the word fingerprint recording to this memory node.
8. the multilayer data de-duplication device of distributed file system is based on as claimed in claim 7, it is characterised in that also Including:
Second acquisition module, for timing acquisition system load;
Upper transmission module, for when system load is less than preset value, by the digital fingerprint list of the section of each memory node Information is uploaded to the digital fingerprint list of the global section.
9. the multilayer data de-duplication device based on distributed file system as described in any one of claim 6-8, it is special Levy and be, the section module is specifically included:
Judging unit, for judging the size of the file to be written whether more than preset value;
Section unit, for when the judged result of the judging unit is "Yes", by the file to be written by default size Section;
Determining unit, for the judged result of the judging unit be "No" when, by the file to be written be determined entirely by for One section.
10. the multilayer data de-duplication device based on distributed file system as described in any one of claim 6-8, it is special Levy and be, first acquisition module is specifically included:
Acquiring unit, for obtaining the MD5 check values and sha values of the file to be written;
Superpositing unit, for the character string of the MD5 check values and sha values to be superimposed the digital finger-print as file to be written.
CN201610984188.7A 2016-11-08 2016-11-08 Method and device for deleting multiple layered repetitive data based on distributed file system Pending CN106649556A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610984188.7A CN106649556A (en) 2016-11-08 2016-11-08 Method and device for deleting multiple layered repetitive data based on distributed file system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610984188.7A CN106649556A (en) 2016-11-08 2016-11-08 Method and device for deleting multiple layered repetitive data based on distributed file system

Publications (1)

Publication Number Publication Date
CN106649556A true CN106649556A (en) 2017-05-10

Family

ID=58805866

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610984188.7A Pending CN106649556A (en) 2016-11-08 2016-11-08 Method and device for deleting multiple layered repetitive data based on distributed file system

Country Status (1)

Country Link
CN (1) CN106649556A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109947731A (en) * 2017-07-31 2019-06-28 星辰天合(北京)数据科技有限公司 The delet method and device of repeated data
WO2020215580A1 (en) * 2019-04-23 2020-10-29 平安科技(深圳)有限公司 Distributed global data deduplication method and device
WO2023093091A1 (en) * 2021-11-25 2023-06-01 华为技术有限公司 Data storage system, smart network card, and computing node

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101216791A (en) * 2008-01-04 2008-07-09 华中科技大学 File backup method based on fingerprint
CN102833298A (en) * 2011-06-17 2012-12-19 英业达集团(天津)电子技术有限公司 Distributed repeated data deleting system and processing method thereof
CN102915278A (en) * 2012-09-19 2013-02-06 浪潮(北京)电子信息产业有限公司 Data deduplication method
CN103514250A (en) * 2013-06-20 2014-01-15 易乐天 Method and system for deleting global repeating data and storage device
US8677132B1 (en) * 2012-01-06 2014-03-18 Narus, Inc. Document security
CN104408154A (en) * 2014-12-04 2015-03-11 华为技术有限公司 Repeated data deletion method and device
CN104932841A (en) * 2015-06-17 2015-09-23 南京邮电大学 Saving type duplicated data deleting method in cloud storage system
CN105320773A (en) * 2015-11-03 2016-02-10 中国人民解放军理工大学 Distributed duplicated data deleting system and method based on Hadoop platform
CN105912268A (en) * 2016-04-12 2016-08-31 韶关学院 Distributed data deduplocation method and apparatus based on self-matching characteristics

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101216791A (en) * 2008-01-04 2008-07-09 华中科技大学 File backup method based on fingerprint
CN102833298A (en) * 2011-06-17 2012-12-19 英业达集团(天津)电子技术有限公司 Distributed repeated data deleting system and processing method thereof
US8677132B1 (en) * 2012-01-06 2014-03-18 Narus, Inc. Document security
CN102915278A (en) * 2012-09-19 2013-02-06 浪潮(北京)电子信息产业有限公司 Data deduplication method
CN103514250A (en) * 2013-06-20 2014-01-15 易乐天 Method and system for deleting global repeating data and storage device
CN104408154A (en) * 2014-12-04 2015-03-11 华为技术有限公司 Repeated data deletion method and device
CN104932841A (en) * 2015-06-17 2015-09-23 南京邮电大学 Saving type duplicated data deleting method in cloud storage system
CN105320773A (en) * 2015-11-03 2016-02-10 中国人民解放军理工大学 Distributed duplicated data deleting system and method based on Hadoop platform
CN105912268A (en) * 2016-04-12 2016-08-31 韶关学院 Distributed data deduplocation method and apparatus based on self-matching characteristics

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109947731A (en) * 2017-07-31 2019-06-28 星辰天合(北京)数据科技有限公司 The delet method and device of repeated data
WO2020215580A1 (en) * 2019-04-23 2020-10-29 平安科技(深圳)有限公司 Distributed global data deduplication method and device
WO2023093091A1 (en) * 2021-11-25 2023-06-01 华为技术有限公司 Data storage system, smart network card, and computing node

Similar Documents

Publication Publication Date Title
US10158483B1 (en) Systems and methods for efficiently and securely storing data in a distributed data storage system
US9547560B1 (en) Amortized snapshots
US7464247B2 (en) System and method for updating data in a distributed column chunk data store
CN105190573B (en) The reduction redundancy of storing data
US20180150640A1 (en) Policy aware unified file system
JP4648723B2 (en) Method and apparatus for hierarchical storage management based on data value
US5819272A (en) Record tracking in database replication
US7457935B2 (en) Method for a distributed column chunk data store
US20070061542A1 (en) System for a distributed column chunk data store
US11093387B1 (en) Garbage collection based on transmission object models
US20070143359A1 (en) System and method for recovery from failure of a storage server in a distributed column chunk data store
US20120060049A1 (en) System and method for removing a storage server in a distributed column chunk data store
US8768980B2 (en) Process for optimizing file storage systems
CN105556520A (en) Mirroring, in memory, data from disk to improve query performance
US20070162523A1 (en) System and method for storing a data file backup
CN102938784A (en) Method and system used for data storage and used in distributed storage system
CN104281533A (en) Data storage method and device
US7519636B2 (en) Key sequenced clustered I/O in a database management system
CN106649556A (en) Method and device for deleting multiple layered repetitive data based on distributed file system
CN109767274B (en) Method and system for carrying out associated storage on massive invoice data
JP5241298B2 (en) System and method for supporting file search and file operations by indexing historical file names and locations
CN113568582A (en) Data management method and device and storage equipment
CN110008197A (en) A kind of data processing method, system and electronic equipment and storage medium
CN113282540A (en) Cloud object storage synchronization method and device, computer equipment and storage medium
CN114417413A (en) File processing method, device, equipment and medium of block chain file system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right

Effective date of registration: 20171023

Address after: 518100 Guangdong city of Shenzhen province Nanshan District South Road Fiyta Technology Building Room 1402

Applicant after: Shenzhen Zhongbo Kechuang Information Technology Co., Ltd.

Address before: 518000 Guangdong city of Shenzhen province Qianhai Shenzhen Hong Kong cooperation zone before Bay Road No. 1 building 201 room A (located in Shenzhen Qianhai business secretary Co. Ltd.)

Applicant before: Shenzhen City Rui Bo Storage Technology Co. Ltd.

TA01 Transfer of patent application right
TA01 Transfer of patent application right

Effective date of registration: 20190902

Address after: 100089 Floor 1-4, No. 2 Building, No. 9 Courtyard, Dijin Road, Haidian District, Beijing

Applicant after: Beijing Toyou Feiji Electronics Co., Ltd.

Address before: 518100 Room 1402, Feiyada Science and Technology Building, Nanshan District, Shenzhen City, Guangdong Province

Applicant before: Shenzhen Zhongbo Kechuang Information Technology Co., Ltd.

TA01 Transfer of patent application right
RJ01 Rejection of invention patent application after publication

Application publication date: 20170510

RJ01 Rejection of invention patent application after publication