CN105468686A - Method and device for reducing redundant data - Google Patents

Method and device for reducing redundant data Download PDF

Info

Publication number
CN105468686A
CN105468686A CN201510789116.2A CN201510789116A CN105468686A CN 105468686 A CN105468686 A CN 105468686A CN 201510789116 A CN201510789116 A CN 201510789116A CN 105468686 A CN105468686 A CN 105468686A
Authority
CN
China
Prior art keywords
file
files
hard disk
hard link
hard
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201510789116.2A
Other languages
Chinese (zh)
Inventor
徐鹏捷
陈雄
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Qihoo Technology Co Ltd
Qizhi Software Beijing Co Ltd
Original Assignee
Beijing Qihoo Technology Co Ltd
Qizhi Software Beijing Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Qihoo Technology Co Ltd, Qizhi Software Beijing Co Ltd filed Critical Beijing Qihoo Technology Co Ltd
Priority to CN201510789116.2A priority Critical patent/CN105468686A/en
Publication of CN105468686A publication Critical patent/CN105468686A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/17Details of further file system functions
    • G06F16/174Redundancy elimination performed by the file system
    • G06F16/1748De-duplication implemented within the file system, e.g. based on file segments
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/17Details of further file system functions
    • G06F16/174Redundancy elimination performed by the file system
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/18File system types
    • G06F16/1847File system types specifically adapted to static storage, e.g. adapted to flash memory or SSD
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0602Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
    • G06F3/0608Saving storage space on storage systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0638Organizing or formatting or addressing of data
    • G06F3/0643Management of files
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0646Horizontal data movement in storage systems, i.e. moving data in between storage devices or systems
    • G06F3/0652Erasing, e.g. deleting, data cleaning, moving of data to a wastebasket

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Human Computer Interaction (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a method and a device for reducing redundant data. The method comprises the following steps: traversing all files in various partitions of a hard disk; judging whether the hard disk contain two or more identical files; on the premise of determining that two or more files are identical and the file identifiers are different, determining whether the two or more files are identical according to the verification values of the two or more files; and creating a hard link to merge the two or more identical files into one file. By adopting the method disclosed by the invention, the repeated files are merged into one file by the hard link, so the redundant data in the hard disk can be reduced, the space of the hard disk is saved and the operation efficiency of the hard disk is improved.

Description

Reduce method and the device of redundant data
Technical field
The present invention relates to field of computer technology, be specifically related to a kind of method and the device that reduce redundant data.
Background technology
Under normal circumstances, most of user only understands how to continue document retaining at memory storage relayings such as hard disks, does not but know how to safeguard memory storage better, effectively to monitor the capacity of its hard disk.In some cases, the system with memory storage can searching, obtain and editing or the maintenance management such as file erase operation by some software application execute files, it is preserve file one by one according to catalogue that traditional fixed disk file is preserved, there are some drawbacks in this mode: first, if when certain hard disk has a large amount of duplicate file, this preserving type often takies larger storage space; Secondly, when file in backup harddisk, if preserve a large amount of duplicate file in hard disk, the cost more time is backed up these duplicate files by system, causes BACKUP TIME to increase; Finally, when the file size in hard disk is larger, if user wishes to remove these files repeated from hard disk, then have to check file one by one, very loaded down with trivial details.
Summary of the invention
In view of the above problems, propose the present invention to provide a kind of method of minimizing redundant data overcoming the problems referred to above or solve the problem at least in part, thus improve hard disk operational efficiency.
According to one aspect of the present invention, a kind of method reducing redundant data is provided, comprises: the All Files of each subregion in hard disk is traveled through; Judge in described hard disk, whether to comprise two and file identical above: determining under two and above file size is identical, file identification is different prerequisite, according to two and determine two and whether identical with files with the proof test value of files; By creating hard link, be a file by two and Piece file mergence identical above.
Preferably, described judge in described hard disk, whether to comprise two and file identical above before, also comprise: from traversal file get rid of user class file; Describedly judge whether comprise two and file identical above in described hard disk, comprising: carry out judging whether to comprise two and file identical above from non-user level file.
Preferably, user class file is got rid of the described file from traveling through, judge whether comprise two and file identical above in described hard disk, comprising: the file of chosen in advance particular type, the file only for described particular type carries out judging whether to comprise two and file identical above; And/or, according to the amendment of file and/or the file of up-to-date generation in date created eliminating preset time period, carry out judging whether to comprise two and file identical above outside the file getting rid of up-to-date generation.
Preferably, the file of described particular type comprises exe file and/or dll file.
Preferably, described two and refer to two and with the hash value of files with the proof test value of files, according to two and determine with files hash value two and whether identical with files after, also comprise: for hash value identical two and with files, further by two and carry out scale-of-two with the partial data of files and compare; According to partial data comparative result, finally determine two and whether identical with files.
Preferably, described by creating hard link, be a file by two and Piece file mergence identical above, comprise: in new technology file system, by CreateHardLink instruction, two and above same file are merged into a file in hard link mode.
Reduce a device for redundant data, comprising: Traversal Unit, for traveling through the All Files of each subregion in hard disk; Judging unit, two and file identical above whether is comprised in described hard disk: determining under two and above file size is identical, file identification is different prerequisite, according to two and determine two and whether identical with files with the proof test value of files for judging; Two and Piece file mergence identical above, for by creating hard link, are a file by hard link merge cells.
Preferably, described device also comprises: user class file rejected unit, for getting rid of user class file in the file from traversal; Described judging unit specifically for, carry out judging whether to comprise two and file identical above from non-user level file.
Preferably, described user class file rejected unit, specifically for, the file of chosen in advance particular type, gets rid of the file of non specified type, and/or, the file of up-to-date generation in preset time period is got rid of according to the amendment of file and/or date created.
Preferably, the file of described particular type comprises exe file and/or dll file.
Preferably, described judging unit specifically for, according to two and determine two and whether identical with files with the hash value of files; Described judging unit also for, for hash value identical two and with files, further by two and carry out scale-of-two with the partial data of files and compare; According to partial data comparative result, finally determine two and whether identical with files.
Preferably, described hard link merge cells specifically for, in new technology file system, by CreateHardLink instruction, two and above same file are merged into a file in hard link mode.
Based on the file management method reducing redundant data, comprising: the All Files of each subregion in hard disk is traveled through; Judge in described hard disk, whether to comprise two and file identical above: determining under two and above file size is identical, file identification is different prerequisite, according to two and determine two and whether identical with files with the proof test value of files; By creating hard link, be a file by two and Piece file mergence identical above; Safeguard the hard link corresponding relation of described hard link file and one or more filename, when upgrading for described one or more filename described hard link file, delete and/or revise, by safeguarding described hard link corresponding relation, the unified operation described hard link file being carried out to described renewal, deletion and/or amendment.
Preferably, described hard link corresponding relation comprises file chaining number, and described file chaining number corresponds to the quantity of described one or more filename; When deleting for one of one or more filename, described file chaining number being subtracted one, until when described file chaining number is zero, just described hard link file being deleted.
Preferably, when upgrading hard link file, the hard link relation of file to be updated is revised as the hard link relation of file after upgrading.
Preferred when carrying out file modification for one of one or more filename, unification is modified to described hard link file.
Visible, the Piece file mergence of repetition is a file by hard link by such scheme, can reduce the redundant data in hard disk, saves hard drive space and also improves hard disk operational efficiency.In a preferred embodiment of the invention, by getting rid of user class file, the speed that file judges can be improved, and skip the specific file of these user classes, can avoid producing compatibility issue.In another preferred embodiment of the invention, compared by partial data and determine that whether file is identical, more accurately, reduce the probability of erroneous judgement.
In addition, the present invention at least can bring the advantage of the following aspects: (1) saves hard drive space.Same file, only needs to safeguard hard link relation, does not need to carry out multiple copy, can save hard drive space like this.(2) Rename file.Rename file does not need to open this file, only need change the content of certain directory entry.(3) deleted file.Corresponding directory entry only need be deleted by deleted file, and the link number of this file subtracts 1, if the link number of this file is zero after the item that deltrees, at this moment system is just deleted real file from disk.(4) file update.If relate to file update, such as under Windows system, what a only need first inside WinSxS catalogue, to download redaction, then revise Windows the hard link relation of file of the same name below System32, the hard link of redaction is pointed to from the hard link of legacy version, this makes it possible to the renewal work completing file fast, and do not need to carry out copying of file, speed can significantly improve.(5) patch is unloaded.Run into the situation needing patch to unload, only need hard link to point to change legacy version into, there is no the problem that file is replaced.And the amendment between the file establishing hard link relation is synchronous, as long as therefore there is a side to be modified, the opposing party also can be revised.
Above-mentioned explanation is only the general introduction of technical solution of the present invention, in order to technological means of the present invention can be better understood, and can be implemented according to the content of instructions, and can become apparent, below especially exemplified by the specific embodiment of the present invention to allow above and other objects of the present invention, feature and advantage.
Accompanying drawing explanation
By reading hereafter detailed description of the preferred embodiment, various other advantage and benefit will become cheer and bright for those of ordinary skill in the art.Accompanying drawing only for illustrating the object of preferred implementation, and does not think limitation of the present invention.And in whole accompanying drawing, represent identical parts by identical reference symbol.In the accompanying drawings:
Fig. 1 shows the method flow diagram reducing redundant data according to an embodiment of the invention;
Fig. 2 shows the method flow diagram reducing redundant data in accordance with another embodiment of the present invention;
Fig. 3 a-3b shows the method file hard link schematic diagram reducing redundant data according to an embodiment of the invention;
Fig. 4 shows the apparatus structure schematic diagram reducing redundant data according to an embodiment of the invention.
Embodiment
Below with reference to accompanying drawings exemplary embodiment of the present disclosure is described in more detail.Although show exemplary embodiment of the present disclosure in accompanying drawing, however should be appreciated that can realize the disclosure in a variety of manners and not should limit by the embodiment set forth here.On the contrary, provide these embodiments to be in order to more thoroughly the disclosure can be understood, and complete for the scope of the present disclosure can be conveyed to those skilled in the art.
Often there is this such situation in hard disk, two file contents are completely the same, but are but dispersed in different positions.Such as 360 net dishes with 360 bodyguards in fact a lot of file be all common, this just causes a large amount of duplicate file of hard disk storing, reduces the available rate of hard disk, causes great waste.
Those skilled in the art understand, and NTFS (NewTechnologyFileSystem, New Technology File System) is the file system of WindowsNT environment.New Technology File System be WindowsNT family (as, Windows2000, WindowsXP, WindowsVista, Windows7 and windows8.1) etc. the special file system (file system of the drive at operating system place must be formatted as the file system of NTFS, under 4096 bunches of environment) of limiter stage.NTFS instead of old-fashioned FAT file system.NTFS has done some improvement to FAT and HPFS, such as, supports metadata, and employs advanced data structure, so that improving SNR, reliability and disk space utilization factor, and provide some additional extension functions.
Hard link is the one service of new technology file system.So-called hard link (hardlink), a file has one or more filename exactly.Hard link similarly is shadow to a document creation, and shadow and body are synchronous.Hard link application scenarios one of them be exactly when certain fdisk off-capacity, but in order to classify, use the capacity of other subregions, but being presented at target partition.
Therefore, the application's novelty proposes, and for the file repeated in hard disk, takes hard link mode to merge into a file.So, in order to realize the hard link to duplicate file, first needing to judge which file is repetition, that is, there is which identical file, therefore, need to judge in hard disk, which file is identical.Under the prerequisite determining duplicate file, the file that these can be repeated adopts hard link mode, merges into a file, thus reduces taking hard drive space, improves the operational efficiency of hard disk.
See Fig. 1, show the method flow reducing redundant data according to an embodiment of the invention, the method comprises the following steps:
S101: the All Files of each subregion in hard disk is traveled through.
For file traversal, file traversal algorithm realization popular at present can be taked, the following file traversal algorithm proposed can certainly be taked.Current file traversal algorithm, mainly comprises depth-priority-searching method and width first traversal.When specific implementation, often kind of algorithm can have multiple realization, in general, has recurrence and onrecurrent two kinds.
For concrete file system, system can be taked to carry out file traversal from tape function.Such as, for the new technology file system of Windows, WindowsAPI function can be utilized to realize file traversal.Concrete, use the All Files of WindowsAPI function traversal specified partition, catalogue.Its thinking is: recall browse through folders window and specify the initial path that will search for, the All Files under then traveling through this catalogue with the api function of locating file and under the sub-directory comprised.
S102: judge whether comprise two and file identical above in hard disk: determining under two and above file size is identical, file identification (ID) is different prerequisite, according to two and determine two and whether identical with files with the proof test value of files.
Judge whether to comprise in hard disk two and the identical file of above quantity, namely find out the file of repetition, this performs the condition precedent that hard link merges.When judging whether to comprise two and file identical above in hard disk, realize by following steps:
The first step: determine two and whether identical with the size of files, if varied in size, then determine it is not identical file, if size is identical, then continues to perform second step.For how determining two and whether identical with the size of files, such as, can when WindowsAPI function carries out file traversal by file reading attribute information determination file size, thus compare two and whether identical with the size of files.
Second step: judge two and whether identical with the ID of files, because in the file system of hard disk, each file has unique ID, so, for two files that independently still content is identical mutually, their ID is also different, therefore, need in second step to filter out the different file of file ID herein, continue execution the 3rd step.For how to determine two and above file ID whether identical, such as, can call GetFileInformationByHandleAPI, obtain file attribute information, thus the unique ID of documents realizes.
3rd step: calculate two and with the proof test value of files (such as, cryptographic hash, MD5 value, etc.), such as, if two and identical with the hash value of files, then can determine these two and be identical file with files.For how obtaining two and with files hash value, Crypt series A PI can be used to calculate.Windows provides CryptoAPI, uses these API, can realize Hash more easily.
S103: by creating hard link, be a file by two and Piece file mergence identical above.
Support hard link mechanism in new technology file system, physical file can be made only to there is portion, and can have and multiplely quote path, form and shortcut are a bit similar.When all quote deleted after, just really deleted.During specific implementation, CreateHardLink can be used to create hard link, the file before replacing, thus save storage space.With reference to prior art, can not repeat about the detail creating hard link herein.
Visible, in such scheme, be a file by hard link by the Piece file mergence of repetition, the redundant data in hard disk can be reduced, save hard drive space and also improve hard disk operational efficiency.
See Fig. 2, show the method flow diagram reducing redundant data in accordance with another embodiment of the present invention.Compared to embodiment illustrated in fig. 1, the present embodiment key distinction is: (1), when determining duplicate file, is first got rid of user class file, namely only in non-user level file, carried out the whether identical judgement of file; (2) after identical according to hash value determination file, whether the partial data carrying out file further compares, finally identical according to partial data comparative result determination file.
Fig. 2 comprises the following steps:
S201: the All Files of each subregion in hard disk is traveled through.
For file traversal, file traversal algorithm realization popular at present can be taked, the following file traversal algorithm proposed can certainly be taked.Current file traversal algorithm, mainly comprises depth-priority-searching method and width first traversal.When specific implementation, often kind of algorithm can have multiple realization, in general, has recurrence and onrecurrent two kinds.
For concrete file system, system can be taked to carry out file traversal from tape function.Such as, for the new technology file system of Windows, WindowsAPI function can be utilized to realize file traversal.Concrete, use the All Files of WindowsAPI function traversal specified partition, catalogue.Its thinking is: recall browse through folders window and specify the initial path that will search for, the All Files under then traveling through this catalogue with the api function of locating file and under the sub-directory comprised.
S202: get rid of user class file from the file of traversal.
In order to avoid producing compatibility file, need to skip specific user class file, and only the follow-up operation whether repeating to judge and hard link merges is carried out to non-user level file.So-called user class file, refers to that user may carry out the file operated, the file of the operations such as such as, user may carry out creating, revise, deletion, such as, and txt file or word file etc.Non-user level file is then the file that user can not carry out operating substantially, such as some system files, such as exe file or dll file etc.
During specific implementation, realize the eliminating of user class file by one of following two kinds of modes or combination:
A) file of chosen in advance particular type, when file traversal, the file only for this particular type carries out, and such as, only scans the file of designated suffix, such as only scans for the file of .exe .dll type.
B) according to the amendment of file and/or the file of up-to-date generation in date created eliminating preset time period, carry out judging whether to comprise two and file identical above outside the file getting rid of up-to-date generation.Such as, obtain file attribute information when carrying out file traversal by WindowsAPI file traversal function, thus determine the file of up-to-date generation, such as, to determine in nearest 1 time-of-week by the file operated to be latest document.
S203: judge whether comprise two and file identical above in hard disk in non-user level file: determining under two and above file size is identical, ID is different prerequisite, according to two and with the proof test value of files (such as, hash value) tentatively determine two and whether identical with files, and finally determine two and whether identical with files further by the comparison of file partial data.
Judge whether to comprise in hard disk two and the identical file of above quantity, namely find out the file of repetition, this performs the condition precedent that hard link merges.When judging whether to comprise two and file identical above in hard disk, realize by following four steps:
The first step: determine two and whether identical with the size of files, if varied in size, then determine it is not identical file, if size is identical, then continues to perform second step.For how determining two and whether identical with the size of files, such as, can when WindowsAPI function carries out file traversal by file reading attribute information determination file size, thus compare two and whether identical with the size of files.
Second step: judge two and whether identical with the ID of files, because in the file system of hard disk, each file has unique ID, so, for two files that independently still content is identical mutually, their ID is also different, therefore, need in second step to filter out the different file of file ID herein, continue execution the 3rd step.For how to determine two and above file ID whether identical, such as, can call GetFileInformationByHandleAPI, obtain file attribute information, thus the unique ID of documents realizes.
3rd step: calculate two and with the cryptographic hash of files, if two and identical with the hash value of files, then can determine these two and be tentatively identical file with files.For how obtaining two and with files hash value, Crypt series A PI can be used to calculate.Windows provides CryptoAPI, uses these API, can realize Hash more easily.
4th step: for hash value identical two and with files, further by two and carry out scale-of-two with the partial data of files and compare; According to partial data comparative result, finally determine two and whether identical with files.
S204: by creating hard link, be a file by two and Piece file mergence identical above.
Support hard link mechanism in new technology file system, physical file can be made only to there is portion, and can have and multiplely quote path, form and shortcut are a bit similar.When all quote deleted after, just really deleted.During specific implementation, CreateHardLink can be used to create hard link, the file before replacing, thus save storage space.
Visible, the embodiment that Fig. 2 is corresponding is more excellent compared to the embodiment that Fig. 1 is corresponding.By getting rid of user class file, the speed that file judges can be improved, and skip the specific file of these user classes, can avoid producing compatibility issue; Compared by partial data and determine that whether file is identical, more accurately, reduce the probability of erroneous judgement.Certainly, above-mentioned 2 improvements can all perform simultaneously, and also only can perform an improvement, the embodiment of the present invention is not construed as limiting this.
As previously mentioned, hard link similarly is shadow to a document creation, and shadow and body are synchronous.In brief, so-called link is nothing but the node number link that filename and computer file system are used.Therefore can link with same file with multiple filename, these filenames can in same catalogue or different directories.A file has several filename (such as, adopting ln order to realize multiple filename), just refers to that the link number of this article part is several.From definition, this link number can be 1, and this shows that this file only has a filename.In a word, hard link be exactly allow multiple not or the catalogue that coexists under filename, simultaneously can revise same file, after one of them amendment, all with its have the file of hard link all together with have modified.
With reference to figure 3a, the storage configuration schematic diagram of duplicate file before being employing the present invention.Wherein, file 1 and file 2 have the identical data of essence, file 1 leave in C: windows 11.dat; File 2 leave in C: user 245.dat, this is for hard disk, houses two files that data are identical, cause the waste of storage space in two places.With reference to figure 3b, it is the method file hard link schematic diagram reducing redundant data according to an embodiment of the invention.In Fig. 3 b, file 1 and file 2-in-1 and be a file, the mode of its data by mapping, be mapped to C: windows 11.dat and C: user under 245.dat two catalogues, and the data of reality only preserve portion, thus save hard-disc storage space.
The method of the minimizing redundant data that Fig. 1 or Fig. 2 of the present invention provides, can be specifically applied in file management.Therefore, the present invention also provides a kind of file management method based on reducing redundant data, and the method comprises the following steps:
Step 1: the All Files of each subregion in hard disk is traveled through;
Step 2: judge whether comprise two and file identical above in hard disk: determining under two and above file size is identical, file identification is different prerequisite, according to two and determine two and whether identical with files with the proof test value of files;
Step 3: by creating hard link, be a file by two and Piece file mergence identical above;
Step 4: the hard link corresponding relation safeguarding described hard link file and one or more filename, when upgrading for one or more filename hard link file, delete and/or revise, by safeguarding hard link corresponding relation, the unified operation that hard link file is upgraded, deletes and/or revised.
As described before, hard link corresponding relation comprises file chaining number, and this file chaining number corresponds to the quantity of one or more filename; So, when deleting for one of one or more filename, file chaining number being subtracted one, until when file chaining number is zero, just hard link file being deleted.When upgrading hard link file, the hard link relation of file to be updated is revised as the hard link relation of file after upgrading.When carrying out file modification for one of one or more filename, unification is modified to hard link file.
Visible, the present invention also at least brings the advantage of the following aspects:
(1) hard drive space is saved.Same file, only needs to safeguard hard link relation, does not need to carry out multiple copy, can save hard drive space like this.
(2) Rename file.Rename file does not need to open this file, only need change the content of certain directory entry.
(3) deleted file.Corresponding directory entry only need be deleted by deleted file, and the link number of this file subtracts 1, if the link number of this file is zero after the item that deltrees, at this moment system is just deleted real file from disk.
(4) file update.If relate to file update, such as under Windows system, what a only need first inside WinSxS catalogue, to download redaction, then revise Windows the hard link relation of file of the same name below System32, the hard link of redaction is pointed to from the hard link of legacy version, this makes it possible to the renewal work completing file fast, and do not need to carry out copying of file, speed can significantly improve.
(5) patch is unloaded.Run into the situation needing patch to unload, only need hard link to point to change legacy version into, there is no the problem that file is replaced.And the amendment between the file establishing hard link relation is synchronous, as long as therefore there is a side to be modified, the opposing party also can be revised.
Corresponding with said method embodiment, the present invention also provides a kind of device reducing redundant data.This device can be that software, hardware or software and hardware combining realize, and its function realizes said method embodiment.
See Fig. 4, it is the apparatus structure schematic diagram of the minimizing redundant data of the embodiment of the present invention.This device comprises as lower unit:
Traversal Unit 401, for traveling through the All Files of each subregion in hard disk;
Whether judging unit 402, comprise two and file identical above for judging in hard disk: determining under two and above file size is identical, ID is different prerequisite, according to two and determine two and whether identical with files with the proof test value of files;
Two and Piece file mergence identical above, for by creating hard link, are a file by hard link merge cells 403.
Preferably, this device also comprises: user class file rejected unit 404, for getting rid of user class file in the file from traversal; Judging unit 402 specifically for, carry out judging whether to comprise two and file identical above from non-user level file.
Preferably, user class file rejected unit 404, specifically for, the file of chosen in advance particular type, gets rid of the file of non specified type, and/or, the file of up-to-date generation in preset time period is got rid of according to the amendment of file and/or date created.
Preferably, the file of particular type comprises exe file and/or dll file.
Preferably, judging unit 402 specifically for, according to two and determine two and whether identical with files with the hash value of files; Judging unit 402 also for, for hash value identical two and with files, further by two and carry out scale-of-two with the partial data of files and compare; According to partial data comparative result, finally determine two and whether identical with files.
Preferably, hard link merge cells 403 specifically for, in new technology file system, by CreateHardLink instruction, two and above same file are merged into a file in hard link mode.
Intrinsic not relevant to any certain computer, virtual system or miscellaneous equipment with display at this algorithm provided.Various general-purpose system also can with use based on together with this teaching.According to description above, the structure constructed required by this type systematic is apparent.In addition, the present invention is not also for any certain programmed language.It should be understood that and various programming language can be utilized to realize content of the present invention described here, and the description done language-specific is above to disclose preferred forms of the present invention.
In instructions provided herein, describe a large amount of detail.But can understand, embodiments of the invention can be put into practice when not having these details.In some instances, be not shown specifically known method, structure and technology, so that not fuzzy understanding of this description.
Similarly, be to be understood that, in order to simplify the disclosure and to help to understand in each inventive aspect one or more, in the description above to exemplary embodiment of the present invention, each feature of the present invention is grouped together in single embodiment, figure or the description to it sometimes.But, the method for the disclosure should be construed to the following intention of reflection: namely the present invention for required protection requires feature more more than the feature clearly recorded in each claim.Or rather, as claims below reflect, all features of disclosed single embodiment before inventive aspect is to be less than.Therefore, the claims following embodiment are incorporated to this embodiment thus clearly, and wherein each claim itself is as independent embodiment of the present invention.
Those skilled in the art are appreciated that and adaptively can change the module in the equipment in embodiment and they are arranged in one or more equipment different from this embodiment.Module in embodiment or unit or assembly can be combined into a module or unit or assembly, and multiple submodule or subelement or sub-component can be put them in addition.Except at least some in such feature and/or process or unit be mutually repel except, any combination can be adopted to combine all processes of all features disclosed in this instructions (comprising adjoint claim, summary and accompanying drawing) and so disclosed any method or equipment or unit.Unless expressly stated otherwise, each feature disclosed in this instructions (comprising adjoint claim, summary and accompanying drawing) can by providing identical, alternative features that is equivalent or similar object replaces.
In addition, those skilled in the art can understand, although embodiments more described herein to comprise in other embodiment some included feature instead of further feature, the combination of the feature of different embodiment means and to be within scope of the present invention and to form different embodiments.Such as, in the following claims, the one of any of embodiment required for protection can use with arbitrary array mode.
All parts embodiment of the present invention with hardware implementing, or can realize with the software module run on one or more processor, or realizes with their combination.It will be understood by those of skill in the art that the some or all functions that microprocessor or digital signal processor (DSP) can be used in practice to realize according to the some or all parts in the device of the minimizing redundant data of the embodiment of the present invention.The present invention can also be embodied as part or all equipment for performing method as described herein or device program (such as, computer program and computer program).Realizing program of the present invention and can store on a computer-readable medium like this, or the form of one or more signal can be had.Such signal can be downloaded from internet website and obtain, or provides on carrier signal, or provides with any other form.
The present invention will be described instead of limit the invention to it should be noted above-described embodiment, and those skilled in the art can design alternative embodiment when not departing from the scope of claims.In the claims, any reference symbol between bracket should be configured to limitations on claims.Word " comprises " not to be got rid of existence and does not arrange element in the claims or step.Word "a" or "an" before being positioned at element is not got rid of and be there is multiple such element.The present invention can by means of including the hardware of some different elements and realizing by means of the computing machine of suitably programming.In the unit claim listing some devices, several in these devices can be carry out imbody by same hardware branch.Word first, second and third-class use do not represent any order.Can be title by these word explanations.
As previously mentioned, the invention provides at least following scheme:
A1, a kind of method reducing redundant data, comprising:
The All Files of each subregion in hard disk is traveled through;
Judge in described hard disk, whether to comprise two and file identical above: determining under two and above file size is identical, file identification is different prerequisite, according to two and determine two and whether identical with files with the proof test value of files;
By creating hard link, be a file by two and Piece file mergence identical above.
A2, method as described in A1, described judge in described hard disk, whether to comprise two and file identical above before, also comprise: from the file of traversal, get rid of user class file;
Describedly judge whether comprise two and file identical above in described hard disk, comprising: carry out judging whether to comprise two and file identical above from non-user level file.
A3, method as described in A2, describedly get rid of user class file from the file of traversal, judges whether comprise two and file identical above in described hard disk, comprising:
The file of chosen in advance particular type, the file only for described particular type carries out judging whether to comprise two and file identical above; And/or,
According to the amendment of file and/or the file of up-to-date generation in date created eliminating preset time period, carry out judging whether to comprise two and file identical above outside the file getting rid of up-to-date generation.
A4, method as described in A3, the file of described particular type comprises exe file and/or dll file.
A5, method as described in A1, described two and refer to two and with the hash value of files with the proof test value of files, according to two and determine with files hash value two and whether identical with files after, also comprise:
For hash value identical two and with files, further by two and carry out scale-of-two with the partial data of files and compare;
According to partial data comparative result, finally determine two and whether identical with files.
A6, method as described in A1, described by creating hard link, be a file by two and Piece file mergence identical above, comprise:
In new technology file system, by CreateHardLink instruction, two and above same file are merged into a file in hard link mode.
B7, a kind of device reducing redundant data, comprising:
Traversal Unit, for traveling through the All Files of each subregion in hard disk;
Judging unit, two and file identical above whether is comprised in described hard disk: determining under two and above file size is identical, file identification is different prerequisite, according to two and determine two and whether identical with files with the proof test value of files for judging;
Two and Piece file mergence identical above, for by creating hard link, are a file by hard link merge cells.
B8, device as described in B7, also comprise: user class file rejected unit, for getting rid of user class file in the file from traversal;
Described judging unit specifically for, carry out judging whether to comprise two and file identical above from non-user level file.
B9, device as described in B8, described user class file rejected unit specifically for, the file of chosen in advance particular type, gets rid of the file of non specified type, and/or, get rid of the file of up-to-date generation in preset time period according to the amendment of file and/or date created.
B10, device as described in B9, the file of described particular type comprises exe file and/or dll file.
B11, device as described in B7, described judging unit specifically for, according to two and determine two and whether identical with files with the hash value of files; Described judging unit also for, for hash value identical two and with files, further by two and carry out scale-of-two with the partial data of files and compare; According to partial data comparative result, finally determine two and whether identical with files.
B12, device as described in B7, described hard link merge cells specifically for, in new technology file system, by CreateHardLink instruction, two and above same file are merged into a file in hard link mode.
C13, a kind of file management method based on reducing redundant data, comprising:
The All Files of each subregion in hard disk is traveled through;
Judge in described hard disk, whether to comprise two and file identical above: determining under two and above file size is identical, file identification is different prerequisite, according to two and determine two and whether identical with files with the proof test value of files;
By creating hard link, be a file by two and Piece file mergence identical above;
Safeguard the hard link corresponding relation of described hard link file and one or more filename, when upgrading for described one or more filename described hard link file, delete and/or revise, by safeguarding described hard link corresponding relation, the unified operation described hard link file being carried out to described renewal, deletion and/or amendment.
C14, method as described in C13, described hard link corresponding relation comprises file chaining number, and described file chaining number corresponds to the quantity of described one or more filename; When deleting for one of one or more filename, described file chaining number being subtracted one, until when described file chaining number is zero, just described hard link file being deleted.
C15, method as described in C13, when upgrading hard link file, be revised as the hard link relation of file after upgrading by the hard link relation of file to be updated.
C16, method as described in C13, when carrying out file modification for one of one or more filename, unifiedly to modify to described hard link file.

Claims (10)

1. reduce a method for redundant data, it is characterized in that, comprising:
The All Files of each subregion in hard disk is traveled through;
Judge in described hard disk, whether to comprise two and file identical above: determining under two and above file size is identical, file identification is different prerequisite, according to two and determine two and whether identical with files with the proof test value of files;
By creating hard link, be a file by two and Piece file mergence identical above.
2. the method for claim 1, is characterized in that, described judge in described hard disk, whether to comprise two and file identical above before, also comprise: from traversal file get rid of user class file;
Describedly judge whether comprise two and file identical above in described hard disk, comprising: carry out judging whether to comprise two and file identical above from non-user level file.
3. method as claimed in claim 2, is characterized in that, getting rid of user class file, judging whether comprise two and file identical above in described hard disk, comprising the described file from traveling through:
The file of chosen in advance particular type, the file only for described particular type carries out judging whether to comprise two and file identical above; And/or,
According to the amendment of file and/or the file of up-to-date generation in date created eliminating preset time period, carry out judging whether to comprise two and file identical above outside the file getting rid of up-to-date generation.
4. method as claimed in claim 3, it is characterized in that, the file of described particular type comprises exe file and/or dll file.
5. the method for claim 1, is characterized in that, described two and refer to two and with the hash value of files with the proof test value of files, according to two and determine with files hash value two and whether identical with files after, also comprise:
For hash value identical two and with files, further by two and carry out scale-of-two with the partial data of files and compare;
According to partial data comparative result, finally determine two and whether identical with files.
6. the method for claim 1, is characterized in that, described by creating hard link, is a file, comprises two and Piece file mergence identical above:
In new technology file system, by CreateHardLink instruction, two and above same file are merged into a file in hard link mode.
7. reduce a device for redundant data, it is characterized in that, comprising:
Traversal Unit, for traveling through the All Files of each subregion in hard disk;
Judging unit, two and file identical above whether is comprised in described hard disk: determining under two and above file size is identical, file identification is different prerequisite, according to two and determine two and whether identical with files with the proof test value of files for judging;
Two and Piece file mergence identical above, for by creating hard link, are a file by hard link merge cells.
8. device as claimed in claim 7, is characterized in that, also comprise: user class file rejected unit, for getting rid of user class file in the file from traversal;
Described judging unit specifically for, carry out judging whether to comprise two and file identical above from non-user level file.
9. device as claimed in claim 8, is characterized in that, described user class file rejected unit specifically for, the file of chosen in advance particular type, get rid of the file of non specified type, and/or, according to the amendment of file and/or the file of up-to-date generation in date created eliminating preset time period.
10., based on the file management method reducing redundant data, it is characterized in that, comprising:
The All Files of each subregion in hard disk is traveled through;
Judge in described hard disk, whether to comprise two and file identical above: determining under two and above file size is identical, file identification is different prerequisite, according to two and determine two and whether identical with files with the proof test value of files;
By creating hard link, be a file by two and Piece file mergence identical above;
Safeguard the hard link corresponding relation of described hard link file and one or more filename, when upgrading for described one or more filename described hard link file, delete and/or revise, by safeguarding described hard link corresponding relation, the unified operation described hard link file being carried out to described renewal, deletion and/or amendment.
CN201510789116.2A 2015-11-17 2015-11-17 Method and device for reducing redundant data Pending CN105468686A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510789116.2A CN105468686A (en) 2015-11-17 2015-11-17 Method and device for reducing redundant data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510789116.2A CN105468686A (en) 2015-11-17 2015-11-17 Method and device for reducing redundant data

Publications (1)

Publication Number Publication Date
CN105468686A true CN105468686A (en) 2016-04-06

Family

ID=55606387

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510789116.2A Pending CN105468686A (en) 2015-11-17 2015-11-17 Method and device for reducing redundant data

Country Status (1)

Country Link
CN (1) CN105468686A (en)

Cited By (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106649721A (en) * 2016-12-22 2017-05-10 创新科存储技术有限公司 Method and device for duplication removal of file
CN106844431A (en) * 2016-12-12 2017-06-13 北京猎豹移动科技有限公司 File memory method, device and its electronic equipment
CN107391593A (en) * 2017-06-29 2017-11-24 努比亚技术有限公司 A kind of document handling method, mobile terminal and computer-readable recording medium
CN107861686A (en) * 2017-09-26 2018-03-30 深圳前海微众银行股份有限公司 File memory method, service end and computer-readable recording medium
CN108037895A (en) * 2017-12-06 2018-05-15 Tcl移动通信科技(宁波)有限公司 A kind of mobile terminal and data information memory control method and storage medium
CN108052291A (en) * 2017-12-14 2018-05-18 郑州云海信息技术有限公司 A kind of storage method of Cloud Server, system, device and readable storage medium storing program for executing
WO2018113210A1 (en) * 2016-12-21 2018-06-28 深圳市易特科信息技术有限公司 Repeated medical documentation deletion system and method in medical informationization
CN109240985A (en) * 2018-08-13 2019-01-18 上海擎感智能科技有限公司 More storage dish duplicate file processing methods, system, storage medium and vehicle device
CN110991361A (en) * 2019-12-06 2020-04-10 衢州学院 Multi-channel multi-modal background modeling method for high-definition high-speed video
CN111913916A (en) * 2020-07-07 2020-11-10 泰康保险集团股份有限公司 File recombination method and equipment
CN112131194A (en) * 2020-09-24 2020-12-25 上海摩勤智能技术有限公司 File storage control method and device of read-only file system and storage medium
CN112527740A (en) * 2019-09-17 2021-03-19 北京国双科技有限公司 File resource processing method and device, storage medium and electronic equipment
CN112612413A (en) * 2020-12-04 2021-04-06 海光信息技术股份有限公司 Version management file caching method, device and system and related equipment
CN113254397A (en) * 2021-06-15 2021-08-13 成都统信软件技术有限公司 Data checking method and computing device
CN113836093A (en) * 2021-09-29 2021-12-24 深圳万兴软件有限公司 Network disk file cleaning method, system, computer equipment and storage medium thereof
CN117648288A (en) * 2022-09-05 2024-03-05 华为技术有限公司 File processing method and electronic equipment
CN117725028A (en) * 2023-06-26 2024-03-19 荣耀终端有限公司 File processing method, terminal equipment and computer readable storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101079034A (en) * 2006-07-10 2007-11-28 腾讯科技(深圳)有限公司 System and method for eliminating redundancy file of file storage system
CN103324699A (en) * 2013-06-08 2013-09-25 西安交通大学 Rapid data de-duplication method adapted to big data application
CN104462591A (en) * 2014-12-31 2015-03-25 上海斐讯数据通信技术有限公司 Method for avoiding repeated download and mobile terminal

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101079034A (en) * 2006-07-10 2007-11-28 腾讯科技(深圳)有限公司 System and method for eliminating redundancy file of file storage system
CN103324699A (en) * 2013-06-08 2013-09-25 西安交通大学 Rapid data de-duplication method adapted to big data application
CN104462591A (en) * 2014-12-31 2015-03-25 上海斐讯数据通信技术有限公司 Method for avoiding repeated download and mobile terminal

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
王禹: "分布式存储系统中的数据冗余与维护技术研究", 《中国博士学位论文全文数据库 信息科技辑》 *
王钧 等: "《操作系统(Linux)》", 31 July 2015 *

Cited By (21)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106844431A (en) * 2016-12-12 2017-06-13 北京猎豹移动科技有限公司 File memory method, device and its electronic equipment
WO2018113210A1 (en) * 2016-12-21 2018-06-28 深圳市易特科信息技术有限公司 Repeated medical documentation deletion system and method in medical informationization
CN106649721A (en) * 2016-12-22 2017-05-10 创新科存储技术有限公司 Method and device for duplication removal of file
CN107391593A (en) * 2017-06-29 2017-11-24 努比亚技术有限公司 A kind of document handling method, mobile terminal and computer-readable recording medium
CN107861686B (en) * 2017-09-26 2021-01-05 深圳前海微众银行股份有限公司 File storage method, server and computer readable storage medium
CN107861686A (en) * 2017-09-26 2018-03-30 深圳前海微众银行股份有限公司 File memory method, service end and computer-readable recording medium
CN108037895A (en) * 2017-12-06 2018-05-15 Tcl移动通信科技(宁波)有限公司 A kind of mobile terminal and data information memory control method and storage medium
CN108037895B (en) * 2017-12-06 2021-06-22 Tcl移动通信科技(宁波)有限公司 Mobile terminal, data information storage control method and storage medium
CN108052291A (en) * 2017-12-14 2018-05-18 郑州云海信息技术有限公司 A kind of storage method of Cloud Server, system, device and readable storage medium storing program for executing
CN109240985A (en) * 2018-08-13 2019-01-18 上海擎感智能科技有限公司 More storage dish duplicate file processing methods, system, storage medium and vehicle device
CN112527740A (en) * 2019-09-17 2021-03-19 北京国双科技有限公司 File resource processing method and device, storage medium and electronic equipment
CN110991361A (en) * 2019-12-06 2020-04-10 衢州学院 Multi-channel multi-modal background modeling method for high-definition high-speed video
CN110991361B (en) * 2019-12-06 2021-01-15 衢州学院 Multi-channel multi-modal background modeling method for high-definition high-speed video
CN111913916A (en) * 2020-07-07 2020-11-10 泰康保险集团股份有限公司 File recombination method and equipment
CN112131194A (en) * 2020-09-24 2020-12-25 上海摩勤智能技术有限公司 File storage control method and device of read-only file system and storage medium
CN112612413A (en) * 2020-12-04 2021-04-06 海光信息技术股份有限公司 Version management file caching method, device and system and related equipment
CN113254397A (en) * 2021-06-15 2021-08-13 成都统信软件技术有限公司 Data checking method and computing device
CN113836093A (en) * 2021-09-29 2021-12-24 深圳万兴软件有限公司 Network disk file cleaning method, system, computer equipment and storage medium thereof
CN113836093B (en) * 2021-09-29 2024-04-12 深圳万兴软件有限公司 Network disk file cleaning method, system, computer equipment and storage medium thereof
CN117648288A (en) * 2022-09-05 2024-03-05 华为技术有限公司 File processing method and electronic equipment
CN117725028A (en) * 2023-06-26 2024-03-19 荣耀终端有限公司 File processing method, terminal equipment and computer readable storage medium

Similar Documents

Publication Publication Date Title
CN105468686A (en) Method and device for reducing redundant data
US10725976B2 (en) Fast recovery using self-describing replica files in a distributed storage system
US10140461B2 (en) Reducing resource consumption associated with storage and operation of containers
US9189342B1 (en) Generic process for determining child to parent inheritance for fast provisioned or linked clone virtual machines
US9195667B2 (en) System for on-line archiving of content in an object store
CN109086388B (en) Block chain data storage method, device, equipment and medium
US11042504B2 (en) Managing overwrites when archiving data in cloud/object storage
WO2016079629A1 (en) Optimizing database deduplication
US10248656B2 (en) Removal of reference information for storage blocks in a deduplication system
US10212067B2 (en) Dynamic symbolic links for referencing in a file system
US9965487B2 (en) Conversion of forms of user data segment IDs in a deduplication system
US9535925B2 (en) File link migration
CN112631621A (en) Dependency package management method, device, server and storage medium
US10915246B2 (en) Cloud storage format to enable space reclamation while minimizing data transfer
US11500835B2 (en) Cohort management for version updates in data deduplication
US11151093B2 (en) Distributed system control for on-demand data access in complex, heterogenous data storage
WO2007026484A1 (en) Device, method, and program for generating and executing execution binary image, and computer-readable recording medium containing the execution binary image execution program
US20230222165A1 (en) Object storage-based indexing systems and method
US10732843B2 (en) Tape drive data reclamation
WO2023201002A1 (en) Implementing graph search with in-structure metadata of a graph-organized file system
US10042854B2 (en) Detection of data affected by inaccessible storage blocks in a deduplication system
US9965488B2 (en) Back referencing of deduplicated data
US11947495B1 (en) System and method for providing a file system without duplication of files
US11762603B2 (en) Storing modified or unmodified portions of a file based on tape loading
US10853312B2 (en) Archiving data in cloud/object storage using local metadata staging

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20160406