CN105468686A - Method and device for reducing redundant data - Google Patents
Method and device for reducing redundant data Download PDFInfo
- Publication number
- CN105468686A CN105468686A CN201510789116.2A CN201510789116A CN105468686A CN 105468686 A CN105468686 A CN 105468686A CN 201510789116 A CN201510789116 A CN 201510789116A CN 105468686 A CN105468686 A CN 105468686A
- Authority
- CN
- China
- Prior art keywords
- file
- files
- hard disk
- hard link
- hard
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/10—File systems; File servers
- G06F16/17—Details of further file system functions
- G06F16/174—Redundancy elimination performed by the file system
- G06F16/1748—De-duplication implemented within the file system, e.g. based on file segments
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/10—File systems; File servers
- G06F16/17—Details of further file system functions
- G06F16/174—Redundancy elimination performed by the file system
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/10—File systems; File servers
- G06F16/18—File system types
- G06F16/1847—File system types specifically adapted to static storage, e.g. adapted to flash memory or SSD
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/06—Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
- G06F3/0601—Interfaces specially adapted for storage systems
- G06F3/0602—Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
- G06F3/0608—Saving storage space on storage systems
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/06—Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
- G06F3/0601—Interfaces specially adapted for storage systems
- G06F3/0628—Interfaces specially adapted for storage systems making use of a particular technique
- G06F3/0638—Organizing or formatting or addressing of data
- G06F3/0643—Management of files
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/06—Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
- G06F3/0601—Interfaces specially adapted for storage systems
- G06F3/0628—Interfaces specially adapted for storage systems making use of a particular technique
- G06F3/0646—Horizontal data movement in storage systems, i.e. moving data in between storage devices or systems
- G06F3/0652—Erasing, e.g. deleting, data cleaning, moving of data to a wastebasket
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Human Computer Interaction (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a method and a device for reducing redundant data. The method comprises the following steps: traversing all files in various partitions of a hard disk; judging whether the hard disk contain two or more identical files; on the premise of determining that two or more files are identical and the file identifiers are different, determining whether the two or more files are identical according to the verification values of the two or more files; and creating a hard link to merge the two or more identical files into one file. By adopting the method disclosed by the invention, the repeated files are merged into one file by the hard link, so the redundant data in the hard disk can be reduced, the space of the hard disk is saved and the operation efficiency of the hard disk is improved.
Description
Technical field
The present invention relates to field of computer technology, be specifically related to a kind of method and the device that reduce redundant data.
Background technology
Under normal circumstances, most of user only understands how to continue document retaining at memory storage relayings such as hard disks, does not but know how to safeguard memory storage better, effectively to monitor the capacity of its hard disk.In some cases, the system with memory storage can searching, obtain and editing or the maintenance management such as file erase operation by some software application execute files, it is preserve file one by one according to catalogue that traditional fixed disk file is preserved, there are some drawbacks in this mode: first, if when certain hard disk has a large amount of duplicate file, this preserving type often takies larger storage space; Secondly, when file in backup harddisk, if preserve a large amount of duplicate file in hard disk, the cost more time is backed up these duplicate files by system, causes BACKUP TIME to increase; Finally, when the file size in hard disk is larger, if user wishes to remove these files repeated from hard disk, then have to check file one by one, very loaded down with trivial details.
Summary of the invention
In view of the above problems, propose the present invention to provide a kind of method of minimizing redundant data overcoming the problems referred to above or solve the problem at least in part, thus improve hard disk operational efficiency.
According to one aspect of the present invention, a kind of method reducing redundant data is provided, comprises: the All Files of each subregion in hard disk is traveled through; Judge in described hard disk, whether to comprise two and file identical above: determining under two and above file size is identical, file identification is different prerequisite, according to two and determine two and whether identical with files with the proof test value of files; By creating hard link, be a file by two and Piece file mergence identical above.
Preferably, described judge in described hard disk, whether to comprise two and file identical above before, also comprise: from traversal file get rid of user class file; Describedly judge whether comprise two and file identical above in described hard disk, comprising: carry out judging whether to comprise two and file identical above from non-user level file.
Preferably, user class file is got rid of the described file from traveling through, judge whether comprise two and file identical above in described hard disk, comprising: the file of chosen in advance particular type, the file only for described particular type carries out judging whether to comprise two and file identical above; And/or, according to the amendment of file and/or the file of up-to-date generation in date created eliminating preset time period, carry out judging whether to comprise two and file identical above outside the file getting rid of up-to-date generation.
Preferably, the file of described particular type comprises exe file and/or dll file.
Preferably, described two and refer to two and with the hash value of files with the proof test value of files, according to two and determine with files hash value two and whether identical with files after, also comprise: for hash value identical two and with files, further by two and carry out scale-of-two with the partial data of files and compare; According to partial data comparative result, finally determine two and whether identical with files.
Preferably, described by creating hard link, be a file by two and Piece file mergence identical above, comprise: in new technology file system, by CreateHardLink instruction, two and above same file are merged into a file in hard link mode.
Reduce a device for redundant data, comprising: Traversal Unit, for traveling through the All Files of each subregion in hard disk; Judging unit, two and file identical above whether is comprised in described hard disk: determining under two and above file size is identical, file identification is different prerequisite, according to two and determine two and whether identical with files with the proof test value of files for judging; Two and Piece file mergence identical above, for by creating hard link, are a file by hard link merge cells.
Preferably, described device also comprises: user class file rejected unit, for getting rid of user class file in the file from traversal; Described judging unit specifically for, carry out judging whether to comprise two and file identical above from non-user level file.
Preferably, described user class file rejected unit, specifically for, the file of chosen in advance particular type, gets rid of the file of non specified type, and/or, the file of up-to-date generation in preset time period is got rid of according to the amendment of file and/or date created.
Preferably, the file of described particular type comprises exe file and/or dll file.
Preferably, described judging unit specifically for, according to two and determine two and whether identical with files with the hash value of files; Described judging unit also for, for hash value identical two and with files, further by two and carry out scale-of-two with the partial data of files and compare; According to partial data comparative result, finally determine two and whether identical with files.
Preferably, described hard link merge cells specifically for, in new technology file system, by CreateHardLink instruction, two and above same file are merged into a file in hard link mode.
Based on the file management method reducing redundant data, comprising: the All Files of each subregion in hard disk is traveled through; Judge in described hard disk, whether to comprise two and file identical above: determining under two and above file size is identical, file identification is different prerequisite, according to two and determine two and whether identical with files with the proof test value of files; By creating hard link, be a file by two and Piece file mergence identical above; Safeguard the hard link corresponding relation of described hard link file and one or more filename, when upgrading for described one or more filename described hard link file, delete and/or revise, by safeguarding described hard link corresponding relation, the unified operation described hard link file being carried out to described renewal, deletion and/or amendment.
Preferably, described hard link corresponding relation comprises file chaining number, and described file chaining number corresponds to the quantity of described one or more filename; When deleting for one of one or more filename, described file chaining number being subtracted one, until when described file chaining number is zero, just described hard link file being deleted.
Preferably, when upgrading hard link file, the hard link relation of file to be updated is revised as the hard link relation of file after upgrading.
Preferred when carrying out file modification for one of one or more filename, unification is modified to described hard link file.
Visible, the Piece file mergence of repetition is a file by hard link by such scheme, can reduce the redundant data in hard disk, saves hard drive space and also improves hard disk operational efficiency.In a preferred embodiment of the invention, by getting rid of user class file, the speed that file judges can be improved, and skip the specific file of these user classes, can avoid producing compatibility issue.In another preferred embodiment of the invention, compared by partial data and determine that whether file is identical, more accurately, reduce the probability of erroneous judgement.
In addition, the present invention at least can bring the advantage of the following aspects: (1) saves hard drive space.Same file, only needs to safeguard hard link relation, does not need to carry out multiple copy, can save hard drive space like this.(2) Rename file.Rename file does not need to open this file, only need change the content of certain directory entry.(3) deleted file.Corresponding directory entry only need be deleted by deleted file, and the link number of this file subtracts 1, if the link number of this file is zero after the item that deltrees, at this moment system is just deleted real file from disk.(4) file update.If relate to file update, such as under Windows system, what a only need first inside WinSxS catalogue, to download redaction, then revise Windows the hard link relation of file of the same name below System32, the hard link of redaction is pointed to from the hard link of legacy version, this makes it possible to the renewal work completing file fast, and do not need to carry out copying of file, speed can significantly improve.(5) patch is unloaded.Run into the situation needing patch to unload, only need hard link to point to change legacy version into, there is no the problem that file is replaced.And the amendment between the file establishing hard link relation is synchronous, as long as therefore there is a side to be modified, the opposing party also can be revised.
Above-mentioned explanation is only the general introduction of technical solution of the present invention, in order to technological means of the present invention can be better understood, and can be implemented according to the content of instructions, and can become apparent, below especially exemplified by the specific embodiment of the present invention to allow above and other objects of the present invention, feature and advantage.
Accompanying drawing explanation
By reading hereafter detailed description of the preferred embodiment, various other advantage and benefit will become cheer and bright for those of ordinary skill in the art.Accompanying drawing only for illustrating the object of preferred implementation, and does not think limitation of the present invention.And in whole accompanying drawing, represent identical parts by identical reference symbol.In the accompanying drawings:
Fig. 1 shows the method flow diagram reducing redundant data according to an embodiment of the invention;
Fig. 2 shows the method flow diagram reducing redundant data in accordance with another embodiment of the present invention;
Fig. 3 a-3b shows the method file hard link schematic diagram reducing redundant data according to an embodiment of the invention;
Fig. 4 shows the apparatus structure schematic diagram reducing redundant data according to an embodiment of the invention.
Embodiment
Below with reference to accompanying drawings exemplary embodiment of the present disclosure is described in more detail.Although show exemplary embodiment of the present disclosure in accompanying drawing, however should be appreciated that can realize the disclosure in a variety of manners and not should limit by the embodiment set forth here.On the contrary, provide these embodiments to be in order to more thoroughly the disclosure can be understood, and complete for the scope of the present disclosure can be conveyed to those skilled in the art.
Often there is this such situation in hard disk, two file contents are completely the same, but are but dispersed in different positions.Such as 360 net dishes with 360 bodyguards in fact a lot of file be all common, this just causes a large amount of duplicate file of hard disk storing, reduces the available rate of hard disk, causes great waste.
Those skilled in the art understand, and NTFS (NewTechnologyFileSystem, New Technology File System) is the file system of WindowsNT environment.New Technology File System be WindowsNT family (as, Windows2000, WindowsXP, WindowsVista, Windows7 and windows8.1) etc. the special file system (file system of the drive at operating system place must be formatted as the file system of NTFS, under 4096 bunches of environment) of limiter stage.NTFS instead of old-fashioned FAT file system.NTFS has done some improvement to FAT and HPFS, such as, supports metadata, and employs advanced data structure, so that improving SNR, reliability and disk space utilization factor, and provide some additional extension functions.
Hard link is the one service of new technology file system.So-called hard link (hardlink), a file has one or more filename exactly.Hard link similarly is shadow to a document creation, and shadow and body are synchronous.Hard link application scenarios one of them be exactly when certain fdisk off-capacity, but in order to classify, use the capacity of other subregions, but being presented at target partition.
Therefore, the application's novelty proposes, and for the file repeated in hard disk, takes hard link mode to merge into a file.So, in order to realize the hard link to duplicate file, first needing to judge which file is repetition, that is, there is which identical file, therefore, need to judge in hard disk, which file is identical.Under the prerequisite determining duplicate file, the file that these can be repeated adopts hard link mode, merges into a file, thus reduces taking hard drive space, improves the operational efficiency of hard disk.
See Fig. 1, show the method flow reducing redundant data according to an embodiment of the invention, the method comprises the following steps:
S101: the All Files of each subregion in hard disk is traveled through.
For file traversal, file traversal algorithm realization popular at present can be taked, the following file traversal algorithm proposed can certainly be taked.Current file traversal algorithm, mainly comprises depth-priority-searching method and width first traversal.When specific implementation, often kind of algorithm can have multiple realization, in general, has recurrence and onrecurrent two kinds.
For concrete file system, system can be taked to carry out file traversal from tape function.Such as, for the new technology file system of Windows, WindowsAPI function can be utilized to realize file traversal.Concrete, use the All Files of WindowsAPI function traversal specified partition, catalogue.Its thinking is: recall browse through folders window and specify the initial path that will search for, the All Files under then traveling through this catalogue with the api function of locating file and under the sub-directory comprised.
S102: judge whether comprise two and file identical above in hard disk: determining under two and above file size is identical, file identification (ID) is different prerequisite, according to two and determine two and whether identical with files with the proof test value of files.
Judge whether to comprise in hard disk two and the identical file of above quantity, namely find out the file of repetition, this performs the condition precedent that hard link merges.When judging whether to comprise two and file identical above in hard disk, realize by following steps:
The first step: determine two and whether identical with the size of files, if varied in size, then determine it is not identical file, if size is identical, then continues to perform second step.For how determining two and whether identical with the size of files, such as, can when WindowsAPI function carries out file traversal by file reading attribute information determination file size, thus compare two and whether identical with the size of files.
Second step: judge two and whether identical with the ID of files, because in the file system of hard disk, each file has unique ID, so, for two files that independently still content is identical mutually, their ID is also different, therefore, need in second step to filter out the different file of file ID herein, continue execution the 3rd step.For how to determine two and above file ID whether identical, such as, can call GetFileInformationByHandleAPI, obtain file attribute information, thus the unique ID of documents realizes.
3rd step: calculate two and with the proof test value of files (such as, cryptographic hash, MD5 value, etc.), such as, if two and identical with the hash value of files, then can determine these two and be identical file with files.For how obtaining two and with files hash value, Crypt series A PI can be used to calculate.Windows provides CryptoAPI, uses these API, can realize Hash more easily.
S103: by creating hard link, be a file by two and Piece file mergence identical above.
Support hard link mechanism in new technology file system, physical file can be made only to there is portion, and can have and multiplely quote path, form and shortcut are a bit similar.When all quote deleted after, just really deleted.During specific implementation, CreateHardLink can be used to create hard link, the file before replacing, thus save storage space.With reference to prior art, can not repeat about the detail creating hard link herein.
Visible, in such scheme, be a file by hard link by the Piece file mergence of repetition, the redundant data in hard disk can be reduced, save hard drive space and also improve hard disk operational efficiency.
See Fig. 2, show the method flow diagram reducing redundant data in accordance with another embodiment of the present invention.Compared to embodiment illustrated in fig. 1, the present embodiment key distinction is: (1), when determining duplicate file, is first got rid of user class file, namely only in non-user level file, carried out the whether identical judgement of file; (2) after identical according to hash value determination file, whether the partial data carrying out file further compares, finally identical according to partial data comparative result determination file.
Fig. 2 comprises the following steps:
S201: the All Files of each subregion in hard disk is traveled through.
For file traversal, file traversal algorithm realization popular at present can be taked, the following file traversal algorithm proposed can certainly be taked.Current file traversal algorithm, mainly comprises depth-priority-searching method and width first traversal.When specific implementation, often kind of algorithm can have multiple realization, in general, has recurrence and onrecurrent two kinds.
For concrete file system, system can be taked to carry out file traversal from tape function.Such as, for the new technology file system of Windows, WindowsAPI function can be utilized to realize file traversal.Concrete, use the All Files of WindowsAPI function traversal specified partition, catalogue.Its thinking is: recall browse through folders window and specify the initial path that will search for, the All Files under then traveling through this catalogue with the api function of locating file and under the sub-directory comprised.
S202: get rid of user class file from the file of traversal.
In order to avoid producing compatibility file, need to skip specific user class file, and only the follow-up operation whether repeating to judge and hard link merges is carried out to non-user level file.So-called user class file, refers to that user may carry out the file operated, the file of the operations such as such as, user may carry out creating, revise, deletion, such as, and txt file or word file etc.Non-user level file is then the file that user can not carry out operating substantially, such as some system files, such as exe file or dll file etc.
During specific implementation, realize the eliminating of user class file by one of following two kinds of modes or combination:
A) file of chosen in advance particular type, when file traversal, the file only for this particular type carries out, and such as, only scans the file of designated suffix, such as only scans for the file of .exe .dll type.
B) according to the amendment of file and/or the file of up-to-date generation in date created eliminating preset time period, carry out judging whether to comprise two and file identical above outside the file getting rid of up-to-date generation.Such as, obtain file attribute information when carrying out file traversal by WindowsAPI file traversal function, thus determine the file of up-to-date generation, such as, to determine in nearest 1 time-of-week by the file operated to be latest document.
S203: judge whether comprise two and file identical above in hard disk in non-user level file: determining under two and above file size is identical, ID is different prerequisite, according to two and with the proof test value of files (such as, hash value) tentatively determine two and whether identical with files, and finally determine two and whether identical with files further by the comparison of file partial data.
Judge whether to comprise in hard disk two and the identical file of above quantity, namely find out the file of repetition, this performs the condition precedent that hard link merges.When judging whether to comprise two and file identical above in hard disk, realize by following four steps:
The first step: determine two and whether identical with the size of files, if varied in size, then determine it is not identical file, if size is identical, then continues to perform second step.For how determining two and whether identical with the size of files, such as, can when WindowsAPI function carries out file traversal by file reading attribute information determination file size, thus compare two and whether identical with the size of files.
Second step: judge two and whether identical with the ID of files, because in the file system of hard disk, each file has unique ID, so, for two files that independently still content is identical mutually, their ID is also different, therefore, need in second step to filter out the different file of file ID herein, continue execution the 3rd step.For how to determine two and above file ID whether identical, such as, can call GetFileInformationByHandleAPI, obtain file attribute information, thus the unique ID of documents realizes.
3rd step: calculate two and with the cryptographic hash of files, if two and identical with the hash value of files, then can determine these two and be tentatively identical file with files.For how obtaining two and with files hash value, Crypt series A PI can be used to calculate.Windows provides CryptoAPI, uses these API, can realize Hash more easily.
4th step: for hash value identical two and with files, further by two and carry out scale-of-two with the partial data of files and compare; According to partial data comparative result, finally determine two and whether identical with files.
S204: by creating hard link, be a file by two and Piece file mergence identical above.
Support hard link mechanism in new technology file system, physical file can be made only to there is portion, and can have and multiplely quote path, form and shortcut are a bit similar.When all quote deleted after, just really deleted.During specific implementation, CreateHardLink can be used to create hard link, the file before replacing, thus save storage space.
Visible, the embodiment that Fig. 2 is corresponding is more excellent compared to the embodiment that Fig. 1 is corresponding.By getting rid of user class file, the speed that file judges can be improved, and skip the specific file of these user classes, can avoid producing compatibility issue; Compared by partial data and determine that whether file is identical, more accurately, reduce the probability of erroneous judgement.Certainly, above-mentioned 2 improvements can all perform simultaneously, and also only can perform an improvement, the embodiment of the present invention is not construed as limiting this.
As previously mentioned, hard link similarly is shadow to a document creation, and shadow and body are synchronous.In brief, so-called link is nothing but the node number link that filename and computer file system are used.Therefore can link with same file with multiple filename, these filenames can in same catalogue or different directories.A file has several filename (such as, adopting ln order to realize multiple filename), just refers to that the link number of this article part is several.From definition, this link number can be 1, and this shows that this file only has a filename.In a word, hard link be exactly allow multiple not or the catalogue that coexists under filename, simultaneously can revise same file, after one of them amendment, all with its have the file of hard link all together with have modified.
With reference to figure 3a, the storage configuration schematic diagram of duplicate file before being employing the present invention.Wherein, file 1 and file 2 have the identical data of essence, file 1 leave in C: windows 11.dat; File 2 leave in C: user 245.dat, this is for hard disk, houses two files that data are identical, cause the waste of storage space in two places.With reference to figure 3b, it is the method file hard link schematic diagram reducing redundant data according to an embodiment of the invention.In Fig. 3 b, file 1 and file 2-in-1 and be a file, the mode of its data by mapping, be mapped to C: windows 11.dat and C: user under 245.dat two catalogues, and the data of reality only preserve portion, thus save hard-disc storage space.
The method of the minimizing redundant data that Fig. 1 or Fig. 2 of the present invention provides, can be specifically applied in file management.Therefore, the present invention also provides a kind of file management method based on reducing redundant data, and the method comprises the following steps:
Step 1: the All Files of each subregion in hard disk is traveled through;
Step 2: judge whether comprise two and file identical above in hard disk: determining under two and above file size is identical, file identification is different prerequisite, according to two and determine two and whether identical with files with the proof test value of files;
Step 3: by creating hard link, be a file by two and Piece file mergence identical above;
Step 4: the hard link corresponding relation safeguarding described hard link file and one or more filename, when upgrading for one or more filename hard link file, delete and/or revise, by safeguarding hard link corresponding relation, the unified operation that hard link file is upgraded, deletes and/or revised.
As described before, hard link corresponding relation comprises file chaining number, and this file chaining number corresponds to the quantity of one or more filename; So, when deleting for one of one or more filename, file chaining number being subtracted one, until when file chaining number is zero, just hard link file being deleted.When upgrading hard link file, the hard link relation of file to be updated is revised as the hard link relation of file after upgrading.When carrying out file modification for one of one or more filename, unification is modified to hard link file.
Visible, the present invention also at least brings the advantage of the following aspects:
(1) hard drive space is saved.Same file, only needs to safeguard hard link relation, does not need to carry out multiple copy, can save hard drive space like this.
(2) Rename file.Rename file does not need to open this file, only need change the content of certain directory entry.
(3) deleted file.Corresponding directory entry only need be deleted by deleted file, and the link number of this file subtracts 1, if the link number of this file is zero after the item that deltrees, at this moment system is just deleted real file from disk.
(4) file update.If relate to file update, such as under Windows system, what a only need first inside WinSxS catalogue, to download redaction, then revise Windows the hard link relation of file of the same name below System32, the hard link of redaction is pointed to from the hard link of legacy version, this makes it possible to the renewal work completing file fast, and do not need to carry out copying of file, speed can significantly improve.
(5) patch is unloaded.Run into the situation needing patch to unload, only need hard link to point to change legacy version into, there is no the problem that file is replaced.And the amendment between the file establishing hard link relation is synchronous, as long as therefore there is a side to be modified, the opposing party also can be revised.
Corresponding with said method embodiment, the present invention also provides a kind of device reducing redundant data.This device can be that software, hardware or software and hardware combining realize, and its function realizes said method embodiment.
See Fig. 4, it is the apparatus structure schematic diagram of the minimizing redundant data of the embodiment of the present invention.This device comprises as lower unit:
Traversal Unit 401, for traveling through the All Files of each subregion in hard disk;
Whether judging unit 402, comprise two and file identical above for judging in hard disk: determining under two and above file size is identical, ID is different prerequisite, according to two and determine two and whether identical with files with the proof test value of files;
Two and Piece file mergence identical above, for by creating hard link, are a file by hard link merge cells 403.
Preferably, this device also comprises: user class file rejected unit 404, for getting rid of user class file in the file from traversal; Judging unit 402 specifically for, carry out judging whether to comprise two and file identical above from non-user level file.
Preferably, user class file rejected unit 404, specifically for, the file of chosen in advance particular type, gets rid of the file of non specified type, and/or, the file of up-to-date generation in preset time period is got rid of according to the amendment of file and/or date created.
Preferably, the file of particular type comprises exe file and/or dll file.
Preferably, judging unit 402 specifically for, according to two and determine two and whether identical with files with the hash value of files; Judging unit 402 also for, for hash value identical two and with files, further by two and carry out scale-of-two with the partial data of files and compare; According to partial data comparative result, finally determine two and whether identical with files.
Preferably, hard link merge cells 403 specifically for, in new technology file system, by CreateHardLink instruction, two and above same file are merged into a file in hard link mode.
Intrinsic not relevant to any certain computer, virtual system or miscellaneous equipment with display at this algorithm provided.Various general-purpose system also can with use based on together with this teaching.According to description above, the structure constructed required by this type systematic is apparent.In addition, the present invention is not also for any certain programmed language.It should be understood that and various programming language can be utilized to realize content of the present invention described here, and the description done language-specific is above to disclose preferred forms of the present invention.
In instructions provided herein, describe a large amount of detail.But can understand, embodiments of the invention can be put into practice when not having these details.In some instances, be not shown specifically known method, structure and technology, so that not fuzzy understanding of this description.
Similarly, be to be understood that, in order to simplify the disclosure and to help to understand in each inventive aspect one or more, in the description above to exemplary embodiment of the present invention, each feature of the present invention is grouped together in single embodiment, figure or the description to it sometimes.But, the method for the disclosure should be construed to the following intention of reflection: namely the present invention for required protection requires feature more more than the feature clearly recorded in each claim.Or rather, as claims below reflect, all features of disclosed single embodiment before inventive aspect is to be less than.Therefore, the claims following embodiment are incorporated to this embodiment thus clearly, and wherein each claim itself is as independent embodiment of the present invention.
Those skilled in the art are appreciated that and adaptively can change the module in the equipment in embodiment and they are arranged in one or more equipment different from this embodiment.Module in embodiment or unit or assembly can be combined into a module or unit or assembly, and multiple submodule or subelement or sub-component can be put them in addition.Except at least some in such feature and/or process or unit be mutually repel except, any combination can be adopted to combine all processes of all features disclosed in this instructions (comprising adjoint claim, summary and accompanying drawing) and so disclosed any method or equipment or unit.Unless expressly stated otherwise, each feature disclosed in this instructions (comprising adjoint claim, summary and accompanying drawing) can by providing identical, alternative features that is equivalent or similar object replaces.
In addition, those skilled in the art can understand, although embodiments more described herein to comprise in other embodiment some included feature instead of further feature, the combination of the feature of different embodiment means and to be within scope of the present invention and to form different embodiments.Such as, in the following claims, the one of any of embodiment required for protection can use with arbitrary array mode.
All parts embodiment of the present invention with hardware implementing, or can realize with the software module run on one or more processor, or realizes with their combination.It will be understood by those of skill in the art that the some or all functions that microprocessor or digital signal processor (DSP) can be used in practice to realize according to the some or all parts in the device of the minimizing redundant data of the embodiment of the present invention.The present invention can also be embodied as part or all equipment for performing method as described herein or device program (such as, computer program and computer program).Realizing program of the present invention and can store on a computer-readable medium like this, or the form of one or more signal can be had.Such signal can be downloaded from internet website and obtain, or provides on carrier signal, or provides with any other form.
The present invention will be described instead of limit the invention to it should be noted above-described embodiment, and those skilled in the art can design alternative embodiment when not departing from the scope of claims.In the claims, any reference symbol between bracket should be configured to limitations on claims.Word " comprises " not to be got rid of existence and does not arrange element in the claims or step.Word "a" or "an" before being positioned at element is not got rid of and be there is multiple such element.The present invention can by means of including the hardware of some different elements and realizing by means of the computing machine of suitably programming.In the unit claim listing some devices, several in these devices can be carry out imbody by same hardware branch.Word first, second and third-class use do not represent any order.Can be title by these word explanations.
As previously mentioned, the invention provides at least following scheme:
A1, a kind of method reducing redundant data, comprising:
The All Files of each subregion in hard disk is traveled through;
Judge in described hard disk, whether to comprise two and file identical above: determining under two and above file size is identical, file identification is different prerequisite, according to two and determine two and whether identical with files with the proof test value of files;
By creating hard link, be a file by two and Piece file mergence identical above.
A2, method as described in A1, described judge in described hard disk, whether to comprise two and file identical above before, also comprise: from the file of traversal, get rid of user class file;
Describedly judge whether comprise two and file identical above in described hard disk, comprising: carry out judging whether to comprise two and file identical above from non-user level file.
A3, method as described in A2, describedly get rid of user class file from the file of traversal, judges whether comprise two and file identical above in described hard disk, comprising:
The file of chosen in advance particular type, the file only for described particular type carries out judging whether to comprise two and file identical above; And/or,
According to the amendment of file and/or the file of up-to-date generation in date created eliminating preset time period, carry out judging whether to comprise two and file identical above outside the file getting rid of up-to-date generation.
A4, method as described in A3, the file of described particular type comprises exe file and/or dll file.
A5, method as described in A1, described two and refer to two and with the hash value of files with the proof test value of files, according to two and determine with files hash value two and whether identical with files after, also comprise:
For hash value identical two and with files, further by two and carry out scale-of-two with the partial data of files and compare;
According to partial data comparative result, finally determine two and whether identical with files.
A6, method as described in A1, described by creating hard link, be a file by two and Piece file mergence identical above, comprise:
In new technology file system, by CreateHardLink instruction, two and above same file are merged into a file in hard link mode.
B7, a kind of device reducing redundant data, comprising:
Traversal Unit, for traveling through the All Files of each subregion in hard disk;
Judging unit, two and file identical above whether is comprised in described hard disk: determining under two and above file size is identical, file identification is different prerequisite, according to two and determine two and whether identical with files with the proof test value of files for judging;
Two and Piece file mergence identical above, for by creating hard link, are a file by hard link merge cells.
B8, device as described in B7, also comprise: user class file rejected unit, for getting rid of user class file in the file from traversal;
Described judging unit specifically for, carry out judging whether to comprise two and file identical above from non-user level file.
B9, device as described in B8, described user class file rejected unit specifically for, the file of chosen in advance particular type, gets rid of the file of non specified type, and/or, get rid of the file of up-to-date generation in preset time period according to the amendment of file and/or date created.
B10, device as described in B9, the file of described particular type comprises exe file and/or dll file.
B11, device as described in B7, described judging unit specifically for, according to two and determine two and whether identical with files with the hash value of files; Described judging unit also for, for hash value identical two and with files, further by two and carry out scale-of-two with the partial data of files and compare; According to partial data comparative result, finally determine two and whether identical with files.
B12, device as described in B7, described hard link merge cells specifically for, in new technology file system, by CreateHardLink instruction, two and above same file are merged into a file in hard link mode.
C13, a kind of file management method based on reducing redundant data, comprising:
The All Files of each subregion in hard disk is traveled through;
Judge in described hard disk, whether to comprise two and file identical above: determining under two and above file size is identical, file identification is different prerequisite, according to two and determine two and whether identical with files with the proof test value of files;
By creating hard link, be a file by two and Piece file mergence identical above;
Safeguard the hard link corresponding relation of described hard link file and one or more filename, when upgrading for described one or more filename described hard link file, delete and/or revise, by safeguarding described hard link corresponding relation, the unified operation described hard link file being carried out to described renewal, deletion and/or amendment.
C14, method as described in C13, described hard link corresponding relation comprises file chaining number, and described file chaining number corresponds to the quantity of described one or more filename; When deleting for one of one or more filename, described file chaining number being subtracted one, until when described file chaining number is zero, just described hard link file being deleted.
C15, method as described in C13, when upgrading hard link file, be revised as the hard link relation of file after upgrading by the hard link relation of file to be updated.
C16, method as described in C13, when carrying out file modification for one of one or more filename, unifiedly to modify to described hard link file.
Claims (10)
1. reduce a method for redundant data, it is characterized in that, comprising:
The All Files of each subregion in hard disk is traveled through;
Judge in described hard disk, whether to comprise two and file identical above: determining under two and above file size is identical, file identification is different prerequisite, according to two and determine two and whether identical with files with the proof test value of files;
By creating hard link, be a file by two and Piece file mergence identical above.
2. the method for claim 1, is characterized in that, described judge in described hard disk, whether to comprise two and file identical above before, also comprise: from traversal file get rid of user class file;
Describedly judge whether comprise two and file identical above in described hard disk, comprising: carry out judging whether to comprise two and file identical above from non-user level file.
3. method as claimed in claim 2, is characterized in that, getting rid of user class file, judging whether comprise two and file identical above in described hard disk, comprising the described file from traveling through:
The file of chosen in advance particular type, the file only for described particular type carries out judging whether to comprise two and file identical above; And/or,
According to the amendment of file and/or the file of up-to-date generation in date created eliminating preset time period, carry out judging whether to comprise two and file identical above outside the file getting rid of up-to-date generation.
4. method as claimed in claim 3, it is characterized in that, the file of described particular type comprises exe file and/or dll file.
5. the method for claim 1, is characterized in that, described two and refer to two and with the hash value of files with the proof test value of files, according to two and determine with files hash value two and whether identical with files after, also comprise:
For hash value identical two and with files, further by two and carry out scale-of-two with the partial data of files and compare;
According to partial data comparative result, finally determine two and whether identical with files.
6. the method for claim 1, is characterized in that, described by creating hard link, is a file, comprises two and Piece file mergence identical above:
In new technology file system, by CreateHardLink instruction, two and above same file are merged into a file in hard link mode.
7. reduce a device for redundant data, it is characterized in that, comprising:
Traversal Unit, for traveling through the All Files of each subregion in hard disk;
Judging unit, two and file identical above whether is comprised in described hard disk: determining under two and above file size is identical, file identification is different prerequisite, according to two and determine two and whether identical with files with the proof test value of files for judging;
Two and Piece file mergence identical above, for by creating hard link, are a file by hard link merge cells.
8. device as claimed in claim 7, is characterized in that, also comprise: user class file rejected unit, for getting rid of user class file in the file from traversal;
Described judging unit specifically for, carry out judging whether to comprise two and file identical above from non-user level file.
9. device as claimed in claim 8, is characterized in that, described user class file rejected unit specifically for, the file of chosen in advance particular type, get rid of the file of non specified type, and/or, according to the amendment of file and/or the file of up-to-date generation in date created eliminating preset time period.
10., based on the file management method reducing redundant data, it is characterized in that, comprising:
The All Files of each subregion in hard disk is traveled through;
Judge in described hard disk, whether to comprise two and file identical above: determining under two and above file size is identical, file identification is different prerequisite, according to two and determine two and whether identical with files with the proof test value of files;
By creating hard link, be a file by two and Piece file mergence identical above;
Safeguard the hard link corresponding relation of described hard link file and one or more filename, when upgrading for described one or more filename described hard link file, delete and/or revise, by safeguarding described hard link corresponding relation, the unified operation described hard link file being carried out to described renewal, deletion and/or amendment.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510789116.2A CN105468686A (en) | 2015-11-17 | 2015-11-17 | Method and device for reducing redundant data |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510789116.2A CN105468686A (en) | 2015-11-17 | 2015-11-17 | Method and device for reducing redundant data |
Publications (1)
Publication Number | Publication Date |
---|---|
CN105468686A true CN105468686A (en) | 2016-04-06 |
Family
ID=55606387
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201510789116.2A Pending CN105468686A (en) | 2015-11-17 | 2015-11-17 | Method and device for reducing redundant data |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN105468686A (en) |
Cited By (17)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106649721A (en) * | 2016-12-22 | 2017-05-10 | 创新科存储技术有限公司 | Method and device for duplication removal of file |
CN106844431A (en) * | 2016-12-12 | 2017-06-13 | 北京猎豹移动科技有限公司 | File memory method, device and its electronic equipment |
CN107391593A (en) * | 2017-06-29 | 2017-11-24 | 努比亚技术有限公司 | A kind of document handling method, mobile terminal and computer-readable recording medium |
CN107861686A (en) * | 2017-09-26 | 2018-03-30 | 深圳前海微众银行股份有限公司 | File memory method, service end and computer-readable recording medium |
CN108037895A (en) * | 2017-12-06 | 2018-05-15 | Tcl移动通信科技(宁波)有限公司 | A kind of mobile terminal and data information memory control method and storage medium |
CN108052291A (en) * | 2017-12-14 | 2018-05-18 | 郑州云海信息技术有限公司 | A kind of storage method of Cloud Server, system, device and readable storage medium storing program for executing |
WO2018113210A1 (en) * | 2016-12-21 | 2018-06-28 | 深圳市易特科信息技术有限公司 | Repeated medical documentation deletion system and method in medical informationization |
CN109240985A (en) * | 2018-08-13 | 2019-01-18 | 上海擎感智能科技有限公司 | More storage dish duplicate file processing methods, system, storage medium and vehicle device |
CN110991361A (en) * | 2019-12-06 | 2020-04-10 | 衢州学院 | Multi-channel multi-modal background modeling method for high-definition high-speed video |
CN111913916A (en) * | 2020-07-07 | 2020-11-10 | 泰康保险集团股份有限公司 | File recombination method and equipment |
CN112131194A (en) * | 2020-09-24 | 2020-12-25 | 上海摩勤智能技术有限公司 | File storage control method and device of read-only file system and storage medium |
CN112527740A (en) * | 2019-09-17 | 2021-03-19 | 北京国双科技有限公司 | File resource processing method and device, storage medium and electronic equipment |
CN112612413A (en) * | 2020-12-04 | 2021-04-06 | 海光信息技术股份有限公司 | Version management file caching method, device and system and related equipment |
CN113254397A (en) * | 2021-06-15 | 2021-08-13 | 成都统信软件技术有限公司 | Data checking method and computing device |
CN113836093A (en) * | 2021-09-29 | 2021-12-24 | 深圳万兴软件有限公司 | Network disk file cleaning method, system, computer equipment and storage medium thereof |
CN117648288A (en) * | 2022-09-05 | 2024-03-05 | 华为技术有限公司 | File processing method and electronic equipment |
CN117725028A (en) * | 2023-06-26 | 2024-03-19 | 荣耀终端有限公司 | File processing method, terminal equipment and computer readable storage medium |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101079034A (en) * | 2006-07-10 | 2007-11-28 | 腾讯科技(深圳)有限公司 | System and method for eliminating redundancy file of file storage system |
CN103324699A (en) * | 2013-06-08 | 2013-09-25 | 西安交通大学 | Rapid data de-duplication method adapted to big data application |
CN104462591A (en) * | 2014-12-31 | 2015-03-25 | 上海斐讯数据通信技术有限公司 | Method for avoiding repeated download and mobile terminal |
-
2015
- 2015-11-17 CN CN201510789116.2A patent/CN105468686A/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101079034A (en) * | 2006-07-10 | 2007-11-28 | 腾讯科技(深圳)有限公司 | System and method for eliminating redundancy file of file storage system |
CN103324699A (en) * | 2013-06-08 | 2013-09-25 | 西安交通大学 | Rapid data de-duplication method adapted to big data application |
CN104462591A (en) * | 2014-12-31 | 2015-03-25 | 上海斐讯数据通信技术有限公司 | Method for avoiding repeated download and mobile terminal |
Non-Patent Citations (2)
Title |
---|
王禹: "分布式存储系统中的数据冗余与维护技术研究", 《中国博士学位论文全文数据库 信息科技辑》 * |
王钧 等: "《操作系统(Linux)》", 31 July 2015 * |
Cited By (21)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106844431A (en) * | 2016-12-12 | 2017-06-13 | 北京猎豹移动科技有限公司 | File memory method, device and its electronic equipment |
WO2018113210A1 (en) * | 2016-12-21 | 2018-06-28 | 深圳市易特科信息技术有限公司 | Repeated medical documentation deletion system and method in medical informationization |
CN106649721A (en) * | 2016-12-22 | 2017-05-10 | 创新科存储技术有限公司 | Method and device for duplication removal of file |
CN107391593A (en) * | 2017-06-29 | 2017-11-24 | 努比亚技术有限公司 | A kind of document handling method, mobile terminal and computer-readable recording medium |
CN107861686B (en) * | 2017-09-26 | 2021-01-05 | 深圳前海微众银行股份有限公司 | File storage method, server and computer readable storage medium |
CN107861686A (en) * | 2017-09-26 | 2018-03-30 | 深圳前海微众银行股份有限公司 | File memory method, service end and computer-readable recording medium |
CN108037895A (en) * | 2017-12-06 | 2018-05-15 | Tcl移动通信科技(宁波)有限公司 | A kind of mobile terminal and data information memory control method and storage medium |
CN108037895B (en) * | 2017-12-06 | 2021-06-22 | Tcl移动通信科技(宁波)有限公司 | Mobile terminal, data information storage control method and storage medium |
CN108052291A (en) * | 2017-12-14 | 2018-05-18 | 郑州云海信息技术有限公司 | A kind of storage method of Cloud Server, system, device and readable storage medium storing program for executing |
CN109240985A (en) * | 2018-08-13 | 2019-01-18 | 上海擎感智能科技有限公司 | More storage dish duplicate file processing methods, system, storage medium and vehicle device |
CN112527740A (en) * | 2019-09-17 | 2021-03-19 | 北京国双科技有限公司 | File resource processing method and device, storage medium and electronic equipment |
CN110991361A (en) * | 2019-12-06 | 2020-04-10 | 衢州学院 | Multi-channel multi-modal background modeling method for high-definition high-speed video |
CN110991361B (en) * | 2019-12-06 | 2021-01-15 | 衢州学院 | Multi-channel multi-modal background modeling method for high-definition high-speed video |
CN111913916A (en) * | 2020-07-07 | 2020-11-10 | 泰康保险集团股份有限公司 | File recombination method and equipment |
CN112131194A (en) * | 2020-09-24 | 2020-12-25 | 上海摩勤智能技术有限公司 | File storage control method and device of read-only file system and storage medium |
CN112612413A (en) * | 2020-12-04 | 2021-04-06 | 海光信息技术股份有限公司 | Version management file caching method, device and system and related equipment |
CN113254397A (en) * | 2021-06-15 | 2021-08-13 | 成都统信软件技术有限公司 | Data checking method and computing device |
CN113836093A (en) * | 2021-09-29 | 2021-12-24 | 深圳万兴软件有限公司 | Network disk file cleaning method, system, computer equipment and storage medium thereof |
CN113836093B (en) * | 2021-09-29 | 2024-04-12 | 深圳万兴软件有限公司 | Network disk file cleaning method, system, computer equipment and storage medium thereof |
CN117648288A (en) * | 2022-09-05 | 2024-03-05 | 华为技术有限公司 | File processing method and electronic equipment |
CN117725028A (en) * | 2023-06-26 | 2024-03-19 | 荣耀终端有限公司 | File processing method, terminal equipment and computer readable storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN105468686A (en) | Method and device for reducing redundant data | |
US10725976B2 (en) | Fast recovery using self-describing replica files in a distributed storage system | |
US10140461B2 (en) | Reducing resource consumption associated with storage and operation of containers | |
US9189342B1 (en) | Generic process for determining child to parent inheritance for fast provisioned or linked clone virtual machines | |
US9195667B2 (en) | System for on-line archiving of content in an object store | |
CN109086388B (en) | Block chain data storage method, device, equipment and medium | |
US11042504B2 (en) | Managing overwrites when archiving data in cloud/object storage | |
WO2016079629A1 (en) | Optimizing database deduplication | |
US10248656B2 (en) | Removal of reference information for storage blocks in a deduplication system | |
US10212067B2 (en) | Dynamic symbolic links for referencing in a file system | |
US9965487B2 (en) | Conversion of forms of user data segment IDs in a deduplication system | |
US9535925B2 (en) | File link migration | |
CN112631621A (en) | Dependency package management method, device, server and storage medium | |
US10915246B2 (en) | Cloud storage format to enable space reclamation while minimizing data transfer | |
US11500835B2 (en) | Cohort management for version updates in data deduplication | |
US11151093B2 (en) | Distributed system control for on-demand data access in complex, heterogenous data storage | |
WO2007026484A1 (en) | Device, method, and program for generating and executing execution binary image, and computer-readable recording medium containing the execution binary image execution program | |
US20230222165A1 (en) | Object storage-based indexing systems and method | |
US10732843B2 (en) | Tape drive data reclamation | |
WO2023201002A1 (en) | Implementing graph search with in-structure metadata of a graph-organized file system | |
US10042854B2 (en) | Detection of data affected by inaccessible storage blocks in a deduplication system | |
US9965488B2 (en) | Back referencing of deduplicated data | |
US11947495B1 (en) | System and method for providing a file system without duplication of files | |
US11762603B2 (en) | Storing modified or unmodified portions of a file based on tape loading | |
US10853312B2 (en) | Archiving data in cloud/object storage using local metadata staging |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20160406 |