CN106649676A - Duplication eliminating method and device based on HDFS storage file - Google Patents

Duplication eliminating method and device based on HDFS storage file Download PDF

Info

Publication number
CN106649676A
CN106649676A CN201611159251.XA CN201611159251A CN106649676A CN 106649676 A CN106649676 A CN 106649676A CN 201611159251 A CN201611159251 A CN 201611159251A CN 106649676 A CN106649676 A CN 106649676A
Authority
CN
China
Prior art keywords
file
identification
storage
memory node
deduplicated
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201611159251.XA
Other languages
Chinese (zh)
Other versions
CN106649676B (en
Inventor
张为锋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Ruian Technology Co Ltd
Original Assignee
Beijing Ruian Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Ruian Technology Co Ltd filed Critical Beijing Ruian Technology Co Ltd
Priority to CN201611159251.XA priority Critical patent/CN106649676B/en
Publication of CN106649676A publication Critical patent/CN106649676A/en
Application granted granted Critical
Publication of CN106649676B publication Critical patent/CN106649676B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/17Details of further file system functions
    • G06F16/174Redundancy elimination performed by the file system
    • G06F16/1748De-duplication implemented within the file system, e.g. based on file segments

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the invention discloses a duplication eliminating method and device based on an HDFS storage file. The method includes the steps that a file fingerprint of a file to be subjected to duplication eliminating is compared with a file fingerprint of a stored file; if the file fingerprints are the same, link identification is calculated according to file identification of the file to be subjected to duplication eliminating; the link identification and the storage address of the same storage file in a storage node replace the file content of the file to be subjected to duplication eliminating to serve as a key value of the file identification of the file to be subjected to duplication eliminating, and the key value is stored into the storage node. According to the duplication eliminating method and device in the technical scheme, files with the duplicated content are effectively removed, the number of files is decreased, the storage space is saved, and the system performance is improved.

Description

A kind of De-weight method and device based on HDFS storage files
Technical field
The present embodiments relate to unstructured data memory technology, more particularly to a kind of going based on HDFS storage files Weighing method and device.
Background technology
Hadoop distributed file systems (Hadoop Distributed File System, abbreviation HDFS) are to super large The system that scale data collection provides reliable memory function, sets up on the basis of response is with " write-once repeatedly reads " task, The input/output date flow of high bandwidth is provided to user application.HDFS has high fault tolerance, may operate in cheap hard On part cluster.Using the client/server of MASTER/SLAVES, a HDFS cluster is by a Namenode node (management node) With multiple Datanode nodes (memory node) composition.Management node is a central server, is responsible for file system The access of metadata and client to file.Management node stores the metadata of file, therefore the memory size of management node Limit the quantity of file.HDFS acquiescences can be by file division into block (memory block), and such as 64M is 1 memory block.Then Each memory block is stored in the memory node of HDFS in the form of key-value pair, and the mapping of key-value pair is stored in internal memory.Often Individual file, memory block and index list are stored in internal memory in the form of object, and each object accounts for 150 bytes.Citing For, if 1000000 small documents, each file takes a memory block, then management node just at least needs 300M's Internal memory;If when storage 100,000,000 or even more files, needing the even more memory sizes of 20G, solution is to build support The memory database of cluster, but increase system cost.If small documents are too many, excessive memory source is taken, affect sociability Can, need to merge small documents, reduce quantity of documents.
However, in actual the Internet, applications, there are the small documents of magnanimity, especially with blog, microblogging, The rise of the social network sites such as Facebook, changes the mode of internet storage content.User substantially has become internet The creator of content, the features such as its data has magnanimity, various, dynamic change, thereby produces mass small documents, such as state text Part, subscriber data, head portrait etc..These data can be divided into structural data and destructuring number according to the storage format of data According to.Structural data has identical level and network, can be described with numeral or word;And have some information then without Method digital or unified representation, for example, report, word processing text that scan image, fax, photo, computer are generated Shelves, electrical form, PowerPoint, voice and video etc., these are unstructured data.Unstructured data is through structure After the extraction of change, needs preserve original document, for subsequently using.
In many fields, unstructured data proportion is significantly larger than structural data proportion.Destructuring Data message amount is very big, if be directly stored in database, except significantly plus in addition to the capacity of large database concept, also reducing The efficiency safeguarded and apply.The unstructured data for especially obtaining in internet often has repeatability, and focus incident is short Substantial amounts of netizen can be brought in time to pay close attention to, cause a small amount of unstructured data to be reused in a large number at short notice, be taken System memory space.In prior art, data are compressed according to certain ratio using compress technique, but destructuring Data do not possess strict structure, and than structured message standardization is more difficult to, and management is got up more difficult.For these features, The magnanimity destructuring small documents of HDFS storages at present are merged into after big file, not at compression using Mapfile technologies Reason, the memory space of occupancy is more, therefore, the content repeated in magnanimity unstructured data how is removed, saving memory space is Urgent problem.
The content of the invention
The embodiment of the present invention provides a kind of De-weight method and device based on HDFS storage files, so that HDFS processes storage Magnanimity destructuring small documents when, effective duplicate removal, save memory space.
In a first aspect, a kind of De-weight method based on HDFS storage files is embodiments provided, including:
The file fingerprint of deduplicated file will be treated, will be compared with the file fingerprint of storage file;
If comparison result is identical, according to the file identification calculating linking mark for treating deduplicated file;
With the link identification and identical storage address of the storage file in memory node, replace described in treat duplicate removal The file content of file, stores in memory node as the key assignments of the file identification for treating deduplicated file.
Preferably, the file fingerprint of deduplicated file will be treated, before comparing with the file fingerprint of storage file, will also be wrapped Include:
The file for receiving is stored into the memory node in setting regions, and is labeled as non-duplicate removal processing region;
File is obtained one by one from the non-duplicate removal processing region, as treating deduplicated file.
Preferably, the file for receiving is stored into the memory node into setting regions includes:
For the file generated major key for receiving, as file identification;
The file content of the file is converted to into binary data, it is corresponding with the file identification to store to the storage In node in setting regions.
Preferably, the file for receiving is stored into the memory node into setting regions includes:
According to the date received of file, the file for receiving is stored into different setting regions into the memory node In.
Preferably, included according to the file identification calculating linking mark for treating deduplicated file:
32 MD5 values are calculated to the file identification for treating deduplicated file, as the link identification.
Preferably, with the link identification and identical storage address of the storage file in memory node, institute is replaced The file content for treating deduplicated file is stated, as the key assignments of the file identification for treating deduplicated file it is stored in memory node Afterwards, also include:
According to each file identification in the memory node and the storage location of correspondence key assignments, the rope of the memory node is rewritten Quotation part.
Preferably, methods described also includes:
According to the file read request for receiving, the file identification of file to be read is obtained;
Corresponding link identification is calculated according to the file identification;
Location data is set according to what the file identification read corresponding key assignments from memory node;
If comparing the link identification to match with the location data that sets, from the key assignments storage address is read;
The text is responded after the corresponding file of positioning searching, and reading in the memory node according to the storage address Part read requests.
Second aspect, the embodiment of the present invention additionally provides a kind of duplicate removal device based on HDFS storage files, including:
Fingerprint comparison module, for treating the file fingerprint of deduplicated file, is compared with the file fingerprint of storage file It is right;
Link identification computing module, if being identical for comparison result, according to the file identification for treating deduplicated file Calculating linking is identified;
Content replacement module, for the link identification and identical storage of the storage file in memory node Location, replace described in treat the file content of deduplicated file, as the key assignments storage to storage of the file identification for treating deduplicated file In node.
Preferably, described device also includes:
File storage module, for treating the file fingerprint of deduplicated file, is compared with the file fingerprint of storage file To before, the file for receiving being stored into the memory node in setting regions, and it is labeled as non-duplicate removal processing region;
File acquisition module, for obtaining file one by one from the non-duplicate removal processing region, as treating deduplicated file.
Preferably, the file storage module includes:
Major key signal generating unit, for for the file generated major key for receiving, as file identification;
Content conversion unit, for the file content of the file to be converted to into binary data, with the file identification Correspondence is stored into the memory node in setting regions.
Preferably, the file storage module specifically for:
According to the date received of file, the file for receiving is stored into different setting regions into the memory node In.
Preferably, the link identification computing module specifically for:
32 MD5 values are calculated to the file identification for treating deduplicated file, as the link identification.
Preferably, described device also includes:
Index module is rewritten, for the link identification and identical storage of the storage file in memory node Location, replace described in treat the file content of deduplicated file, as the key assignments storage to storage of the file identification for treating deduplicated file After in node, according to each file identification in the memory node and the storage location of correspondence key assignments, the memory node is rewritten Index file.
Preferably, described device also includes:
File identification read module, for according to the file read request for receiving, obtaining the files-designated of file to be read Know;
Correspondence mark computing module, for calculating corresponding link identification according to the file identification;
If location data read module, for setting for corresponding key assignments to be read from memory node according to the file identification Location data;
Matching module, if matched with the location data that sets for comparing the link identification, from the key assignments Read storage address;
Ff module, for according to the storage address in the memory node the corresponding file of positioning searching, And respond the file read request after reading.
The embodiment of the present invention is directed to file content identical magnanimity unstructured document in HDFS, to content identical file Only retain a, delete and storage file fingerprint identical file content, replace with link identification and chained address, effectively go Except the file that content repeats, quantity of documents is reduced, saved substantial amounts of memory space, releasing memory resource, lift system performance, Meanwhile, it is capable to meet quick storage and the correct demand for reading.
Description of the drawings
Figure 1A is a kind of flow chart of the De-weight method based on HDFS storage files in the embodiment of the present invention one;
Figure 1B is a kind of schematic diagram of the De-weight method based on HDFS storage files in the embodiment of the present invention one;
Fig. 2 is a kind of flow chart of the De-weight method based on HDFS storage files in the embodiment of the present invention two;
Fig. 3 is a kind of flow chart of the De-weight method based on HDFS storage files in the embodiment of the present invention three;
Fig. 4 A are a kind of structural representations of the duplicate removal device based on HDFS storage files in the embodiment of the present invention four;
Fig. 4 B are a kind of structural representations of the duplicate removal device based on HDFS storage files in the embodiment of the present invention four.
Specific embodiment
With reference to the accompanying drawings and examples the present invention is described in further detail.It is understood that this place is retouched The specific embodiment stated is used only for explaining the present invention, rather than limitation of the invention.It also should be noted that, in order to just Part related to the present invention rather than entire infrastructure are illustrate only in description, accompanying drawing.
Embodiment one
A kind of flow chart of De-weight method based on HDFS storage files that Figure 1A is provided for the embodiment of the present invention one, this reality Apply example and be applicable to Hadoop distributed file systems, the system typically may include management node and multiple memory nodes.The party Method can be by being performed based on the duplicate removal device of HDFS storage files, and the device can be with the reality by the way of software and/or hardware It is existing, typically it is integrated in the management node in Hadoop distributed file systems.
The method of the embodiment of the present invention one is specifically included:
S101, the file fingerprint of deduplicated file will be treated, be compared with the file fingerprint of storage file.
Treat that deduplicated file is the file for receiving, first this document can be stored in memory node, then in offline shape The deduplication operation of the present embodiment is carried out under state, it is also possible to when receiving whne deduplicated file, carry out online deduplication operation.Due to Online duplicate removal needs to take larger resource, and the speed of service is slow, and the response time is long, it is advantageous to being to carry out offline duplicate removal.From depositing The file for not carrying out duplicate removal process is extracted in storage node, as treating deduplicated file.
Specifically, file fingerprint is calculated according to the content of each file, no matter how file name changes, As long as the content of file is not changed in, the file fingerprint for calculating is exactly identical.If treating deduplicated file with storage file File content it is identical, the file fingerprint for calculating is just identical.The computational methods of file fingerprint can be the message of calculation document Digest algorithm the 5th edition (Message-Digest Algorithm 5, abbreviation MD5 values), Secure Hash Algorithm (Secure Hash Algorithm 1, abbreviation SHA1 values) or CRC (Cyclic Redundancy Check, abbreviation CRC32 values).Its In, MD5 values have the discreteness of height, and the minor variations of prime information content can cause the great variety of MD5 values, and reliability is high. In the present embodiment, 1K binary data and the last 1K binary data of file carry out the calculating of MD5 values before preferred acquisition file, calculate Result as file fingerprint.
Under off-line state, the file fingerprint and the file fingerprint of storage file for periodically treating deduplicated file is compared It is right.After daily 0 point, using the MapReduce computation module in Hadoop distributed file systems compare offline described in treat The file fingerprint of deduplicated file and the file fingerprint of storage file, filter out and have identical content with the storage file Deduplicated file is treated, and obtains corresponding storage file and its storage address in data memory node.
If S102, comparison result are identical, according to the file identification calculating linking mark for treating deduplicated file.
It is with key in mapped file (Mapfile) when specifically, in file write Hadoop distributed file systems It is worth what the form to Key-Value was stored, major key Key is file identification, is that this document energy is distributed to when file is stored uniquely The character string of mark this document.Key assignments Value is the corresponding whole binary numbers of the corresponding binary values of Key, i.e. file content According to.If file fingerprint comparison result is identical, link identification is calculated according to the file identification Key for treating deduplicated file, The link identification plays special identifier effect to the file for having carried out duplicate removal process.The stage is read in file, if from file Key assignments in read be link identification rather than reality binary data, then show that this document has carried out duplicate removal process.Such as Fruit file fingerprint comparison result is differed, then illustrate that this document is differed with the file content of storage file, retains this document File content, do not carry out duplicate removal process.
Preferably, step S102 includes:
32 MD5 values are calculated to the file identification for treating deduplicated file, as the link identification.
In the present embodiment, 32 MD5 values are calculated according to the file identification Key for treating deduplicated file, as the chain Connect mark.Similar to ciphering process, mark is encrypted to the file of duplicate removal process, 32 MD5 values is calculated, in response file It is decrypted during read requests.Also, document stage is read, link identification can be calculated according to file identification, so as to recognize Whether this document has carried out duplicate removal process.
S103, with the link identification and identical storage address of the storage file in memory node, replace described The file content of deduplicated file is treated, is stored in memory node as the key assignments of the file identification for treating deduplicated file.
In the present embodiment, the file of duplicate removal process is carried out, the no longer binary data of storage file content in its key assignments, and It is to replace with link identification and storage address, the content of the storage address storage is identical with the content that duplicate removal processes file 's.As shown in Figure 1B, it is assumed that the corresponding file contents of Key2 are identical with the content of storage file, then Key2 file contents are read Corresponding binary data, writes in the original location the reality of link identification and the MD5 value of storage address, i.e., 32 and same file storage Border address, completes the replacement for treating deduplicated file content.The corresponding file contents of Key1 and Key3 are interior with storage file Hold different, the corresponding binary data of reservation Key1 and Key3 file contents.
Preferably, step S103 includes:
According to each file identification in the memory node and the storage location of correspondence key assignments, the rope of the memory node is rewritten Quotation part.
Specifically, can be with rapidly locating according to index file, the corresponding key of each file identification in the memory node Value Data has been replaced, and original index file is unable to the new mapping relations of Correct, needs according to after replacement The storage location of each file identification and correspondence key assignments, rewrites the index file of the memory node in memory node.
A kind of De-weight method based on HDFS storage files that the embodiment of the present invention one is provided, compares text under off-line state Part fingerprint simultaneously carries out data deduplication process, can increased the reliability of system with proper extension process time, saves internal memory money Source, reduces the requirement to hardware device, and then saves large number quipments cost, and can effectively remove the text that content repeats Part, reduces quantity of documents, saves memory space.
Embodiment two
Fig. 2 is a kind of flow chart of De-weight method based on HDFS storage files that the embodiment of the present invention two is provided, this Bright embodiment two is optimized improvement based on embodiment one, to how offline deduplication operation is further described, such as Shown in Fig. 2, the embodiment of the present invention two is specifically included:
S201, the file for receiving is stored into the memory node in setting regions, and be labeled as non-duplicate removal and processed Region.
In the present embodiment, multiple mapped files are included in Hadoop distributed file systems, the mapped file is used to return Shelves magnanimity destructuring small documents, and generate the corresponding mapping relations of archive file.System continuously receives file and is cached, Capacity threshold is reached when caching takes up room or the reception time prescribes a time limit when reaching default, system is according to receiving unstructured document Order write successively in the mapped file of each memory node, and be labeled as non-duplicate removal processing region.Wherein, the capacity threshold Scope may be configured as between 128M to 2G, the scope in the default time limit could be arranged between 5 minutes to 20 minutes, write Mode can be write by multi-thread concurrent, with the speed for ensureing to write.
Preferably, step S201 includes:
For the file generated major key for receiving, as file identification;
The file content of the file is converted to into binary data, it is corresponding with the file identification to store to the storage In node in setting regions.
Specifically, system is the file generated major key Key for receiving, and as file identification, is indexed according to major key Key Storage, is converted to binary data, as the corresponding key assignments Value of major key Key, major key Key by the file content of the file Stored in the form of key-value pair into the mapped file of the memory node with corresponding key assignments Value.
Preferably, step S201 also includes:
According to the date received of file, the file for receiving is stored into different setting regions into the memory node In.
Specifically, during storage file, the file for receiving is stored into into the memory node difference according to date received Mapped file in.Preserved for the daily file for writing newly-built catalogue in Hadoop distributed file systems, With day as unit partitioned storage.
S202, from the non-duplicate removal processing region file is obtained one by one, as treating deduplicated file.
S203, the file fingerprint of deduplicated file will be treated, be compared with the file fingerprint of storage file.
If S204, comparison result are identical, according to the file identification calculating linking mark for treating deduplicated file.
S205, with the link identification and identical storage address of the storage file in memory node, replace described The file content of deduplicated file is treated, is stored in memory node as the key assignments of the file identification for treating deduplicated file.
A kind of De-weight method based on HDFS storage files that the embodiment of the present invention two is provided, according to the date of acceptance of file Partitioned storage is carried out to file, is easy to processed offline, for the storage file on the same day duplicate removal process is not temporarily carried out, be ensure that The storage efficiency of data, meets the demand of flash storage data, improves the real-time of data storage.
Embodiment three
Fig. 3 is a kind of flow chart of De-weight method based on HDFS storage files that the embodiment of the present invention three is provided, this Bright embodiment three is optimized improvement based on embodiment two, and after duplicate removal process, file content is obtained in file Process be further described, as shown in figure 3, the embodiment of the present invention three is specifically included:
The file read request that S301, basis are received, obtains the file identification of file to be read.
S302, corresponding link identification is calculated according to the file identification.
S303, read corresponding key assignments from memory node according to the file identification set location data.
If S304, comparing the link identification and matching with the location data that sets, storage is read from the key assignments Address.
S305, the response after the corresponding file of positioning searching, and reading in the memory node according to the storage address The file read request.
In the present embodiment, obtain file content process to file user shield internal processes, system according to The file read request for receiving, obtains the file identification major key Key of file to be read, is calculated according to the major key Key and is continued The corresponding link identifications of file major key Key are taken, the link identification can be obtained by calculating MD5 values.According to the major key Key from Front 32 MD5 values of correspondence key assignments are read in memory node, the MD5 values of link identification and front 32 MD5 values of reading is compared, such as Fruit is consistent, then illustrate the file to be read through duplicate removal process, and file content storage is that storage file exists content identical Storage address in memory node, rather than the real content of file itself, remove front 32 data in this document content, from Read storage address in the key assignments, according to the storage address in the memory node the corresponding file of positioning searching, and The file read request is responded after reading.The MD5 values for comparing link identification and the front 32 MD5 values for reading, if inconsistent, Then illustrate that the file to be read, not through duplicate removal process, reads from the key assignments and respond after file content the file reading Request.
A kind of De-weight method based on HDFS storage files that the embodiment of the present invention three is provided, for the destructuring for repeating File only saves corresponding storage address, and internal processes are shielded to visitor when reading file, disclosure satisfy that correct reading The demand for taking, has saved memory space, improves systematic function.
Example IV
Fig. 4 A are a kind of structural representations of the duplicate removal device based on HDFS storage files in the embodiment of the present invention four, should Device is applied to Hadoop distributed file systems.As shown in Figure 4 A, the device includes:
Fingerprint comparison module 401, for treating the file fingerprint of deduplicated file, is carried out with the file fingerprint of storage file Compare;
Link identification computing module 402, if being identical for comparison result, according to the files-designated for treating deduplicated file Know calculating linking mark;
Content replacement module 403, for the link identification and identical storage file depositing in memory node Storage address, replace described in treat the file content of deduplicated file, the key assignments storage as the file identification for treating deduplicated file is arrived In memory node.
Preferably, the link identification computing module specifically for:
32 MD5 values are calculated to the file identification for treating deduplicated file, as the link identification.
Preferably, described device also includes:
Index module 404 is rewritten, for the link identification and identical storage file depositing in memory node Storage address, replace described in treat the file content of deduplicated file, the key assignments storage as the file identification for treating deduplicated file is arrived After in memory node, according to each file identification in the memory node and the storage location of correspondence key assignments, the storage is rewritten The index file of node.
Specifically, under off-line state, using the file fingerprint that deduplicated file is treated described in the comparison of fingerprint comparison module and The file fingerprint of storage file, filters out the deduplicated file for the treatment of for having identical content with the storage file, and obtains correspondence Storage file and its storage address in data memory node.If file fingerprint comparison result is identical, according to institute The file identification Key for treating deduplicated file is stated, 32 MD5 values are calculated in link identification computing module, as link identification, should Link identification plays mark action to the file for having carried out duplicate removal process.By content replacement module, with the link identification With identical storage address of the storage file in memory node, replace described in treat the file content of deduplicated file, as institute The key assignments for stating the file identification for treating deduplicated file is stored in memory node.According to each file identification in the memory node and right The storage location of key assignments is answered, in the index file for rewriteeing the index module rewriting memory node.
Preferably, as shown in Figure 4 A, described device also includes:
File storage module 405, for treating the file fingerprint of deduplicated file, is carried out with the file fingerprint of storage file Before comparison, the file for receiving is stored into the memory node in setting regions, and be labeled as non-duplicate removal processing region;
File acquisition module 406, it is literary as duplicate removal is treated for obtaining file one by one from the non-duplicate removal processing region Part.
Preferably, the file storage module includes:
Major key signal generating unit, for for the file generated major key for receiving, as file identification;
Content conversion unit, for the file content of the file to be converted to into binary data, with the file identification Correspondence is stored into the memory node in setting regions.
Preferably, the file storage module specifically for:
According to the date received of file, the file for receiving is stored into different setting regions into the memory node In.
Specifically, file access module continuously receives file and is cached, and when caching takes up room capacity threshold is reached Or the reception time prescribes a time limit when reaching default, system according to file date of acceptance, according to receive unstructured document order Multi-thread concurrent is write in the mapped file of each memory node.Wherein, the scope of the capacity threshold may be configured as 128M to 2G Between, the scope in the default time limit could be arranged between 5 minutes to 20 minutes.Major key signal generating unit is the file for receiving Major key Key is generated, as file identification, the file content of the file binary number is converted to into using content conversion unit According to being stored in the form of key-value pair to institute as the corresponding key assignments Value of major key Key, major key Key and corresponding key assignments Value In stating the mapped file of memory node.File is obtained one by one from the non-duplicate removal processing region according to file acquisition module, is made To treat deduplicated file.
Preferably, as shown in Figure 4 B, described device also includes:
File identification read module 407, for according to the file read request for receiving, obtaining the file of file to be read Mark;
Correspondence mark computing module 408, for calculating corresponding link identification according to the file identification;
If location data read module 409, for corresponding key assignments to be read from memory node according to the file identification Set location data;
Matching module 410, if matched with the location data that sets for comparing the link identification, from the key assignments Middle reading storage address;
Ff module 411, for according to the storage address in the memory node the corresponding text of positioning searching Part, and respond the file read request after reading.
Specifically, file to be read is obtained according to the file read request for receiving using file identification read module File identification major key Key, file major key Key correspondences to be read are calculated using correspondence mark computing module according to the major key Key MD5 values.Correspondence first 32 of key assignments is read from memory node according to the major key Key using location data read module is set MD5 values, the MD5 values that link identification is compared in matching module and the front 32 MD5 values for reading, if unanimously, illustrate that this is treated File is read through duplicate removal process, file content storage is content identical storage of the storage file in memory node Location, rather than the real content of file itself, remove front 32 data in this document content, by ff module from institute State and read in key assignments storage address, according to the storage address in the memory node the corresponding file of positioning searching, and read The file read request is responded after taking.If it is inconsistent, the file to be read is illustrated not through duplicate removal process, from described Read in key assignments after file content and respond the file read request.
A kind of duplicate removal device based on HDFS storage files that the embodiment of the present invention four is provided, can effectively remove interior unit weight Multiple file, reduces quantity of documents, has saved substantial amounts of memory space, releasing memory resource, lift system performance, meanwhile, energy Enough meet quick storage and the correct demand for reading.
The method that the executable any embodiment of the present invention of device provided in an embodiment of the present invention is provided, possesses execution method phase The functional module answered and beneficial effect.
Note, above are only presently preferred embodiments of the present invention and institute's application technology principle.It will be appreciated by those skilled in the art that The invention is not restricted to specific embodiment described here, can carry out for a person skilled in the art various obvious changes, Readjust and substitute without departing from protection scope of the present invention.Therefore, although the present invention is carried out by above example It is described in further detail, but the present invention is not limited only to above example, without departing from the inventive concept, also More other Equivalent embodiments can be included, and the scope of the present invention is determined by scope of the appended claims.

Claims (14)

1. a kind of De-weight method based on HDFS storage files, it is characterised in that include:
The file fingerprint of deduplicated file will be treated, will be compared with the file fingerprint of storage file;
If comparison result is identical, according to the file identification calculating linking mark for treating deduplicated file;
With the link identification and identical storage address of the storage file in memory node, replace described in treat deduplicated file File content, as the file identification for treating deduplicated file key assignments store in memory node.
2. method according to claim 1, it is characterised in that the file fingerprint of deduplicated file will be treated, with storage file File fingerprint compare before, also include:
The file for receiving is stored into the memory node in setting regions, and is labeled as non-duplicate removal processing region;
File is obtained one by one from the non-duplicate removal processing region, as treating deduplicated file.
3. method according to claim 2, it is characterised in that the file for receiving is stored into the memory node and is set Determining region includes:
For the file generated major key for receiving, as file identification;
The file content of the file is converted to into binary data, it is corresponding with the file identification to store to the memory node In middle setting regions.
4. method according to claim 2, it is characterised in that the file for receiving is stored into the memory node and is set Determining region includes:
According to the date received of file, the file for receiving is stored in different setting regions into the memory node.
5. method according to claim 1, it is characterised in that according to the file identification calculating linking for treating deduplicated file Mark includes:
32 MD5 values are calculated to the file identification for treating deduplicated file, as the link identification.
6. method according to claim 1, it is characterised in that storage file is being deposited with the link identification and identical Storage address in storage node, replace described in treat the file content of deduplicated file, as the file identification for treating deduplicated file Key assignments store in memory node after, also include:
According to each file identification in the memory node and the storage location of correspondence key assignments, the index text of the memory node is rewritten Part.
7. according to the arbitrary described method of claim 1-6, it is characterised in that also include:
According to the file read request for receiving, the file identification of file to be read is obtained;
Corresponding link identification is calculated according to the file identification;
Location data is set according to what the file identification read corresponding key assignments from memory node;
If comparing the link identification to match with the location data that sets, from the key assignments storage address is read;
The file is responded after the corresponding file of positioning searching, and reading in the memory node according to the storage address to read Take request.
8. a kind of duplicate removal device based on HDFS storage files, it is characterised in that include:
Fingerprint comparison module, for treating the file fingerprint of deduplicated file, compares with the file fingerprint of storage file;
Link identification computing module, if being identical for comparison result, calculates according to the file identification for treating deduplicated file Link identification;
Content replacement module, for the link identification and identical storage address of the storage file in memory node, The file content of deduplicated file is treated described in replacing, memory node is arrived in the key assignments storage as the file identification for treating deduplicated file In.
9. device according to claim 8, it is characterised in that described device also includes:
File storage module, for treating the file fingerprint of deduplicated file, with the file fingerprint of storage file it is compared Before, the file for receiving is stored into the memory node in setting regions, and it is labeled as non-duplicate removal processing region;
File acquisition module, for obtaining file one by one from the non-duplicate removal processing region, as treating deduplicated file.
10. device according to claim 9, it is characterised in that the file storage module includes:
Major key signal generating unit, for for the file generated major key for receiving, as file identification;
Content conversion unit, it is corresponding with the file identification for the file content of the file to be converted to into binary data Store into the memory node in setting regions.
11. devices according to claim 9, it is characterised in that the file storage module specifically for:
According to the date received of file, the file for receiving is stored in different setting regions into the memory node.
12. devices according to claim 8, it is characterised in that the link identification computing module specifically for:
32 MD5 values are calculated to the file identification for treating deduplicated file, as the link identification.
13. devices according to claim 8, it is characterised in that also include:
Index module is rewritten, for the link identification and identical storage address of the storage file in memory node, The file content of deduplicated file is treated described in replacing, memory node is arrived in the key assignments storage as the file identification for treating deduplicated file In after, according to each file identification in the memory node and the storage location of correspondence key assignments, rewrite the rope of the memory node Quotation part.
14. according to the arbitrary described device of claim 8-13, it is characterised in that described device also includes:
File identification read module, for according to the file read request for receiving, obtaining the file identification of file to be read;
Correspondence mark computing module, for calculating corresponding link identification according to the file identification;
If location data read module, for the setting position of corresponding key assignments to be read from memory node according to the file identification Data;
Matching module, if matched with the location data that sets for comparing the link identification, reads from the key assignments Storage address;
Ff module, for according to the storage address in the memory node the corresponding file of positioning searching, and read The file read request is responded after taking.
CN201611159251.XA 2016-12-15 2016-12-15 HDFS (Hadoop distributed File System) -based duplicate removal method and device for stored files Active CN106649676B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201611159251.XA CN106649676B (en) 2016-12-15 2016-12-15 HDFS (Hadoop distributed File System) -based duplicate removal method and device for stored files

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201611159251.XA CN106649676B (en) 2016-12-15 2016-12-15 HDFS (Hadoop distributed File System) -based duplicate removal method and device for stored files

Publications (2)

Publication Number Publication Date
CN106649676A true CN106649676A (en) 2017-05-10
CN106649676B CN106649676B (en) 2020-06-19

Family

ID=58822292

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201611159251.XA Active CN106649676B (en) 2016-12-15 2016-12-15 HDFS (Hadoop distributed File System) -based duplicate removal method and device for stored files

Country Status (1)

Country Link
CN (1) CN106649676B (en)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107590109A (en) * 2017-07-24 2018-01-16 深圳市元征科技股份有限公司 A kind of text handling method and electronic equipment
CN108563649A (en) * 2017-12-12 2018-09-21 南京富士通南大软件技术有限公司 Offline De-weight method based on GlusterFS distributed file systems
CN110413960A (en) * 2019-06-19 2019-11-05 平安银行股份有限公司 File control methods, device, computer equipment and computer readable storage medium
CN110442845A (en) * 2019-07-08 2019-11-12 新华三信息安全技术有限公司 File repetitive rate calculation method and device
CN110535835A (en) * 2019-08-09 2019-12-03 西藏宁算科技集团有限公司 It is a kind of to support cloudy shared cloud storage method and system based on Message Digest 5
CN111522502A (en) * 2019-02-01 2020-08-11 阿里巴巴集团控股有限公司 Data deduplication method and device, electronic equipment and computer-readable storage medium
CN111522791A (en) * 2020-04-30 2020-08-11 电子科技大学 Distributed file repeating data deleting system and method
CN112084179A (en) * 2020-09-02 2020-12-15 北京锐安科技有限公司 Data processing method, device, equipment and storage medium
WO2023070462A1 (en) * 2021-10-28 2023-05-04 华为技术有限公司 File deduplication method and apparatus, and device

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101706825A (en) * 2009-12-10 2010-05-12 华中科技大学 Replicated data deleting method based on file content types
CN104410692A (en) * 2014-11-28 2015-03-11 上海爱数软件有限公司 Method and system for uploading duplicated files
US9367397B1 (en) * 2011-12-20 2016-06-14 Emc Corporation Recovering data lost in data de-duplication system

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101706825A (en) * 2009-12-10 2010-05-12 华中科技大学 Replicated data deleting method based on file content types
US9367397B1 (en) * 2011-12-20 2016-06-14 Emc Corporation Recovering data lost in data de-duplication system
CN104410692A (en) * 2014-11-28 2015-03-11 上海爱数软件有限公司 Method and system for uploading duplicated files

Cited By (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107590109A (en) * 2017-07-24 2018-01-16 深圳市元征科技股份有限公司 A kind of text handling method and electronic equipment
CN108563649A (en) * 2017-12-12 2018-09-21 南京富士通南大软件技术有限公司 Offline De-weight method based on GlusterFS distributed file systems
CN111522502B (en) * 2019-02-01 2022-04-29 阿里巴巴集团控股有限公司 Data deduplication method and device, electronic equipment and computer-readable storage medium
CN111522502A (en) * 2019-02-01 2020-08-11 阿里巴巴集团控股有限公司 Data deduplication method and device, electronic equipment and computer-readable storage medium
CN110413960A (en) * 2019-06-19 2019-11-05 平安银行股份有限公司 File control methods, device, computer equipment and computer readable storage medium
CN110413960B (en) * 2019-06-19 2023-03-28 平安银行股份有限公司 File comparison method and device, computer equipment and computer readable storage medium
CN110442845B (en) * 2019-07-08 2022-12-20 新华三信息安全技术有限公司 File repetition rate calculation method and device
CN110442845A (en) * 2019-07-08 2019-11-12 新华三信息安全技术有限公司 File repetitive rate calculation method and device
CN110535835A (en) * 2019-08-09 2019-12-03 西藏宁算科技集团有限公司 It is a kind of to support cloudy shared cloud storage method and system based on Message Digest 5
CN111522791A (en) * 2020-04-30 2020-08-11 电子科技大学 Distributed file repeating data deleting system and method
CN111522791B (en) * 2020-04-30 2023-05-30 电子科技大学 Distributed file repeated data deleting system and method
CN112084179A (en) * 2020-09-02 2020-12-15 北京锐安科技有限公司 Data processing method, device, equipment and storage medium
CN112084179B (en) * 2020-09-02 2023-11-07 北京锐安科技有限公司 Data processing method, device, equipment and storage medium
WO2023070462A1 (en) * 2021-10-28 2023-05-04 华为技术有限公司 File deduplication method and apparatus, and device

Also Published As

Publication number Publication date
CN106649676B (en) 2020-06-19

Similar Documents

Publication Publication Date Title
CN106649676A (en) Duplication eliminating method and device based on HDFS storage file
US8799291B2 (en) Forensic index method and apparatus by distributed processing
CN105069111A (en) Similarity based data-block-grade data duplication removal method for cloud storage
US11221992B2 (en) Storing data files in a file system
Upadhyay et al. Deduplication and compression techniques in cloud design
CN102609462A (en) Method for compressed storage of massive SQL (structured query language) by means of extracting SQL models
CN110389859B (en) Method, apparatus and computer program product for copying data blocks
CN103414762A (en) Cloud backup method and cloud backup device
US10515055B2 (en) Mapping logical identifiers using multiple identifier spaces
CN115858488A (en) Parallel migration method and device based on data governance and readable medium
KR101428649B1 (en) Encryption system for mass private information based on map reduce and operating method for the same
US9633035B2 (en) Storage system and methods for time continuum data retrieval
CN106980618B (en) File storage method and system based on MongoDB distributed cluster architecture
CN110019169B (en) Data processing method and device
CN106708911A (en) Method and device for synchronizing data files in cloud environment
CN112965939A (en) File merging method, device and equipment
CN115757642A (en) Data synchronization method and device based on filing log file
CN111723063A (en) Method and device for processing offline log data
CN110888847B (en) Recycle bin system and file recycling method
CN112131229A (en) Block chain-based distributed data access method and device and storage node
Zhang et al. SimpleSync: A parallel delta synchronization method based on Flink
CN115934670B (en) Method and device for verifying copy placement strategy of HDFS (Hadoop distributed File System) multi-machine room
KR102500278B1 (en) Mapreduce-based data conversion sysetem and converion method for storing large amount of lod
CN113553329B (en) Data integration system and method
CN116910051B (en) Data processing method, device, electronic equipment and computer readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant