CN106649676A - Duplication eliminating method and device based on HDFS storage file - Google Patents
Duplication eliminating method and device based on HDFS storage file Download PDFInfo
- Publication number
- CN106649676A CN106649676A CN201611159251.XA CN201611159251A CN106649676A CN 106649676 A CN106649676 A CN 106649676A CN 201611159251 A CN201611159251 A CN 201611159251A CN 106649676 A CN106649676 A CN 106649676A
- Authority
- CN
- China
- Prior art keywords
- file
- identification
- storage
- memory node
- deduplicated
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/10—File systems; File servers
- G06F16/17—Details of further file system functions
- G06F16/174—Redundancy elimination performed by the file system
- G06F16/1748—De-duplication implemented within the file system, e.g. based on file segments
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The embodiment of the invention discloses a duplication eliminating method and device based on an HDFS storage file. The method includes the steps that a file fingerprint of a file to be subjected to duplication eliminating is compared with a file fingerprint of a stored file; if the file fingerprints are the same, link identification is calculated according to file identification of the file to be subjected to duplication eliminating; the link identification and the storage address of the same storage file in a storage node replace the file content of the file to be subjected to duplication eliminating to serve as a key value of the file identification of the file to be subjected to duplication eliminating, and the key value is stored into the storage node. According to the duplication eliminating method and device in the technical scheme, files with the duplicated content are effectively removed, the number of files is decreased, the storage space is saved, and the system performance is improved.
Description
Technical field
The present embodiments relate to unstructured data memory technology, more particularly to a kind of going based on HDFS storage files
Weighing method and device.
Background technology
Hadoop distributed file systems (Hadoop Distributed File System, abbreviation HDFS) are to super large
The system that scale data collection provides reliable memory function, sets up on the basis of response is with " write-once repeatedly reads " task,
The input/output date flow of high bandwidth is provided to user application.HDFS has high fault tolerance, may operate in cheap hard
On part cluster.Using the client/server of MASTER/SLAVES, a HDFS cluster is by a Namenode node (management node)
With multiple Datanode nodes (memory node) composition.Management node is a central server, is responsible for file system
The access of metadata and client to file.Management node stores the metadata of file, therefore the memory size of management node
Limit the quantity of file.HDFS acquiescences can be by file division into block (memory block), and such as 64M is 1 memory block.Then
Each memory block is stored in the memory node of HDFS in the form of key-value pair, and the mapping of key-value pair is stored in internal memory.Often
Individual file, memory block and index list are stored in internal memory in the form of object, and each object accounts for 150 bytes.Citing
For, if 1000000 small documents, each file takes a memory block, then management node just at least needs 300M's
Internal memory;If when storage 100,000,000 or even more files, needing the even more memory sizes of 20G, solution is to build support
The memory database of cluster, but increase system cost.If small documents are too many, excessive memory source is taken, affect sociability
Can, need to merge small documents, reduce quantity of documents.
However, in actual the Internet, applications, there are the small documents of magnanimity, especially with blog, microblogging,
The rise of the social network sites such as Facebook, changes the mode of internet storage content.User substantially has become internet
The creator of content, the features such as its data has magnanimity, various, dynamic change, thereby produces mass small documents, such as state text
Part, subscriber data, head portrait etc..These data can be divided into structural data and destructuring number according to the storage format of data
According to.Structural data has identical level and network, can be described with numeral or word;And have some information then without
Method digital or unified representation, for example, report, word processing text that scan image, fax, photo, computer are generated
Shelves, electrical form, PowerPoint, voice and video etc., these are unstructured data.Unstructured data is through structure
After the extraction of change, needs preserve original document, for subsequently using.
In many fields, unstructured data proportion is significantly larger than structural data proportion.Destructuring
Data message amount is very big, if be directly stored in database, except significantly plus in addition to the capacity of large database concept, also reducing
The efficiency safeguarded and apply.The unstructured data for especially obtaining in internet often has repeatability, and focus incident is short
Substantial amounts of netizen can be brought in time to pay close attention to, cause a small amount of unstructured data to be reused in a large number at short notice, be taken
System memory space.In prior art, data are compressed according to certain ratio using compress technique, but destructuring
Data do not possess strict structure, and than structured message standardization is more difficult to, and management is got up more difficult.For these features,
The magnanimity destructuring small documents of HDFS storages at present are merged into after big file, not at compression using Mapfile technologies
Reason, the memory space of occupancy is more, therefore, the content repeated in magnanimity unstructured data how is removed, saving memory space is
Urgent problem.
The content of the invention
The embodiment of the present invention provides a kind of De-weight method and device based on HDFS storage files, so that HDFS processes storage
Magnanimity destructuring small documents when, effective duplicate removal, save memory space.
In a first aspect, a kind of De-weight method based on HDFS storage files is embodiments provided, including:
The file fingerprint of deduplicated file will be treated, will be compared with the file fingerprint of storage file;
If comparison result is identical, according to the file identification calculating linking mark for treating deduplicated file;
With the link identification and identical storage address of the storage file in memory node, replace described in treat duplicate removal
The file content of file, stores in memory node as the key assignments of the file identification for treating deduplicated file.
Preferably, the file fingerprint of deduplicated file will be treated, before comparing with the file fingerprint of storage file, will also be wrapped
Include:
The file for receiving is stored into the memory node in setting regions, and is labeled as non-duplicate removal processing region;
File is obtained one by one from the non-duplicate removal processing region, as treating deduplicated file.
Preferably, the file for receiving is stored into the memory node into setting regions includes:
For the file generated major key for receiving, as file identification;
The file content of the file is converted to into binary data, it is corresponding with the file identification to store to the storage
In node in setting regions.
Preferably, the file for receiving is stored into the memory node into setting regions includes:
According to the date received of file, the file for receiving is stored into different setting regions into the memory node
In.
Preferably, included according to the file identification calculating linking mark for treating deduplicated file:
32 MD5 values are calculated to the file identification for treating deduplicated file, as the link identification.
Preferably, with the link identification and identical storage address of the storage file in memory node, institute is replaced
The file content for treating deduplicated file is stated, as the key assignments of the file identification for treating deduplicated file it is stored in memory node
Afterwards, also include:
According to each file identification in the memory node and the storage location of correspondence key assignments, the rope of the memory node is rewritten
Quotation part.
Preferably, methods described also includes:
According to the file read request for receiving, the file identification of file to be read is obtained;
Corresponding link identification is calculated according to the file identification;
Location data is set according to what the file identification read corresponding key assignments from memory node;
If comparing the link identification to match with the location data that sets, from the key assignments storage address is read;
The text is responded after the corresponding file of positioning searching, and reading in the memory node according to the storage address
Part read requests.
Second aspect, the embodiment of the present invention additionally provides a kind of duplicate removal device based on HDFS storage files, including:
Fingerprint comparison module, for treating the file fingerprint of deduplicated file, is compared with the file fingerprint of storage file
It is right;
Link identification computing module, if being identical for comparison result, according to the file identification for treating deduplicated file
Calculating linking is identified;
Content replacement module, for the link identification and identical storage of the storage file in memory node
Location, replace described in treat the file content of deduplicated file, as the key assignments storage to storage of the file identification for treating deduplicated file
In node.
Preferably, described device also includes:
File storage module, for treating the file fingerprint of deduplicated file, is compared with the file fingerprint of storage file
To before, the file for receiving being stored into the memory node in setting regions, and it is labeled as non-duplicate removal processing region;
File acquisition module, for obtaining file one by one from the non-duplicate removal processing region, as treating deduplicated file.
Preferably, the file storage module includes:
Major key signal generating unit, for for the file generated major key for receiving, as file identification;
Content conversion unit, for the file content of the file to be converted to into binary data, with the file identification
Correspondence is stored into the memory node in setting regions.
Preferably, the file storage module specifically for:
According to the date received of file, the file for receiving is stored into different setting regions into the memory node
In.
Preferably, the link identification computing module specifically for:
32 MD5 values are calculated to the file identification for treating deduplicated file, as the link identification.
Preferably, described device also includes:
Index module is rewritten, for the link identification and identical storage of the storage file in memory node
Location, replace described in treat the file content of deduplicated file, as the key assignments storage to storage of the file identification for treating deduplicated file
After in node, according to each file identification in the memory node and the storage location of correspondence key assignments, the memory node is rewritten
Index file.
Preferably, described device also includes:
File identification read module, for according to the file read request for receiving, obtaining the files-designated of file to be read
Know;
Correspondence mark computing module, for calculating corresponding link identification according to the file identification;
If location data read module, for setting for corresponding key assignments to be read from memory node according to the file identification
Location data;
Matching module, if matched with the location data that sets for comparing the link identification, from the key assignments
Read storage address;
Ff module, for according to the storage address in the memory node the corresponding file of positioning searching,
And respond the file read request after reading.
The embodiment of the present invention is directed to file content identical magnanimity unstructured document in HDFS, to content identical file
Only retain a, delete and storage file fingerprint identical file content, replace with link identification and chained address, effectively go
Except the file that content repeats, quantity of documents is reduced, saved substantial amounts of memory space, releasing memory resource, lift system performance,
Meanwhile, it is capable to meet quick storage and the correct demand for reading.
Description of the drawings
Figure 1A is a kind of flow chart of the De-weight method based on HDFS storage files in the embodiment of the present invention one;
Figure 1B is a kind of schematic diagram of the De-weight method based on HDFS storage files in the embodiment of the present invention one;
Fig. 2 is a kind of flow chart of the De-weight method based on HDFS storage files in the embodiment of the present invention two;
Fig. 3 is a kind of flow chart of the De-weight method based on HDFS storage files in the embodiment of the present invention three;
Fig. 4 A are a kind of structural representations of the duplicate removal device based on HDFS storage files in the embodiment of the present invention four;
Fig. 4 B are a kind of structural representations of the duplicate removal device based on HDFS storage files in the embodiment of the present invention four.
Specific embodiment
With reference to the accompanying drawings and examples the present invention is described in further detail.It is understood that this place is retouched
The specific embodiment stated is used only for explaining the present invention, rather than limitation of the invention.It also should be noted that, in order to just
Part related to the present invention rather than entire infrastructure are illustrate only in description, accompanying drawing.
Embodiment one
A kind of flow chart of De-weight method based on HDFS storage files that Figure 1A is provided for the embodiment of the present invention one, this reality
Apply example and be applicable to Hadoop distributed file systems, the system typically may include management node and multiple memory nodes.The party
Method can be by being performed based on the duplicate removal device of HDFS storage files, and the device can be with the reality by the way of software and/or hardware
It is existing, typically it is integrated in the management node in Hadoop distributed file systems.
The method of the embodiment of the present invention one is specifically included:
S101, the file fingerprint of deduplicated file will be treated, be compared with the file fingerprint of storage file.
Treat that deduplicated file is the file for receiving, first this document can be stored in memory node, then in offline shape
The deduplication operation of the present embodiment is carried out under state, it is also possible to when receiving whne deduplicated file, carry out online deduplication operation.Due to
Online duplicate removal needs to take larger resource, and the speed of service is slow, and the response time is long, it is advantageous to being to carry out offline duplicate removal.From depositing
The file for not carrying out duplicate removal process is extracted in storage node, as treating deduplicated file.
Specifically, file fingerprint is calculated according to the content of each file, no matter how file name changes,
As long as the content of file is not changed in, the file fingerprint for calculating is exactly identical.If treating deduplicated file with storage file
File content it is identical, the file fingerprint for calculating is just identical.The computational methods of file fingerprint can be the message of calculation document
Digest algorithm the 5th edition (Message-Digest Algorithm 5, abbreviation MD5 values), Secure Hash Algorithm (Secure Hash
Algorithm 1, abbreviation SHA1 values) or CRC (Cyclic Redundancy Check, abbreviation CRC32 values).Its
In, MD5 values have the discreteness of height, and the minor variations of prime information content can cause the great variety of MD5 values, and reliability is high.
In the present embodiment, 1K binary data and the last 1K binary data of file carry out the calculating of MD5 values before preferred acquisition file, calculate
Result as file fingerprint.
Under off-line state, the file fingerprint and the file fingerprint of storage file for periodically treating deduplicated file is compared
It is right.After daily 0 point, using the MapReduce computation module in Hadoop distributed file systems compare offline described in treat
The file fingerprint of deduplicated file and the file fingerprint of storage file, filter out and have identical content with the storage file
Deduplicated file is treated, and obtains corresponding storage file and its storage address in data memory node.
If S102, comparison result are identical, according to the file identification calculating linking mark for treating deduplicated file.
It is with key in mapped file (Mapfile) when specifically, in file write Hadoop distributed file systems
It is worth what the form to Key-Value was stored, major key Key is file identification, is that this document energy is distributed to when file is stored uniquely
The character string of mark this document.Key assignments Value is the corresponding whole binary numbers of the corresponding binary values of Key, i.e. file content
According to.If file fingerprint comparison result is identical, link identification is calculated according to the file identification Key for treating deduplicated file,
The link identification plays special identifier effect to the file for having carried out duplicate removal process.The stage is read in file, if from file
Key assignments in read be link identification rather than reality binary data, then show that this document has carried out duplicate removal process.Such as
Fruit file fingerprint comparison result is differed, then illustrate that this document is differed with the file content of storage file, retains this document
File content, do not carry out duplicate removal process.
Preferably, step S102 includes:
32 MD5 values are calculated to the file identification for treating deduplicated file, as the link identification.
In the present embodiment, 32 MD5 values are calculated according to the file identification Key for treating deduplicated file, as the chain
Connect mark.Similar to ciphering process, mark is encrypted to the file of duplicate removal process, 32 MD5 values is calculated, in response file
It is decrypted during read requests.Also, document stage is read, link identification can be calculated according to file identification, so as to recognize
Whether this document has carried out duplicate removal process.
S103, with the link identification and identical storage address of the storage file in memory node, replace described
The file content of deduplicated file is treated, is stored in memory node as the key assignments of the file identification for treating deduplicated file.
In the present embodiment, the file of duplicate removal process is carried out, the no longer binary data of storage file content in its key assignments, and
It is to replace with link identification and storage address, the content of the storage address storage is identical with the content that duplicate removal processes file
's.As shown in Figure 1B, it is assumed that the corresponding file contents of Key2 are identical with the content of storage file, then Key2 file contents are read
Corresponding binary data, writes in the original location the reality of link identification and the MD5 value of storage address, i.e., 32 and same file storage
Border address, completes the replacement for treating deduplicated file content.The corresponding file contents of Key1 and Key3 are interior with storage file
Hold different, the corresponding binary data of reservation Key1 and Key3 file contents.
Preferably, step S103 includes:
According to each file identification in the memory node and the storage location of correspondence key assignments, the rope of the memory node is rewritten
Quotation part.
Specifically, can be with rapidly locating according to index file, the corresponding key of each file identification in the memory node
Value Data has been replaced, and original index file is unable to the new mapping relations of Correct, needs according to after replacement
The storage location of each file identification and correspondence key assignments, rewrites the index file of the memory node in memory node.
A kind of De-weight method based on HDFS storage files that the embodiment of the present invention one is provided, compares text under off-line state
Part fingerprint simultaneously carries out data deduplication process, can increased the reliability of system with proper extension process time, saves internal memory money
Source, reduces the requirement to hardware device, and then saves large number quipments cost, and can effectively remove the text that content repeats
Part, reduces quantity of documents, saves memory space.
Embodiment two
Fig. 2 is a kind of flow chart of De-weight method based on HDFS storage files that the embodiment of the present invention two is provided, this
Bright embodiment two is optimized improvement based on embodiment one, to how offline deduplication operation is further described, such as
Shown in Fig. 2, the embodiment of the present invention two is specifically included:
S201, the file for receiving is stored into the memory node in setting regions, and be labeled as non-duplicate removal and processed
Region.
In the present embodiment, multiple mapped files are included in Hadoop distributed file systems, the mapped file is used to return
Shelves magnanimity destructuring small documents, and generate the corresponding mapping relations of archive file.System continuously receives file and is cached,
Capacity threshold is reached when caching takes up room or the reception time prescribes a time limit when reaching default, system is according to receiving unstructured document
Order write successively in the mapped file of each memory node, and be labeled as non-duplicate removal processing region.Wherein, the capacity threshold
Scope may be configured as between 128M to 2G, the scope in the default time limit could be arranged between 5 minutes to 20 minutes, write
Mode can be write by multi-thread concurrent, with the speed for ensureing to write.
Preferably, step S201 includes:
For the file generated major key for receiving, as file identification;
The file content of the file is converted to into binary data, it is corresponding with the file identification to store to the storage
In node in setting regions.
Specifically, system is the file generated major key Key for receiving, and as file identification, is indexed according to major key Key
Storage, is converted to binary data, as the corresponding key assignments Value of major key Key, major key Key by the file content of the file
Stored in the form of key-value pair into the mapped file of the memory node with corresponding key assignments Value.
Preferably, step S201 also includes:
According to the date received of file, the file for receiving is stored into different setting regions into the memory node
In.
Specifically, during storage file, the file for receiving is stored into into the memory node difference according to date received
Mapped file in.Preserved for the daily file for writing newly-built catalogue in Hadoop distributed file systems,
With day as unit partitioned storage.
S202, from the non-duplicate removal processing region file is obtained one by one, as treating deduplicated file.
S203, the file fingerprint of deduplicated file will be treated, be compared with the file fingerprint of storage file.
If S204, comparison result are identical, according to the file identification calculating linking mark for treating deduplicated file.
S205, with the link identification and identical storage address of the storage file in memory node, replace described
The file content of deduplicated file is treated, is stored in memory node as the key assignments of the file identification for treating deduplicated file.
A kind of De-weight method based on HDFS storage files that the embodiment of the present invention two is provided, according to the date of acceptance of file
Partitioned storage is carried out to file, is easy to processed offline, for the storage file on the same day duplicate removal process is not temporarily carried out, be ensure that
The storage efficiency of data, meets the demand of flash storage data, improves the real-time of data storage.
Embodiment three
Fig. 3 is a kind of flow chart of De-weight method based on HDFS storage files that the embodiment of the present invention three is provided, this
Bright embodiment three is optimized improvement based on embodiment two, and after duplicate removal process, file content is obtained in file
Process be further described, as shown in figure 3, the embodiment of the present invention three is specifically included:
The file read request that S301, basis are received, obtains the file identification of file to be read.
S302, corresponding link identification is calculated according to the file identification.
S303, read corresponding key assignments from memory node according to the file identification set location data.
If S304, comparing the link identification and matching with the location data that sets, storage is read from the key assignments
Address.
S305, the response after the corresponding file of positioning searching, and reading in the memory node according to the storage address
The file read request.
In the present embodiment, obtain file content process to file user shield internal processes, system according to
The file read request for receiving, obtains the file identification major key Key of file to be read, is calculated according to the major key Key and is continued
The corresponding link identifications of file major key Key are taken, the link identification can be obtained by calculating MD5 values.According to the major key Key from
Front 32 MD5 values of correspondence key assignments are read in memory node, the MD5 values of link identification and front 32 MD5 values of reading is compared, such as
Fruit is consistent, then illustrate the file to be read through duplicate removal process, and file content storage is that storage file exists content identical
Storage address in memory node, rather than the real content of file itself, remove front 32 data in this document content, from
Read storage address in the key assignments, according to the storage address in the memory node the corresponding file of positioning searching, and
The file read request is responded after reading.The MD5 values for comparing link identification and the front 32 MD5 values for reading, if inconsistent,
Then illustrate that the file to be read, not through duplicate removal process, reads from the key assignments and respond after file content the file reading
Request.
A kind of De-weight method based on HDFS storage files that the embodiment of the present invention three is provided, for the destructuring for repeating
File only saves corresponding storage address, and internal processes are shielded to visitor when reading file, disclosure satisfy that correct reading
The demand for taking, has saved memory space, improves systematic function.
Example IV
Fig. 4 A are a kind of structural representations of the duplicate removal device based on HDFS storage files in the embodiment of the present invention four, should
Device is applied to Hadoop distributed file systems.As shown in Figure 4 A, the device includes:
Fingerprint comparison module 401, for treating the file fingerprint of deduplicated file, is carried out with the file fingerprint of storage file
Compare;
Link identification computing module 402, if being identical for comparison result, according to the files-designated for treating deduplicated file
Know calculating linking mark;
Content replacement module 403, for the link identification and identical storage file depositing in memory node
Storage address, replace described in treat the file content of deduplicated file, the key assignments storage as the file identification for treating deduplicated file is arrived
In memory node.
Preferably, the link identification computing module specifically for:
32 MD5 values are calculated to the file identification for treating deduplicated file, as the link identification.
Preferably, described device also includes:
Index module 404 is rewritten, for the link identification and identical storage file depositing in memory node
Storage address, replace described in treat the file content of deduplicated file, the key assignments storage as the file identification for treating deduplicated file is arrived
After in memory node, according to each file identification in the memory node and the storage location of correspondence key assignments, the storage is rewritten
The index file of node.
Specifically, under off-line state, using the file fingerprint that deduplicated file is treated described in the comparison of fingerprint comparison module and
The file fingerprint of storage file, filters out the deduplicated file for the treatment of for having identical content with the storage file, and obtains correspondence
Storage file and its storage address in data memory node.If file fingerprint comparison result is identical, according to institute
The file identification Key for treating deduplicated file is stated, 32 MD5 values are calculated in link identification computing module, as link identification, should
Link identification plays mark action to the file for having carried out duplicate removal process.By content replacement module, with the link identification
With identical storage address of the storage file in memory node, replace described in treat the file content of deduplicated file, as institute
The key assignments for stating the file identification for treating deduplicated file is stored in memory node.According to each file identification in the memory node and right
The storage location of key assignments is answered, in the index file for rewriteeing the index module rewriting memory node.
Preferably, as shown in Figure 4 A, described device also includes:
File storage module 405, for treating the file fingerprint of deduplicated file, is carried out with the file fingerprint of storage file
Before comparison, the file for receiving is stored into the memory node in setting regions, and be labeled as non-duplicate removal processing region;
File acquisition module 406, it is literary as duplicate removal is treated for obtaining file one by one from the non-duplicate removal processing region
Part.
Preferably, the file storage module includes:
Major key signal generating unit, for for the file generated major key for receiving, as file identification;
Content conversion unit, for the file content of the file to be converted to into binary data, with the file identification
Correspondence is stored into the memory node in setting regions.
Preferably, the file storage module specifically for:
According to the date received of file, the file for receiving is stored into different setting regions into the memory node
In.
Specifically, file access module continuously receives file and is cached, and when caching takes up room capacity threshold is reached
Or the reception time prescribes a time limit when reaching default, system according to file date of acceptance, according to receive unstructured document order
Multi-thread concurrent is write in the mapped file of each memory node.Wherein, the scope of the capacity threshold may be configured as 128M to 2G
Between, the scope in the default time limit could be arranged between 5 minutes to 20 minutes.Major key signal generating unit is the file for receiving
Major key Key is generated, as file identification, the file content of the file binary number is converted to into using content conversion unit
According to being stored in the form of key-value pair to institute as the corresponding key assignments Value of major key Key, major key Key and corresponding key assignments Value
In stating the mapped file of memory node.File is obtained one by one from the non-duplicate removal processing region according to file acquisition module, is made
To treat deduplicated file.
Preferably, as shown in Figure 4 B, described device also includes:
File identification read module 407, for according to the file read request for receiving, obtaining the file of file to be read
Mark;
Correspondence mark computing module 408, for calculating corresponding link identification according to the file identification;
If location data read module 409, for corresponding key assignments to be read from memory node according to the file identification
Set location data;
Matching module 410, if matched with the location data that sets for comparing the link identification, from the key assignments
Middle reading storage address;
Ff module 411, for according to the storage address in the memory node the corresponding text of positioning searching
Part, and respond the file read request after reading.
Specifically, file to be read is obtained according to the file read request for receiving using file identification read module
File identification major key Key, file major key Key correspondences to be read are calculated using correspondence mark computing module according to the major key Key
MD5 values.Correspondence first 32 of key assignments is read from memory node according to the major key Key using location data read module is set
MD5 values, the MD5 values that link identification is compared in matching module and the front 32 MD5 values for reading, if unanimously, illustrate that this is treated
File is read through duplicate removal process, file content storage is content identical storage of the storage file in memory node
Location, rather than the real content of file itself, remove front 32 data in this document content, by ff module from institute
State and read in key assignments storage address, according to the storage address in the memory node the corresponding file of positioning searching, and read
The file read request is responded after taking.If it is inconsistent, the file to be read is illustrated not through duplicate removal process, from described
Read in key assignments after file content and respond the file read request.
A kind of duplicate removal device based on HDFS storage files that the embodiment of the present invention four is provided, can effectively remove interior unit weight
Multiple file, reduces quantity of documents, has saved substantial amounts of memory space, releasing memory resource, lift system performance, meanwhile, energy
Enough meet quick storage and the correct demand for reading.
The method that the executable any embodiment of the present invention of device provided in an embodiment of the present invention is provided, possesses execution method phase
The functional module answered and beneficial effect.
Note, above are only presently preferred embodiments of the present invention and institute's application technology principle.It will be appreciated by those skilled in the art that
The invention is not restricted to specific embodiment described here, can carry out for a person skilled in the art various obvious changes,
Readjust and substitute without departing from protection scope of the present invention.Therefore, although the present invention is carried out by above example
It is described in further detail, but the present invention is not limited only to above example, without departing from the inventive concept, also
More other Equivalent embodiments can be included, and the scope of the present invention is determined by scope of the appended claims.
Claims (14)
1. a kind of De-weight method based on HDFS storage files, it is characterised in that include:
The file fingerprint of deduplicated file will be treated, will be compared with the file fingerprint of storage file;
If comparison result is identical, according to the file identification calculating linking mark for treating deduplicated file;
With the link identification and identical storage address of the storage file in memory node, replace described in treat deduplicated file
File content, as the file identification for treating deduplicated file key assignments store in memory node.
2. method according to claim 1, it is characterised in that the file fingerprint of deduplicated file will be treated, with storage file
File fingerprint compare before, also include:
The file for receiving is stored into the memory node in setting regions, and is labeled as non-duplicate removal processing region;
File is obtained one by one from the non-duplicate removal processing region, as treating deduplicated file.
3. method according to claim 2, it is characterised in that the file for receiving is stored into the memory node and is set
Determining region includes:
For the file generated major key for receiving, as file identification;
The file content of the file is converted to into binary data, it is corresponding with the file identification to store to the memory node
In middle setting regions.
4. method according to claim 2, it is characterised in that the file for receiving is stored into the memory node and is set
Determining region includes:
According to the date received of file, the file for receiving is stored in different setting regions into the memory node.
5. method according to claim 1, it is characterised in that according to the file identification calculating linking for treating deduplicated file
Mark includes:
32 MD5 values are calculated to the file identification for treating deduplicated file, as the link identification.
6. method according to claim 1, it is characterised in that storage file is being deposited with the link identification and identical
Storage address in storage node, replace described in treat the file content of deduplicated file, as the file identification for treating deduplicated file
Key assignments store in memory node after, also include:
According to each file identification in the memory node and the storage location of correspondence key assignments, the index text of the memory node is rewritten
Part.
7. according to the arbitrary described method of claim 1-6, it is characterised in that also include:
According to the file read request for receiving, the file identification of file to be read is obtained;
Corresponding link identification is calculated according to the file identification;
Location data is set according to what the file identification read corresponding key assignments from memory node;
If comparing the link identification to match with the location data that sets, from the key assignments storage address is read;
The file is responded after the corresponding file of positioning searching, and reading in the memory node according to the storage address to read
Take request.
8. a kind of duplicate removal device based on HDFS storage files, it is characterised in that include:
Fingerprint comparison module, for treating the file fingerprint of deduplicated file, compares with the file fingerprint of storage file;
Link identification computing module, if being identical for comparison result, calculates according to the file identification for treating deduplicated file
Link identification;
Content replacement module, for the link identification and identical storage address of the storage file in memory node,
The file content of deduplicated file is treated described in replacing, memory node is arrived in the key assignments storage as the file identification for treating deduplicated file
In.
9. device according to claim 8, it is characterised in that described device also includes:
File storage module, for treating the file fingerprint of deduplicated file, with the file fingerprint of storage file it is compared
Before, the file for receiving is stored into the memory node in setting regions, and it is labeled as non-duplicate removal processing region;
File acquisition module, for obtaining file one by one from the non-duplicate removal processing region, as treating deduplicated file.
10. device according to claim 9, it is characterised in that the file storage module includes:
Major key signal generating unit, for for the file generated major key for receiving, as file identification;
Content conversion unit, it is corresponding with the file identification for the file content of the file to be converted to into binary data
Store into the memory node in setting regions.
11. devices according to claim 9, it is characterised in that the file storage module specifically for:
According to the date received of file, the file for receiving is stored in different setting regions into the memory node.
12. devices according to claim 8, it is characterised in that the link identification computing module specifically for:
32 MD5 values are calculated to the file identification for treating deduplicated file, as the link identification.
13. devices according to claim 8, it is characterised in that also include:
Index module is rewritten, for the link identification and identical storage address of the storage file in memory node,
The file content of deduplicated file is treated described in replacing, memory node is arrived in the key assignments storage as the file identification for treating deduplicated file
In after, according to each file identification in the memory node and the storage location of correspondence key assignments, rewrite the rope of the memory node
Quotation part.
14. according to the arbitrary described device of claim 8-13, it is characterised in that described device also includes:
File identification read module, for according to the file read request for receiving, obtaining the file identification of file to be read;
Correspondence mark computing module, for calculating corresponding link identification according to the file identification;
If location data read module, for the setting position of corresponding key assignments to be read from memory node according to the file identification
Data;
Matching module, if matched with the location data that sets for comparing the link identification, reads from the key assignments
Storage address;
Ff module, for according to the storage address in the memory node the corresponding file of positioning searching, and read
The file read request is responded after taking.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201611159251.XA CN106649676B (en) | 2016-12-15 | 2016-12-15 | HDFS (Hadoop distributed File System) -based duplicate removal method and device for stored files |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201611159251.XA CN106649676B (en) | 2016-12-15 | 2016-12-15 | HDFS (Hadoop distributed File System) -based duplicate removal method and device for stored files |
Publications (2)
Publication Number | Publication Date |
---|---|
CN106649676A true CN106649676A (en) | 2017-05-10 |
CN106649676B CN106649676B (en) | 2020-06-19 |
Family
ID=58822292
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201611159251.XA Active CN106649676B (en) | 2016-12-15 | 2016-12-15 | HDFS (Hadoop distributed File System) -based duplicate removal method and device for stored files |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106649676B (en) |
Cited By (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107590109A (en) * | 2017-07-24 | 2018-01-16 | 深圳市元征科技股份有限公司 | A kind of text handling method and electronic equipment |
CN108563649A (en) * | 2017-12-12 | 2018-09-21 | 南京富士通南大软件技术有限公司 | Offline De-weight method based on GlusterFS distributed file systems |
CN110413960A (en) * | 2019-06-19 | 2019-11-05 | 平安银行股份有限公司 | File control methods, device, computer equipment and computer readable storage medium |
CN110442845A (en) * | 2019-07-08 | 2019-11-12 | 新华三信息安全技术有限公司 | File repetitive rate calculation method and device |
CN110535835A (en) * | 2019-08-09 | 2019-12-03 | 西藏宁算科技集团有限公司 | It is a kind of to support cloudy shared cloud storage method and system based on Message Digest 5 |
CN111522791A (en) * | 2020-04-30 | 2020-08-11 | 电子科技大学 | Distributed file repeating data deleting system and method |
CN111522502A (en) * | 2019-02-01 | 2020-08-11 | 阿里巴巴集团控股有限公司 | Data deduplication method and device, electronic equipment and computer-readable storage medium |
CN112084179A (en) * | 2020-09-02 | 2020-12-15 | 北京锐安科技有限公司 | Data processing method, device, equipment and storage medium |
CN115203159A (en) * | 2022-07-25 | 2022-10-18 | 北京字跳网络技术有限公司 | Data storage method and device, computer equipment and storage medium |
WO2023070462A1 (en) * | 2021-10-28 | 2023-05-04 | 华为技术有限公司 | File deduplication method and apparatus, and device |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101706825A (en) * | 2009-12-10 | 2010-05-12 | 华中科技大学 | Replicated data deleting method based on file content types |
CN104410692A (en) * | 2014-11-28 | 2015-03-11 | 上海爱数软件有限公司 | Method and system for uploading duplicated files |
US9367397B1 (en) * | 2011-12-20 | 2016-06-14 | Emc Corporation | Recovering data lost in data de-duplication system |
-
2016
- 2016-12-15 CN CN201611159251.XA patent/CN106649676B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101706825A (en) * | 2009-12-10 | 2010-05-12 | 华中科技大学 | Replicated data deleting method based on file content types |
US9367397B1 (en) * | 2011-12-20 | 2016-06-14 | Emc Corporation | Recovering data lost in data de-duplication system |
CN104410692A (en) * | 2014-11-28 | 2015-03-11 | 上海爱数软件有限公司 | Method and system for uploading duplicated files |
Cited By (16)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107590109A (en) * | 2017-07-24 | 2018-01-16 | 深圳市元征科技股份有限公司 | A kind of text handling method and electronic equipment |
CN108563649A (en) * | 2017-12-12 | 2018-09-21 | 南京富士通南大软件技术有限公司 | Offline De-weight method based on GlusterFS distributed file systems |
CN111522502B (en) * | 2019-02-01 | 2022-04-29 | 阿里巴巴集团控股有限公司 | Data deduplication method and device, electronic equipment and computer-readable storage medium |
CN111522502A (en) * | 2019-02-01 | 2020-08-11 | 阿里巴巴集团控股有限公司 | Data deduplication method and device, electronic equipment and computer-readable storage medium |
CN110413960A (en) * | 2019-06-19 | 2019-11-05 | 平安银行股份有限公司 | File control methods, device, computer equipment and computer readable storage medium |
CN110413960B (en) * | 2019-06-19 | 2023-03-28 | 平安银行股份有限公司 | File comparison method and device, computer equipment and computer readable storage medium |
CN110442845A (en) * | 2019-07-08 | 2019-11-12 | 新华三信息安全技术有限公司 | File repetitive rate calculation method and device |
CN110442845B (en) * | 2019-07-08 | 2022-12-20 | 新华三信息安全技术有限公司 | File repetition rate calculation method and device |
CN110535835A (en) * | 2019-08-09 | 2019-12-03 | 西藏宁算科技集团有限公司 | It is a kind of to support cloudy shared cloud storage method and system based on Message Digest 5 |
CN111522791A (en) * | 2020-04-30 | 2020-08-11 | 电子科技大学 | Distributed file repeating data deleting system and method |
CN111522791B (en) * | 2020-04-30 | 2023-05-30 | 电子科技大学 | Distributed file repeated data deleting system and method |
CN112084179A (en) * | 2020-09-02 | 2020-12-15 | 北京锐安科技有限公司 | Data processing method, device, equipment and storage medium |
CN112084179B (en) * | 2020-09-02 | 2023-11-07 | 北京锐安科技有限公司 | Data processing method, device, equipment and storage medium |
WO2023070462A1 (en) * | 2021-10-28 | 2023-05-04 | 华为技术有限公司 | File deduplication method and apparatus, and device |
CN115203159A (en) * | 2022-07-25 | 2022-10-18 | 北京字跳网络技术有限公司 | Data storage method and device, computer equipment and storage medium |
CN115203159B (en) * | 2022-07-25 | 2024-06-04 | 北京字跳网络技术有限公司 | Data storage method, device, computer equipment and storage medium |
Also Published As
Publication number | Publication date |
---|---|
CN106649676B (en) | 2020-06-19 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106649676A (en) | Duplication eliminating method and device based on HDFS storage file | |
CN107045422B (en) | Distributed storage method and device | |
US8799291B2 (en) | Forensic index method and apparatus by distributed processing | |
US9785646B2 (en) | Data file handling in a network environment and independent file server | |
CN105069111A (en) | Similarity based data-block-grade data duplication removal method for cloud storage | |
US11221992B2 (en) | Storing data files in a file system | |
CN105302920A (en) | Optimal management method and system for cloud storage data | |
CN110389859B (en) | Method, apparatus and computer program product for copying data blocks | |
CN103414762B (en) | cloud backup method and device | |
US10515055B2 (en) | Mapping logical identifiers using multiple identifier spaces | |
CN102609462A (en) | Method for compressed storage of massive SQL (structured query language) by means of extracting SQL models | |
Upadhyay et al. | Deduplication and compression techniques in cloud design | |
US9633035B2 (en) | Storage system and methods for time continuum data retrieval | |
CN112965939A (en) | File merging method, device and equipment | |
CN110888847B (en) | Recycle bin system and file recycling method | |
CN115858488A (en) | Parallel migration method and device based on data governance and readable medium | |
CN106708911A (en) | Method and device for synchronizing data files in cloud environment | |
CN110019169B (en) | Data processing method and device | |
CN111723063A (en) | Method and device for processing offline log data | |
US12050559B2 (en) | Data update device, data update method, data update program, and data structure | |
CN112131229A (en) | Block chain-based distributed data access method and device and storage node | |
CN115934670B (en) | Method and device for verifying copy placement strategy of HDFS (Hadoop distributed File System) multi-machine room | |
KR102500278B1 (en) | Mapreduce-based data conversion sysetem and converion method for storing large amount of lod | |
CN116910051B (en) | Data processing method, device, electronic equipment and computer readable storage medium | |
CN113313540B (en) | Contract generation method and device, electronic equipment and readable storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |