CN111522791A

CN111522791A - Distributed file repeating data deleting system and method

Info

Publication number: CN111522791A
Application number: CN202010362251.XA
Authority: CN
Inventors: 侯孟书; 周立康; 许佳欣; 詹思瑜; 周世杰
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2020-04-30
Filing date: 2020-04-30
Publication date: 2020-08-11
Anticipated expiration: 2040-04-30
Also published as: CN111522791B

Abstract

The invention discloses a distributed file repeating data deleting system and a distributed file repeating data deleting method. The system comprises a meta-information service node, a meta-information service node and a data block management module, wherein the meta-information service node is used for managing the content address of the data block; the meta information table is used for storing the content addresses of all data blocks in the HDFS system; the HDFS comprises at least one HDFS client and a NameNode node, wherein the HDFS client comprises a meta-information service node and a meta-information table, a duplicate removal file is written in the HDFS client, the HDFS client divides the duplicate file into a plurality of data blocks, calculates a fingerprint value of each data block, calls the meta-information service node to inquire the meta-information table, removes repeated data blocks, recombines the residual data blocks in the node, then recombines the residual data blocks and index data to generate a new index file, interacts with the NameNode node to store the index file on the HDFS, and stores the newly generated data fingerprint in the meta-information table of a database of the HDFS client. By the method and the system, the HDFS client can quickly complete the data de-duplication and the distributed storage of the file.

Description

Distributed file repeating data deleting system and method

Technical Field

The invention relates to the technical field of data deduplication, in particular to a distributed file deduplication system and a distributed file deduplication method.

Background

When Hadoop processes certain specific data, redundant data in the specific data can affect the storage efficiency of the system and waste storage resources. The repeated data deleting technology can effectively identify repeated files or data blocks in the system, save the storage space of the system and improve the effective utilization rate of system resources. Hadoop is a mainstream development platform in the field of current big data, and if a repeated data deleting technology can be applied to the Hadoop platform, the development of the current big data can be effectively promoted.

At present, the related design of the deduplication technology is realized on the Hadoop, the deduplication technology is paid much attention to, and some characteristics of the Hadoop are not attached to the analysis design, so that the deduplication technology is not suitable for being applied to the Hadoop. The main drawbacks in the current designs are:

1. firstly, only one server and only one server can interact with the HDFS, so that the server becomes the bottleneck of the system and the advantages of the distributed system are not fully exerted;

2. secondly, a client in the system only provides downloading and uploading of files, and lacks the functions of stream data access and storage;

3. the method is incompatible with an abstract file system of the Hadoop, and application function programs such as a MapReduce program on the Hadoop cannot be directly used, so that the use range of a deduplication system is severely limited;

4. failure recovery is not considered, Hbase and Redis have no isolation level, no rollback function, and are not suitable as management of meta information.

Disclosure of Invention

The invention aims to overcome the defects of the prior art and provide a distributed file system with a repeated data deleting function, which can complete file blocking, fingerprint calculation and duplicate removal at an HDFS client, and recombine the data blocks after duplicate removal and index data to generate a new index file to be written into the HDFS system.

The purpose of the invention is realized by the following technical scheme:

a distributed file data de-duplication system comprises a meta-information service node, a meta-information table and at least one HDFS client. The meta information service node and the meta information table are provided on the HDFS client.

The meta-information service node is used for managing the content address of the data block; and the meta information table is used for storing the content addresses of all data blocks in the HDFS system.

The HDFS comprises an HDFS client, wherein a duplicate removal file is written in the HDFS client, the HDFS client divides the duplicate removal file into a plurality of data blocks and randomly reads the duplicate removal file, divides the duplicate removal file into a plurality of data blocks, calculates a fingerprint value of each data block, calls a meta-information service node to inquire a meta-information table, removes repeated data blocks, recombines the rest data blocks and index data to generate a new index file, interacts with a NameNode node to store the index file on the HDFS, and stores a newly generated data fingerprint in the meta-information table of a database.

Specifically, the content address of the data block includes a fingerprint value of the data block, the number of referrers, a file path and name, an offset of the data block with respect to the file, a size of the data block, and a file creation time.

Specifically, the HDFS client database is a Mysql database, and the isolation level of the object is set to be read and submitted, so as to ensure that the fingerprint value of the deduplicated data block can be concurrently written into the meta information table.

Specifically, an id value is added to the meta information table to replace a hash value as a primary key of the meta information table, so as to avoid primary key collision when the meta information table is written into a file concurrently.

A distributed file data de-duplication method comprises the following steps:

reading a file, wherein an HDFS client randomly reads a duplicate removal file, checks attribute information such as file authority of the duplicate removal file, opens the duplicate removal file after no error, and stores file opening information into a meta-information table;

data segmentation, namely parallelly calling a content-based segmentation algorithm by an HDFS client to segment the opened duplicate-removed file data, and dividing duplicate-removed files into a plurality of data blocks;

calculating fingerprint values, and calculating the fingerprint values of a plurality of data blocks by the HDFS client according to a hash function;

the data block duplication is removed, the HDFS client inquires the fingerprint values of all data blocks of the HDFS system in the meta-information table through the meta-information service node according to the calculated fingerprint values of the data blocks, compares the fingerprint values with the fingerprint values of all data blocks of the HDFS system in the meta-information table, removes repeated data blocks, and writes the fingerprint values of the remaining data blocks into the meta-information table of the HDFS client database;

and file storage, namely recombining the residual data blocks and the index data into a new index file by the HDFS client, and storing the index file on the HDFS.

The invention has the beneficial effects that:

1. the functions of stream data access and storage can be realized while downloading and uploading files are provided;

2. the method comprises the following steps of (1) enabling an abstract file system compatible with Hadoop to directly use an application function program on the Hadoop;

3. the HDFS has a fault recovery function, and can interact with a plurality of client terminals simultaneously.

Drawings

FIG. 1 is a system block diagram of the present invention.

FIG. 2 is a diagram of a random read embodiment of the present invention.

FIG. 3 is a diagram of the MetaBlock logic architecture of the present invention.

Fig. 4 is a diagram of a file deletion embodiment of the present invention.

FIG. 5 is a diagram illustrating the structure of an index file according to the present invention.

FIG. 6 is a diagram of a logical storage structure of an index file according to the present invention.

FIG. 7 is a logical block diagram of the index portion of the index file of the present invention.

FIG. 8 is a flow chart of the writing of the index file according to the present invention.

FIG. 9 is a flowchart of the reading of the index file according to the present invention.

FIG. 10 is a flow chart of a method of the present invention.

Detailed Description

In order to more clearly understand the technical features, objects, and effects of the present invention, embodiments of the present invention will now be described with reference to the accompanying drawings.

In this embodiment, as shown in fig. 1, a distributed file deduplication system includes a meta information service node, a meta information table, and at least one HDFS client. The meta information service node and the meta information table are respectively provided on the HDFS client.

The meta information service node is used for managing the content address of the data block, and the content address of the data block comprises the fingerprint value of the data block, the number of quotations, the file path and name, the offset of the data block relative to the file, the size of the data block and the file creation time. The meta information service node establishes an interface BlockMeta to complete operations including inserting, updating, querying and the like of the content address of the data block. The BlockMeta interface mainly operates as follows: and (4) performing putLock operation to complete the insertion of the content address of the data block, wherein the main information comprises the fingerprint abstract of the data block md5 and the content address of the data block. getLock operation, according to the md5 fingerprint digest of the data block, obtains the content address of the data block in the meta information table. removeBlock operation removes the data block content address. isExist operation, judges whether the data block exists. commit operation, after a file is successfully written, submitting the object and atomically and automatically increasing one for all content address reference numbers in the index list. The abort operation can be used to complete the rollback of the meta-information database transaction, if an error occurs during the file writing process.

The meta information table is used for storing content addresses of all data blocks in the HDFS system, and the meta information table comprises 7 fields of a primary key id, a hash value, a file path and name, an offset of the data blocks relative to a file, the size of the data blocks, file creation time and the number of users. In a data de-duplication system, a meta-information table needs to consider the problem that a file is not updated, and the problem of concurrent update of the file is mainly the problem of file storage encountered by the system during concurrent writing. Concurrent file writes are concurrent transactions to the meta-information table, and transaction concurrency considers setting isolation levels for the database storing the meta-information table. Common transaction isolation levels include four types of read uncommitted, read committed, repeatable read, and serializable. In order to ensure that the system operates stably under concurrent writing, the system selects to set the meta information table in the MySQL database, and sets the object isolation level of the database as read committed. In addition, in the meta information table, the hash value is not used as the primary key any more, another id value is added as the primary key of the table, and the repetition of h individual hash values, namely the repetition of data blocks, is allowed to sacrifice a small part of the disk storage space of the system to reduce the risk of file writing failure in the file writing process, so that the stability of the system is improved.

In the process of writing in the file, the client needs to segment the data stream of the duplicate removal file into a plurality of data blocks, and delete duplicate data according to the data blocks. The smaller the granularity of general data division is, the higher the possibility that duplicate data can be found is, and the higher the deduplication rate is; the larger the data block is divided, the lower the possibility of finding the repeated data is, and the lower the deduplication rate is. In the system, the segmentation of the file data is completed by using a content-based segmentation algorithm, and the MIN and MAX are used in the algorithm to ensure the minimum value and the maximum value of the data block, so that the data block is prevented from being too small or too large, and the system can obtain the maximum deduplication rate. After the segmentation is successful, a hash function is used to calculate a fingerprint value of each data block for removing repeated data blocks.

In the process of deleting repeated data, an HDFS client firstly randomly reads an input de-duplicated file, mainly realized by reading an input stream of the de-duplicated file, and a system does not firstly recover all data of the file, but dynamically recovers required file data according to the file position accessed by a user and the size of the accessed data. A preferred embodiment of the random read implementation is shown in fig. 2. A Bitmap area is added in a MetaBlock area in a file index part, the Bitmap area is an index of indexblocks, the position information of each IndexBlock is stored according to the size of a localloffset, the data volume transmitted from a DataNode during random reading of a system can be effectively reduced, and the MetaBlock logical structure of the Bitmap area is increased as shown in fig. 3. During random reading, the system does not read all IndexBlock, but reads BitMap information firstly, and Locaoffset is fixedly arranged in sequence in the BitMap, so that the content address IndexBlock where the target position is located can be quickly searched by using binary search, and then data of a related data block is read according to the content address, and unnecessary data transmission between an HDFS client and a data node is effectively avoided.

In the deduplication process of the data blocks, the HDFS client calls the meta-information service node to perform matching query on fingerprint values of all data blocks in a meta-information table in a database according to the calculated fingerprint values of all the data blocks, repeated data blocks are searched out, the remaining database fingerprint values which cannot be searched out are the 'new fingerprints', and the 'new fingerprints' are replied to the HDFS client after the meta-information service node inserts the 'new fingerprints' into the meta-information table. And if the node is not matched with the fingerprint value of a certain data block, the data block is proved to be unique, and the fingerprint value of the data block is directly inserted into the meta-information table and returned to the HDFS client.

And the HDFS client deletes repeated data blocks according to the fingerprint value of the data block returned by the node, and recombines the residual data blocks into the data part of the file. In the process of deleting or renaming files in the system, the creation time of each file needs to be saved so as to distinguish the files created at different time points and meet the requirement of a deleting function. The system utilizes the extended attribute characteristics of the Hadoop platform to store file creation time. Specifically, an extended attribute with key of user is added to each written file, and the value of the extended attribute stores a 64-bit creation timestamp. When the file is deleted or moved or renamed, the file is directly moved to the bottom of the/delete directory space, a directory with a path of "/delete + file absolute path +/file name" is created in the/delete directory space, the file name is modified to the creation time in the extended attribute of the file name, and the file is moved to the created directory, so that the purpose of creating the timestamp named file is to distinguish the files deleted at different periods. Specific deletion example as shown in fig. 4, the system in fig. 4 initially has a file1.txt, when a user performs a deletion operation, the file1 file is renamed by the timestamp 1584190386649 in its user. createtime, and the file is moved to the bottom of the/delete/user/file 1. txt/directory, so that the operation may not affect the creation and deletion of the same file name of the file in the future. When other files search the file by the content address of the data block, the system will first judge whether the file exists and whether the creation time is matched, when the file does not exist or the timestamp is not matched, the file is deleted, and then the file named by the timestamp is searched under the "/delete + file absolute path +/file name" directory. The file deletion can also be optimized continuously, if the data in the file has no new data block, that is, all the data in the file are index parts and have no data part, the file can be deleted from the HDFS directly without being moved into the delete directory. The file renaming includes that a user revises the path or the file name of the file, and the symbolic connection technology under the HDFS is used. The operation of renaming the file is similar to the operation of deleting the file, and the symbolic connection file is created in one step. When a user renames a file, the system first performs a file delete operation, then creates a new symbolic link file under the new directory, the symbolic link pointing to the deleted file. If rename/usr/local/file 1.txt is/usr/local/data/file 2.txt, the system first performs delete file operation, then creates new directory/usr/local/data, and creates a symbolic connection of file2.txt under the new directory, where the symbolic connection points to/delete/usr/local/file 1.txt, and finally rename the file on the HDFS to be/usr/local/data/file 2.txt (symbolline >/delete/usr/local/file1. txt).

After the client removes the duplicate of the file data block, an index file is created in the HDFS, and the index file comprises a data part and an index part. In the system, because the non-repeated data blocks in the original file are stored, a plurality of data blocks are directly merged and sequentially stored during storage, so that the system can conveniently recover the file. In the index structure of the system, the data part is a data block of which the file is not repeated, and the index part is the content address of all data blocks of the original file. In order to ensure that the system can correctly recover the file data, the index part must contain the content addresses of all the data blocks and the position information of the original file where the data blocks are located, and the content addresses of the data blocks are the information stored in the meta information table and comprise the hash values of the data blocks and the file paths, offsets, sizes and the like where the data blocks are located.

In the design process of the index file, the system takes the front part of the index file as a data part and takes the rear part of the index file as an index part. The designed index file structure is shown in FIG. 5, under the new design, the data of the index part is directly added behind the data file, the index can point to the data in other files and also point to the data of the front part of the file, the design ensures that the data file and the index file use one file, compared with the previous design, the file number is effectively reduced, and the problem that the NameNode memory is occupied due to excessive file creation is avoided. The data part in the index file is a data block which is not repeated by the original file, and the index part is the content address of all the data blocks of the original file. In order to ensure that the system can correctly recover the file data, the index part must contain the content address of all the data blocks and the position information of the original file where the data blocks are located, and the content address of the data blocks is the information stored in the meta information table, including the file path, offset, size, file name, creation time, and the like where the data blocks are located.

The logical storage structure of the index file is shown in fig. 6. The DataBlock stores non-repeated data blocks in the original file, and directly merges and sequentially stores a plurality of data blocks during storage. IndexBlock is the address of the content of the stored data block (not containing the Hash value of the data block) and the position of the data block in the original file, and the whole index part excludes the last 8 bytes of content, and other address data are stored in the form of character strings, and the detailed index structure is shown in fig. 7. MetaBlock is a fixed 8-byte-length number placed at the end of the index portion that stores the location information of the index portion in the index file. In fig. 7, Localoffset is the position information of the data block in the original file, and path, offset, and size of the file where the data block is located in the content address of the data block are stored, and the information of the path, offset, and size are consistent with the information stored in the meta information table. Between IndexBlock and IndexBlock, a special character '\ n' is used to split, and between different parts within IndexBlock, a special separator '#' is used to split.

And constructing an index file, merging data blocks which are not repeated by the original file, sequentially storing the data blocks in the data part of the index file, simultaneously storing the position information of the data blocks in the original file, the data content addresses of the data blocks and the position information of the index part in the index file in the index part of the index file, generating the index file and storing the index file on the HDFS.

In this embodiment, as shown in fig. 8, the file writing process of the system under the index file is as follows: firstly, the system creates a related file in the HDFS and saves the creation time of the file, secondly, the file data is cut into blocks and compared and deduplicated according to the meta information table information, and then, the deduplicated data is written into the front data part of the index file, and the index data is written into the rear index part of the index file.

In this embodiment, as shown in fig. 9, the file reading process of the system under the index file is as follows: first, when the system opens the index file on the HDFS and reads the index information of the second half, the content address (including path b _ path, file name b _ name, file offset b _ offset, size b _ size, and creation time) of the data block of the original file is obtained, and then the data block data is read according to the content address of the data block. When reading the b _ name of the file using the content address to correspond to the b _ path of the file, three situations may be encountered: one is that the file exists and the file creation time is the same as the createTime saved in the content address, and the data is read smoothly; the second is that the file exists but the creation time and content address are different, which means that the file has been deleted by the user and a new identical file has been created; the third is that the file is not stored, and the file is deleted by the user. In the latter two cases, the system needs to search data in the file deletion directory/delete in the deduplication data deletion system, the file name needs to be changed into createTime during searching, and the directory searching path is changed into delete/b _ path/b _ name.

When a system recovers a file, the system initializes an HDFS client for accessing data on the HDFS, and then removes the HDFS from the file to be read to read the data of 8 bytes after the index file for obtaining an index part offset address IndexOffset; then, according to the offset address indexexffset, index partial data indexData is read, and the indexData comprises a plurality of indexblocks; circularly traversing indexData, reading each IndexBlock, removing the HDFS according to the content address in each IndexBlcok, and reading the data blockData of the related data block; and finally, recombining the original file by using the data blocks.

In this embodiment, as shown in fig. 10, a method for deleting duplicate data of a distributed file includes the following steps:

The foregoing shows and describes the general principles and broad features of the present invention and advantages thereof. It will be understood by those skilled in the art that the present invention is not limited to the embodiments described above, which are described in the specification and illustrated only to illustrate the principle of the present invention, but that various changes and modifications may be made therein without departing from the spirit and scope of the present invention, which fall within the scope of the invention as claimed. The scope of the invention is defined by the appended claims and equivalents thereof.

Claims

1. A distributed file deduplication system, comprising

The meta-information service node is used for managing the content address of the data block;

the meta information table is used for storing the content addresses of all data blocks in the HDFS system;

the HDFS comprises at least one HDFS client and at least one NameNode node, wherein the HDFS client comprises a meta-information service node and a meta-information table, a duplicate removal file is written in the HDFS client, the HDFS client divides the duplicate removal file into a plurality of data blocks, calculates a fingerprint value of each data block, calls the meta-information service node to inquire the meta-information table, removes repeated data blocks, recombines the rest data blocks and index data to generate a new index file, interacts with the NameNode node to store the index file on the HDFS, and stores the newly generated data fingerprint in the meta-information table of a database of the HDFS client.

2. The distributed file deduplication system of claim 1, wherein the content address of the data block comprises a fingerprint value of the data block, a number of referrers, a file path and name, an offset of the data block from the file, a size of the data block, and a file creation time.

3. The distributed file deduplication system of claim 1, wherein the database is a MySQL database, and the transaction isolation level is set to read committed, so as to ensure that the deduplicated data block fingerprint value can be written into the meta-information table concurrently.

4. The distributed file deduplication system of claim 1, wherein an id value is added to the meta information table to replace a hash value as a primary key of the meta information table, so as to avoid a primary key conflict when the meta information table is written into a file concurrently.

5. A distributed file data de-duplication method is characterized by comprising the following steps: