CN111522791A - Distributed file repeating data deleting system and method - Google Patents

Distributed file repeating data deleting system and method Download PDF

Info

Publication number
CN111522791A
CN111522791A CN202010362251.XA CN202010362251A CN111522791A CN 111522791 A CN111522791 A CN 111522791A CN 202010362251 A CN202010362251 A CN 202010362251A CN 111522791 A CN111522791 A CN 111522791A
Authority
CN
China
Prior art keywords
file
data
meta
hdfs
data blocks
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010362251.XA
Other languages
Chinese (zh)
Other versions
CN111522791B (en
Inventor
侯孟书
周立康
许佳欣
詹思瑜
周世杰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Electronic Science and Technology of China
Original Assignee
University of Electronic Science and Technology of China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Electronic Science and Technology of China filed Critical University of Electronic Science and Technology of China
Priority to CN202010362251.XA priority Critical patent/CN111522791B/en
Publication of CN111522791A publication Critical patent/CN111522791A/en
Application granted granted Critical
Publication of CN111522791B publication Critical patent/CN111522791B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/18File system types
    • G06F16/182Distributed file systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/17Details of further file system functions
    • G06F16/174Redundancy elimination performed by the file system
    • G06F16/1748De-duplication implemented within the file system, e.g. based on file segments
    • G06F16/1752De-duplication implemented within the file system, e.g. based on file segments based on file chunks
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The invention discloses a distributed file repeating data deleting system and a distributed file repeating data deleting method. The system comprises a meta-information service node, a meta-information service node and a data block management module, wherein the meta-information service node is used for managing the content address of the data block; the meta information table is used for storing the content addresses of all data blocks in the HDFS system; the HDFS comprises at least one HDFS client and a NameNode node, wherein the HDFS client comprises a meta-information service node and a meta-information table, a duplicate removal file is written in the HDFS client, the HDFS client divides the duplicate file into a plurality of data blocks, calculates a fingerprint value of each data block, calls the meta-information service node to inquire the meta-information table, removes repeated data blocks, recombines the residual data blocks in the node, then recombines the residual data blocks and index data to generate a new index file, interacts with the NameNode node to store the index file on the HDFS, and stores the newly generated data fingerprint in the meta-information table of a database of the HDFS client. By the method and the system, the HDFS client can quickly complete the data de-duplication and the distributed storage of the file.

Description

Distributed file repeating data deleting system and method
Technical Field
The invention relates to the technical field of data deduplication, in particular to a distributed file deduplication system and a distributed file deduplication method.
Background
When Hadoop processes certain specific data, redundant data in the specific data can affect the storage efficiency of the system and waste storage resources. The repeated data deleting technology can effectively identify repeated files or data blocks in the system, save the storage space of the system and improve the effective utilization rate of system resources. Hadoop is a mainstream development platform in the field of current big data, and if a repeated data deleting technology can be applied to the Hadoop platform, the development of the current big data can be effectively promoted.
At present, the related design of the deduplication technology is realized on the Hadoop, the deduplication technology is paid much attention to, and some characteristics of the Hadoop are not attached to the analysis design, so that the deduplication technology is not suitable for being applied to the Hadoop. The main drawbacks in the current designs are:
1. firstly, only one server and only one server can interact with the HDFS, so that the server becomes the bottleneck of the system and the advantages of the distributed system are not fully exerted;
2. secondly, a client in the system only provides downloading and uploading of files, and lacks the functions of stream data access and storage;
3. the method is incompatible with an abstract file system of the Hadoop, and application function programs such as a MapReduce program on the Hadoop cannot be directly used, so that the use range of a deduplication system is severely limited;
4. failure recovery is not considered, Hbase and Redis have no isolation level, no rollback function, and are not suitable as management of meta information.
Disclosure of Invention
The invention aims to overcome the defects of the prior art and provide a distributed file system with a repeated data deleting function, which can complete file blocking, fingerprint calculation and duplicate removal at an HDFS client, and recombine the data blocks after duplicate removal and index data to generate a new index file to be written into the HDFS system.
The purpose of the invention is realized by the following technical scheme:
a distributed file data de-duplication system comprises a meta-information service node, a meta-information table and at least one HDFS client. The meta information service node and the meta information table are provided on the HDFS client.
The meta-information service node is used for managing the content address of the data block; and the meta information table is used for storing the content addresses of all data blocks in the HDFS system.
The HDFS comprises an HDFS client, wherein a duplicate removal file is written in the HDFS client, the HDFS client divides the duplicate removal file into a plurality of data blocks and randomly reads the duplicate removal file, divides the duplicate removal file into a plurality of data blocks, calculates a fingerprint value of each data block, calls a meta-information service node to inquire a meta-information table, removes repeated data blocks, recombines the rest data blocks and index data to generate a new index file, interacts with a NameNode node to store the index file on the HDFS, and stores a newly generated data fingerprint in the meta-information table of a database.
Specifically, the content address of the data block includes a fingerprint value of the data block, the number of referrers, a file path and name, an offset of the data block with respect to the file, a size of the data block, and a file creation time.
Specifically, the HDFS client database is a Mysql database, and the isolation level of the object is set to be read and submitted, so as to ensure that the fingerprint value of the deduplicated data block can be concurrently written into the meta information table.
Specifically, an id value is added to the meta information table to replace a hash value as a primary key of the meta information table, so as to avoid primary key collision when the meta information table is written into a file concurrently.
A distributed file data de-duplication method comprises the following steps:
reading a file, wherein an HDFS client randomly reads a duplicate removal file, checks attribute information such as file authority of the duplicate removal file, opens the duplicate removal file after no error, and stores file opening information into a meta-information table;
data segmentation, namely parallelly calling a content-based segmentation algorithm by an HDFS client to segment the opened duplicate-removed file data, and dividing duplicate-removed files into a plurality of data blocks;
calculating fingerprint values, and calculating the fingerprint values of a plurality of data blocks by the HDFS client according to a hash function;
the data block duplication is removed, the HDFS client inquires the fingerprint values of all data blocks of the HDFS system in the meta-information table through the meta-information service node according to the calculated fingerprint values of the data blocks, compares the fingerprint values with the fingerprint values of all data blocks of the HDFS system in the meta-information table, removes repeated data blocks, and writes the fingerprint values of the remaining data blocks into the meta-information table of the HDFS client database;
and file storage, namely recombining the residual data blocks and the index data into a new index file by the HDFS client, and storing the index file on the HDFS.
The invention has the beneficial effects that:
1. the functions of stream data access and storage can be realized while downloading and uploading files are provided;
2. the method comprises the following steps of (1) enabling an abstract file system compatible with Hadoop to directly use an application function program on the Hadoop;
3. the HDFS has a fault recovery function, and can interact with a plurality of client terminals simultaneously.
Drawings
FIG. 1 is a system block diagram of the present invention.
FIG. 2 is a diagram of a random read embodiment of the present invention.
FIG. 3 is a diagram of the MetaBlock logic architecture of the present invention.
Fig. 4 is a diagram of a file deletion embodiment of the present invention.
FIG. 5 is a diagram illustrating the structure of an index file according to the present invention.
FIG. 6 is a diagram of a logical storage structure of an index file according to the present invention.
FIG. 7 is a logical block diagram of the index portion of the index file of the present invention.
FIG. 8 is a flow chart of the writing of the index file according to the present invention.
FIG. 9 is a flowchart of the reading of the index file according to the present invention.
FIG. 10 is a flow chart of a method of the present invention.
Detailed Description
In order to more clearly understand the technical features, objects, and effects of the present invention, embodiments of the present invention will now be described with reference to the accompanying drawings.
In this embodiment, as shown in fig. 1, a distributed file deduplication system includes a meta information service node, a meta information table, and at least one HDFS client. The meta information service node and the meta information table are respectively provided on the HDFS client.
The meta information service node is used for managing the content address of the data block, and the content address of the data block comprises the fingerprint value of the data block, the number of quotations, the file path and name, the offset of the data block relative to the file, the size of the data block and the file creation time. The meta information service node establishes an interface BlockMeta to complete operations including inserting, updating, querying and the like of the content address of the data block. The BlockMeta interface mainly operates as follows: and (4) performing putLock operation to complete the insertion of the content address of the data block, wherein the main information comprises the fingerprint abstract of the data block md5 and the content address of the data block. getLock operation, according to the md5 fingerprint digest of the data block, obtains the content address of the data block in the meta information table. removeBlock operation removes the data block content address. isExist operation, judges whether the data block exists. commit operation, after a file is successfully written, submitting the object and atomically and automatically increasing one for all content address reference numbers in the index list. The abort operation can be used to complete the rollback of the meta-information database transaction, if an error occurs during the file writing process.
The meta information table is used for storing content addresses of all data blocks in the HDFS system, and the meta information table comprises 7 fields of a primary key id, a hash value, a file path and name, an offset of the data blocks relative to a file, the size of the data blocks, file creation time and the number of users. In a data de-duplication system, a meta-information table needs to consider the problem that a file is not updated, and the problem of concurrent update of the file is mainly the problem of file storage encountered by the system during concurrent writing. Concurrent file writes are concurrent transactions to the meta-information table, and transaction concurrency considers setting isolation levels for the database storing the meta-information table. Common transaction isolation levels include four types of read uncommitted, read committed, repeatable read, and serializable. In order to ensure that the system operates stably under concurrent writing, the system selects to set the meta information table in the MySQL database, and sets the object isolation level of the database as read committed. In addition, in the meta information table, the hash value is not used as the primary key any more, another id value is added as the primary key of the table, and the repetition of h individual hash values, namely the repetition of data blocks, is allowed to sacrifice a small part of the disk storage space of the system to reduce the risk of file writing failure in the file writing process, so that the stability of the system is improved.
In the process of writing in the file, the client needs to segment the data stream of the duplicate removal file into a plurality of data blocks, and delete duplicate data according to the data blocks. The smaller the granularity of general data division is, the higher the possibility that duplicate data can be found is, and the higher the deduplication rate is; the larger the data block is divided, the lower the possibility of finding the repeated data is, and the lower the deduplication rate is. In the system, the segmentation of the file data is completed by using a content-based segmentation algorithm, and the MIN and MAX are used in the algorithm to ensure the minimum value and the maximum value of the data block, so that the data block is prevented from being too small or too large, and the system can obtain the maximum deduplication rate. After the segmentation is successful, a hash function is used to calculate a fingerprint value of each data block for removing repeated data blocks.
In the process of deleting repeated data, an HDFS client firstly randomly reads an input de-duplicated file, mainly realized by reading an input stream of the de-duplicated file, and a system does not firstly recover all data of the file, but dynamically recovers required file data according to the file position accessed by a user and the size of the accessed data. A preferred embodiment of the random read implementation is shown in fig. 2. A Bitmap area is added in a MetaBlock area in a file index part, the Bitmap area is an index of indexblocks, the position information of each IndexBlock is stored according to the size of a localloffset, the data volume transmitted from a DataNode during random reading of a system can be effectively reduced, and the MetaBlock logical structure of the Bitmap area is increased as shown in fig. 3. During random reading, the system does not read all IndexBlock, but reads BitMap information firstly, and Locaoffset is fixedly arranged in sequence in the BitMap, so that the content address IndexBlock where the target position is located can be quickly searched by using binary search, and then data of a related data block is read according to the content address, and unnecessary data transmission between an HDFS client and a data node is effectively avoided.
In the deduplication process of the data blocks, the HDFS client calls the meta-information service node to perform matching query on fingerprint values of all data blocks in a meta-information table in a database according to the calculated fingerprint values of all the data blocks, repeated data blocks are searched out, the remaining database fingerprint values which cannot be searched out are the 'new fingerprints', and the 'new fingerprints' are replied to the HDFS client after the meta-information service node inserts the 'new fingerprints' into the meta-information table. And if the node is not matched with the fingerprint value of a certain data block, the data block is proved to be unique, and the fingerprint value of the data block is directly inserted into the meta-information table and returned to the HDFS client.
And the HDFS client deletes repeated data blocks according to the fingerprint value of the data block returned by the node, and recombines the residual data blocks into the data part of the file. In the process of deleting or renaming files in the system, the creation time of each file needs to be saved so as to distinguish the files created at different time points and meet the requirement of a deleting function. The system utilizes the extended attribute characteristics of the Hadoop platform to store file creation time. Specifically, an extended attribute with key of user is added to each written file, and the value of the extended attribute stores a 64-bit creation timestamp. When the file is deleted or moved or renamed, the file is directly moved to the bottom of the/delete directory space, a directory with a path of "/delete + file absolute path +/file name" is created in the/delete directory space, the file name is modified to the creation time in the extended attribute of the file name, and the file is moved to the created directory, so that the purpose of creating the timestamp named file is to distinguish the files deleted at different periods. Specific deletion example as shown in fig. 4, the system in fig. 4 initially has a file1.txt, when a user performs a deletion operation, the file1 file is renamed by the timestamp 1584190386649 in its user. createtime, and the file is moved to the bottom of the/delete/user/file 1. txt/directory, so that the operation may not affect the creation and deletion of the same file name of the file in the future. When other files search the file by the content address of the data block, the system will first judge whether the file exists and whether the creation time is matched, when the file does not exist or the timestamp is not matched, the file is deleted, and then the file named by the timestamp is searched under the "/delete + file absolute path +/file name" directory. The file deletion can also be optimized continuously, if the data in the file has no new data block, that is, all the data in the file are index parts and have no data part, the file can be deleted from the HDFS directly without being moved into the delete directory. The file renaming includes that a user revises the path or the file name of the file, and the symbolic connection technology under the HDFS is used. The operation of renaming the file is similar to the operation of deleting the file, and the symbolic connection file is created in one step. When a user renames a file, the system first performs a file delete operation, then creates a new symbolic link file under the new directory, the symbolic link pointing to the deleted file. If rename/usr/local/file 1.txt is/usr/local/data/file 2.txt, the system first performs delete file operation, then creates new directory/usr/local/data, and creates a symbolic connection of file2.txt under the new directory, where the symbolic connection points to/delete/usr/local/file 1.txt, and finally rename the file on the HDFS to be/usr/local/data/file 2.txt (symbolline >/delete/usr/local/file1. txt).
After the client removes the duplicate of the file data block, an index file is created in the HDFS, and the index file comprises a data part and an index part. In the system, because the non-repeated data blocks in the original file are stored, a plurality of data blocks are directly merged and sequentially stored during storage, so that the system can conveniently recover the file. In the index structure of the system, the data part is a data block of which the file is not repeated, and the index part is the content address of all data blocks of the original file. In order to ensure that the system can correctly recover the file data, the index part must contain the content addresses of all the data blocks and the position information of the original file where the data blocks are located, and the content addresses of the data blocks are the information stored in the meta information table and comprise the hash values of the data blocks and the file paths, offsets, sizes and the like where the data blocks are located.
In the design process of the index file, the system takes the front part of the index file as a data part and takes the rear part of the index file as an index part. The designed index file structure is shown in FIG. 5, under the new design, the data of the index part is directly added behind the data file, the index can point to the data in other files and also point to the data of the front part of the file, the design ensures that the data file and the index file use one file, compared with the previous design, the file number is effectively reduced, and the problem that the NameNode memory is occupied due to excessive file creation is avoided. The data part in the index file is a data block which is not repeated by the original file, and the index part is the content address of all the data blocks of the original file. In order to ensure that the system can correctly recover the file data, the index part must contain the content address of all the data blocks and the position information of the original file where the data blocks are located, and the content address of the data blocks is the information stored in the meta information table, including the file path, offset, size, file name, creation time, and the like where the data blocks are located.
The logical storage structure of the index file is shown in fig. 6. The DataBlock stores non-repeated data blocks in the original file, and directly merges and sequentially stores a plurality of data blocks during storage. IndexBlock is the address of the content of the stored data block (not containing the Hash value of the data block) and the position of the data block in the original file, and the whole index part excludes the last 8 bytes of content, and other address data are stored in the form of character strings, and the detailed index structure is shown in fig. 7. MetaBlock is a fixed 8-byte-length number placed at the end of the index portion that stores the location information of the index portion in the index file. In fig. 7, Localoffset is the position information of the data block in the original file, and path, offset, and size of the file where the data block is located in the content address of the data block are stored, and the information of the path, offset, and size are consistent with the information stored in the meta information table. Between IndexBlock and IndexBlock, a special character '\ n' is used to split, and between different parts within IndexBlock, a special separator '#' is used to split.
And constructing an index file, merging data blocks which are not repeated by the original file, sequentially storing the data blocks in the data part of the index file, simultaneously storing the position information of the data blocks in the original file, the data content addresses of the data blocks and the position information of the index part in the index file in the index part of the index file, generating the index file and storing the index file on the HDFS.
In this embodiment, as shown in fig. 8, the file writing process of the system under the index file is as follows: firstly, the system creates a related file in the HDFS and saves the creation time of the file, secondly, the file data is cut into blocks and compared and deduplicated according to the meta information table information, and then, the deduplicated data is written into the front data part of the index file, and the index data is written into the rear index part of the index file.
In this embodiment, as shown in fig. 9, the file reading process of the system under the index file is as follows: first, when the system opens the index file on the HDFS and reads the index information of the second half, the content address (including path b _ path, file name b _ name, file offset b _ offset, size b _ size, and creation time) of the data block of the original file is obtained, and then the data block data is read according to the content address of the data block. When reading the b _ name of the file using the content address to correspond to the b _ path of the file, three situations may be encountered: one is that the file exists and the file creation time is the same as the createTime saved in the content address, and the data is read smoothly; the second is that the file exists but the creation time and content address are different, which means that the file has been deleted by the user and a new identical file has been created; the third is that the file is not stored, and the file is deleted by the user. In the latter two cases, the system needs to search data in the file deletion directory/delete in the deduplication data deletion system, the file name needs to be changed into createTime during searching, and the directory searching path is changed into delete/b _ path/b _ name.
When a system recovers a file, the system initializes an HDFS client for accessing data on the HDFS, and then removes the HDFS from the file to be read to read the data of 8 bytes after the index file for obtaining an index part offset address IndexOffset; then, according to the offset address indexexffset, index partial data indexData is read, and the indexData comprises a plurality of indexblocks; circularly traversing indexData, reading each IndexBlock, removing the HDFS according to the content address in each IndexBlcok, and reading the data blockData of the related data block; and finally, recombining the original file by using the data blocks.
In this embodiment, as shown in fig. 10, a method for deleting duplicate data of a distributed file includes the following steps:
reading a file, wherein an HDFS client randomly reads a duplicate removal file, checks attribute information such as file authority of the duplicate removal file, opens the duplicate removal file after no error, and stores file opening information into a meta-information table;
data segmentation, namely parallelly calling a content-based segmentation algorithm by an HDFS client to segment the opened duplicate-removed file data, and dividing duplicate-removed files into a plurality of data blocks;
calculating fingerprint values, and calculating the fingerprint values of a plurality of data blocks by the HDFS client according to a hash function;
the data block duplication is removed, the HDFS client inquires the fingerprint values of all data blocks of the HDFS system in the meta-information table through the meta-information service node according to the calculated fingerprint values of the data blocks, compares the fingerprint values with the fingerprint values of all data blocks of the HDFS system in the meta-information table, removes repeated data blocks, and writes the fingerprint values of the remaining data blocks into the meta-information table of the HDFS client database;
and file storage, namely recombining the residual data blocks and the index data into a new index file by the HDFS client, and storing the index file on the HDFS.
The foregoing shows and describes the general principles and broad features of the present invention and advantages thereof. It will be understood by those skilled in the art that the present invention is not limited to the embodiments described above, which are described in the specification and illustrated only to illustrate the principle of the present invention, but that various changes and modifications may be made therein without departing from the spirit and scope of the present invention, which fall within the scope of the invention as claimed. The scope of the invention is defined by the appended claims and equivalents thereof.

Claims (5)

1. A distributed file deduplication system, comprising
The meta-information service node is used for managing the content address of the data block;
the meta information table is used for storing the content addresses of all data blocks in the HDFS system;
the HDFS comprises at least one HDFS client and at least one NameNode node, wherein the HDFS client comprises a meta-information service node and a meta-information table, a duplicate removal file is written in the HDFS client, the HDFS client divides the duplicate removal file into a plurality of data blocks, calculates a fingerprint value of each data block, calls the meta-information service node to inquire the meta-information table, removes repeated data blocks, recombines the rest data blocks and index data to generate a new index file, interacts with the NameNode node to store the index file on the HDFS, and stores the newly generated data fingerprint in the meta-information table of a database of the HDFS client.
2. The distributed file deduplication system of claim 1, wherein the content address of the data block comprises a fingerprint value of the data block, a number of referrers, a file path and name, an offset of the data block from the file, a size of the data block, and a file creation time.
3. The distributed file deduplication system of claim 1, wherein the database is a MySQL database, and the transaction isolation level is set to read committed, so as to ensure that the deduplicated data block fingerprint value can be written into the meta-information table concurrently.
4. The distributed file deduplication system of claim 1, wherein an id value is added to the meta information table to replace a hash value as a primary key of the meta information table, so as to avoid a primary key conflict when the meta information table is written into a file concurrently.
5. A distributed file data de-duplication method is characterized by comprising the following steps:
reading a file, wherein an HDFS client randomly reads a duplicate removal file, checks attribute information such as file authority of the duplicate removal file, opens the duplicate removal file after no error, and stores file opening information into a meta-information table;
data segmentation, namely parallelly calling a content-based segmentation algorithm by an HDFS client to segment the opened duplicate-removed file data, and dividing duplicate-removed files into a plurality of data blocks;
calculating fingerprint values, and calculating the fingerprint values of a plurality of data blocks by the HDFS client according to a hash function;
the data block duplication is removed, the HDFS client inquires the fingerprint values of all data blocks of the HDFS system in the meta-information table through the meta-information service node according to the calculated fingerprint values of the data blocks, compares the fingerprint values with the fingerprint values of all data blocks of the HDFS system in the meta-information table, removes repeated data blocks, and writes the fingerprint values of the remaining data blocks into the meta-information table of the HDFS client database;
and file storage, namely recombining the residual data blocks and the index data into a new index file by the HDFS client, and storing the index file on the HDFS.
CN202010362251.XA 2020-04-30 2020-04-30 Distributed file repeated data deleting system and method Active CN111522791B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010362251.XA CN111522791B (en) 2020-04-30 2020-04-30 Distributed file repeated data deleting system and method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010362251.XA CN111522791B (en) 2020-04-30 2020-04-30 Distributed file repeated data deleting system and method

Publications (2)

Publication Number Publication Date
CN111522791A true CN111522791A (en) 2020-08-11
CN111522791B CN111522791B (en) 2023-05-30

Family

ID=71908355

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010362251.XA Active CN111522791B (en) 2020-04-30 2020-04-30 Distributed file repeated data deleting system and method

Country Status (1)

Country Link
CN (1) CN111522791B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2022048475A1 (en) * 2020-09-03 2022-03-10 中兴通讯股份有限公司 Data deduplication method, node, and computer readable storage medium
CN116107979A (en) * 2023-04-14 2023-05-12 大熊集团有限公司 Data distributed reading method and system
CN117194490A (en) * 2023-11-07 2023-12-08 长春金融高等专科学校 Financial big data storage query method based on artificial intelligence

Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102456059A (en) * 2010-10-21 2012-05-16 英业达股份有限公司 Data deduplication processing system
US8402250B1 (en) * 2010-02-03 2013-03-19 Applied Micro Circuits Corporation Distributed file system with client-side deduplication capacity
CN103051671A (en) * 2012-11-22 2013-04-17 浪潮电子信息产业股份有限公司 Repeating data deletion method for cluster file system
CN103177111A (en) * 2013-03-29 2013-06-26 西安理工大学 System and method for deleting repeating data
CN103530201A (en) * 2013-07-17 2014-01-22 华中科技大学 Safety data repetition removing method and system applicable to backup system
CN103763362A (en) * 2014-01-13 2014-04-30 西安电子科技大学 Safe distributed duplicated data deletion method
CN103914522A (en) * 2014-03-20 2014-07-09 电子科技大学 Data block merging method applied to deleting duplicated data in cloud storage
CN104408111A (en) * 2014-11-24 2015-03-11 浙江宇视科技有限公司 Method and device for deleting duplicate data
CN106294826A (en) * 2016-08-17 2017-01-04 北京北信源软件股份有限公司 A kind of company-data Query method in real time and system
CN106446099A (en) * 2016-09-13 2017-02-22 国家超级计算深圳中心(深圳云计算中心) Distributed cloud storage method and system and uploading and downloading method thereof
CN106649676A (en) * 2016-12-15 2017-05-10 北京锐安科技有限公司 Duplication eliminating method and device based on HDFS storage file
CN106940715A (en) * 2017-03-09 2017-07-11 星环信息科技(上海)有限公司 A kind of method and apparatus of the inquiry based on concordance list
EP3360033A1 (en) * 2015-10-07 2018-08-15 NEC Laboratories Europe GmbH Method for storing a data file

Patent Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8402250B1 (en) * 2010-02-03 2013-03-19 Applied Micro Circuits Corporation Distributed file system with client-side deduplication capacity
CN102456059A (en) * 2010-10-21 2012-05-16 英业达股份有限公司 Data deduplication processing system
CN103051671A (en) * 2012-11-22 2013-04-17 浪潮电子信息产业股份有限公司 Repeating data deletion method for cluster file system
CN103177111A (en) * 2013-03-29 2013-06-26 西安理工大学 System and method for deleting repeating data
CN103530201A (en) * 2013-07-17 2014-01-22 华中科技大学 Safety data repetition removing method and system applicable to backup system
CN103763362A (en) * 2014-01-13 2014-04-30 西安电子科技大学 Safe distributed duplicated data deletion method
CN103914522A (en) * 2014-03-20 2014-07-09 电子科技大学 Data block merging method applied to deleting duplicated data in cloud storage
CN104408111A (en) * 2014-11-24 2015-03-11 浙江宇视科技有限公司 Method and device for deleting duplicate data
EP3360033A1 (en) * 2015-10-07 2018-08-15 NEC Laboratories Europe GmbH Method for storing a data file
CN106294826A (en) * 2016-08-17 2017-01-04 北京北信源软件股份有限公司 A kind of company-data Query method in real time and system
CN106446099A (en) * 2016-09-13 2017-02-22 国家超级计算深圳中心(深圳云计算中心) Distributed cloud storage method and system and uploading and downloading method thereof
CN106649676A (en) * 2016-12-15 2017-05-10 北京锐安科技有限公司 Duplication eliminating method and device based on HDFS storage file
CN106940715A (en) * 2017-03-09 2017-07-11 星环信息科技(上海)有限公司 A kind of method and apparatus of the inquiry based on concordance list

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
付印金 等: "重复数据删除关键技术研究进展", 《计算机研究与发展》 *
俞善海: "基于Hadoop的重复数据删除技术研究", 《中国优秀博硕士学位论文全文数据库(硕士) 信息科技辑》 *
刘青 等: "基于Hadoop平台的分布式重删存储系统" *
王瀚: "面向备份容灾系统的重复数据删除引擎的设计与实现", 《中国优秀博硕士学位论文全文数据库(硕士) 信息科技辑》 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2022048475A1 (en) * 2020-09-03 2022-03-10 中兴通讯股份有限公司 Data deduplication method, node, and computer readable storage medium
CN116107979A (en) * 2023-04-14 2023-05-12 大熊集团有限公司 Data distributed reading method and system
CN117194490A (en) * 2023-11-07 2023-12-08 长春金融高等专科学校 Financial big data storage query method based on artificial intelligence
CN117194490B (en) * 2023-11-07 2024-04-05 长春金融高等专科学校 Financial big data storage query method based on artificial intelligence

Also Published As

Publication number Publication date
CN111522791B (en) 2023-05-30

Similar Documents

Publication Publication Date Title
US7228299B1 (en) System and method for performing file lookups based on tags
US9922046B2 (en) Scalable distributed metadata file-system using key-value stores
US7752226B1 (en) Reverse pathname lookup by inode identifier
CN102629247B (en) Method, device and system for data processing
US9208031B2 (en) Log structured content addressable deduplicating storage
US8849759B2 (en) Unified local storage supporting file and cloud object access
US8200633B2 (en) Database backup and restore with integrated index reorganization
US8527556B2 (en) Systems and methods to update a content store associated with a search index
US7418544B2 (en) Method and system for log structured relational database objects
US8560500B2 (en) Method and system for removing rows from directory tables
US20120284317A1 (en) Scalable Distributed Metadata File System using Key-Value Stores
US7467163B1 (en) System and method to manipulate large objects on enterprise server data management system
CN111522791B (en) Distributed file repeated data deleting system and method
US7054887B2 (en) Method and system for object replication in a content management system
US7769719B2 (en) File system dump/restore by node numbering
US20150269213A1 (en) Compacting change logs using file content location identifiers
US11960363B2 (en) Write optimized, distributed, scalable indexing store
JP2013541057A (en) Map Reduce Instant Distributed File System
US20200097558A1 (en) System and method for bulk removal of records in a database
US11403024B2 (en) Efficient restoration of content
US20180276267A1 (en) Methods and system for efficiently performing eventual and transactional edits on distributed metadata in an object storage system
US20230394010A1 (en) File system metadata deduplication
US20220083504A1 (en) Managing snapshotting of a dataset using an ordered set of b+ trees
WO2020192663A1 (en) Data management method and related device
US20220222146A1 (en) Versioned backup on an object addressable storage system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant