CN103279502B

CN103279502B - A kind of framework and method with the data de-duplication file system be combined with parallel file system

Info

Publication number: CN103279502B
Application number: CN201310168444.1A
Authority: CN
Inventors: 周晓阳; 周游
Original assignee: BEIJING SCISTOR TECHNOLOGY Co Ltd
Current assignee: BEIJING SCISTOR TECHNOLOGY Co Ltd
Priority date: 2013-05-06
Filing date: 2013-05-06
Publication date: 2016-01-20
Anticipated expiration: 2033-05-06
Also published as: CN103279502A

Abstract

The present invention is a kind of framework and method with the data de-duplication file system be combined with parallel file system, described framework is deployed with data deduplication system file access interface at client device, data de-duplication gateway deploy has data de-duplication processing engine and data mover system, parallel file system provides access interface to client device and data de-duplication gateway, and data de-duplication processing engine and data mover system realize data de-duplication to file data, reduction and migration process.The Data Migration reaching transition condition in parallel file system stores by the inventive method in data de-duplication file system, carries out data de-duplication to the data of migration, realizes the operation of reading data of having moved from client device.The present invention adds data de-duplication file system pellucidly in existing parallel file system, reduces the impact on operation system, saves storage space, reduces data center's handling cost and energy consumption cost.

Description

A kind of framework and method with the data de-duplication file system be combined with parallel file system

Technical field

The invention belongs to technical field of data storage, relate to a kind of transparent scheme be combined with data de-duplication file system, specifically a kind of framework and method with the data de-duplication file system be combined with parallel file system.

Background technology

But existing most of parallel file system is as cluster parallel file system Lustre, blue whale cluster file system BWFS etc., do not have built-inly to realize data de-duplication function.And in these centralized stores systems, there is a large amount of redundant data information, redundant data amount even can reach tens times of even hundreds of times in some cases, and As time goes on, redundant data amount can be increasing.Such as: in data backup and filing system, heap file data movement is less, even there is multiple copy, stores, create a large amount of redundant datas through filing repeatedly; In the office automation system, restoring files, version revision are commonplace, and a file may be made a copy for multiple people, and a file may have multiple version, and this wherein has a large amount of repeating datas; In addition, mail mass-sending, forwarding also can cause a large amount of information redundancies.The sharp increase of data volume substantially increases handling cost and the energy consumption cost of data center.Therefore how to reduce the demand to data space, reduction data carrying cost becomes a difficult problem urgently to be resolved hurrily.

Data de-duplication technology (be otherwise known as the superfluous technology that disappears) effectively can identify and eliminate the repeating data in data, improves the utilization factor of storage resources, therefore becomes a study hotspot gradually.

But maybe should be used for supporting that this data de-duplication function has larger difficulty and risk by amendment existed system, therefore how data de-duplication technology being attached in existing parallel file system pellucidly becomes a problem demanding prompt solution simultaneously.

Summary of the invention

The present invention is directed to problem data de-duplication technology be attached to how pellucidly in existing parallel file system, provide a kind of framework and the method with the data de-duplication file system be combined with parallel file system.

A kind of framework with the data de-duplication file system be combined with parallel file system provided by the invention, comprises client device, parallel file system cluster, data de-duplication gateway cluster and memory device.Run operation system on client device, generate data stream.Parallel file system clustered deploy(ment) parallel file system, parallel file system externally provides parallel file system access interface.Parallel file system cluster comprises more than one parallel file system equipment, and parallel file system equipment is divided into meta data server and data server.Data de-duplication gateway cluster comprises more than one data de-duplication gateway, data de-duplication gateway deployment data de-duplication file system, data de-duplication function is externally provided, specifically, data de-duplication gateway deploy has data de-duplication processing engine and data mover system; Data de-duplication processing engine carries out data de-duplication process and reduction treatment to the data that parallel file system stores; The Data Migration reaching transition condition in parallel file system stores by data mover system in data de-duplication file system.Memory device be used for storing data information, and with parallel file system equipment, data de-duplication gateway interconnect.Client device and data de-duplication gateway, by parallel file system access interface, carry out read-write deletion action to the data in parallel file system.

Data de-duplication processing engine to the method that data process is: first, the data of file reading, and to deblocking, calculate the fingerprint of each data block, then, the fingerprint of each data block is inquired about, if inquire in data block concordance list, then this data block exists, no longer store, otherwise this data block is new data block, store this data block in data block warehouse, and in data block concordance list, generate corresponding tuple.Described data block concordance list looks into retry for data block, and tuple format is < data block fingerprint, data block place file, the side-play amount of data block in file, data block length, data block reference count >.For storing unduplicated data block in described data block warehouse, be arranged in memory device.

Data mover system is by parallel file system access interface, file in periodic scanning parallel file system, the file of transition condition will be reached, move in data de-duplication file system, and for original is set up and move associating of rear file in parallel file system, the file moved is by being stored in data de-duplication file system after the process of data de-duplication processing engine, the tuple that this file is corresponding is generated in data block mapping table, the form of each tuple is < file unique identification, ChunkFP1, ChunkFP2, ChunkFP _i... >, wherein, ChunkFP _irepresent the fingerprint of i-th data block.

Client device is by data deduplication system file access interface, the file moved from parallel file system equipment is accessed in data de-duplication file system, specifically: in parallel file system, according to original and the associating of file after migration, be redirected to the file after migration in data deduplication system file, the fingerprint of the data block finding this file to comprise from data block mapping table, according to data block fingerprint, the memory address of respective data blocks is found from data block concordance list, data are read from data block warehouse, the data read return to client device 1 by data de-duplication file system access interface.

Based on the above-mentioned framework with the data de-duplication file system be combined with parallel file system, the data de-duplication method be combined with parallel file system provided by the invention, mainly comprises following three aspects:

First aspect: data mover system periodic scanning parallel file system, obtain the listed files not yet moved, to each file in list, judge whether this file meets transition condition, if meet, then this file is moved in data de-duplication file system, and for original is set up and move associating of rear file in parallel file system.

Second aspect: data delete processing engine processes the file that will move in parallel file system, the data of file reading, and to deblocking, calculate the fingerprint of each data block, the fingerprint of each data block is inquired about in data block concordance list, if inquire, then no longer store this data block, otherwise this data block is new data block, store this data block, and in data block concordance list, generate corresponding tuple.Data block concordance list looks into retry for data block, and tuple format is < data block fingerprint, data block place file, the side-play amount of data block in file, data block length, data block reference count >.

The third aspect: read the file of having moved in data de-duplication file system parallel file system equipment from client device, specifically: in parallel file system, according to original f and migration after file f ' associate, be redirected to the file f after migration in data deduplication system file ', file f is found from data block mapping table ' fingerprint of data block that comprises, according to data block fingerprint, the physical storage address of respective data blocks is found from data block concordance list, from data block warehouse, obtain database and copy to request buffer, by request buffer file f ' data return to client device by data de-duplication file system access interface.

Framework and the method with the data de-duplication file system be combined with parallel file system provided by the invention, transparently in existing parallel file system and operation system can add the support of data de-duplication function, reduce the impact on operation system; Using the secondary storage system of data de-duplication file system as parallel file system, save storage space, reduce data center's handling cost and energy consumption cost, improve storage efficiency, therefore, technical scheme of the present invention has very strong practicality and range of application, has application prospect very widely.

Accompanying drawing explanation

Fig. 1 is the physical module figure that the present invention has the framework of the data de-duplication file system be combined with parallel file system;

Fig. 2 is the logic module figure that the present invention has the framework of the data de-duplication file system be combined with parallel file system;

Fig. 3 is the operational flowchart that parallel file system file moves to data de-duplication file system;

Fig. 4 is the process flow diagram carrying out data de-duplication operations in repeating data delete file system;

Fig. 5 is the operational flowchart reading the file moved to data de-duplication file system from client device.

Embodiment

Below in conjunction with drawings and Examples, technical solution of the present invention is described in further detail.

As depicted in figs. 1 and 2, a kind of physical connection figure and logic composition diagram with the structure of the data de-duplication file system be combined with parallel file system of the present invention is given.The data de-duplication file system be combined with parallel file system that example of the present invention provides, linux operating system realizes, and solves the problem how by parallel file system and the transparent combination of data de-duplication file system.

As shown in Figure 1, the framework of the data de-duplication file system be combined with parallel file system that what the present invention provided have, physical module comprises: client device 1, parallel file system cluster 2, data de-duplication gateway cluster 3 and memory device 4.Client device 1 runs operation system, generates data stream; Parallel file system cluster 2 comprises some parallel file system equipment, generally comprises some meta data servers 21 and some data servers 22; Data de-duplication gateway cluster 3 comprises some data de-duplication gateways 31, and data de-duplication gateway 31, for disposing data de-duplication file system, externally provides data de-duplication function; Memory device 4 is for storing data information, and interconnected with parallel file system cluster 2, data de-duplication gateway cluster 3.

As shown in Figure 2, also comprise some logic modules in framework of the present invention: be deployed in the data de-duplication processing engine 6 on data de-duplication gateway 31 and data mover system 7, be deployed in the data de-duplication file system access interface 5 on client device 1.Parallel file system cluster 2 deploy parallel file system, and parallel file system externally provides parallel file system access interface, and inside has metadata store and data to store.Client device 1 and data de-duplication gateway 31, by parallel file system access interface, carry out the operations such as read-write deletion to the data in parallel file system.Client device 1 accesses the data moved to from parallel file system in data de-duplication file system by data de-duplication file system access interface 5.

Data de-duplication gateway 31 deploy has data de-duplication processing engine 6 and data mover system 7.In data de-duplication processing engine 6 pairs of parallel file system clusters 2, the data of parallel file system device storage carry out data de-duplication process and reduction treatment.The Data Migration reaching transition condition in data mover system 7 pairs of parallel file systems stores in data de-duplication file system.

The method that data de-duplication gateway 31 is processed by the data in data de-duplication processing engine 6 pairs of parallel file system clusters 2 is: read the data in parallel file system, to deblocking, calculate the fingerprint of each data block, to the fingerprint of each data block, inquire about in data block concordance list, if inquire, then determine that this data block exists, no longer store, realize the elimination of repeating data, if do not inquire, then illustrate that this data block is new data block, new data is stored in data block warehouse, and generate corresponding tuple at database index table.Described data block concordance list creates when data de-duplication file system first time uses, retry is looked into for data block, tuple is generally < data block fingerprint, data block place file, the side-play amount of data block in file, data block length, data block reference count >, whenever having new data block fingerprint, store the tuple of new data block fingerprint.Described data block warehouse creates when repeating delete file system first time use, for storing the data after data de-duplication processing engine 6 processes, i.e. unduplicated data block.General each file is made up of multiple data block, and in actual use, general each file size is within tens MB to 2GB.The side-play amount of data block in file

Data mover system 7 is by parallel file system access interface, periodic scanning parallel file system equipment file, the file migration of transition condition will be reached in data de-duplication file system in parallel file system, transition condition is specified according to service conditions by user, generally according to the setting such as nearest access time, file size, file extension of file.When certain file generation migration operation, for original is set up and move associating of rear file in parallel file system, the file moved is by being stored in data de-duplication file system after the process of data de-duplication processing engine, and in data block mapping table, record tuple corresponding to this file, the corresponding multiple data block of each file, the form of each tuple is < file unique identification (inode), ChunkFP ₁, ChunkFP ₂, ChunkFP ₃, ChunkFP _i... >, ChunkFP ₁, ChunkFP ₂, ChunkFP ₃represent the fingerprint of the 1st data block, the fingerprint of the 2nd data block, the fingerprint of the 3rd data block ..., ChunkFP _irepresent the fingerprint of i-th data block.

In example of the present invention, operation system is from client device 1 file reading, first from parallel file system cluster 2, this file is obtained, if file is moved, according to original and the associating of file after migration, automatically be redirected in data de-duplication file system, the fingerprint of the data block finding respective file to comprise from data block mapping table, according to data block fingerprint, the memory address of respective data blocks is found from data block concordance list, then from data block warehouse, data are read according to memory address, the data read return to client device 1 by data de-duplication file system access interface 5.According to the side-play amount in file of recorded data block place file, data block in data block concordance list tuple and the memory address of data block length acquisition data block in the embodiment of the present invention.

Adopt framework of the present invention in whole file reading process, the operation such as redirect operation, finger print information acquisition is transparent to the operation system of client device.During data filing, migration operation occurs on data de-duplication gateway 31, data mover system 7 will meet the Data Migration of transition condition in data de-duplication file system, and automatically set up the related information of parallel file system to data de-duplication file system, this process is also completely transparent to the operation system of client device.Therefore, use framework of the present invention, the transparent combination of parallel file system and data de-duplication file system can be realized.

System shown in composition graphs 1 and Fig. 2, a kind of data de-duplication method be combined with parallel file system provided by the invention, mainly comprises following three aspects:

First aspect: Data Migration, by the data reaching transition condition in parallel file system equipment, moves in data de-duplication file system; Data mover system 7 periodic scanning parallel file system, obtain the listed files not yet moved, to each file in list, judge whether this file meets transition condition, if meet, then this file is moved in data de-duplication file system, and for original is set up and move associating of rear file in parallel file system;

Second aspect: data de-duplication operations is carried out to the data of migration; The file that will move in data delete processing engine 6 pairs of parallel file systems processes, the data of file reading, and to deblocking, calculate the fingerprint of each data block, the fingerprint of each data block is inquired about in data block concordance list, if inquire, then no longer store this data block, otherwise this data block is new data block, store this data block, and in data block concordance list, generate corresponding tuple.

The third aspect: realize reading data data de-duplication file system of having moved from client device; In parallel file system, according to original f and migration after file f ' associate, be redirected to the file f after migration in data deduplication system file ', file f is found from data block mapping table ' fingerprint of data block that comprises, according to data block fingerprint, the physical storage address of respective data blocks is found from data block concordance list, from data block warehouse, obtain database and copy to request buffer, by request buffer file f ' data return to client device by data de-duplication file system access interface.

As shown in Figure 3, give the operating process of Data Migration, this flow process comprises Data Migration, and the foundation of file association between parallel file system and data de-duplication file system operates, and concrete steps are as follows:

Step 101: data mover system 7 scans parallel file system, obtains the listed files not yet moved; Each file in listed files is judged and processed.

Step 102: judge whether untreated file in addition, if had, perform step 103, otherwise this migration is finished, terminates;

Step 103: obtain a file f from list;

Step 104: judge whether file f meets transition condition, this condition is generally arranged by user, generally comprise file extension, file path filtercondition, file last access time, file size etc., such as transition condition is: choose expansion dat by name, the file last access time is that before 10 days, file size is greater than 1GB; If meet transition condition, perform step 105, otherwise perform step 102;

Step 105: file f moved in data de-duplication file system, forms file f '; File f needs, after data de-duplication processing engine 6 processes, to form the file f after migration '; After migration, the data of original f are deleted in parallel file system;

Step 106: set up file f to file f in parallel file system ' association, with the reading f enabling operation system transparent; File f after the original f that the present invention sets up and migration ' associate, to refer in parallel file system as original f sets up Symbolic Links, the destination address of link be set to the path of migrated file f '.

As shown in Figure 4, the data will moved in data delete processing engine 6 pairs of parallel file systems carry out the flow process of data de-duplication operations, and concrete steps are as follows:

Step 201: establish current file f in parallel file system will being moved in data de-duplication file system, generate the file f after corresponding migration ', data delete processing engine 6 is according to the path of file f, file reading f from parallel file system, piecemeal is carried out to the data of file f, then reads a data block successively;

Step 202: the fingerprint calculating this data block;

Step 203: judge whether there is this fingerprint in data block concordance list, if existed, illustrates that this data block is repeating data block, perform step 206, otherwise this data block is new data block, performs step 204;

Step 204: this data block is stored in data block warehouse, and in data block information table, generate tuple corresponding to this data block; Data block Warehouse Establishing is in memory device 4;

Step 205: generate the tuple that this data block is corresponding in data block concordance list, then goes to step 207 execution; The tuple format of data block concordance list is < data block fingerprint, data block place file, the side-play amount of data block in file, data block length, data block reference count >;

Step 206: the data block reference count upgrading this data block tuple corresponding in data block concordance list;

Step 207: in the tuple that data block mapping table file f ' is corresponding, record the fingerprint of this data block;

Step 208: judge that all data blocks of file f all read, the data that if so, then deleted file f is corresponding, terminate this data de-duplication operations, otherwise, read next data block, then go to step 202 execution.

When operation system is from client device 1 file reading f, first from parallel file system cluster 2, this file f is read, when file f has been migrated in data de-duplication file system, if the file after migration is f ', then according to file f and file f ' associate, be redirected to the file f in data de-duplication file system ', read the operating process of data de-duplication file system file of having moved from client device 1, as shown in Figure 5, concrete steps are as follows:

Step 301: according to file f and file f ' associate, be redirected to the file f in data de-duplication file system ';

The fingerprint of the data block that step 302: find file f from data block mapping table ' corresponding tuple, record file f in this tuple ' comprises;

Step 303: read a data block fingerprint successively;

Step 304: search this data block fingerprint in data block concordance list, according to the data block place file recorded in found tuple, the data block side-play amount in file and the physical storage address of data block length acquisition data block;

Step 305: according to physical storage address, reads corresponding data block, by block copy in request buffer in data block warehouse;

Step 306: judge file f ' in all data blocks all read, if do not have, perform step 303, otherwise, perform step 307;

Step 307: by request buffering file X, data return to client device 1 by data de-duplication file system access interface 5.

It should be noted that and understand, when not departing from the spirit and scope of the present invention required by accompanying claim, various amendment and improvement can be made to the present invention of foregoing detailed description.Therefore, the scope of claimed technical scheme is not by the restriction of given any specific exemplary teachings.

Claims

1. there is a framework for the data de-duplication file system be combined with parallel file system, comprise client device, parallel file system cluster and memory device, run operation system on client device, generate data stream, parallel file system clustered deploy(ment) parallel file system, parallel file system externally provides parallel file system access interface, memory device is used for storing data information, it is characterized in that, the described data de-duplication file system be combined with parallel file system, also comprise data de-duplication gateway cluster, data de-duplication gateway cluster comprises more than one data de-duplication gateway, data de-duplication gateway deploy has data de-duplication processing engine and data mover system, data de-duplication processing engine carries out data de-duplication process and reduction treatment to the data that parallel file system stores, the Data Migration reaching transition condition in parallel file system stores by data mover system in data de-duplication file system, client device and data de-duplication gateway, by parallel file system access interface, carry out read-write deletion action to the data that parallel file system stores,

Data de-duplication processing engine to the method that data process is: first, read data and to deblocking, calculate the fingerprint of each data block, then, the fingerprint of each data block is inquired about in data block concordance list, if inquire, then this data block exists, and no longer stores, otherwise, this data block is new data block, stores this data block in data block warehouse, and in data block concordance list, generate corresponding tuple; Described data block concordance list looks into retry for data block, and tuple format is < data block fingerprint, data block place file, the side-play amount of data block in file, data block length, data block reference count >; Described data block warehouse is arranged on a storage device;

Data mover system is by parallel file system access interface, file in periodic scanning parallel file system, the file of transition condition will be reached, move in data de-duplication file system, and for original is set up and move associating of rear file in parallel file system, file after migration is by storing after the process of data de-duplication processing engine, the tuple that the file after this migration is corresponding is generated in data block mapping table, the form of each tuple is < file unique identification, ChunkFP ₁, ChunkFP ₂..., ChunkFP _i... >, wherein, ChunkFP _irepresent the fingerprint of i-th data block, client device is by data de-duplication file system access interface, the file moved from parallel file system equipment is accessed in data de-duplication file system, specifically: in parallel file system, according to original and the associating of file after migration, be redirected to the file after migration in data de-duplication file system, the fingerprint of the data block finding the file after this migration to comprise from data block mapping table, according to data block fingerprint, the memory address of respective data blocks is found from data block concordance list, data are read from data block warehouse, the data read return to client device by data de-duplication file system access interface.

2. based on the data de-duplication method that a kind of of framework described in claim 1 is combined with parallel file system, it is characterized in that, comprise following three kinds of process:

First aspect: data mover system periodic scanning parallel file system, obtain the listed files not yet moved, to each file in list, judge whether this file meets transition condition, if meet, then this file is moved in data de-duplication file system, and for original is set up and move associating of rear file in parallel file system;

Second aspect: data de-duplication processing engine processes the file that will move in parallel file system, the data of file reading, and to deblocking, calculate the fingerprint of each data block, the fingerprint of each data block is inquired about in data block concordance list, if inquire, then no longer store this data block, otherwise this data block is new data block, store this data block, and in data block concordance list, generate corresponding tuple; Data block concordance list looks into retry for data block, and tuple format is < data block fingerprint, data block place file, the side-play amount of data block in file, data block length, data block reference count >;

The third aspect: read the file of having moved in data de-duplication file system parallel file system equipment from client device, specifically: in parallel file system, according to original f and migration after file f ' associate, be redirected to the file f after migration in data de-duplication file system ', file f is found from data block mapping table ' fingerprint of data block that comprises, according to data block fingerprint, the physical storage address of respective data blocks is found from data block concordance list, from data block warehouse, obtain database and copy to request buffer, by request buffer file f ' data return to client device by data de-duplication file system access interface.