CN104679772A

CN104679772A - Method, device, equipment and system for deleting files in distributed data warehouse

Info

Publication number: CN104679772A
Application number: CN201310628849.9A
Authority: CN
Inventors: 庄虔玉; 鲍春健; 麦艺华; 翟艳堂
Original assignee: Shenzhen Tencent Computer Systems Co Ltd
Current assignee: Shenzhen Tencent Computer Systems Co Ltd
Priority date: 2013-11-29
Filing date: 2013-11-29
Publication date: 2015-06-03
Anticipated expiration: 2033-11-29
Also published as: CN104679772B; US20160253362A1; US9830327B2; WO2015078370A1

Abstract

The invention discloses a method, device, equipment and system for deleting files in distributed data warehouse and belongs to the field data warehouse management. The method comprises transmitting a heartbeat report to a management node; receiving a deleting instruction carrying a data block identity, which is transmitted by the management node; storing the data block identity into a delay queue; deleting a data block corresponding to the data block identity stored in the delay queue under a designated condition. According to the method, device, equipment and system for deleting files in distributed data warehouse, by receiving the deleting instruction carrying the data block identity, which is transmitted by the management node, storing the data block identity into the delay queue and deleting the data block corresponding to the data block identity stored by the delay queue under the designated conditions, the problem that a recycle bin set in a NameNode in the prior art cannot restore mistaken deleting operations under certain circumstances to result in reducing the data safety of a Hadoop system can be solved; the effects of ensuring data safety of the Hadoop system to a large extent can be achieved.

Description

Method, device, the equipment and system of deleted file in Distributed Data Warehouse

Technical field

The present invention relates to data warehouse management field, particularly method, device, the equipment and system of deleted file in a kind of Distributed Data Warehouse.

Background technology

Hadoop is a kind of distributed system architecture, can make full use of cluster high-speed computation and storage, and Hadoop achieves a distributed system (HDFS, Hadoop Distributed File System).Can comprise a management node NameNode and multiple back end DataNode in the framework of this HDFS, the file be stored in HDFS can be divided into multiple data block, and these data blocks are distributed and are stored in different DataNode.

The File lose of existing Hadoop in order to cause when preventing user from deleting file by mistake, the function of recycle bin (Trash) is provided with in NameNode, when user deletes a certain file, under the catalogue of this file is modified to recycle bin directory by NameNode, now DataNode does not delete this file physically, but under the catalogue of this file is modified to recycle bin directory, NameNode can not to the metadata of this file of user feedback and block mapping relations, and therefore user also cannot view this file; If user finds that this deletion is maloperation, want to recover this file, under this filename can be moved to former catalogue by NameNode from recycle bin directory, complete the recovery to this file, like this, when client needs to check this file, the metadata of this file that client is fed back according to NameNode and relevant block mapping relations, just can read the data block corresponding to this file in relevant DataNode.

Realizing in process of the present invention, inventor finds that prior art at least exists following problem: the recycle bin arranged in NameNode cannot be repaired mistake deletion action in some cases, such as, when user has carried out cleaning operation to this recycle bin, NameNode can determine the data block be stored in DataNode that the file in recycle bin comprises, NameNode can issue delete instruction for deleting these data blocks to DataNode, and DataNode then deletes these data blocks according to delete instruction; Also such as, before cluster starts, if the metadata in NameNode is deleted or is damaged by mistake, when now starting NameNode, in the internal memory of NameNode, do not comprise metadata, after DataNode starts, DataNode is to Namenode reported data block message, owing to there are not these data block information in NameNode, therefore NameNode can issue to DataNode the delete instruction deleted these data blocks, and DataNode then deletes these data blocks.Because DataNode directly can delete relevant data block after receiving delete instruction, even if therefore user finds to occur deleting at short notice by mistake, also the data block that these are deleted cannot be recovered, reduce the data security of Hadoop system.

Summary of the invention

Cannot repair mistake deletion action in some cases to solve the recycle bin arranged in NameNode in prior art, reduce the problem of the data security of Hadoop system, embodiments provide the method for deleted file in a kind of Distributed Data Warehouse, device and electronic equipment.Described technical scheme is as follows:

First aspect, provides the method for deleted file in a kind of Distributed Data Warehouse, and described method comprises:

Back end sends heartbeat to management node and reports, described heartbeat reports the data block identifier of all data blocks comprising described back end and store, and described heartbeat reports and is provided for described management node and reports according to described heartbeat and determine the mapping of described data block identifier to described back end;

Reception heartbeat returns, and returns the delete instruction carrying data block identifier of the described management node transmission of middle acquisition from the heartbeat received;

By described data block identifier stored in delay queue and record stored in time;

The data block in the described back end corresponding to described data block identifier stored is deleted in described delay queue under specified requirements.

Second aspect, provides the method for deleted file in a kind of Distributed Data Warehouse, and described method comprises:

Management node receives the file erase instruction being used to indicate deletion specified file that client sends;

The heartbeat receiving back end transmission reports, and reports determine the mapping of described data block identifier to described back end according to described heartbeat, and described heartbeat reports the data block identifier comprising all data blocks that described back end stores;

For each back end, according to the corresponding relation between the file prestored and data block identifier and report the described data block identifier determined to the mapping of described back end according to described heartbeat, determine the data block being stored in and belonging to described specified file in described back end;

The delete instruction carrying the data block identifier of described data block is added, so that described back end receives described delete instruction in the heartbeat sent to described back end returns; By described data block identifier stored in in delay queue; The data block corresponding to described data block identifier stored is deleted in described delay queue under specified requirements.

The third aspect, provides the method for deleted file in a kind of Distributed Data Warehouse, and described method comprises:

Client sends the file erase instruction being used to indicate and deleting specified file to management node, so that described management node is after receiving described file erase instruction, for each back end, report established data block identification to the mapping of back end according to the corresponding relation between the file prestored and data block identifier and according to the heartbeat that back end sends, determine the data block being stored in and belonging to described specified file in described back end; The delete instruction carrying the data block identifier of described data block is added, so that described back end receives described delete instruction in the heartbeat sent to described back end returns; By described data block identifier stored in in delay queue; The data block corresponding to described data block identifier stored is deleted in described delay queue under specified requirements.

Fourth aspect, provides the device of deleted file in a kind of Distributed Data Warehouse, is applied in back end, and described device comprises:

Heartbeat sending module, report for sending heartbeat to management node, described heartbeat reports the data block identifier of all data blocks comprising described back end and store, and described heartbeat reports and is provided for described management node and reports according to described heartbeat and determine the mapping of described data block identifier to described back end;

Acquisition module, returns for receiving heartbeat, returns the delete instruction carrying data block identifier of the described management node transmission of middle acquisition from heartbeat;

Stored in module, for by described data block identifier stored in delay queue and record stored in time;

Removing module, for deleting in described delay queue the data block in the described back end corresponding to the described data block identifier that stores under specified requirements.

5th aspect, provides the device of deleted file in a kind of Distributed Data Warehouse, is applied in management node, and described device comprises:

3rd receiver module, for receiving the file erase instruction being used to indicate deletion specified file that client sends;

4th receiver module, the heartbeat sent for receiving back end reports, and reports determine the mapping of described data block identifier to described back end according to described heartbeat, and described heartbeat reports the data block identifier comprising all data blocks that described back end stores;

Second determination module, for for each back end, according to the corresponding relation between the file prestored and data block identifier and report the described data block identifier determined to the mapping of described back end according to described heartbeat, determine the data block being stored in and belonging to described specified file in described back end;

Second sending module, adds the delete instruction carrying the data block identifier of described data block, so that described back end receives described delete instruction in returning in the heartbeat sent to described back end; By described data block identifier stored in in delay queue; The data block corresponding to described data block identifier stored is deleted in described delay queue under specified requirements.

6th aspect, provides the device of deleted file in a kind of Distributed Data Warehouse, is applied in client, and described device comprises:

3rd sending module, for sending the file erase instruction being used to indicate and deleting specified file to management node, so that described management node is after receiving described file erase instruction, for each back end, report the mapping between established data block identification to back end according to the corresponding relation between the file prestored and data block identifier and according to the heartbeat that back end sends, determine the data block being stored in and belonging to described specified file in described back end; The delete instruction carrying the data block identifier of described data block is added, so that described back end receives described delete instruction in the heartbeat sent to described back end returns; By described data block identifier stored in in delay queue; The data block corresponding to described data block identifier stored is deleted in described delay queue under specified requirements.

7th aspect, provides a kind of back end, and described back end comprises the device of deleted file in the Distributed Data Warehouse as described in fourth aspect.

Eighth aspect, provides a kind of management node, and described management node comprises the device of deleted file in the Distributed Data Warehouse described in the 5th aspect.

9th aspect, provides a kind of client, and described client comprises the device of deleted file in the Distributed Data Warehouse described in the 6th aspect.

Tenth aspect, provides the system of deleted file in a kind of Distributed Data Warehouse, and described system comprises client, management node and at least one back end;

Described client comprises the device of deleted file in the Distributed Data Warehouse described in the 6th aspect;

Described management node comprises the device of deleted file in the Distributed Data Warehouse described in the 5th aspect;

Described back end comprises the device of deleted file in the Distributed Data Warehouse described in fourth aspect.

The beneficial effect that the technical scheme that the embodiment of the present invention provides is brought is:

By the delete instruction carrying data block identifier that receiving management node in back end sends, by described data block identifier stored in in delay queue, under specified requirements, delete the data block corresponding to data block identifier stored in this delay queue; Solve the recycle bin arranged in NameNode in prior art cannot repair mistake deletion action in some cases, reduce the problem of the data security of Hadoop system; Because back end is after receiving delete instruction, directly do not delete the data block of specifying, but be delayed a period of time, therefore during this period of time, if user finds to have carried out deletion action by mistake, then can recover these data blocks, reach the effect that can ensure that the security of data in Hadoop system to a great extent.

Accompanying drawing explanation

In order to be illustrated more clearly in the technical scheme in the embodiment of the present invention, below the accompanying drawing used required in describing embodiment is briefly described, apparently, accompanying drawing in the following describes is only some embodiments of the present invention, for those of ordinary skill in the art, under the prerequisite not paying creative work, other accompanying drawing can also be obtained according to these accompanying drawings.

Fig. 1 is the schematic diagram of the implementation environment involved by the method for deleted file in the Distributed Data Warehouse provided in section Example of the present invention;

Fig. 2 is the process flow diagram of the method for deleted file in the Distributed Data Warehouse provided in one embodiment of the invention;

Fig. 3 is the process flow diagram of the method for deleted file in the Distributed Data Warehouse provided in another embodiment of the present invention;

Fig. 4 A is the process flow diagram of the method for deleted file in the Distributed Data Warehouse provided in another embodiment of the present invention;

Fig. 4 B is the schematic diagram of the method for deleted file in the Distributed Data Warehouse provided in section Example of the present invention;

Fig. 5 is the process flow diagram that the present invention goes back the method for deleted file in the Distributed Data Warehouse provided in an embodiment;

Fig. 6 is the structural representation of the system of deleted file in the Distributed Data Warehouse provided in one embodiment of the invention;

Fig. 7 is the structural representation of the system of deleted file in the Distributed Data Warehouse provided in another embodiment of the present invention.

Embodiment

For making the object, technical solutions and advantages of the present invention clearly, below in conjunction with accompanying drawing, embodiment of the present invention is described further in detail.

Shown in Figure 1, the schematic diagram of the implementation environment involved by method of deleted file in the Distributed Data Warehouse provided in section Example of the present invention is provided.This implementation environment is a Hadoop cluster, and this Hadoop cluster can comprise client 102, management node 104 and at least one back end 106.

Client 102 can issue the dependent instructions such as preservation, deletion, recovery file to management node 104 or back end 106.

Management node 104 is the NameNode in Hadoop cluster, this NameNode is used for the NameSpace of managing file system, by the meta-data preservation of the All Files in file system and file in a file system tree, these metadata also can preserve into NameSpace image file FSImage and editor's journal file EditLog on hard disk, simultaneously, NameNode records the information of the DataNode at each data block place in each file by block Mapping B locksMAP, block maps the mapping from data block to DataNode of the information generation being the data block that NameNode regularly reports according to DataNode.Information required when metadata wherein refers to All Files and catalogue in maintaining file system, the relevant information (such as the DataNode information etc. at file block situation, copy number, each copy place) that the attribute information (such as filename, directory name, parent directory information, file size, creation-time, modification time etc.) of such as file and catalogue self, file content store and being used for records the information of all DataNode in HDFS.

Back end 106 is the DataNode in Hadoop cluster, DataNode stores and retrieves data blocks according to the scheduling of client or NameNode, regularly report the storage space shared by the information of stored data block and these data blocks to NameNode, this heartbeat reporting process to be back end 106 reports.

Like this, when client 102 can read the data block of this specified file in back end 106 according to the metadata of the specified file in management node 104 and the mapping relations between data block and back end 106, the read work to specified file is realized.

Shown in Figure 2, it has used the process flow diagram of the method for deleted file in the Distributed Data Warehouse provided in one embodiment of the invention, and the back end 106 that the present embodiment is mainly applied in the implementation environment shown in Fig. 1 with the method for deleted file in this Distributed Data Warehouse is illustrated.In this Distributed Data Warehouse, the method for deleted file can comprise:

201, send heartbeat to management node and report, heartbeat reports the data block identifier comprising all data blocks that back end stores;

Back end sends heartbeat to management node and reports, and heartbeat here reports and is provided for management node and reports according to heartbeat and determine that data block identifier arrives the mapping of back end.

202, receive heartbeat and return, return the delete instruction carrying data block identifier of middle acquisition receiving management node transmission from the heartbeat received;

Back end receives heartbeat and returns, and returns the delete instruction carrying data block identifier of middle acquisition management node transmission from the heartbeat received.

In actual applications, management node can carry this delete instruction that management node issues when carrying out heartbeat and returning, and obviously, management node can also issue delete instruction to back end separately.

Owing to usually including a lot of data block in back end, and these data blocks may belong to different files, therefore when a needs deletion file, need to carry the data block identifier of the data block in this file be stored in back end in this delete instruction, like this, this back end after receiving this delete instruction, then can be learnt and need to delete which data block stored.

203, by data block identifier stored in delay queue and record stored in time;

Back end by data block identifier stored in in delay queue.

This delay queue in the internal memory of back end, sets up one for preserving the data block identifier needing the data block of deleting.

Obviously, can also for being arranged on a queue in the disk of back end or other memory devices at delay queue.

Meanwhile, often stored in a data block identifier to delay queue, then the time that this data block identifier is stored into is recorded.

204, under specified requirements, delete the data block in this back end corresponding to the data block identifier stored in delay queue.

Here the data block corresponding to the data block identifier stored in delay queue can be referred to as delayed deletion data block.

Back end deletes the data block corresponding to data block identifier stored in delay queue under specified requirements.That is, only have when meeting specified requirements, back end just goes to delete the data block corresponding to data block identifier stored in delay queue, known, back end is not delete this deletion immediately to plant the data block corresponding to data block identifier of carrying here when receiving delete instruction, but by these data block delayed deletions.

In the first possible implementation in the present embodiment,

Under specified requirements, delete the data block in the back end corresponding to the data block identifier stored in delay queue, comprising:

When data block identifier reaches schedule time threshold value stored in the access time to delay queue, delete the data block in the back end corresponding to data block identifier;

Or,

Receive that client issues be used to indicate the flush instructions of the data block emptied in delay queue corresponding to all data block identifier time, delete the data block in the back end in delay queue corresponding to all data block identifier.

In the implementation that the second is in the present embodiment possible,

By data block identifier stored in delay queue and record stored in time after, also comprise:

What receiving management node sent is used to indicate the recovery instruction recovering the data block corresponding to data block identifier stored in delay queue;

Send to management node the heartbeat of data block identifier carrying all data blocks that back end stores to report, so as management node report according to the heartbeat received in data block identifier structure data block identifier to the mapping of back end.

In the third possible implementation in the present embodiment,

Receive that client issues be used to indicate the flush instructions of the data block emptied in delay queue corresponding to all data block identifier time, before deleting the data block in the back end in delay queue corresponding to all data block identifier, also comprise:

Determine the data block in the back end in delay queue corresponding to all data block identifier;

Calculate data block in back end and take parameter in back end;

Parameter will be taken and send to management node, so that management node receive take parameter, client view take parameter after determine whether need issue to back end the flush instructions being used to indicate the data block emptied in back end in delay queue corresponding to all data block identifier.

In the 4th kind of possible implementation in the present embodiment, take parameter and comprise delayed deletion storage space and delayed deletion number percent, calculate data block in back end and take parameter in back end, comprising:

Calculate the storage space of the back end shared by data block, storage space is defined as delayed deletion storage space;

Computing relay deletes the number percent that storage space takies total storage space of back end, number percent is defined as delayed deletion number percent.

In the 5th kind of possible implementation in the present embodiment,

In this Distributed Data Warehouse, the method for deleted file also comprises:

Receive the time configuration-direct carrying appointment duration that client sends, time configuration-direct is used for carrying out dynamic-configuration to schedule time threshold value;

According to time configuration-direct, schedule time threshold value is updated to appointment duration.

In sum, the method of deleted file in the Distributed Data Warehouse provided in the embodiment of the present invention, by the delete instruction carrying data block identifier that receiving management node in back end sends, by data block identifier stored in in delay queue, under specified requirements, delete the data block corresponding to data block identifier stored in this delay queue; Solve the recycle bin arranged in NameNode in prior art cannot repair mistake deletion action in some cases, reduce the problem of the data security of Hadoop system; Because back end is after receiving delete instruction, directly do not delete the data block of specifying, but be delayed a period of time, therefore during this period of time, if user finds to have carried out deletion action by mistake, then can recover these data blocks, reach the effect that can ensure that the security of data in Hadoop system to a great extent.

Shown in Figure 3, it illustrates the process flow diagram of the method for deleted file in the Distributed Data Warehouse provided in another embodiment of the present invention, the management node 104 that the present embodiment is mainly applied in the implementation environment shown in Fig. 1 with the method for deleted file in this Distributed Data Warehouse is illustrated.In this Distributed Data Warehouse, the method for deleted file can comprise:

301, receive the file erase instruction being used to indicate deletion specified file that client sends;

Management node receives the file erase instruction being used to indicate deletion specified file that client sends.

302, the heartbeat receiving back end transmission reports, and reports determine that data block identifier arrives the mapping of back end according to heartbeat, and heartbeat reports the data block identifier comprising all data blocks that back end stores;

The heartbeat that management node receives back end transmission reports, and reports determine that data block identifier arrives the mapping of back end according to heartbeat, and heartbeat reports the data block identifier comprising all data blocks that back end stores.

Generally, what back end can be regular reports to management node transmission heartbeat, corresponding, management node can send heartbeat to this back end and return.Usually the information such as the data block identifier of all data blocks of back end storage and the storage space shared by these data blocks can be comprised during heartbeat reports.

303, for each back end, report established data block identification to the mapping of back end according to the corresponding relation between the file prestored and data block identifier and according to heartbeat, determine the data block being stored in back end and belonging to specified file;

For each back end, management node reports established data block identification to the mapping of back end according to the corresponding relation between the file prestored and data block identifier and according to heartbeat, determines the data block being stored in back end and belonging to specified file.

In actual applications, usually the metadata of the file preserved in Hadoop system is included in management node, this metadata may be used for the data message such as authority information, the catalogue of file, the data block corresponding to file of description document, such as, the data message of the data block identifier for the data block comprised in log file and file can be comprised in the metadata of management node, that is, the corresponding relation of the data block identifier of the data block comprised in file and file is preserved in management node.Simultaneously also include mapping relations BlocksMap in management node, including the corresponding relation between each data block to the back end preserving this data block in these mapping relations, is also the mapping of data block to back end.These mapping relations are that the reporting information given the correct time in heartbeat according to back end is determined, usually, this reporting information can comprise the information such as the data block identifier of the data block that this back end stores and the storage space shared by these data blocks, and the data block identifier in reporting information and this back end then can map by such management node.

According to the corresponding relation between the file prestored in management node and data block identifier and the corresponding relation between the data block identifier prestored and back end, then can determine the data block being stored in back end and belonging to specified file like this.For example, corresponding relation between specified file and data block identifier is < specified file, data block 1, data block 2, data block 4>, also be include data block 1, data block 2 and data block 4 in specified file, if some back end comprise the data block identifier of data block 2, data block 3 and data block 4 by the reporting information that heartbeat reports, management node then can determine that the data block storing this specified file in this back end is data block 2 and data block 4.

304, the heartbeat sent to back end returns the delete instruction that middle interpolation carries the data block identifier of data block, so that back end receives delete instruction; By data block identifier stored in in delay queue; The data block corresponding to the data block identifier stored in delay queue is deleted under specified requirements.

Management node returns to the heartbeat that back end sends the delete instruction that middle interpolation carries the data block identifier of data block, so that back end receives delete instruction; By data block identifier stored in in delay queue; The data block corresponding to the data block identifier stored in delay queue is deleted under specified requirements.

Generally, what back end can be regular reports to management node transmission heartbeat, corresponding, management node can send heartbeat to this back end and return.

Management node adds the delete instruction carrying the data block identifier of data block in the heartbeat sent to back end returns, and such management node can return according to this heartbeat determines delete instruction.For example, when storing at least one data block of specified file in some back end that management node is determined through step 303, the data block identifier of these data blocks is included in the delete instruction then sent to this back end, so that this back end is after receiving this delete instruction, can by these data block identifier stored in in delay queue, namely temporarily retain these data blocks needing to delete, and just go to delete the data block corresponding to data block identifier stored in delay queue under specified requirements.

In actual applications, this delete instruction can be informed to back end by the heartbeat return action of back end by management node.Obviously, management node can also send this delete instruction to back end separately.

For example, when needing the specified file deleted to comprise multiple data block for one, and these data blocks are distributed and are kept in multiple back end, now management node then determines each back end preserves which data block of this specified file respectively, and send different delete instructions respectively to these back end, include the data block identifier of the data block of this specified file be stored in back end in each delete instruction.Like this, each back end storing the data block of this specified file all can receive a delete instruction corresponding with this back end.

In the first possible implementation in the present embodiment,

In this Distributed Data Warehouse, the method for deleted file also comprises:

Receive the file access pattern instruction being used to indicate recovery specified file that client sends;

Recover qualified first corresponding relation, this qualified first corresponding relation be send back up before delete instruction and the first corresponding relation that the time of backup is nearest with the time gap sending delete instruction, the first corresponding relation comprises the data block identifier of the data block that specified file and specified file comprise; ;

Recover the second corresponding relation, this second corresponding relation comprises the data block identifier of data block and stores the back end of data block.

In the implementation that the second is in the present embodiment possible,

Recover the first corresponding relation, comprising:

Obtain this qualified first corresponding relation;

This first corresponding relation is read in the internal memory of management node.

In the third possible implementation in the present embodiment,

Recover the second corresponding relation, comprising:

The recovery instruction of the data block corresponding to the data block identifier being used to indicate and recovering to store in delay queue is sent to back end, so that after back end receives and recovers instruction, send to management node the heartbeat carrying the data block identifier of all data blocks of storage and report;

The heartbeat receiving back end transmission reports;

Data block identifier in reporting according to the heartbeat received builds the mapping of data block identifier to back end.

In the 4th kind of possible implementation in the present embodiment,

In this Distributed Data Warehouse, the method for deleted file also comprises:

What receive back end transmission takies parameter, take parameter and comprise delayed deletion storage space and delayed deletion number percent, storage space shared by data block in the delay queue that delayed deletion storage space is back end corresponding to all data block identifier, delayed deletion number percent is the number percent that delayed deletion storage space takies total storage space of back end, so as client view take parameter after determine whether to issue to back end the flush instructions being used to indicate the data block emptied in back end in delay queue corresponding to all data block identifier.

In sum, the method of deleted file in the Distributed Data Warehouse provided in the embodiment of the present invention, by sending the delete instruction carrying data block identifier to back end, so that data block identifier stored in in delay queue, is deleted the data block corresponding to data block identifier stored in this delay queue by back end under specified requirements; Solve the recycle bin arranged in NameNode in prior art cannot repair mistake deletion action in some cases, reduce the problem of the data security of Hadoop system; Because back end is after receiving delete instruction, directly do not delete the data block of specifying, but be delayed a period of time, therefore during this period of time, if user finds to have carried out deletion action by mistake, then can recover these data blocks, reach the effect that can ensure that the security of data in Hadoop system to a great extent.

In actual applications, client can issue delete instruction to management node, to delete some or some specified file, when corresponding to this delete instruction of client terminal to discover be operating as maloperation time, then again can issue recovery instruction to management node within a short period of time, to recover the specified file that these are deleted by mistake.Concrete deletion and rejuvenation, can see shown in Fig. 4 A.

Refer to shown in Fig. 4 A, the process flow diagram of the method for deleted file in the Distributed Data Warehouse provided in another embodiment of the present invention is provided.The present embodiment is mainly applied in the implementation environment shown in Fig. 1 with the method for deleted file in this Distributed Data Warehouse and is illustrated.In this Distributed Data Warehouse, the method for deleted file can comprise:

401, client sends the file erase instruction being used to indicate and deleting specified file to management node;

In actual applications, this file delete instruction can comprise a specified file, two specified files or multiple specified file.

402, management node receives this file delete instruction;

403, back end sends heartbeat to management node and reports;

This heartbeat reports the data block identifier comprising all data blocks that back end stores.

404, management node receives this heartbeat and reports, and reports determine that data block identifier arrives the mapping of back end according to heartbeat;

405, for each back end, management node to the mapping of back end, determines the data block being stored in back end and belonging to specified file according to the corresponding relation between the file prestored and data block identifier and data block identifier;

In actual applications, usually the metadata of the file preserved in Hadoop system is included in management node, metadata can for for data messages such as the data blocks corresponding to the catalogue of the authority information of description document, file, file, such as, can comprise in the metadata of management node with the data message of the data block identifier of data block that comprises in log file and file, that is, the corresponding relation of the data block identifier of the data block comprised in file and file is preserved in management node.Simultaneously also include mapping relations BlocksMap in management node, including the corresponding relation between each data block to the back end preserving this data block in these mapping relations, is also the mapping of data block to back end.These mapping relations the heartbeat according to back end transmission are offered determine, usually, this heartbeat reports the information such as the data block identifier that can comprise the data block that this back end stores and the storage space shared by these data blocks, and the data block identifier during heartbeat then can report by such management node and this back end map.

Management node according to the corresponding relation between the file prestored in management node and data block identifier and the corresponding relation between the data block identifier prestored and back end, then can determine the data block being stored in back end and belonging to specified file.For example, when specified file is 1, corresponding relation between specified file and data block identifier is < specified file, data block 1, data block 2, data block 4>, also be include data block 1, data block 2 and data block 4 in specified file, if some back end report the data block identifier comprising data block 2, data block 3 and data block 4 by the heartbeat that heartbeat reports, then management node then can determine that the data block storing this specified file in this back end is data block 2 and data block 4.Again for example, when specified file is 2, corresponding relation < specified file 1 between specified file 1 and data block identifier, data block 1, data block 2, data block 4>, corresponding relation < specified file 2 between specified file 2 and data block identifier, data block 3, data block 5, data block 6>, if the heartbeat that some back end send reports comprise data block 2, data block 3, data block 6 and data block 8, then management node can determine the data block 2 storing these two specified files in this back end, data block 3, data block 6.

406, management node adds the delete instruction carrying the data block identifier of data block in returning to back end transmission heartbeat;

Management node sends the delete instruction carrying the data block identifier of data block to back end.For example, when storing at least one data block of specified file in some back end that management node is determined through step 403, then include the data block identifier of these data blocks in the delete instruction guarded to this back end.

General, when needing the specified file deleted to comprise multiple data block for one, and these data blocks are distributed and are kept in multiple back end, now management node then determines each back end preserves which data block of this specified file respectively, and send different delete instructions respectively to these back end, include the data block identifier of the data block of this specified file be stored in back end in each delete instruction.Like this, each back end storing the data block of this specified file all can receive a delete instruction corresponding with this back end, and deletes corresponding data block according to predetermined condition.

407, back end receives this heartbeat and returns, and returns the delete instruction carrying data block identifier of middle acquisition management node transmission from the heartbeat received;

Delete instruction is included during the heartbeat that back end receives returns, data block identifier is included in this delete instruction, back end is after receiving this delete instruction, then the data block can determining corresponding to these data block identifier is the data block in client requirements deleted file.

408, back end by data block identifier stored in in delay queue;

Back end is after receiving this delete instruction, can by these data block identifier stored in in delay queue, namely temporarily retain these data blocks needing to delete, and just go to delete the data block corresponding to data block identifier stored in delay queue under specified requirements.

In one case, when data block identifier reaches schedule time threshold value stored in the access time to delay queue, delete the data block corresponding to data block identifier, also namely can by data block identifier stored in during to delay queue, logging timestamp, when this timestamp is greater than schedule time threshold value, then delete the data block corresponding to this data block identifier.In actual applications, client can issue to back end the time configuration-direct carrying and specify duration, time configuration-direct is used for carrying out dynamic-configuration to schedule time threshold value, back end receives the time configuration-direct carrying appointment duration that client sends, and time configuration-direct is used for carrying out dynamic-configuration to schedule time threshold value; Schedule time threshold value is updated to appointment duration according to time configuration-direct by back end.

In another case, back end receive that client issues be used to indicate the flush instructions of the data block emptied in delay queue corresponding to all data block identifier time, delete the data block corresponding to all data block identifier in delay queue.In actual applications, the data block in back end determination delay queue corresponding to all data block identifier; Calculate the storage space shared by data block, storage space is defined as delayed deletion storage space; Back end computing relay deletes the number percent that storage space takies total storage space of back end, number percent is defined as delayed deletion number percent; Delayed deletion storage space and delayed deletion number percent are sent to management node by back end, so that management node receive delay deletes storage space and delayed deletion number percent, client determines whether to need to issue to back end the flush instructions being used to indicate the data block emptied in back end in delay queue corresponding to all data block identifier after viewing delayed deletion storage space and delayed deletion number percent; Need to issue flush instructions to back end if determine, then issue flush instructions to back end.Now, back end after receiving this flush instructions, then can be deleted data block corresponding to all data block identifier in delay queue, also namely delete all delayed deletion data blocks.

409, client sends the file access pattern instruction being used to indicate and recovering specified file to management node;

The file erase instruction issued when client terminal to discover is the instruction of mistake, namely corresponding to this file delete instruction be operating as maloperation time, client sends the file access pattern instruction being used to indicate and recovering this specified file to management node.

410, management node receives this file access pattern instruction;

Management node after receiving this file access pattern instruction, then recovers qualified first corresponding relation and the second corresponding relation, and wherein the first corresponding relation comprises the data block identifier of the data block that specified file and specified file comprise; Second corresponding relation comprises the mapping of data block identifier to back end.Recovering the first corresponding relation can see the description in following steps 411 to step 412, and recovering the second corresponding relation can see the description in following steps 413 to step 417.

411, management node obtains qualified first corresponding relation;

Management node after the file access pattern instruction receiving client transmission, then obtains qualified first corresponding relation.

Qualified first corresponding relation be send back up before delete instruction and the first corresponding relation that the time of backup is nearest with the time gap sending delete instruction, the first corresponding relation comprises the data block identifier of the data block that specified file and specified file comprise.

In actual applications, management node can back up the metadata in management node every predetermined time interval, and the metadata of also i.e. each backup all wraps preservation, wherein the then data block identifier of data block that comprises of include file and file in this metadata.Obviously, when management node receive client send be used to indicate recover specified file file access pattern instruction after, then can obtain qualified backup metadata, this qualified backup metadata can for send back up before delete instruction and the metadata of the backup that the time of backup is nearest with the time gap sending delete instruction, namely comprise the corresponding relation of the data block identifier of the data block that above-mentioned specified file and this specified file comprise in the metadata of this backup.

412, this first corresponding relation is read in the internal memory of management node by management node;

In actual applications, the metadata of backup can be read in in the internal memory of management node by management node after restarting, simultaneously, because this first corresponding relation is kept in this qualified backup metadata, therefore the first corresponding relation in the metadata of this backup can read in the internal memory of management node by management node.

Obviously, in a kind of possible implementation, as long as management node receive client send file access pattern instruction and get qualified first corresponding relation, can this first corresponding relation be read in internal memory, not necessarily must be realized by the mode restarting management node.

413, management node sends the recovery instruction of the data block corresponding to data block identifier being used to indicate and recovering to store in delay queue to back end;

Due to client want to read a file in Hadoop time, need from management node, obtain all data block identifier of this file and the mapping relations of these data block identifier and back end, so just can to store this file data block back end in read the data block of this file.Therefore management node is after the internal memory the first corresponding relation being read in management node, also needs to set up the mapping in this specified file between data block identifier and back end.Now, this management node needs send the recovery instruction of the data block corresponding to data block identifier being used to indicate and recovering to store in delay queue to back end.

414, back end receives this recovery instruction;

415, back end sends to management node the heartbeat carrying the data block identifier of all data blocks of storage and reports;

When back end is after the recovery instruction receiving management node transmission, then can send the heartbeat that carry the data block identifier of all data blocks of storage by heartbeat to management node and report.Data block corresponding to the data block identifier in delay queue may also not deleted by physics, and therefore this heartbeat also includes the data block identifier of the data block of these delayed deletions simultaneously in reporting.

In actual applications, back end is after receiving the recovery instruction that management node issues, then can restart, because delay queue is usually in the internal memory of back end, when back end restarts, then this delay queue is then lost, and no longer includes this delay queue in the internal memory of the back end after also namely restarting, and therefore also would not delete the data block corresponding to the data block identifier comprised in this delay queue.

416, the heartbeat that management node receives back end transmission reports;

417, the data block identifier during management node reports according to the heartbeat received builds the mapping of data block identifier to back end.

When the data block identifier (namely having recovered the first corresponding relation) of having read in specified file and this specified file in the internal memory in management node and comprising, and the mapping relations established between these data block identifier and back end, if when therefore client needs to read this specified file (namely having recovered the second corresponding relation), management node then can to the data block identifier corresponding to this specified file of client feedback and the mapping relations between back end, client then can read the data block relevant to specified file according in these mapping relations to the back end of correspondence, namely the read operation to this specified file is completed.

In sum, the method of deleted file in the Distributed Data Warehouse provided in the embodiment of the present invention, by the delete instruction carrying data block identifier that receiving management node in back end sends, by described data block identifier stored in in delay queue, under specified requirements, delete the data block corresponding to data block identifier stored in this delay queue; Solve the recycle bin arranged in NameNode in prior art cannot repair mistake deletion action in some cases, reduce the problem of the data security of Hadoop system; Because back end is after receiving delete instruction, directly do not delete the data block of specifying, but be delayed a period of time, therefore during this period of time, if user finds to have carried out deletion action by mistake, then can recover these data blocks, reach the effect that can ensure that the security of data in Hadoop system to a great extent.

In a kind of possible implementation, refer to shown in Fig. 4 B, the schematic diagram of the method for deleted file in the Distributed Data Warehouse provided in section Example of the present invention is provided.In this Distributed Data Warehouse, the method for deleted file can comprise: 41, and client sends the delete instruction being used to indicate and deleting specified file to management node; 42, management node is inquired about and is deleted the mapping relations in metadata corresponding to this specified file and this specified file between data block and back end; 43, the data block identifier of the data block that this specified file comprises by management node is added in inefficacy queue; Here said inefficacy queue can be recentinvalidateSets; 44, the heartbeat that management node receives back end transmission reports, and includes in this back end the data block identifier corresponding to all data blocks stored during this heartbeat reports; 45, the data block identifier belonging to this back end in inefficacy queue is labeled as inefficacy by management node; This invalid markers can be Invalid; 46, management node to back end send include be labeled as inefficacy and belong to heartbeat report in the delete instruction of data block identifier; 47, back end receives this delete instruction; 48, the data block identifier of data block to be deleted adds in delay queue by back end; 49, when the time that data block identifier is added into delay queue is greater than schedule time threshold value, back end physics deletes the data block corresponding to this data block identifier.

In actual applications, storage space shared by the data block of delayed deletion can also be reported management node by back end, such client can inquire about the number percent of storage space shared by the data block of the delayed deletion that each back end stores and shared back end storage space, and client can determine according to these information the data block emptying delayed deletion in which back end.Specifically see the description in Fig. 5.

Shown in Figure 5, it illustrates the process flow diagram that the present invention goes back the method for deleted file in the Distributed Data Warehouse provided in an embodiment, the present embodiment is mainly applied in the implementation environment shown in Fig. 1 with the method for deleted file in this Distributed Data Warehouse and is illustrated, and in this Distributed Data Warehouse, the method for deleted file can comprise:

501, the data block in back end determination delay queue corresponding to all data block identifier;

502, back end calculates the storage space shared by data block, storage space is defined as delayed deletion storage space;

503, back end computing relay deletes the number percent that storage space takies total storage space of back end, number percent is defined as delayed deletion number percent;

504, delayed deletion storage space and delayed deletion number percent are sent to management node by back end;

505, management node receives delayed deletion storage space and the delayed deletion number percent of back end transmission;

506, client checks that each back end sends to delayed deletion storage space and the delayed deletion number percent of management node;

Client can inquire about storage space (i.e. delayed deletion storage space) and the number percent (delayed deletion number percent) of delayed deletion data block in each back end in Hadoop system, and can refine to the storage condition of each back end.For example, referring to as following table 1, is the data storage condition in a back end in this table 1.

Table 1

Here add Delay Deleting and Delay Deleting% in back end, represent delayed deletion storage space and delayed deletion number percent respectively.Wherein, real amount of physical memory in DFS Used representative data node, comprise the storage space of delayed deletion data block, therefore the effective storage space of system is: DFSUsed-Delay Deleting.

Client can also check the data storage condition in each back end of Hadoop system by Report order, same increase Delay Deleting and Delay Deleting%, represents delayed deletion storage space and delayed deletion number percent respectively.

For example, form when being checked the data storage condition in each back end of Hadoop system by Report order is: hadoop dfs admin report, and content is as follows:

“Datanodes available:3(3total,0dead)

Name:10.136.138.225:50010

Decommission Status:Normal

Configured Capacity:950674255872(885.38GB)

DFS Used:730422607872(680.25GB)

DFS Remaining:220251648000(205.13GB)

DFS Delay Deleting:2684354560(2.5KB)

DFS Used%:76.83%

DFS Remaining%:23.17%

DFS Delay Deletingi%:0%

Last contact:Wed Nov0610:50:49CST2013

Name:10.185.1.159:50010

Decommission Status:Normal

Configured Capacity:950674255872(885.38GB)

DFS Used:730422607872(680.25GB)

DFS Remaining:220251648000(205.13GB)

DFS Delay Deleting:2684354560(2.5KB)

DFS Used%:76.83%

DFS Remaining%:23.17%

DFS Delay Deletingi%:0%

Last contact:Wed Nov0610:50:49CST2013

Name:10.185.1.160:50010

Decommission Status:Normal

Configured Capacity:950674255872(885.38GB)

DFS Used:730422607872(680.25GB)

DFS Remaining:220251648000(205.13GB)

DFS Delay Deleting:2684354560(2.5KB)

DFS Used%:76.83%

DFS Remaining%:23.17%

DFS Delay Deletingi%:0%

Last contact:Wed Nov0610:50:49CST2013”。

507, client determines whether to need to issue at least one back end in back end the flush instructions being used to indicate the data block emptied in back end in delay queue corresponding to all data block identifier;

In actual applications, managerial personnel can determine whether according to the delayed deletion storage space in back end and delayed deletion number percent the delayed deletion data block that needs to remove in this back end, such as, when the delayed deletion storage space in a back end and delayed deletion number percent larger time, such as delayed deletion storage space is greater than the first predetermined threshold, delayed deletion number percent is greater than the second predetermined threshold, then managerial personnel can determine the delayed deletion data block of deleting in this back end.

508, need to issue flush instructions at least one back end in back end if client is determined, then issue flush instructions to the back end determined;

In actual applications, when client executing clears up the delayed deletion file operation in back end, form is as follows:

hadoop dfsadmin-clearDelayDeletedFile[datanodeIp]

Wherein datanodeIp is optional, is the ip of the back end of cleaning, then empties the delayed deletion data block of all back end in Hadoop cluster when datanodeIp is sky.

That is, managerial personnel can check out the storage space shared by delayed deletion data block in each back end and number percent in client, managerial personnel can select to delete the delayed deletion data block in one of them or partial data node according to actual conditions, also can select the delayed deletion data block of deleting in all back end.When the selected delayed deletion data block needing to delete in one of them back end, then client then can send flush instructions to this back end.

509, back end receives the flush instructions that client sends;

510, back end deletes the data block in delay queue corresponding to all data block identifier.

In addition, client can also issue to back end the time configuration-direct carrying and specify duration, this time configuration-direct is used for carrying out dynamic-configuration to the schedule time threshold value in back end, so that schedule time threshold value is updated to appointment duration according to this time configuration-direct by back end.Like this, back end, when data block identifier reaches stored in the access time to delay queue the schedule time threshold value being updated to and specifying duration, deletes the data block corresponding to this data block identifier.

In actual applications, client can increase delayed deletion file switch in configuration file conf/hdfs-site.xml, such as:

“<property>

<name>dfs.delaydeletion.time.sec</name>

</property>”

Wherein, when value is greater than 0, open delayed deletion switch, this value is delay time lag, unit can be second, point or time, when value is less than or equal to 0, close delayed deletion.

That is, delayed deletion switch can be set, when arranging this delayed deletion switch for opening, when data block identifier then in back end reaches schedule time threshold value (delay time lag of Value setting namely) stored in time of delay queue, back end then removes the data block corresponding to this data block identifier.When and this delayed deletion switch is set for closing time, then do not determine whether to remove the data block corresponding to the data block identifier in delay queue by this mode time delay.

Hadoop system starts acquiescence and reads this value value, does not need to restart Hadoop system, and user can dynamically update this configuration by input command, orders as follows:

hadoop dfsadmin-setDelayDeletedTimeSec value

Wherein, value is new delay time lag, unit can be second, point or time.

It should be noted that, in actual applications, if when user's transmission is deleted, usually can recognize that deletion action is maloperation in a short period of time, therefore the unit of the value of this value can be set to second by mistake.

In sum, the method of deleted file in the Distributed Data Warehouse provided in the embodiment of the present invention, by back end determination delayed deletion storage space and delayed deletion number percent, and delayed deletion storage space and delayed deletion number percent are sent to management node, client, when viewing delayed deletion storage space and delayed deletion number percent in each back end, can determine whether the delayed deletion data block removed in some or all back end; Thus achieve the back end can optionally determining to need to remove.

Shown in Figure 6, the structural representation of the system of deleted file in the Distributed Data Warehouse provided in one embodiment of the invention is provided, the present embodiment is mainly applied in the implementation environment shown in Fig. 1 with the system of deleted file in this Distributed Data Warehouse and is illustrated, and in this Distributed Data Warehouse, the system of deleted file can comprise: client 620, management node 640 and at least one back end 660.

Client 620 can comprise the device of deleted file in Distributed Data Warehouse, and in this Distributed Data Warehouse, the device of deleted file can comprise the 3rd sending module 621.

3rd sending module 621, may be used for sending to management node 640 the file erase instruction being used to indicate and deleting specified file, so that management node 640 is after receiving file erase instruction, for each back end 660, according to the corresponding relation between the file prestored and data block identifier and the corresponding relation between the data block identifier prestored and back end 660, determine the data block being stored in and belonging to specified file in back end 660; The heartbeat sent to back end 660 returns the delete instruction that middle interpolation carries the data block identifier of data block, so that back end 660 receives delete instruction; By data block identifier stored in in delay queue; The data block corresponding to the data block identifier stored in delay queue is deleted under specified requirements.

Management node 640 can comprise the device of deleted file in Distributed Data Warehouse, and in this Distributed Data Warehouse, the device of deleted file can comprise: the 3rd receiver module 641, the 4th receiver module 642, second determination module 643 and the second sending module 644.

3rd receiver module 641, what the 3rd sending module 621 for receiving client 620 sent is used to indicate the file erase instruction of deleting specified file;

4th receiver module 642, the heartbeat sent for receiving back end reports, and reports determine that data block identifier arrives the mapping of back end according to heartbeat, and heartbeat reports the data block identifier comprising all data blocks that back end stores;

Second determination module 643, for for each back end 660, report established data block identification to the mapping of back end 660 according to the corresponding relation between the file prestored and data block identifier and according to heartbeat, determine the data block being stored in and belonging to specified file in back end 660;

Second sending module 644, adds the delete instruction carrying the data block identifier of data block, so that back end 660 receives delete instruction in returning in the heartbeat sent to back end 660; By data block identifier stored in in delay queue; The data block corresponding to the data block identifier stored in delay queue is deleted under specified requirements.

Back end 660 can comprise the device of deleted file in Distributed Data Warehouse, and in this Distributed Data Warehouse, the device of deleted file can comprise: heartbeat sending module 661, acquisition module 662, stored in module 663 and removing module 664.

Heartbeat sending module 661, report for sending heartbeat to the 4th receiver module 642 of management node, heartbeat reports the data block identifier of all data blocks comprising back end and store, and heartbeat reports and is provided for management node and reports according to heartbeat and determine that data block identifier arrives the mapping of back end;

Acquisition module 662, the heartbeat returned for the second sending module 644 in receiving management node 640 returns, and returns the delete instruction carrying data block identifier of the second sending module 644 transmission of middle acquisition management node 640 from heartbeat;

Stored in module 663, for by data block identifier stored in delay queue and record stored in time;

Removing module 664, for deleting the data block in the back end corresponding to the data block identifier that stores in delay queue under specified requirements.

In sum, the system of deleted file in the Distributed Data Warehouse provided in the embodiment of the present invention, by the delete instruction carrying data block identifier that receiving management node in back end sends, by described data block identifier stored in in delay queue, under specified requirements, delete the data block corresponding to data block identifier stored in this delay queue; Solve the recycle bin arranged in NameNode in prior art cannot repair mistake deletion action in some cases, reduce the problem of the data security of Hadoop system; Because back end is after receiving delete instruction, directly do not delete the data block of specifying, but be delayed a period of time, therefore during this period of time, if user finds to have carried out deletion action by mistake, then can recover these data blocks, reach the effect that can ensure that the security of data in Hadoop system to a great extent.

Shown in Figure 7, the structural representation of the system of deleted file in the Distributed Data Warehouse provided in another embodiment of the present invention is provided, the present embodiment is mainly applied in the implementation environment shown in Fig. 1 with the system of deleted file in this Distributed Data Warehouse and is illustrated, and in this Distributed Data Warehouse, the system of deleted file can comprise: client 720, management node 740 and at least one back end 760.

Back end 760 can comprise the device of deleted file in Distributed Data Warehouse, and in this Distributed Data Warehouse, the device of deleted file can comprise: heartbeat sending module 7611, acquisition module 761, stored in module 762 and removing module 763.

Heartbeat sending module 7611, report for sending heartbeat to management node, heartbeat reports the data block identifier of all data blocks comprising back end and store, and heartbeat reports and is provided for management node and reports according to heartbeat and determine that data block identifier arrives the mapping of back end;

Acquisition module 761, may be used for receiving heartbeat and returns, return the delete instruction carrying data block identifier of middle acquisition management node 740 transmission from heartbeat;

Stored in module 762, may be used for by data block identifier stored in delay queue and record stored in time;

Removing module 763, may be used for deleting the data block in the back end corresponding to data block identifier stored in delay queue under specified requirements.

In the first possible implementation in an embodiment, removing module 763 can comprise: the first delete cells 763a and the second delete cells 763b.

First delete cells 763a, may be used for when data block identifier reaches schedule time threshold value stored in the time to delay queue, deletes the data block in the back end corresponding to data block identifier;

Or,

Second delete cells 763b, may be used for receive that client 720 issues be used to indicate the flush instructions of the data block emptied in delay queue corresponding to all data block identifier time, delete the data block in the back end in delay queue corresponding to all data block identifier.

In the implementation that the second is in an embodiment possible, in the Distributed Data Warehouse in back end 760, the device of deleted file can also comprise the first receiver module 764 and reporting module 765.

First receiver module 764, may be used for the recovery instruction being used to indicate the data block corresponding to data block identifier stored in recovery delay queue that receiving management node 740 sends;

Reporting module 765, may be used for sending to management node 740 heartbeat of data block identifier carrying all data blocks that back end stores to report, so as management node 740 report according to the heartbeat received in data block identifier build the mapping of data block identifier to back end.

In the third possible implementation in an embodiment, in the Distributed Data Warehouse in back end 760, the device of deleted file can also comprise the first determination module 766, computing module 767 and the first sending module 768.

First determination module 766, may be used for determining the data block in delay queue corresponding to all data block identifier;

Computing module 767, may be used for calculating described data block in described back end and take parameter in described back end;

First sending module 768, take parameter described in may be used for described computing module 767 to calculate and send to described management node, to take parameter described in described management node 740 receives, described client 720 determines whether to need to issue to described back end the flush instructions being used to indicate the data block emptied in delay queue described in described back end 760 corresponding to all data block identifier after taking parameter described in viewing.

In the 4th kind of possible implementation in an embodiment, described in take parameter and comprise delayed deletion storage space and delayed deletion number percent, described computing module 767 can comprise:

First computing unit 767a, may be used for the storage space of the described back end calculated shared by described data block, and described storage space is defined as described delayed deletion storage space;

Second computing unit 767b, may be used for calculating the number percent that described delayed deletion storage space takies total storage space of described back end, described number percent is defined as described delayed deletion number percent.

In the 5th kind of possible implementation in an embodiment, in the Distributed Data Warehouse in back end 760, the device of deleted file can also comprise the second receiver module 769 and update module 7610.

Second receiver module 769, may be used for the time configuration-direct carrying appointment duration receiving client 720 transmission, time configuration-direct is used for carrying out dynamic-configuration to schedule time threshold value;

Update module 7610, schedule time threshold value is updated to appointment duration by the time configuration-direct that may be used for receiving according to the second receiver module 769.

Management node 740 can comprise the device of deleted file in Distributed Data Warehouse, and in this Distributed Data Warehouse, the device of deleted file can comprise: the 3rd receiver module 748, the 4th receiver module 741, second determination module 742 and the second sending module 743.

3rd receiver module 748, may be used for the file erase instruction being used to indicate deletion specified file receiving client 720 transmission;

4th receiver module 741, report for receiving the heartbeat that in back end 760, heartbeat sending module 7611 sends, report according to described heartbeat and determine the mapping of described data block identifier to described back end, described heartbeat reports the data block identifier comprising all data blocks that described back end stores;

Second determination module 742, may be used for for each back end 760, report established data block identification to the mapping of back end according to the corresponding relation between the file prestored and data block identifier and according to described heartbeat, determine the data block being stored in back end and belonging to specified file;

Second sending module 743, may be used in the heartbeat sent to the acquisition module 761 of back end 760 returns, adding the delete instruction carrying the data block identifier of data block, so that back end 760 receives delete instruction; By data block identifier stored in in delay queue; The data block corresponding to the data block identifier stored in delay queue is deleted under specified requirements.

In the 6th kind of possible implementation in an embodiment, in the Distributed Data Warehouse in management node 740, the device of deleted file can comprise: the 5th receiver module 744, first recovers module 745 and second and recovers module 746.

5th receiver module 744, may be used for the file access pattern instruction being used to indicate recovery specified file receiving client 720 transmission;

First recovers module 745, may be used for recovering qualified first corresponding relation, qualified first corresponding relation be send back up before delete instruction and the first corresponding relation that the time of backup is nearest with the time gap sending delete instruction, the first corresponding relation comprises the data block identifier of the data block that specified file and specified file comprise;

Second recovers module 746, may be used for recovery second corresponding relation, and this second corresponding relation is the mapping of data block identifier to the back end of this data block of storage of data block.

In the 7th kind of possible implementation in an embodiment, first recovers module 745 can comprise: acquiring unit 745a and read in unit 745b.

Acquiring unit 745a, for obtaining qualified first corresponding relation;

Read in unit 745b, may be used for the internal memory the first corresponding relation that acquiring unit 745a obtains being read in management node.

In the 8th kind of possible implementation in an embodiment, second recovers module 746 can comprise: transmitting element 746a, receiving element 746b and construction unit 746c.

Transmitting element 746a, may be used for the recovery instruction sending the data block corresponding to the data block identifier that is used to indicate and recovers to store in delay queue to the first receiver module 764 of back end 760, so that after the first receiver module 764 of back end 760 receives and recovers instruction, reporting module 765 sends to management node 740 heartbeat carrying the data block identifier of all data blocks of storage and reports;

Receiving element 746b, the heartbeat that the reporting module 765 that may be used for receiving back end 760 sends reports;

Construction unit 746c, the data block identifier during the heartbeat that may be used for receiving according to receiving element 746b reports builds the mapping of data block identifier to back end.

In the 9th kind of possible implementation in an embodiment, in the Distributed Data Warehouse in this management node 740, the device of deleted file can comprise: the 6th receiver module 747.

6th receiver module 747, what the first sending module 768 that may be used for receiving back end 760 sent takies parameter, the described parameter that takies comprises delayed deletion storage space and delayed deletion number percent, storage space shared by data block in the delay queue that delayed deletion storage space is back end corresponding to all data block identifier, delayed deletion number percent is the number percent that delayed deletion storage space takies total storage space of back end, so that client 720 view take parameter after determine whether to issue to back end the flush instructions being used to indicate the data block emptied in back end in delay queue corresponding to all data block identifier.

Client 720 can comprise the device of deleted file in Distributed Data Warehouse, and in this Distributed Data Warehouse, the device of deleted file can comprise: the 3rd sending module 721.

3rd sending module 721, the file erase instruction being used to indicate and deleting specified file is sent for the 3rd receiver module 748 to management node 740, so that the 3rd receiver module 748 of management node 740 is after receiving file erase instruction, for each back end, second determination module 742 reports established data block identification to the mapping of back end according to the corresponding relation between the file prestored and data block identifier and according to the heartbeat that back end sends, and determines the data block being stored in and belonging to specified file in back end 760; Second sending module 743 adds the delete instruction carrying the data block identifier of data block in the heartbeat sent to the acquisition module 761 of back end 760 returns, so that the acquisition module 761 of back end 760 receives delete instruction; Stored in module 762 by data block identifier stored in in delay queue; Removing module 763 deletes the data block corresponding to data block identifier stored in delay queue under specified requirements.

In the tenth kind of possible implementation in an embodiment, in the Distributed Data Warehouse in client 720, the device of deleted file can also comprise: check module 722, the 3rd determination module 723 and first issues module 724.

Check module 722, parameter is taken for what check that each back end 760 sends to the 6th receiver module 747 in management node 740, the described parameter that takies comprises delayed deletion storage space and delayed deletion number percent, storage space shared by data block in the delay queue that delayed deletion storage space is back end corresponding to all data block identifier, delayed deletion number percent is the number percent that delayed deletion storage space takies total storage space of back end;

3rd determination module 723, needs to issue at least one back end 760 in back end the flush instructions being used to indicate the data block emptied in back end 760 in delay queue corresponding to all data block identifier for determining whether;

First issues module 724, and need to issue flush instructions at least one back end 760 in back end for determining at the 3rd determination module 723, then the second delete cells 763b to the back end 760 determined issues flush instructions.

In the 11 kind of possible implementation in an embodiment, in the Distributed Data Warehouse in client 720, the device of deleted file can also comprise: second issues module 725.

Second issues module 725, for issuing the time configuration-direct carrying and specify duration to the second receiver module 769 of back end 760, time configuration-direct is used for carrying out dynamic-configuration to schedule time threshold value, so that schedule time threshold value is updated to appointment duration according to time configuration-direct by the update module 7610 of back end 760.

In the 12 kind of possible implementation in an embodiment, in the Distributed Data Warehouse in client 720, the device of deleted file can also comprise: the 4th sending module 726.

4th sending module 726, the file access pattern instruction being used to indicate and recovering specified file is sent for the 5th receiver module 744 to management node 740, so that the 5th receiver module 744 of management node 740 is after receiving file access pattern instruction, first recovers module 745 recovers qualified first corresponding relation, qualified first corresponding relation be send back up before delete instruction and the first corresponding relation that the time of backup is nearest with the time gap sending delete instruction, first corresponding relation comprises the data block identifier of the data block that specified file and specified file comprise, second recovers module 746 recovers the second corresponding relation, and described second corresponding relation is the mapping of the described data block identifier that comprises of specified file to described back end 760.

It should be noted that, above-mentioned client 720, management node 740 and back end 760 all can be implemented separately.

It should be noted that: in the Distributed Data Warehouse provided in above-described embodiment, the device of deleted file is when deleting the file in Distributed Data Warehouse, only be illustrated with the division of above-mentioned each functional module, in practical application, can distribute as required and by above-mentioned functions and be completed by different functional modules, inner structure by management node, back end and client is divided into different functional modules, to complete all or part of function described above.In addition, in the Distributed Data Warehouse that above-described embodiment provides, in deleted file device and Distributed Data Warehouse, deleted file embodiment of the method belongs to same design, and its specific implementation process refers to embodiment of the method, repeats no more here.

The invention described above embodiment sequence number, just to describing, does not represent the quality of embodiment.

One of ordinary skill in the art will appreciate that all or part of step realizing above-described embodiment can have been come by hardware, the hardware that also can carry out instruction relevant by program completes, described program can be stored in a kind of computer-readable recording medium, the above-mentioned storage medium mentioned can be ROM (read-only memory), disk or CD etc.

The foregoing is only preferred embodiment of the present invention, not in order to limit the present invention, within the spirit and principles in the present invention all, any amendment done, equivalent replacement, improvement etc., all should be included within protection scope of the present invention.

Claims

1. the method for deleted file in Distributed Data Warehouse, it is characterized in that, described method comprises:

Send heartbeat to management node to report, described heartbeat reports the data block identifier of all data blocks comprising back end and store, and described heartbeat reports and is provided for described management node and reports according to described heartbeat and determine the mapping of described data block identifier to described back end;

Reception heartbeat returns, and returns the delete instruction carrying data block identifier of the described management node transmission of middle acquisition from the described heartbeat received;

2. method according to claim 1, is characterized in that, described data block of deleting under specified requirements in described delay queue in the back end corresponding to described data block identifier stored, comprising:

When described data block identifier reaches schedule time threshold value stored in the time to described delay queue, delete the data block in the described back end corresponding to described data block identifier;

Or,

Receive that client issues be used to indicate the flush instructions of the data block emptied in described delay queue corresponding to all data block identifier time, delete the data block in the described back end in described delay queue corresponding to all data block identifier.

3. method according to claim 2, is characterized in that, described by described data block identifier stored in delay queue and record stored in time after, also comprise:

Receive the recovery instruction being used to indicate the data block corresponding to described data block identifier stored in the described delay queue of recovery that described management node sends;

The heartbeat sending the data block identifier carrying all data blocks that described back end stores to described management node reports, so that described management node builds the mapping of described data block identifier to described back end according to the described data block identifier during the described heartbeat received reports.

4. according to the method in claim 2 or 3, it is characterized in that, described receive that described client issues be used to indicate the flush instructions of the data block emptied in described delay queue corresponding to all data block identifier time, before deleting the data block in the described back end in described delay queue corresponding to all data block identifier, also comprise:

Determine the data block in the described back end in described delay queue corresponding to all data block identifier;

Calculate described data block in described back end and take parameter in described back end;

The described parameter that takies is sent to described management node, so that described management node takies parameter described in receiving, described client determines whether to need to issue to described back end the flush instructions being used to indicate the data block emptied in delay queue described in described back end corresponding to all data block identifier after taking parameter described in viewing.

5. method according to claim 4, is characterized in that, described in take parameter and comprise delayed deletion storage space and delayed deletion number percent, the described data block in the described back end of described calculating takies parameter in described back end, comprising:

Calculate the storage space of the described back end shared by described data block, described storage space is defined as described delayed deletion storage space;

Calculate the number percent that described delayed deletion storage space takies total storage space of described back end, described number percent is defined as described delayed deletion number percent.

6. according to the method in claim 2 or 3, it is characterized in that, described method also comprises:

Receive the time configuration-direct carrying appointment duration that described client sends, described time configuration-direct is used for carrying out dynamic-configuration to described schedule time threshold value;

According to described time configuration-direct, described schedule time threshold value is updated to described appointment duration.

7. the method for deleted file in Distributed Data Warehouse, it is characterized in that, described method comprises:

Receive the file erase instruction being used to indicate deletion specified file that client sends;

8. method according to claim 7, is characterized in that, described method also comprises:

What receive the transmission of described client is used to indicate the file access pattern instruction recovering described specified file;

Recover qualified first corresponding relation, described qualified first corresponding relation be send back up before described delete instruction and the first corresponding relation that the time of backup is nearest with the time gap sending described delete instruction, described first corresponding relation comprises the data block identifier of the data block that described specified file and described specified file comprise;

Recover the second corresponding relation, described second corresponding relation is the mapping of data block identifier to the back end of the described data block of storage of data block.

9. method according to claim 8, is characterized in that, qualified first corresponding relation of described recovery, comprising:

Obtain described qualified first corresponding relation;

Described first corresponding relation is read in the internal memory of management node.

10. method according to claim 8, is characterized in that, described recovery second corresponding relation, comprising:

The recovery instruction of the data block corresponding to the described data block identifier being used to indicate and recovering to store in described delay queue is sent to described back end, so that after described back end receives described recovery instruction, send to described management node the heartbeat carrying the data block identifier of all data blocks of storage and report;

The described heartbeat receiving the transmission of described back end reports;

The mapping of described data block identifier to described back end is built according to the described data block identifier during the described heartbeat received reports.

11. according to described method arbitrary in claim 7 to 10, and it is characterized in that, described method also comprises:

What receive the transmission of described back end takies parameter, the described parameter that takies comprises delayed deletion storage space and delayed deletion number percent, the storage space shared by data block in the delay queue of described delayed deletion storage space for described back end corresponding to all data block identifier, described delayed deletion number percent is the number percent that described delayed deletion storage space takies total storage space of described back end, so that described client determines whether to issue to described back end the flush instructions being used to indicate the data block emptied in delay queue described in described back end corresponding to all data block identifier after taking parameter described in viewing.

In 12. 1 kinds of Distributed Data Warehouses, the method for deleted file, is characterized in that, described method comprises:

The file erase instruction being used to indicate and deleting specified file is sent to management node, so that described management node is after receiving described file erase instruction, for each back end, report established data block identification to the mapping of back end according to the corresponding relation between the file prestored and data block identifier and according to the heartbeat that back end sends, determine the data block being stored in and belonging to described specified file in described back end; The delete instruction carrying the data block identifier of described data block is added, so that described back end receives described delete instruction in the heartbeat sent to described back end returns; By described data block identifier stored in in delay queue; The data block corresponding to described data block identifier stored is deleted in described delay queue under specified requirements.

13. methods according to claim 12, is characterized in that, described method also comprises:

That checks that each back end sends to described management node takies parameter, the described parameter that takies comprises delayed deletion storage space and delayed deletion number percent, the storage space shared by data block in the delay queue of described delayed deletion storage space for described back end corresponding to all data block identifier, described delayed deletion number percent is the number percent that described delayed deletion storage space takies total storage space of described back end;

Determine whether to need to issue at least one back end in described back end the flush instructions being used to indicate the data block emptied in delay queue described in described back end corresponding to all data block identifier;

Need to issue described flush instructions at least one back end in described back end if determine, then issue described flush instructions to the described back end determined.

14. methods according to claim 12, is characterized in that, described method also comprises:

The time configuration-direct carrying and specify duration is issued to described back end, described time configuration-direct is used for carrying out dynamic-configuration to described schedule time threshold value, so that described schedule time threshold value is updated to described appointment duration according to described time configuration-direct by described back end.

15. methods according to claim 12, is characterized in that, described method also comprises:

Send to described management node and be used to indicate the file access pattern instruction recovering described specified file, so that described management node is after receiving described file access pattern instruction, recover qualified first corresponding relation, described qualified first corresponding relation be send back up before described delete instruction and the first corresponding relation that the time of backup is nearest with the time gap sending described delete instruction, described first corresponding relation comprises the data block identifier of the data block that described specified file and described specified file comprise; Recover the second corresponding relation, described second corresponding relation is the mapping of the described data block identifier that comprises of specified file to described back end.

In 16. 1 kinds of Distributed Data Warehouses, the device of deleted file, is characterized in that, described device comprises:

Heartbeat sending module, report for sending heartbeat to management node, described heartbeat reports the data block identifier of all data blocks comprising back end and store, and described heartbeat reports and is provided for described management node and reports according to described heartbeat and determine the mapping of described data block identifier to described back end;

Acquisition module, returns for receiving heartbeat, returns the delete instruction carrying data block identifier of the described management node transmission of middle acquisition from described heartbeat;

17. devices according to claim 16, is characterized in that, described removing module, comprising:

First delete cells, for when described data block identifier reaches schedule time threshold value stored in the time to described delay queue, deletes the data block in the described back end corresponding to described data block identifier;

Or,

Second delete cells, for receive that client issues be used to indicate the flush instructions of the data block emptied in described delay queue corresponding to all data block identifier time, delete the data block in the described back end in described delay queue corresponding to all data block identifier.

18. devices according to claim 17, is characterized in that, described device also comprises:

First receiver module, for the recovery instruction of the data block corresponding to the described data block identifier that being used to indicate of receiving that described management node sends recovers to store in described delay queue;

Reporting module, heartbeat for sending from the data block identifier carrying all data blocks that described back end stores to described management node reports, so that described management node builds the mapping of described data block identifier to described back end according to the described data block identifier during the described heartbeat received reports.

19. devices according to claim 17 or 18, it is characterized in that, described device also comprises:

First determination module, for determining the data block in the described back end in described delay queue corresponding to all data block identifier;

Computing module, calculates described data block in described back end and take parameter in described back end;

First sending module, described management node is sent to for taking parameter described in being calculated by described computing module, so that described management node takies parameter described in receiving, described client determines whether to need to issue to described back end the flush instructions being used to indicate the data block emptied in delay queue described in described back end corresponding to all data block identifier after taking parameter described in viewing.

20. devices according to claim 19, is characterized in that, described in take parameter and comprise delayed deletion storage space and delayed deletion number percent, described computing module, comprising:

First computing unit, for calculating the storage space of the described back end shared by described data block, is defined as described delayed deletion storage space by described storage space;

Second computing unit, takies the number percent of total storage space of described back end for calculating described delayed deletion storage space, described number percent is defined as described delayed deletion number percent.

21. devices according to claim 17 or 18, it is characterized in that, described device also comprises:

Second receiver module, for receiving the time configuration-direct carrying appointment duration that described client sends, described time configuration-direct is used for carrying out dynamic-configuration to described schedule time threshold value;

Update module, is updated to described appointment duration for the described time configuration-direct received according to described 3rd receiver module by described schedule time threshold value.

In 22. 1 kinds of Distributed Data Warehouses, the device of deleted file, is characterized in that, described device comprises:

23. devices according to claim 22, is characterized in that, described device also comprises:

5th receiver module, for receive described client send be used to indicate the file access pattern instruction recovering described specified file;

First recovers module, for recovering qualified first corresponding relation, described qualified first corresponding relation be send back up before described delete instruction and the first corresponding relation that the time of backup is nearest with the time gap sending described delete instruction, described first corresponding relation comprises the data block identifier of the data block that described specified file and described specified file comprise;

Second recovers module, and for recovering the second corresponding relation, described second corresponding relation is the mapping of data block identifier to the back end of the described data block of storage of data block.

24. devices according to claim 23, is characterized in that, described first recovers module, comprising:

Acquiring unit, for obtaining described qualified first corresponding relation;

Read in unit, described first corresponding relation for being obtained by described acquiring unit reads in the internal memory of management node.

25. devices according to claim 23, is characterized in that, described second recovers module, comprising:

Transmitting element, for sending the recovery instruction of the data block corresponding to the described data block identifier that is used to indicate and recovers to store in described delay queue to described back end, so that after described back end receives described recovery instruction, send to described management node the heartbeat carrying the data block identifier of all data blocks of storage and report;

Receiving element, the described heartbeat sent for receiving described back end reports;

Construction unit, the described data block identifier during the described heartbeat for receiving according to described receiving element reports builds the mapping of described data block identifier to described back end.

26. according to described device arbitrary in claim 22 to 25, and it is characterized in that, described device also comprises:

6th receiver module, for receive described back end send take parameter, the described parameter that takies comprises delayed deletion storage space and delayed deletion number percent, the storage space shared by data block in the delay queue of described delayed deletion storage space for described back end corresponding to all data block identifier, described delayed deletion number percent is the number percent that described delayed deletion storage space takies total storage space of described back end, so that described client determines whether to issue to described back end the flush instructions being used to indicate the data block emptied in delay queue described in described back end corresponding to all data block identifier after taking parameter described in viewing.

In 27. 1 kinds of Distributed Data Warehouses, the device of deleted file, is characterized in that, described device comprises:

28. devices according to claim 27, is characterized in that, described device also comprises:

Check module, parameter is taken for what check that each back end sends to described management node, the described parameter that takies comprises delayed deletion storage space and delayed deletion number percent, the storage space shared by data block in the delay queue of described delayed deletion storage space for described back end corresponding to all data block identifier, described delayed deletion number percent is the number percent that described delayed deletion storage space takies total storage space of described back end;

3rd determination module, needs to issue at least one back end in described back end the flush instructions being used to indicate the data block emptied in delay queue described in described back end corresponding to all data block identifier for determining whether;

First issues module, needs to issue described flush instructions at least one back end in described back end, then issue described flush instructions to the described back end determined for determining at described 3rd determination module.

29. devices according to claim 27, is characterized in that, described device also comprises:

Second issues module, for issuing the time configuration-direct carrying and specify duration to described back end, described time configuration-direct is used for carrying out dynamic-configuration to described schedule time threshold value, so that described schedule time threshold value is updated to described appointment duration according to described time configuration-direct by described back end.

30. devices according to claim 27, is characterized in that, described device also comprises:

4th sending module, the file access pattern instruction recovering described specified file is used to indicate for sending to described management node, so that described management node is after receiving described file access pattern instruction, recover qualified first corresponding relation, described qualified first corresponding relation be send back up before described delete instruction and the first corresponding relation that the time of backup is nearest with the time gap sending described delete instruction, described first corresponding relation comprises the data block identifier of the data block that described specified file and described specified file comprise; Recover the second corresponding relation, described second corresponding relation is the mapping of the described data block identifier that comprises of specified file to described back end.

31. 1 kinds of back end, is characterized in that, described back end comprises the device as deleted file in the Distributed Data Warehouse as described in arbitrary in claim 16 to 21.

32. 1 kinds of management nodes, is characterized in that, described management node comprises the device as deleted file in the Distributed Data Warehouse as described in arbitrary in claim 22 to 26.

33. 1 kinds of clients, is characterized in that, described client comprises the device as deleted file in the Distributed Data Warehouse as described in arbitrary in claim 27 to 30.

In 34. 1 kinds of Distributed Data Warehouses, the system of deleted file, is characterized in that, described system comprises client, management node and at least one back end;

Described client comprises the device as deleted file in the Distributed Data Warehouse as described in arbitrary in claim 27 to 30;

Described management node comprises the device as deleted file in the Distributed Data Warehouse as described in arbitrary in claim 22 to 26;

Described back end comprises the device as deleted file in the Distributed Data Warehouse as described in arbitrary in claim 16 to 21.