CN110188076A

CN110188076A - A kind of method that the concurrent high-speed data of Hadoop file system is deleted completely

Info

Publication number: CN110188076A
Application number: CN201910448716.0A
Authority: CN
Inventors: 蔡剑怀; 成林
Original assignee: Xiamen Digital Certificate Technology Co Ltd; China Information Technology Security Evaluation Center
Current assignee: Xiamen Digital Certificate Technology Co Ltd; China Information Technology Security Evaluation Center
Priority date: 2019-05-28
Filing date: 2019-05-28
Publication date: 2019-08-30
Anticipated expiration: 2039-05-28
Also published as: CN110188076B

Abstract

The invention discloses a kind of methods that the concurrent high-speed data of Hadoop file system is deleted completely, realize the complete deletion to Hadoop file system data, fast and effectively thoroughly remove data, ensure the data safety of user, it is effectively prevent leaking data, harm and economic loss caused by avoiding because of leaking data.

Description

A kind of method that the concurrent high-speed data of Hadoop file system is deleted completely

Technical field

The present invention relates to data security arts more particularly to a kind of concurrent high-speed data of Hadoop file system to delete completely The method removed.

Background technique

Problem of data safety is the main problem that cloud computing is faced, therefore the research of data safety also becomes current cloud meter One of focus on research direction of calculation.And Hadoop as in cloud computing use more data processing platform (DPP), in the process of running, Confidentiality for data does not provide effective protection in particular for the data after deletion, does not provide one safely and effectively Data deleting mechanism.Due to the fact that being likely to result in leaking data:

If 1) some DataNode fails, the storing data in the DataNode cannot be protected thoroughly, thus It will lead to the safety problem of leaking data.

2) due to the recycle bin mechanism of Hadoop file system, the file that user deletes is not erased entirely, but It is moved to recycle bin file.Therefore the file in recycle bin can also be given for change.

3) Hadoop file system is deployed in Linux file system, delete operation of the Linux file system for file The deletion data in disk can't be thoroughly destroyed, before disk is written in new content, remaining data content will also be after It is continuous to be stored in disk.So the data in HDFS are not destroyed actually, as long as using data recovery technique, file letter Breath is probably resumed.

Using leaking data caused by the above reason, the user of malice or cloud service provider can be restored to use using common The data that family has been deleted, damage user benefit even cause more great harm.Therefore the present invention is directed to Hadoop file The problem of data in system cannot be destroyed thoroughly provides a kind of concurrent high-speed data of Hadoop file system and deletes completely Method.

Summary of the invention

The object of the invention is that provide to solve the above-mentioned problems, a kind of deletion data are thorough, and safety coefficient is high The method that the concurrent high-speed data of Hadoop file system is deleted completely.

The present invention through the following technical solutions to achieve the above objectives:

The present invention includes the name node and back end of user client, Hadoop, and the Hadoop name node is rung The service routine answering the service request of user client, and being transmitted in back end after treatment.

The filename and delete grade that the selection of a client needs to delete completely, the service request of transmission is to described The name node of Hadoop；

After the name node of Hadoop described in b receives the service request of user client transmission, task list is locked, solely Formula access is accounted for, if query task list, the deletion service request that user client is sent is not in task list, then by the task It is added to task list, then discharges task list lock, if the deletion service request that inquiry user client is sent exists Task list then feeds back to the user client task in task list, then discharges task list lock.

Number filename and data block name are packaged as data flow and sent by the service routine in the name node of Hadoop described in c To all back end stored and have data block to be deleted；

Service routine in back end described in d responds the name node service request of the Hadoop, is asked according to service It asks and obtains data block title to be deleted and deletion grade.

After obtaining data block title on back end described in e and deleting grade, task list is locked, monopolizes formula access, such as The data block is then added to task list not in task list by fruit query task list, data block to be deleted, and then release is appointed Business list lock；If inquiring data block to be deleted in task list, service routine task in name node is fed back to In task list, task list lock is then discharged；

Service routine in back end described in f traverses task list, executes overwriting operations, the data section to data block Point sends deleted data number of blocks to the name node of the Hadoop in real time.

Service routine in the name node of Hadoop described in G calls Hadoop's after to complete delete operation API deletes this document, so that this document is deleted from the file directory of Hadoop.

Service routine in the name node of Hadoop described in h summarizes the service in all back end for having deletion task The deletion progress that program is sent, and then calculate file population and delete progress, and be sent to user client in real time.

Specifically, after the complete deletion task is added to task list, the name node of the Hadoop can basis Assignment file name obtains the file information.

As preferential, the file information mainly includes file size, duplicate of the document quantity, file data blocks name, file The IP address of the back end of data number of blocks and storage file data blocks.

Specifically, the service routine in back end searches the store path of data according to data name.

As preferential, the back end is determined the overriding number of data block by deletion grade.

Specifically, it after the service routine in the back end completes an overwriting operations to each data block, examines It deletes intensity and has override number, judge whether to be completed the overriding number for the data block.

As preferential, the overriding number for the data block is completed in the service routine in the back end, then from appointing The data block is rejected in business list, traversal task list is then proceeded to, carries out covering for data block next in task list Write operation；If the service routine in the back end does not complete the overriding number for the data block, continues traversal and appoint Business list, carries out the overwriting operations for data block next in task list.

Specifically, the service routine in the back end is completed once to execute one for the unidirectional traversal of task list It is secondary to write with a brush dipped in Chinese ink disk operating, the overwriting operations before for data block are determined and are executed in disk.

The beneficial effects of the present invention are:

Deployment and maintenance work of the invention will not have an impact running Hadoop cluster, avoid for originally The modification of cluster, deployment is convenient, and maintenance is simple, avoids and redeploys bring economic loss and risk.In addition, this method can It to respond the removal request of a large number of users in cluster simultaneously, and is dealt carefully with, ensures the synchronization of data information, improve place Efficiency is managed, realizes high concurrent processing function.And during data destroying, in the method, the service routine in back end is logical Storage information of the parsing data in disk is crossed sequentially to carry out destruction operation using disk scheduling, greatly improve data Speed is destroyed, realizes that high speed destroys data.

The complete deletion to Hadoop file system data may be implemented in the method proposed through the invention, fast and has Data are thoroughly removed to effect, the data safety of user is ensured, effectively prevent leaking data, caused by avoiding because of leaking data Harm and economic loss.

Detailed description of the invention

Fig. 1 is a kind of system for the method that the concurrent high-speed data of Hadoop file system is deleted completely in the embodiment of the present invention Architecture diagram.

Fig. 2 is a kind of method that the concurrent high-speed data of Hadoop file system is deleted completely in the embodiment of the present invention The implementation flow chart of HCCDClient.

Fig. 3 is a kind of method that the concurrent high-speed data of Hadoop file system is deleted completely in the embodiment of the present invention The implementation flow chart of HCCDNServer.

Fig. 4 is a kind of method that the concurrent high-speed data of Hadoop file system is deleted completely in the embodiment of the present invention The implementation flow chart of HCCDDServer.

Specific embodiment

The present invention will be further explained below with reference to the attached drawings:

As shown in Figure 1, the present invention includes the name node and back end of user client, Hadoop, the Hadoop Name node responds the service request of user client, and the service routine being transmitted in back end after treatment.

The filename and delete grade that the selection of a client needs to delete completely, the service request of transmission is to described The name node of Hadoop.

After obtaining data block title on back end described in e and deleting grade, task list is locked, monopolizes formula access, such as The data block is then added to task list not in task list by fruit query task list, data block to be deleted, and then release is appointed Business list lock；If inquiring data block to be deleted in task list, service routine task in name node is fed back to In task list, task list lock is then discharged.

As shown in Figure 1, being a kind of system architecture diagram of embodiment of the invention.

The invention mainly comprises three service routines, are the HCCDClient operated in user computer respectively, operate in HCCDNServer in NameNode operates in the HCCDDServer in all DataNode.

HCCDClient provides visualization interface for user, and user can browse catalogue and file by HCCDClient Information carries out complete delete operation.

HCCDNServer is core of the invention service routine, is responsible for receiving the deletion life from each HCCDClient It enables.Concurrent processing is carried out to the removal request of each HCCDClient, data information is then obtained from NameNode and is deleted Information is sent to HCCDDServer, while reading deletion progress, and feed back to HCCDClient.

HCCDDServer is the final execution program of complete delete operation.HCCDDServer is received from HCCDNServer Complete delete command, and complete deletion progress is fed back into HCCDNServer.

System operational process is as follows:

Start HCCDClient, HCCDNServer, HCCDDServer simultaneously complete to initialize.

User browses file directory by HCCDClient, selects the file for needing to delete completely, and delete grade, and It is that data flow is sent to HCCDNServer by the above information-package of deleting.

HCCDNServer accesses name node according to information is deleted, and obtains the file information, then will delete information-package Corresponding HCCDDServer is sent to for data flow.

HCCDDServer parses the data flow that HCCDNServer is sent, and executes the overwriting operations for data block.

HCCDDServer sends a data block deletion progress to HCCDNServer every 10ms and (has deleted data block number Amount), HCCDNServer deletes progress according to the data block of each HCCDDServer, calculates file and deletes progress (percentage Than).HCCDNServer sends a file to HCCDClient every 10ms and deletes progress.

HCCDNServer after HCCDDServer complete delete operation, can call the API of Hadoop to file into Row is deleted, so that this document is deleted from the file directory of Hadoop.

As shown in Fig. 2, the implementation flow chart of the HCCDClient for a kind of embodiment of the invention.

The detailed operational process of HCCDClient is as follows:

HCCDClient starts, the IP address and communication port and thread pool of the name node in reading configuration file Configuration information completes the initialization of HCCDClient.

User's fortune browses file directory by HCCDClient, selects the file for needing to delete completely.

Grade is deleted in user's selection completely.

Completely delete grade be arranged according to different deletion standards, share three grades, WEAK, STRONG and THOROUGH, wherein WEAK corresponds to DOD5220.22-M and simply overrides standard, and STRONG corresponds to DOD5220.22-M 7 times erasings Standard, THOROUGH correspond to Gutmann [17] standard.

By the filename for needing to delete completely and grade packaging is deleted as data flow.

HCCDClient is communicated with HCCDNServer foundation.

HCCDClient is sent to HCCDNServer for inter-area traffic interarea is deleted.

HCCDClient receives the data flow that HCCDNServer is sent.

HCCDClient parsing data flow obtains filename and corresponding document deletes progress.

As shown in figure 3, the implementation flow chart of the HCCDNServer for a kind of embodiment of the invention.

The detailed operational process of HCCDNServer is as follows:

HCCDNServer starting, reads configuration file, completes initialization.

HCCDNServer monitors the communication port with HCCDClient, and whether monitor, which has HCCDClient to delete completely, is asked It asks.

HCCDNServer receives data flow from HCCDClient, parses filename and deletes grade.

HCCDNServer locks task list, carries out exclusive formula access.

Whether HCCDNServer query task list judges file in task list.

If file, in task list, HCCDNServer feeds back to HCCDClient file and is deleting task list.

If file is not in task list, task is added to task list by HCCDNServer.

HCCDNServer discharges task list lock.The synchronization that data have been ensured with this solves the problems, such as concurrently to access.

HCCDNServer inquires file data block message from NameNode according to filename, and main includes the number of data block It measures, the number of copies of data block, the data block name having in the IP address of the DataNode where data block and each IP address Claim.

HCCDNServer sends primary deletion progress to HCCDClient every 10ms.

HCCDNServer is communicated with the DataNode foundation of all storage data blocks to be deleted.

Respective data blocks title and deletion grade packaging are that data flow is sent in the DataNode by HCCDNServer The HCCDDServer of operation.

HCCDNServer receives the data flow that the HCCDDServer of all storage data blocks to be deleted is sent.In data flow It mainly include filename and deleted data number of blocks.

The data block that HCCDNServer is fed back according to each HDFSDataNode deletes progress and (has deleted data block number Amount), the file for calculating HCCDClient to be fed back to deletes progress (percentage).

Filename and corresponding document deletion progress (percentage) are encapsulated as data flow and are sent to by HCCDNServer HCCDClient。

As shown in figure 4, the implementation flow chart of the HCCDDServer for a kind of embodiment of the invention.

The detailed operational process of HCCDDServer is as follows:

HCCDDServer after actuation, calls Linux console instructions to execute " whereis Hadoop ", obtains The installation site of Hadoop.

According to the installation site of Hadoop, the hdfs-site.xml configuration file of Hadoop is found.In the configuration file Store the size of data block, number of copies, the information such as data block storage location.

HCCDDServer initializes three pieces of static character arrays, respectively according to the size information of configuration file in memory Having character is ' 0 ', random number, ' F '.

The communication port with HCCDNServer is monitored, delete command information is waited.

HCCDDServer listens to the communication request of HCCDNServer, communicates with HCCDNServer foundation.

HCCDDServer receives the data flow that HCCDNServer is sent.

HCCDDServer parses data flow, obtains the filename for needing to delete completely, data block title and deletion grade.

HCCDDServer locks task list, carries out exclusive formula access.

Whether HCCDDServer judges task in task list according to the data block name query task list received In.

If task does not create deletion task in task list.

If task in task list, is not added to task list, and executes following process.

HCCDDServer discharges task list lock.The synchronization that data have been ensured with this solves the problems, such as concurrently to access.

HCCDDServer sends primary deletion progress to HCCDNServer every 10ms.

HCCDDServer finds the data block on storage machine according to the data block title of received task The path of storage.

HCCDDServer traverses task list, executes overwriting operations to data block.

The overriding number of data block is determined by deletion grade.

It after HCCDDServer completes an overwriting operations to each data block, examines and deletes intensity, and override secondary Number, judges whether to be completed the overriding number for the data block.

If the overriding number for the data block is completed in HCCDDServer, the data block is rejected from task list, Traversal task list is then proceeded to, the overwriting operations for data block next in task list are carried out.

If HCCDDServer does not complete the overriding number for the data block, continues to traverse task list, carry out pair The overwriting operations of next data block in task list.

HCCDDServer completes that once the unidirectional traversal of task list is executed and once writes with a brush dipped in Chinese ink disk operating, by before The overwriting operations of data block are determined and are executed in disk, the traversal next time for task list is then proceeded to, override number According to block.

The above is merely preferred embodiments of the present invention, be not intended to limit the invention, it is all in spirit of the invention and Made any modifications, equivalent replacements, and improvements etc., should be included within the scope of the present invention within principle.

Claims

1. a kind of method that the concurrent high-speed data of Hadoop file system is deleted completely, feature with, including user client, The name node and back end of Hadoop, the service request of the Hadoop name node response user client, and locating The service routine being transmitted to after reason in back end.

The filename and delete grade that the selection of a client needs to delete completely, the service request of transmission is to the Hadoop's Name node；

After the name node of Hadoop described in b receives the service request of user client transmission, task list is locked, monopolizes formula Access, if query task list, the deletion service request that user client is sent then adds the task not in task list To task list, task list lock is then discharged, if the deletion service request that inquiry user client is sent is in task List then feeds back to the user client task in task list, then discharges task list lock；

Number filename and data block name are packaged as data flow and are sent to institute by the service routine in the name node of Hadoop described in c There is storage to have the back end of data block to be deleted；

Service routine in back end described in d responds the name node service request of the Hadoop, is obtained according to service request It obtains data block title to be deleted and deletes grade；

After obtaining data block title on back end described in e and deleting grade, task list is locked, formula access is monopolized, if looked into Task list is ask, which is then added to task list not in task list by data block to be deleted, then discharges task column Table lock；If inquiring data block to be deleted in task list, feeds back to service routine task in name node and exist Then task list discharges task list lock；

Service routine in back end described in f traverses task list, executes overwriting operations to data block, the back end is real When send deleted data number of blocks to the name node of the Hadoop；

Service routine in the name node of Hadoop described in g is after to complete delete operation, and call Hadoop API pairs This document is deleted, so that this document is deleted from the file directory of Hadoop；

Service routine in the name node of Hadoop described in h summarizes the service routine in all back end for having deletion task The deletion progress of transmission, and then calculate file population and delete progress, and be sent to user client in real time.

2. the method that the concurrent high-speed data of Hadoop file system according to claim 1 is deleted completely, it is characterised in that: After the complete deletion task is added to task list, the name node of the Hadoop can obtain text according to assignment file name Part information.

3. the method that the concurrent high-speed data of Hadoop file system according to claim 2 is deleted completely, it is characterised in that: The file information mainly includes file size, duplicate of the document quantity, file data blocks name, file data number of blocks and storage The IP address of the back end of file data blocks.

4. the method that the concurrent high-speed data of Hadoop file system according to claim 1 is deleted completely, it is characterised in that: Service routine in the back end searches the store path of data according to data name.

5. the method that the concurrent high-speed data of Hadoop file system according to claim 1 is deleted completely, it is characterised in that: The back end is determined the overriding number of data block by deletion grade.

6. the method that the concurrent high-speed data of Hadoop file system according to claim 1 is deleted completely, it is characterised in that: After service routine in the back end completes an overwriting operations to each data block, examines and delete intensity and override Number judges whether to be completed the overriding number for the data block.

7. the method that the concurrent high-speed data of Hadoop file system according to claim 1 is deleted completely, it is characterised in that: The overriding number for the data block is completed in service routine in the back end, then the data are rejected from task list Block then proceedes to traversal task list, carries out the overwriting operations for data block next in task list；If the data section Service routine in point does not complete the overriding number for the data block, then continues to traverse task list, carry out for task The overwriting operations of next data block in list.

8. the method that the concurrent high-speed data of Hadoop file system according to claim 1 is deleted completely, it is characterised in that: Service routine in the back end is completed once for the unidirectional traversal of task list, and execution once writes with a brush dipped in Chinese ink disk operating, Overwriting operations before for data block are determined and are executed in disk.