CN112540968A - Garbage recycling method and device of HDFS - Google Patents

Garbage recycling method and device of HDFS Download PDF

Info

Publication number
CN112540968A
CN112540968A CN202011437627.5A CN202011437627A CN112540968A CN 112540968 A CN112540968 A CN 112540968A CN 202011437627 A CN202011437627 A CN 202011437627A CN 112540968 A CN112540968 A CN 112540968A
Authority
CN
China
Prior art keywords
file
deleted
instruction
deleting
obtaining
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011437627.5A
Other languages
Chinese (zh)
Inventor
崔丽珺
杨全文
张洋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Citic Bank Corp Ltd
Original Assignee
China Citic Bank Corp Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Citic Bank Corp Ltd filed Critical China Citic Bank Corp Ltd
Priority to CN202011437627.5A priority Critical patent/CN112540968A/en
Publication of CN112540968A publication Critical patent/CN112540968A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/18File system types
    • G06F16/182Distributed file systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/16File or folder operations, e.g. details of user interfaces specifically adapted to file systems
    • G06F16/162Delete operations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/17Details of further file system functions
    • G06F16/174Redundancy elimination performed by the file system
    • G06F16/1748De-duplication implemented within the file system, e.g. based on file segments
    • G06F16/1752De-duplication implemented within the file system, e.g. based on file segments based on file chunks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Human Computer Interaction (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a garbage recycling method and device of an HDFS (Hadoop distributed File System), wherein the method comprises the following steps: obtaining a first deleting instruction; according to the first deleting instruction, obtaining size information of a first file to be deleted; obtaining a predetermined file size threshold; judging whether the size information of the first file to be deleted is larger than the preset file size threshold value or not; if the size information of the first file to be deleted is not larger than the preset file size threshold value, a first execution instruction is obtained; and executing the first deleting instruction according to the first executing instruction, and deleting the first file to be deleted. The method solves the technical problems that in the prior art, the deletion amount of a single node is large, the number of nodes related to deletion operation is large, the NameNode processing performance is reduced, and the cluster performance is reduced.

Description

Garbage recycling method and device of HDFS
Technical Field
The invention relates to the technical field of big data, in particular to a garbage recycling method and device of an HDFS.
Background
With the development of services, the data volume is exponentially increased, and the demands of deleting a large number of files at one time, such as "hard disk is gradually full, data needs to be cleaned in a centralized manner", "cluster is unstable, and files need to be released in a centralized manner", are increased. On the premise of starting a garbage collection mechanism, the default deletion strategy of the HDFS is to rename and move a file to be deleted to a/hash directory and automatically delete the file after the configured retention time is exceeded. The prior art comprises a method for reconstructing a file system directory structure through a log and recovering a lost file, thereby realizing the recovery of a HDFS deleted file, and improving the security of the HDFS file by controlling access authority control equipment and a storage medium of the HDFS file.
In the process of implementing the technical scheme of the invention in the embodiment of the present application, the inventor of the present application finds that the above-mentioned technology has at least the following technical problems:
the deletion amount of a single node is large, the number of nodes related to deletion operation is large, the performance of the DataNode service is reduced, the NameNode processing performance is reduced, and the cluster performance is reduced.
Disclosure of Invention
The embodiment of the application provides a method and a device for recycling garbage of an HDFS (Hadoop distributed File System), solves the technical problems that in the prior art, a single node is large in deletion amount, nodes related to deletion operation are multiple, NameNode processing performance is reduced, and then cluster performance is reduced, and achieves the technical purpose that small files are deleted in batches, so that operation nodes are reduced, the data node service performance is improved, the NameNode processing performance is improved, and then the cluster performance is improved.
The embodiment of the application provides a garbage recycling method of an HDFS (Hadoop distributed File System), wherein the method comprises the following steps: obtaining a first deleting instruction; according to the first deleting instruction, obtaining size information of a first file to be deleted; obtaining a predetermined file size threshold; judging whether the size information of the first file to be deleted is larger than the preset file size threshold value or not; if the size information of the first file to be deleted is not larger than the preset file size threshold value, a first execution instruction is obtained; and executing the first deleting instruction according to the first executing instruction, and deleting the first file to be deleted.
On the other hand, this application still provides a rubbish recovery unit of HDFS, wherein, the device includes: a first obtaining unit configured to obtain a first deletion instruction; the second obtaining unit is used for obtaining the size information of the first file to be deleted according to the first deleting instruction; a third obtaining unit configured to obtain a predetermined file size threshold; a first judging unit, configured to judge whether the first to-be-deleted file size information is larger than the predetermined file size threshold; a fourth obtaining unit, configured to obtain a first execution instruction if the size information of the first file to be deleted is not greater than the predetermined file size threshold; and the first execution unit is used for executing the first deleting instruction according to the first execution instruction and deleting the first file to be deleted.
On the other hand, an embodiment of the present application further provides a garbage collection apparatus for an HDFS, including a memory, a processor, and a computer program stored on the memory and executable on the processor, where the processor implements the steps of the method according to the first aspect when executing the program.
One or more technical solutions provided in the embodiments of the present application have at least the following technical effects or advantages:
because the operation of deleting a large number of files at one time by the garbage collection station is converted into the operation of deleting small files within the bearing range of the cluster in batches by reasonably configuring parameters, the number of blocks added into the InvalidateBlocks set is reduced, correspondingly, the number of the InvalidateBlocks set is reduced, and the number of DataNodes which need to receive and execute the deleting operation through a heartbeat mechanism is also reduced; the probability that the block to be deleted is not processed and completed before the next BlockReport is reduced, and the probability that the invalid block affects the cluster performance is reduced; the number of the blocks to be deleted of the node is reduced, the processing time and the locking holding time of the node are correspondingly shortened, and the improvement of the NameNode processing performance is facilitated. The method achieves the technical purpose of reducing the execution operation nodes, improving the service performance of the DataNode, improving the processing performance of the NameNode and further improving the cluster performance by deleting the small files in batches.
The foregoing is a summary of the present disclosure, and embodiments of the present disclosure are described below to make the technical means of the present disclosure more clearly understood.
Drawings
Fig. 1 is a schematic flow chart of a garbage recycling method of an HDFS according to an embodiment of the present disclosure;
fig. 2 is a schematic structural diagram of a garbage recycling device of an HDFS according to an embodiment of the present disclosure;
fig. 3 is a schematic structural diagram of an exemplary electronic device according to an embodiment of the present application.
Description of reference numerals: a first obtaining unit 11, a second obtaining unit 12, a third obtaining unit 13, a first judging unit 14, a fourth obtaining unit 15, a first executing unit 16, a bus 300, a receiver 301, a processor 302, a transmitter 303, a memory 304, and a bus interface 305.
Detailed Description
The embodiment of the application provides a method and a device for recycling garbage of an HDFS (Hadoop distributed File System), solves the technical problems that in the prior art, a single node is large in deletion amount, nodes related to deletion operation are multiple, NameNode processing performance is reduced, and then cluster performance is reduced, and achieves the technical purpose that small files are deleted in batches, so that operation nodes are reduced, the data node service performance is improved, the NameNode processing performance is improved, and then the cluster performance is improved. Hereinafter, example embodiments of the present application will be described in detail with reference to the accompanying drawings. It should be apparent that the described embodiments are merely some embodiments of the present application and not all embodiments of the present application, and it should be understood that the present application is not limited to the example embodiments described herein.
Summary of the application
With the development of services, the data volume is exponentially increased, and the demands of deleting a large number of files at one time, such as "hard disk is gradually full, data needs to be cleaned in a centralized manner", "cluster is unstable, and files need to be released in a centralized manner", are increased. On the premise of starting a garbage collection mechanism, the default deletion strategy of the HDFS is to rename and move a file to be deleted to a/hash directory and automatically delete the file after the configured retention time is exceeded. The prior art comprises a method for reconstructing a file system directory structure through a log and recovering a lost file, thereby realizing the recovery of a HDFS deleted file, and improving the security of the HDFS file by controlling access authority control equipment and a storage medium of the HDFS file. However, in the prior art, there are also technical problems that a single node has a large deletion amount and many nodes involved in deletion operation, the performance of the DataNode service is reduced, the performance of the NameNode processing is reduced, and the performance of the cluster is reduced.
In view of the above technical problems, the technical solution provided by the present application has the following general idea:
the embodiment of the application provides a garbage recycling method of an HDFS (Hadoop distributed File System), wherein the method comprises the following steps: obtaining a first deleting instruction; according to the first deleting instruction, obtaining size information of a first file to be deleted; obtaining a predetermined file size threshold; judging whether the size information of the first file to be deleted is larger than the preset file size threshold value or not; if the size information of the first file to be deleted is not larger than the preset file size threshold value, a first execution instruction is obtained; and executing the first deleting instruction according to the first executing instruction, and deleting the first file to be deleted.
Having thus described the general principles of the present application, various non-limiting embodiments thereof will now be described in detail with reference to the accompanying drawings.
Example one
As shown in fig. 1, an embodiment of the present application provides a method for recycling garbage of an HDFS, where the method includes:
step S100: obtaining a first deleting instruction;
specifically, the HDFS cluster is divided into two roles, NameNode and DataNode. The NameNode is responsible for managing metadata of the whole file system, and the DataNode is responsible for managing file data blocks of users. To write data to the HDFS, the client first communicates with the NameNode to confirm that the file can be written and to obtain the DataNode that received the file block. And after the client sends the first deleting instruction, the NameNode obtains the block of the data to be deleted, and after the block (block) to be deleted is added into an InvalidateBlocks set, the DataNode receives and executes deleting operation through a heartbeat mechanism.
Step S200: according to the first deleting instruction, obtaining size information of a first file to be deleted;
step S300: obtaining a predetermined file size threshold;
step S400: judging whether the size information of the first file to be deleted is larger than the preset file size threshold value or not;
specifically, after the NameNode receives the first deletion instruction of the client, the first file information to be deleted transmitted by the client, including the file size information, is obtained, and whether the first file size information to be deleted exceeds the predetermined file size threshold is determined. The preset file size threshold information is preset file size information of small files within the cluster bearing range.
Step S500: if the size information of the first file to be deleted is not larger than the preset file size threshold value, a first execution instruction is obtained;
step S600: and executing the first deleting instruction according to the first executing instruction, and deleting the first file to be deleted.
Specifically, the DataNode reports block information held by the DataNode to the NameNode periodically (via heartbeat information reporting). If the client side judges that the size of the first file to be deleted is within the preset file size threshold value, the NameNode sends the data information of the first file to be deleted to the DataNode after receiving the heartbeat packet of the DataNode, and the DataNode executes the deleting operation after receiving the instruction to delete the block of the first file to be deleted from the client side.
Further, after determining whether the first file size information to be deleted is greater than the predetermined file size threshold, the S400 in this embodiment of the present application further includes:
step S401: if the size information of the first file to be deleted is larger than the preset file size threshold value, a first judgment instruction is obtained;
step S402: judging whether the first file to be deleted is a single file or not according to the first judgment instruction;
step S403: if the first file to be deleted is a single file, a second deletion instruction is obtained;
step S404: and deleting the first file to be deleted according to the second deleting instruction.
Specifically, if the client determines that the size information of the first file to be deleted is larger than the predetermined file size threshold and the first file to be deleted is a single file, the client obtains the second deletion instruction, and the DataNode deletes the block of the first file to be deleted from the client.
Further, after determining whether the first file to be deleted is a single file, step S402 of this embodiment of the present application further includes:
step S4021: if the first file to be deleted is not a single file, a second judgment instruction is obtained;
step S4022: judging whether the first file to be deleted contains a first file according to the second judgment instruction, wherein the first file is a single file with the file size larger than the preset file size threshold;
step S4023: if the first file to be deleted comprises the first file, a third deleting instruction is obtained;
step S4024: and deleting the first file according to the third deleting instruction.
Specifically, if the client determines that the first file to be deleted is not a single file, the client determines one by one whether the single file, that is, the first file, exists in the first file to be deleted, where the first file is a single file whose file size is greater than the predetermined file size threshold. If the files exist, the integrity of the files is sacrificed, the files are divided into groups by taking a threshold value as a standard, each group is submitted and deleted in sequence according to a preset time interval, and the NameNode coordinates the cluster to delete the files after the retention time is reached.
Further, after the determining, according to the second determining instruction, whether the first file to be deleted includes a single file whose file size is greater than the predetermined file size threshold, step S4022 in this embodiment of the present application further includes:
step S40221: if the first file to be deleted does not contain the first file, a first grouping principle is obtained;
step S40222: according to the first grouping principle, dividing the first file to be deleted into N groups, wherein N is an integer greater than 0;
step S40223: obtaining first scheduled submission frequency information;
step S40224: and according to the first preset submission frequency information, sequentially submitting and deleting the N groups of files to be deleted according to the first preset submission frequency information.
Specifically, if the first file to be deleted does not contain the first file through judgment, the integrity of the file is maintained, and the method is as follows
Figure BDA0002829001780000091
The first file to be deleted is divided into N groups, each group is sequentially submitted and deleted according to the first preset submission frequency information, and the NameNode coordinated cluster deletes the file after the retention time is reached.
Further, before obtaining the first execution instruction, step S500 in this embodiment of the present application further includes:
step S501: obtaining a third judgment instruction;
step S502: judging whether the first file to be deleted is a single file or not according to the third judgment instruction;
step S503: if the first file to be deleted is a single file, obtaining a first execution instruction;
step S504: and executing the first deleting instruction according to the first executing instruction, and deleting the first file to be deleted.
Specifically, if the first file to be deleted is judged to be a single file and the size of the first file to be deleted is smaller than the preset threshold, the deletion command of the whole file is directly submitted to the NameNode, and the NameNode coordinates the cluster to delete the file after the retention time is reached.
Further, in the step S404 according to the embodiment of the present application, the deleting the first file to be deleted according to the second deleting instruction further includes:
step S4041: obtaining a first predetermined slicing threshold;
step S4042: according to the first preset segmentation threshold value, segmenting the first file to be deleted 1 into M groups, wherein M is an integer larger than 0;
step S4043: obtaining second scheduled submission frequency information;
step S4044: and according to the second preset submission frequency information, sequentially submitting and deleting the M groups of files to be deleted according to the second preset submission frequency information.
Specifically, if the client determines that the size information of the first file to be deleted is larger than the predetermined file size threshold and the first file to be deleted is a single file, the integrity of the file is sacrificed, the file is divided into M groups by using the first predetermined division threshold as a standard, each group is sequentially submitted and deleted according to the second predetermined submission frequency information, and the NameNode coordinated cluster deletes the file after the retention time is reached. Wherein, M ═ (filesize% threshold)? (filesize/threshold): (filesize/threshold + 1).
Further, in the step S4024 according to the present application, the deleting the first file according to the third deleting instruction further includes:
step S40241: obtaining a second predetermined slicing threshold;
step S40242: according to the second preset segmentation threshold value, segmenting the first file into P groups, wherein P is an integer larger than 0;
step S40243: obtaining third scheduled submission frequency information;
step S40244: and according to the third preset submission frequency information, sequentially submitting and deleting the first files after the P groups are segmented according to the third preset submission frequency information.
Specifically, if the client determines whether the single file exists in the first files to be deleted one by one, if so, the integrity of the files is sacrificed, and the files are divided into P groups by using the second predetermined division threshold as a standard, where P is an integer greater than 0. And submitting and deleting the groups in sequence according to the third preset submitting frequency information, and deleting the groups after the residence time is reached by the NameNode coordination cluster. Wherein, is P ═ (filesize% threshold)? (filesize/threshold): (filesize/threshold + 1).
To sum up, the garbage recycling method of the HDFS provided by the embodiment of the present application has the following technical effects:
because the operation of deleting a large number of files at one time by the garbage collection station is converted into the operation of deleting small files within the bearing range of the cluster in batches by reasonably configuring parameters, the number of blocks added into the InvalidateBlocks set is reduced, correspondingly, the number of the InvalidateBlocks set is reduced, and the number of DataNodes which need to receive and execute the deleting operation through a heartbeat mechanism is also reduced; the probability that the block to be deleted is not processed and completed before the next BlockReport is reduced, and the probability that the invalid block affects the cluster performance is reduced; the number of the blocks to be deleted of the node is reduced, the processing time and the locking holding time of the node are correspondingly shortened, and the improvement of the NameNode processing performance is facilitated. The method achieves the technical purpose of reducing the execution operation nodes, improving the service performance of the DataNode, improving the processing performance of the NameNode and further improving the cluster performance by deleting the small files in batches.
Example two
Based on the same inventive concept as the garbage collection method of the HDFS in the foregoing embodiment, the present invention further provides a garbage collection device of the HDFS, as shown in fig. 2, the device includes:
a first obtaining unit 11, where the first obtaining unit 11 is configured to obtain a first deletion instruction;
a second obtaining unit 12, where the second obtaining unit 12 is configured to obtain size information of a first file to be deleted according to the first deleting instruction;
a third obtaining unit 13, wherein the third obtaining unit 13 is configured to obtain a predetermined file size threshold;
a first judging unit 14, where the first judging unit 14 is configured to judge whether the first file size information to be deleted is larger than the predetermined file size threshold;
a fourth obtaining unit 15, where the fourth obtaining unit 15 is configured to obtain a first execution instruction if the first file size information to be deleted is not greater than the predetermined file size threshold;
a first executing unit 16, where the first executing unit 16 is configured to execute the first deleting instruction according to the first executing instruction, and delete the first file to be deleted.
Further, the apparatus further comprises:
a fifth obtaining unit, configured to obtain a first judgment instruction if the size information of the first file to be deleted is greater than the predetermined file size threshold;
the second judging unit is used for judging whether the first file to be deleted is a single file or not according to the first judging instruction;
a sixth obtaining unit, configured to obtain a second deletion instruction if the first file to be deleted is a single file;
and the second execution unit is used for deleting the first file to be deleted according to the second deletion instruction.
Further, the apparatus further comprises:
a seventh obtaining unit, configured to obtain a second determination instruction if the first file to be deleted is not a single file;
a third determining unit, configured to determine, according to the second determining instruction, whether the first file to be deleted includes a first file, where the first file is a single file whose file size is greater than the predetermined file size threshold;
an eighth obtaining unit, configured to obtain a third deletion instruction if the first file to be deleted includes the first file;
and the third execution unit is used for deleting the first file according to the third deletion instruction.
Further, the apparatus further comprises:
a ninth obtaining unit, configured to obtain a first grouping rule if the first file to be deleted does not include the first file;
a tenth obtaining unit, configured to divide the first file to be deleted into N groups according to the first grouping principle, where N is an integer greater than 0;
an eleventh obtaining unit configured to obtain first scheduled commit frequency information;
and the fourth execution unit is used for sequentially submitting and deleting the N groups of files to be deleted according to the first preset submission frequency information.
Further, the apparatus further comprises:
a twelfth obtaining unit, configured to obtain a third determination instruction;
a fourth judging unit, configured to judge whether the first file to be deleted is a single file according to the third judging instruction;
a thirteenth obtaining unit, configured to obtain a first execution instruction if the first file to be deleted is a single file;
and the fifth execution unit is used for executing the first deleting instruction according to the first execution instruction and deleting the first file to be deleted.
Further, the apparatus further comprises:
a fourteenth obtaining unit configured to obtain a first predetermined slicing threshold;
a fifteenth obtaining unit, configured to segment the first file to be deleted 1 into M groups according to the first predetermined segmentation threshold, where M is an integer greater than 0;
a sixteenth obtaining unit configured to obtain second scheduled commit frequency information;
and the sixth execution unit is used for sequentially submitting and deleting the M groups of files to be deleted according to the second preset submission frequency information.
Further, the apparatus further comprises:
a seventeenth obtaining unit configured to obtain a second predetermined slicing threshold;
an eighteenth obtaining unit, configured to segment the first file into P groups according to the second predetermined segmentation threshold, where P is an integer greater than 0;
a nineteenth obtaining unit configured to obtain third scheduled submission frequency information;
and the seventh execution unit is used for sequentially submitting and deleting the first files after the P groups are segmented according to the third preset submission frequency information.
Various changes and specific examples of the garbage collection method of the HDFS in the first embodiment of fig. 1 are also applicable to the garbage collection device of the HDFS in this embodiment, and a person skilled in the art can clearly know the garbage collection device of the HDFS in this embodiment through the foregoing detailed description of the garbage collection method of the HDFS, so for the brevity of the description, detailed descriptions are omitted here.
Exemplary electronic device
The electronic device of the embodiment of the present application is described below with reference to fig. 3.
Fig. 3 illustrates a schematic structural diagram of an electronic device according to an embodiment of the present application.
Based on the inventive concept of the garbage collection method of the HDFS in the foregoing embodiments, the present invention further provides a garbage collection apparatus of the HDFS, which has a computer program stored thereon, and when the computer program is executed by a processor, the method implements the steps of any one of the methods of the garbage collection method of the HDFS described above.
Where in fig. 3 a bus architecture (represented by bus 300), bus 300 may include any number of interconnected buses and bridges, bus 300 linking together various circuits including one or more processors, represented by processor 302, and memory, represented by memory 304. The bus 300 may also link together various other circuits such as peripherals, voltage regulators, power management circuits, and the like, which are well known in the art, and therefore, will not be described any further herein. A bus interface 305 provides an interface between the bus 300 and the receiver 301 and transmitter 303. The receiver 301 and the transmitter 303 may be the same element, i.e., a transceiver, providing a means for communicating with various other apparatus over a transmission medium.
The processor 302 is responsible for managing the bus 300 and general processing, and the memory 304 may be used for storing data used by the processor 302 in performing operations.
The embodiment of the application provides a garbage recycling method of an HDFS (Hadoop distributed File System), wherein the method comprises the following steps: obtaining a first deleting instruction; according to the first deleting instruction, obtaining size information of a first file to be deleted; obtaining a predetermined file size threshold; judging whether the size information of the first file to be deleted is larger than the preset file size threshold value or not; if the size information of the first file to be deleted is not larger than the preset file size threshold value, a first execution instruction is obtained; and executing the first deleting instruction according to the first executing instruction, and deleting the first file to be deleted.
As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, apparatus, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present invention has been described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (devices), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks. While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all such alterations and modifications as fall within the scope of the invention.
It will be apparent to those skilled in the art that various changes and modifications may be made in the present invention without departing from the spirit and scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to include such modifications and variations.

Claims (9)

1. A garbage collection method of an HDFS (Hadoop distributed File System), wherein the method comprises the following steps:
obtaining a first deleting instruction;
according to the first deleting instruction, obtaining size information of a first file to be deleted;
obtaining a predetermined file size threshold;
judging whether the size information of the first file to be deleted is larger than the preset file size threshold value or not;
if the size information of the first file to be deleted is not larger than the preset file size threshold value, a first execution instruction is obtained;
and executing the first deleting instruction according to the first executing instruction, and deleting the first file to be deleted.
2. The method of claim 1, wherein the determining whether the first file size information to be deleted is greater than the predetermined file size threshold comprises:
if the size information of the first file to be deleted is larger than the preset file size threshold value, a first judgment instruction is obtained;
judging whether the first file to be deleted is a single file or not according to the first judgment instruction;
if the first file to be deleted is a single file, a second deletion instruction is obtained;
and deleting the first file to be deleted according to the second deleting instruction.
3. The method according to claim 2, wherein the determining whether the first file to be deleted is a single file according to the first determination instruction includes:
if the first file to be deleted is not a single file, a second judgment instruction is obtained;
judging whether the first file to be deleted contains a first file according to the second judgment instruction, wherein the first file is a single file with the file size larger than the preset file size threshold;
if the first file to be deleted comprises the first file, a third deleting instruction is obtained;
and deleting the first file according to the third deleting instruction.
4. The method according to claim 3, wherein said determining, according to the second determination instruction, whether the first file to be deleted includes a single file with a file size larger than the predetermined file size threshold includes:
if the first file to be deleted does not contain the first file, a first grouping principle is obtained;
according to the first grouping principle, dividing the first file to be deleted into N groups, wherein N is an integer greater than 0;
obtaining first scheduled submission frequency information;
and according to the first preset submission frequency information, sequentially submitting and deleting the N groups of files to be deleted according to the first preset submission frequency information.
5. The method of claim 1, wherein obtaining the first execution instruction is preceded by:
obtaining a third judgment instruction;
judging whether the first file to be deleted is a single file or not according to the third judgment instruction;
if the first file to be deleted is a single file, obtaining a first execution instruction;
and executing the first deleting instruction according to the first executing instruction, and deleting the first file to be deleted.
6. The method of claim 2, wherein the deleting the first file to be deleted according to the second deletion instruction comprises:
obtaining a first predetermined slicing threshold;
according to the first preset segmentation threshold value, segmenting the first file to be deleted 1 into M groups, wherein M is an integer larger than 0;
obtaining second scheduled submission frequency information;
and according to the second preset submission frequency information, sequentially submitting and deleting the M groups of files to be deleted according to the second preset submission frequency information.
7. The method of claim 3, wherein the deleting the first file according to the third deletion instruction comprises:
obtaining a second predetermined slicing threshold;
according to the second preset segmentation threshold value, segmenting the first file into P groups, wherein P is an integer larger than 0;
obtaining third scheduled submission frequency information;
and according to the third preset submission frequency information, sequentially submitting and deleting the first files after the P groups are segmented according to the third preset submission frequency information.
8. A garbage collection apparatus of an HDFS, wherein the apparatus comprises:
a first obtaining unit configured to obtain a first deletion instruction;
the second obtaining unit is used for obtaining the size information of the first file to be deleted according to the first deleting instruction;
a third obtaining unit configured to obtain a predetermined file size threshold;
a first judging unit, configured to judge whether the first to-be-deleted file size information is larger than the predetermined file size threshold;
a fourth obtaining unit, configured to obtain a first execution instruction if the size information of the first file to be deleted is not greater than the predetermined file size threshold;
and the first execution unit is used for executing the first deleting instruction according to the first execution instruction and deleting the first file to be deleted.
9. A garbage collection apparatus of an HDFS, comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the steps of the method of any one of claims 1 to 7 when executing the program.
CN202011437627.5A 2020-12-10 2020-12-10 Garbage recycling method and device of HDFS Pending CN112540968A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011437627.5A CN112540968A (en) 2020-12-10 2020-12-10 Garbage recycling method and device of HDFS

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011437627.5A CN112540968A (en) 2020-12-10 2020-12-10 Garbage recycling method and device of HDFS

Publications (1)

Publication Number Publication Date
CN112540968A true CN112540968A (en) 2021-03-23

Family

ID=75019946

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011437627.5A Pending CN112540968A (en) 2020-12-10 2020-12-10 Garbage recycling method and device of HDFS

Country Status (1)

Country Link
CN (1) CN112540968A (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20010045586A (en) * 1999-11-05 2001-06-05 이계철 Device for security against deleting a file in UNIX series Operating System and Method thereof
JP2013088931A (en) * 2011-10-14 2013-05-13 Hitachi Solutions Ltd Retrieval device, document management method, and document retrieval system
CN108846032A (en) * 2018-05-28 2018-11-20 郑州云海信息技术有限公司 File delet method, device and equipment in a kind of storage system
CN109446162A (en) * 2018-10-22 2019-03-08 王梅 Determine the method and system of the data mode of destination mobile terminal in mobile Internet
CN110750503A (en) * 2019-09-27 2020-02-04 浪潮电子信息产业股份有限公司 File deletion speed control method, device, equipment and readable storage medium

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20010045586A (en) * 1999-11-05 2001-06-05 이계철 Device for security against deleting a file in UNIX series Operating System and Method thereof
JP2013088931A (en) * 2011-10-14 2013-05-13 Hitachi Solutions Ltd Retrieval device, document management method, and document retrieval system
CN108846032A (en) * 2018-05-28 2018-11-20 郑州云海信息技术有限公司 File delet method, device and equipment in a kind of storage system
CN109446162A (en) * 2018-10-22 2019-03-08 王梅 Determine the method and system of the data mode of destination mobile terminal in mobile Internet
CN110750503A (en) * 2019-09-27 2020-02-04 浪潮电子信息产业股份有限公司 File deletion speed control method, device, equipment and readable storage medium

Similar Documents

Publication Publication Date Title
US10831562B2 (en) Method and system for operating a data center by reducing an amount of data to be processed
US20100318759A1 (en) Distributed rdc chunk store
US7680998B1 (en) Journaled data backup during server quiescence or unavailability
US10853242B2 (en) Deduplication and garbage collection across logical databases
US9104679B2 (en) Reference count propagation
AU2014212780B2 (en) Data stream splitting for low-latency data access
CN110347651B (en) Cloud storage-based data synchronization method, device, equipment and storage medium
US8769055B2 (en) Distributed backup and versioning
US8489612B2 (en) Identifying similar files in an environment having multiple client computers
EP2544107A1 (en) Method of managing a database
CN106161633B (en) Transmission method and system for packed files based on cloud computing environment
EP3575968A1 (en) Method and device for synchronizing active transaction lists
US9357004B2 (en) Reference count propagation
EP1265155A2 (en) File tree comparator
CN111177159B (en) Data processing system and method and data updating equipment
CN111475108A (en) Distributed storage method, computer equipment and computer readable storage medium
CN113486026A (en) Data processing method, device, equipment and medium
CN112965939A (en) File merging method, device and equipment
CN111881086B (en) Big data storage method, query method, electronic device and storage medium
CN116775712A (en) Method, device, electronic equipment, distributed system and storage medium for inquiring linked list
US7441252B2 (en) Cache control device, and method and computer program for the same
US10511656B1 (en) Log information transmission integrity
CN112540968A (en) Garbage recycling method and device of HDFS
CN109165259B (en) Index table updating method based on network attached storage, processor and storage device
EP2765517B1 (en) Data stream splitting for low-latency data access

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination