CN115640261A

CN115640261A - HDFS empty file positioning method, device, equipment and medium

Info

Publication number: CN115640261A
Application number: CN202211376114.7A
Authority: CN
Inventors: 穆纯进; 王云朋
Original assignee: China United Network Communications Group Co Ltd; Unicom Digital Technology Co Ltd
Current assignee: China United Network Communications Group Co Ltd; Unicom Digital Technology Co Ltd
Priority date: 2022-11-04
Filing date: 2022-11-04
Publication date: 2023-01-24

Abstract

The application provides a method, a device, equipment and a medium for positioning an HDFS empty file, and aims to solve the problem, wherein the method comprises the following steps: acquiring a binary metadata file of a distributed file system (HDFS); deserializing the binary metadata file to obtain a plaintext file; loading the plaintext file in a pre-established Hive table to obtain a Hive table of the plaintext file; and positioning the file storing the empty file list in the Hive table of the plaintext file to obtain positioning information. By the method, the empty file can be quickly and accurately positioned, the positioning efficiency of the empty file is improved, support and convenience are provided for cleaning the empty file, and the stability of the HDFS is effectively improved.

Description

HDFS empty file positioning method, device, equipment and medium

Technical Field

The present application relates to the field of computer technologies, and in particular, to a method, an apparatus, a device, and a medium for locating an HDFS empty file.

Background

HDFS, known collectively in english as Hadoop Distributed File System, is a very large number of Distributed, highly available File systems used in the field of big data.

In the practical use process of HDFS, the stability of HDFS may be deteriorated as the number of files increases, and especially in the case of a large number of useless empty files, instability of the storage system may be further caused. Therefore, how to quickly locate the empty file in the HDFS is worth studying.

In the related technology, locating the empty file in the HDFS mainly obtains all file lists by recursively traversing the directory structure tree of the entire HDFS, and then locating the empty file in all file lists, which causes a great data pressure on the HDFS, affects the development of normal data services, and consumes a long time.

Disclosure of Invention

In view of the above problems, that is, the problem that the normal data service of the HDFS is affected and the efficiency is low in the HDFS empty file positioning process, the present application provides an HDFS empty file positioning method, apparatus, device, and medium.

In order to achieve the above purpose, the present application provides the following technical solutions:

according to an aspect of the present application, there is provided a method for locating an empty file of a distributed file system HDFS, including:

acquiring a binary metadata file of a distributed file system (HDFS);

deserializing the binary metadata file to obtain a plaintext file;

loading the plaintext file in a pre-established Hive table to obtain a Hive table of plaintext files; and positioning the file storing the empty file list in the Hive table of the plaintext file to obtain positioning information.

In one embodiment, after obtaining the binary metadata file of the distributed file system HDFS and before deserializing the binary metadata file, the method further includes:

pushing the binary metadata file to a server cluster irrelevant to a production environment;

the deserializing of the binary metadata file includes: deserializing the binary metadata file in the server cluster.

In one embodiment, the obtaining a binary metadata file of a HDFS of a distributed file system includes:

extracting binary metadata files from metadata nodes NameNode of a distributed file system HDFS.

In one embodiment, the deserializing the binary metadata file comprises:

acquiring the structural information of the binary metadata file, and acquiring a corresponding deserialization program based on the structural information;

deserializing the binary metadata file based on the deserializer.

In an embodiment, the locating a file storing an empty file list in the Hive table of the plaintext file to obtain location information includes:

and positioning the file storing the empty file list in the Hive table of the plaintext file based on an object relationship mapping frame query language HQL to obtain positioning information.

In one embodiment, the format of the plaintext file comprises at least one of: a file path and a directory path HDFS _ DIR of a file HDFS of the distributed system; number of copies REPLICATION; modifying TIME MODIFICATION _ TIME; ACCESS TIME ACCESS _ TIME.

In an embodiment, after locating the file storing the empty file list in the Hive table of the plaintext file to obtain the location information, the method further includes: and deleting all the files which are positioned and stored in the empty file list based on the positioning information.

According to another aspect of the present application, there is provided a device for locating an empty file in a distributed file system HDFS, including:

an acquisition module configured to acquire a binary metadata file of a distributed file system HDFS;

the anti-sequence module is used for carrying out anti-serialization on the binary metadata file to obtain a plaintext file;

the loading module is used for loading the plaintext file in a pre-established Hive table to obtain a Hive table of the plaintext file;

and the positioning module is used for positioning the file storing the empty file list in the Hive table of the plaintext file to obtain positioning information.

According to yet another aspect of the present application, there is provided an electronic device including: a memory and a processor;

the memory stores computer-executable instructions;

the processor executes computer execution instructions stored in the memory, so that the electronic equipment executes the method for positioning the empty file of the HDFS.

According to another aspect of the present application, a computer-readable storage medium is provided, where a computer executable instruction is stored, and when the computer executable instruction is executed by a processor, the computer executable instruction is used for implementing the method for locating an empty file in a distributed file system HDFS.

It can be understood that the method, the device, the equipment and the medium for positioning the HDFS empty file provided by the application acquire the binary metadata file of the HDFS of the distributed file system; deserializing the binary metadata file to obtain a plaintext file; loading the plaintext file in a pre-established Hive table to obtain a Hive table of the plaintext file; and positioning the file storing the empty file list in the Hive table of the plaintext file to obtain positioning information. By the method, the empty file can be quickly and accurately positioned, the positioning efficiency of the empty file is improved, support and convenience are provided for cleaning the empty file, and the stability of the HDFS is effectively improved.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the application and, together with the description, serve to explain the principles of the application.

FIG. 1a is a network architecture diagram of a distributed file system HDFS;

FIG. 1b is a diagram of a network architecture for improving the HDFS stability of a distributed file system in the related art;

fig. 2 is a schematic diagram of a possible scenario provided in an embodiment of the present application;

fig. 3 is a schematic flowchart of a method for locating an empty file in a distributed file system HDFS according to an embodiment of the present disclosure;

fig. 4 is a second schematic flowchart of a method for locating an empty file in a HDFS of a distributed file system according to an embodiment of the present application;

fig. 5 is a schematic structural diagram of an empty file positioning device of a distributed file system HDFS according to an embodiment of the present disclosure;

fig. 6 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

With the above figures, there are shown specific embodiments of the present application, which will be described in more detail below. These drawings and written description are not intended to limit the scope of the inventive concepts in any manner, but rather to illustrate the inventive concepts to those skilled in the art by reference to specific embodiments.

Detailed Description

The HDFS may be deployed on a cluster composed of multiple machines, and includes several basic concept metadata nodes NameNode, data nodes DataNode, and data blocks, where the NameNode is responsible for metadata management of the entire distributed file system, that is, information such as a file path name, an ID of the data block, and a storage location, and also records which nodes are part of the cluster, and a block has several copies, as shown in fig. 1 a.

Therefore, it can be seen that the NameNode is a very important component, the stability and reliability of the NameNode are very important, if the NameNode component is hung down, the whole distributed file system is in an unavailable state, all open source communities do much work on the high availability of the HDFS, and the overall stability of the HDFS is improved by starting two namenodes to solve the problem of a single point according to the NameNode high availability principle shown in fig. 1 b.

However, in the actual use process of the HDFS, the stability of the HDFS may be poor under the condition that the number of files is increased along with the lapse of time, and particularly, a large number of useless empty files exist, which further causes instability of the storage system. Then, the number of files in the HDFS storage system in the large data platform is usually in the tens of millions or even hundreds of millions, and in the related art, only the number of files in the HDFS file system, the number of files in a certain directory, and which files exist in a certain small directory can be known, and it is impossible to accurately, quickly and accurately locate which empty files exist in the entire HDFS storage system.

In view of this, embodiments of the present application provide an HDFS empty file positioning method, apparatus, device, and medium, where a plaintext file is obtained by obtaining a binary metadata file of an HDFS and deserializing the metadata file, and then the plaintext file is loaded in a Hive table, so that an empty file positioning is performed on a file storing an empty file list in the Hive table of the plaintext file, which is performed in a non-intrusive HDFS manner, so that a normal data service of the HDFS is not affected, and an empty file can be quickly and accurately positioned, thereby improving positioning efficiency of the empty file, providing support and convenience for subsequent empty file cleaning, and effectively improving stability of the HDFS.

In order to make the objects, technical solutions and advantages of the present application clearer, the technical solutions in the embodiments of the present application will be described in more detail below with reference to the accompanying drawings in the embodiments of the present application. In the drawings, the same or similar reference numerals denote the same or similar components or components having the same or similar functions throughout. The described embodiments are a subset of the embodiments in the present application and not all embodiments in the present application. The embodiments described below with reference to the accompanying drawings are illustrative and intended to explain the present application and should not be construed as limiting the present application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

Fig. 2 is a schematic diagram of a possible scenario provided by an embodiment of the present application, as shown in fig. 2, the scenario includes that a distributed file system HDFS210 and a terminal device 220, the hdfs210 is deployed on a server, and the terminal device 220 and the server are connected to each other through a wired or wireless network. In some embodiments, the HDFS210 is configured to provide a binary metadata file to the terminal device 220, and the terminal device 220 is configured to perform operations such as deserialization and empty file location based on the metadata file provided by the HDFS 210. Optionally, in the deserialization and empty file positioning processes, the terminal device 220 undertakes a main calculation work, or undertakes a calculation work alone.

The terminal device may include, but is not limited to, a computer, a smart phone, a tablet computer, an e-book reader, a motion Picture experts group audio layer III (MP 3) player, a motion Picture experts group audio layer IV (MP 4) player, a portable computer, a vehicle-mounted computer, a wearable device, a desktop computer, a set-top box, a smart television, and the like.

The server may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a Network service, cloud communication, middleware service, a domain name service, a security service, a Content Delivery Network (CDN), a big data and artificial intelligence platform, and the like.

Optionally, the number of the distributed file system HDFS210 and the terminal devices 220 may be more or less, and the embodiment of the present application does not limit this.

The foregoing briefly explains the scene schematic diagram of the present application, and the following describes in detail the method for locating an empty file in an HDFS of a distributed file system according to an embodiment of the present application by taking the terminal device 220 applied in fig. 1b as an example.

Referring to fig. 3, fig. 3 is a diagram illustrating an empty file location method of a distributed file system HDFS according to an embodiment of the present disclosure, including steps S301 to S304.

Step S301, obtaining a binary metadata file of the HDFS.

In the embodiment, the positioning of the HDFS empty file is not directly performed on the HDFS, but the non-invasive positioning of the HDFS empty file is realized by acquiring the HDFS metadata file and performing analysis positioning in subsequent steps.

It can be understood that, in the related art, the API interface of the HDFS of the distributed file system is directly used to scan the null text, since the number of files of the HDFS is generally in the tens of millions or even hundreds of millions, the performance of the HDFS is intrusive in the process of scanning and locating, and even the HDFS is unusable.

In one embodiment, step S301 obtains a binary metadata file of the HDFS of the distributed file system, specifically: extracting binary metadata files from metadata nodes NameNode of a distributed file system HDFS.

It can be understood that the NameNode is responsible for metadata management of the entire distributed file system, including information such as file path names, IDs and storage locations of data blocks. By checking the monitoring page of the HDFS, it can be found that the NameNode can obtain how many files and folders, how many blocks, how many objects in total, and the size of a certain file in the system of the whole file.

Further, after acquiring the binary metadata file of the distributed file system HDFS in step S301 and before deserializing the binary metadata file in step S302, the method may further include the following steps:

in this embodiment, the service impact on the HDFS can be further reduced by analyzing the HDFS metadata file on other service clusters unrelated to the HDFS to locate the empty file without intrusion. In some embodiments, the metadata parsing may also be performed in the terminal device.

Step S302, deserializing the binary metadata file to obtain a plaintext file.

In this embodiment, step S302 deserializes the binary metadata file, specifically: deserializing the binary metadata file in the server cluster.

Further, the deserializing of the binary metadata file in step S302 may include the following steps.

deserializing the binary metadata file based on the deserializer.

In this embodiment, by knowing the structure of the metadata and writing a Java program deserializer to deserialize the binary metadata file into a plaintext file, the metadata file can be efficiently parsed.

It will be appreciated that other information may be included in addition to the format information described above. Illustratively, in this embodiment, the format is as follows:

HDFS _ DIR: a file path and a directory path of the HDFS; REPLICATION: the number of copies; MODIFICATION _ TIME: modifying the time; ACCESS _ TIME: an access time; PREFERRED _ BLOCK _ SIZE: a preferred data block size; block _ COUNT: the number of data blocks; FILE _ SIZE: the size of the file; NSQUOTA: a name quota; DSQUOTA: a target space quota; PERMISSION: permission; USER _ NAME: a user name; GROUP _ NAME: the group name.

And step S303, loading the plaintext file in the pre-established Hive table to obtain a Hive table of plaintext files.

As can be appreciated, hive is a data warehouse tool based on Hadoop, which is a mechanism that can store, query, and analyze large-scale data stored in Hadoop. The Hive data warehouse tool can map the Structured data file into a database table and a Hive table, provide an SQL (Structured Query Language) Query function, convert SQL sentences into MapReduce tasks to execute, and realize quick Query of data. In this embodiment, the plaintext file Hive table building statement is as follows:

and S304, positioning the file storing the empty file list in the Hive table of the plaintext file to obtain positioning information.

In this embodiment, for example, an HQL (object relational mapping framework Query Language) Query statement is used to Query and locate a plaintext file Hive table, so as to quickly find a corresponding empty file. It can be understood that, if the API interface of the HDFS is used for scanning, several hours are generally required, and by adopting the above-mentioned scheme of this embodiment, it is possible to accurately locate the empty file in minutes, which greatly improves the efficiency of locating the empty file.

Specifically, step S304 locates the file storing the empty file list in the plaintext file Hive table, and obtains locating information, which includes: and positioning the file storing the empty file list in the Hive table of the plaintext file based on an object relationship mapping frame query language HQL to obtain positioning information.

It will be appreciated that HQL, an object-oriented query language, is similar to SQL, but instead of operating on tables and columns, operates on databases oriented to objects and their attributes, i.e., a conventional SQL query. The method comprises the following steps of submitting an HQL distributed job to quickly and accurately calculate an empty file list, wherein the query language example is as follows:

Select path

From hdfs_table

Where replication>0and filesize＝0

in one embodiment, after the empty file is located, the empty file is deleted and cleaned according to the locating information, so that the stability of the HDFS can be effectively guaranteed. Specifically, after the step S304 locates the file storing the empty file list in the plaintext file Hive table to obtain the locating information, the method may further include the following steps: and deleting all the files which are positioned and stored in the empty file list based on the positioning information.

In one implementation, a deletion execution program may be generated to delete a target empty file located in the HDFS, and in other implementations, deletion of an empty file may be performed in the HDFS that generates deletion instruction information, where the deletion instruction information carries location information.

For the convenience of understanding of the embodiments of the present application, the following process is included in conjunction with fig. 4:

step 1: and collecting a metadata file at the Namenode node, wherein the metadata file is a binary file and is uploaded to one or more servers in the server cluster independent of the production environment.

Specifically, the process is to obtain the metadata file nonintrusively, the latest metadata of the NameNode is stored in the memory in full, and the direct reading of the memory of the NameNode can generate great pressure and affect the data service, so the serialized metadata file of the NameNode is obtained firstly, and the metadata file is pushed to a cluster irrelevant to the production environment of the enterprise.

And 2, step: deserializing the metadata file in the server cluster.

Specifically, by solving the structure of the metadata, a Java program deserializer is written to deserialize the binary metadata file into a plaintext file.

And 3, step 3: and establishing a Hive table to load the deserialized plaintext file.

And 4, step 4: and submitting the HQL distributed operation to quickly and accurately calculate an empty file list.

The embodiment of the present application further provides a device for locating an empty file in a HDFS of a distributed file system, as shown in fig. 5, which includes an obtaining module 51, an anti-sequence module 52, a loading module 53 and a locating module 54, wherein,

an obtaining module 51 configured to obtain a binary metadata file of the HDFS;

an deserializing module 52 configured to deserialize the binary metadata file to obtain a plaintext file;

the loading module 53 is configured to load the plaintext file in a pre-established Hive table to obtain a Hive table of plaintext files;

and a positioning module 54 configured to position the file storing the empty file list in the plaintext file Hive table to obtain positioning information.

In one embodiment, the apparatus further comprises:

a push module configured to push the binary metadata file into a server cluster unrelated to a production environment;

the deserializing module 52 is specifically configured to deserialize the binary metadata file in the server cluster.

In an embodiment, the obtaining module 51 is specifically configured to extract a binary metadata file from a metadata node NameNode of the HDFS of the distributed file system.

In one embodiment, the anti-sequence module 52 includes:

a program acquisition unit configured to acquire structure information of the binary metadata file and acquire a corresponding deserialization program based on the structure information;

an deserialization unit configured to deserialize the binary metadata file based on the deserializer.

In an embodiment, the positioning module 54 is specifically configured to position a file storing an empty file list in the Hive table of the plaintext file based on an object relationship mapping framework query language HQL, so as to obtain positioning information.

In one embodiment, the format of the plaintext file comprises at least one of: a file path and a directory path HDFS _ DIR of a file HDFS of the distributed system; a copy number REPLICATION; modifying TIME MODIFICATION _ TIME; ACCESS TIME ACCESS _ TIME.

In one embodiment, the apparatus further comprises: and the empty file deleting module is configured to delete all the located files storing the empty file list based on the locating information.

It should be noted that, the apparatus provided in the present application can correspondingly implement all the method steps implemented by the terminal device in the foregoing method embodiment, and can achieve the same technical effect, and detailed descriptions of the same parts and beneficial effects as those of the method embodiment in this embodiment are not repeated herein.

Correspondingly, an electronic device is further provided in an embodiment of the present application, as shown in fig. 6, including: a memory 61 and a processor 62;

the memory 61 stores computer execution instructions;

the processor 62 executes the computer-executable instructions stored in the memory 61, so that the electronic device executes the method for locating the empty file of the HDFS in the distributed file system.

It should be noted that, the electronic device provided in the present application can correspondingly implement all the method steps implemented by the terminal device in the foregoing method embodiment, and can achieve the same technical effect, and detailed descriptions of the same parts and beneficial effects as those of the method embodiment are not repeated here.

The embodiment of the present application correspondingly provides a computer-readable storage medium, where a computer-executable instruction is stored in the computer-readable storage medium, and when the computer-executable instruction is executed by a processor, the computer-executable instruction is used to implement the method for locating an empty file in a distributed file system HDFS.

The embodiment of the present application also provides a computer program product, where the computer program product includes computer program code, and when the computer program code runs on a computer, the computer is caused to execute the database updating method.

The embodiment of the present application correspondingly provides a chip, which includes a memory and a processor, where the memory is used to store a computer program, and the processor is used to call and run the computer program from the memory, and execute the database updating method.

It should be noted that the medium, the program product, and the chip provided in the present application can correspondingly implement all the method steps implemented by the terminal device in the foregoing method embodiments, and can achieve the same technical effects, and detailed descriptions of the same parts and beneficial effects as those of the method embodiments in this embodiment are not repeated herein.

It will be understood by those of ordinary skill in the art that all or some of the steps of the methods, systems, functional modules/units in the devices disclosed above may be implemented as software, firmware, hardware, and suitable combinations thereof. In a hardware implementation, the division between functional modules/units mentioned in the above description does not necessarily correspond to the division of physical components; for example, one physical component may have multiple functions, or one function or step may be performed by several physical components in cooperation. Some or all of the physical components may be implemented as software executed by a processor, such as a central processing unit, digital signal processor, or microprocessor, or as hardware, or as an integrated circuit, such as an application specific integrated circuit. Such software may be distributed on computer readable media, which may include computer storage media (or non-transitory media) and communication media (or transitory media).

The term computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data, as is well known to those of ordinary skill in the art. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital Versatile Disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can accessed by a computer.

In addition, communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media as is well known to those skilled in the art.

In the description of the embodiments of the present application, the term "and/or" merely represents an association relationship describing an associated object, and means that three relationships may exist, for example, a and/or B may represent: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the term "at least one" means any one of a variety or any combination of at least two of a variety, for example, including at least one of A, B, and may mean any one or more elements selected from the group consisting of A, B and C communication. Further, the term "plurality" means two or more unless specifically stated otherwise.

In the description of the embodiments of the present application, the terms "first," "second," "third," "fourth," and the like (if any) are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the application described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

Finally, it should be noted that: the above embodiments are only used to illustrate the technical solutions of the present application, and not to limit the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present application.

Claims

1. A method for locating an empty file of a distributed file system (HDFS) is characterized by comprising the following steps:

acquiring a binary metadata file of a distributed file system (HDFS);

deserializing the binary metadata file to obtain a plaintext file;

loading the plaintext file in a pre-established Hive table to obtain a Hive table of the plaintext file; and positioning the file storing the empty file list in the Hive table of the plaintext file to obtain positioning information.

2. The method according to claim 1, after obtaining the binary metadata file of the HDFS, and before deserializing the binary metadata file, further comprising:

the deserializing the binary metadata file includes: deserializing the binary metadata file in the server cluster.

3. The method according to claim 1 or 2, wherein said obtaining a binary metadata file of a distributed file system HDFS comprises:

4. The method of claim 1, wherein deserializing the binary metadata file comprises:

deserializing the binary metadata file based on the deserializer.

5. The method according to claim 1, wherein the locating the file storing the empty file list in the Hive table of the plaintext file to obtain the locating information comprises:

6. The method of claim 1, wherein the format of the plaintext file comprises at least one of: a file path and a directory path HDFS _ DIR of a distributed system file HDFS; number of copies REPLICATION; modifying TIME MODIFICATION _ TIME; ACCESS TIME ACCESS _ TIME.

7. The method according to claim 1, wherein after locating the file storing the empty file list in the Hive table of the plaintext file to obtain the location information, the method further comprises: and deleting all the files which are positioned and stored in the empty file list based on the positioning information.

8. A distributed file system HDFS empty file positioning device is characterized by comprising:

the system comprises an acquisition module, a storage module and a processing module, wherein the acquisition module is used for acquiring a binary metadata file of the HDFS;

9. An electronic device, comprising: a memory and a processor;

the memory stores computer-executable instructions;

the processor executes the computer-executable instructions stored in the memory to cause the electronic device to execute the method for locating an empty file of a distributed file system (HDFS) according to any one of claims 1-7.

10. A computer-readable storage medium having stored thereon computer-executable instructions for implementing the method of any one of claims 1-7 when executed by a processor.